Blog Post

Attacking CERN’s big data problem

[protected-iframe id=”23eeb462d19cfe5d86576a81ceec94a2-14960843-33105277″ info=”http://new.livestream.com/accounts/74987/events/2361507/videos/30326858/player?autoPlay=false&height=360&mute=false&width=640″ width=”640″ height=”360″ frameborder=”0″ scrolling=”no”]
Session Name: Cern’s Grand Collision Experiment: Cloud Technology Meets Work Culture.

Chris Albrecht
Tim Bell

Chris Albrecht 00:03

Thank you Barack. Moving on to the next session of the morning. We have CERN’S GRAND COLLISION EXPERIMENT: CLOUD TECHNOLOGY MEETS WORK CULTURE and it’s going to be a presentation from Mr. Tim Bell. He’s the Infrastructure Manager for CERN’s. Please welcome Tim, he’s right there. Welcome Tim Bell to the stage.

Tim Bell 00:30

Great. Thanks for having the chance to come along and talk to you today. My name is Tim Bell, I’m responsible for the infrastructure. To explain a little about CERN with your opinion of laboratory of particle physics, it’s an organization with a budget of around 700,000,000 pounds a year, and we support 11,000 physicists around the world. We have a very simple mandate, it’s to understand how the universe works and what it’s made of. In order to do that, we basically have to take on a large quantity of complex experiments. We have 20 countries based in Europe that support us in this and it’s the tax payers of Europe that fund this. So thank you very much for your contributions.

Tim Bell 01:16

What the physicist worry about when they wake up in morning? In the 1960’s, they came up with the theory of how the universe should work. It’s a jigsaw. They’ve been gradually filling the pieces of that jigsaw together. There’s one final piece that we’re closing in on which you may have heard about from last July which is called the Higgs Boson. That explains why we have mass and why we aren’t traveling around at the speed of light. Furthermore in that, we have some basic questions. You look at the big bang, a massive energy, you should produce the same amount of matter and anti-matter. However, we’re clearly matter, therefore, where has all the anti-matter gone. It’s the kind of question we need to understand and those of you who believe in red and browns, angels and demons, you don’t need to worry, the anti-matter isn’t stored at CERN.

Tim Bell 02:02

We have a rather embarrassing situation where we lost 95% of the universe. When we look at how the planets move and then we add up the amount of mass we actually find but most of it is dark. We don’t know what it’s there for. So the good news here is that last year, we’d lost 97% of the universe so we’re gradually getting there. But it’s a serious worry when you only got 5% of the universe understood. So how do we do it? We do it by taking the world’s largest scientific experiments and using those to simulate and to execute fundamental particle research. The Large Hadron Collider is the flagship experiment to the moment. 27 kilometers in circumference, 100 meters under the ground between Geneva and the Jura mountains. It takes about 20 minutes to drive from one side to the other. The protons go around 11,000 times a second just below the speed of light. The tunnel itself consists of 1,600 superconducting magnets. This is 2 degrees above absolute zero minus 271 degrees centigrade. Inside, there are two tubes about 1 centimeter across. These have pressure about ten times less than the moon. We fire around large quantities of protons in one direction and large quantities of protons in the other direction and we collide them in four places.

Tim Bell 03:27

In those four places, we have detectors. These are 7,000 ton piece of equipment about the size of Notre Dame Cathedral. They can be thought of a digital cameras. A 100 megapixel cameras. Well it’s quite a large thing but they take 40,000,000 pictures a second. This produces a petabyte every second that needs to be analyzed. There are massive server farms that look after doing this filtering of data down to reasonable levels. What do we get at the end? We get understanding of collisions. If you imagine a high-speed train colliding with another high-speed train, then that is the kind of energies that we’re talking about. We produce matter at this point, unusual forms of matter and then we analyze that. And with this we can then simulate and understand the experiment conditions. At the end of all these, we have a big day to challenge. In total, we’re getting about 35 petabytes as a year at the moment that we have to record. When the LHC is upgraded, it’s currently in the process of an upgrade, then we’re expecting that data rate to approach on double. The physicist want to keep this data for 20 years. That means that when we have that whole out, we’re certainly heading for Exabyte’s fairly soon. We actually store things on 45,000 tapes. At the moment, this is the easiest and the most economic.

Tim Bell 04:47

The data center itself was built from the 1970’s from mainframe in a [Cray?] and we’re having the classic problems of traditional data centers. As you can see, the wrecks aren’t any partially full. We’re finding it very difficult to cool the center. For those of you that are interested, this will soon be going up on Google street view. We have the car round a few months ago. In two weeks’ time, we’re getting a 100,000 people through the computer center as part of CERN’s open day over a weekend. This means along with being a computer center, we’re also a tourist attraction and part of the physics outreach activities.

Tim Bell 05:24

The good news is that we’ll be able to get a new data center in Hungary. This is needed in order to allow us to expand out to be able to address the increased data coming in. We asked the 20 members states of Europe to provide us a proposal and in the end Vigna data center in Hungary won. This will be about a 3 megawatt facility matching the 3.5 megawatt facility that’s in CERN at the moment. However, that was the good side. The bad side is that with the current economic situation in Europe, we can’t expect to get additional staff. That means, the same number of people have to find a way to run twice as many service as previously. On top of that, the tool set that we wrote about ten years ago will not scale to this size and is becoming increasingly brittle and requiring large amounts of maintenance. Our physicist were also getting more and more used to Cloud services. They don’t want to be told they can’t have resources for weeks. They want to be able to go and get it in the time it takes to go and get a coffee, therefore, we have to find ways to become more agile.

Tim Bell 06:30

As part of this reflection, we had to work out what do we do to arrange this to happen. One of the fundamental transitions that we went through was to turn around and say that CERN computing is not special. This culturally is a major problem for a research organization. We’re used to taking blank sheets of paper, inventing things like the world wide web in the 90’s and designing solutions. However, in the computer area, as you saw Google for example, are way ahead of us in terms of scales so we ought to be able to build on what they’ve done rather than having to invent everything ourselves. Where we find things that we think we’re special on fundamentally it’s because we failed to understand the concepts not because we’re actually aren’t special.

Tim Bell 07:14

We had to adopt to the Google tool chain, this is a standard approach that Google have published. And it likes you to break up the IT infrastructure services into a series of smaller components. We are generally open source organization ever since we put the world wide web into the public domain. It’s generally being part of the culture of CERN. So we attached a set of various projects to the various areas and none of these are intended necessarily to be the best solutions in the world, it’s more of question of finding something which is good enough, adopting it and failing early because only by trying something and failing do you really understand what you need. T items and this is pompitous of configuration management system and OpenStackers and orchestration engine. So, having chosen that, that’s the technology problem solved. We then have to work out how to do this with the people. Clearly, doing this kind of transition is not just a technology and software problem. In order to do this within a fixed resource then what we had to do is to freeze the existing tool sets. Identify a set of experienced people who are used to running services but also adding to them a set of new comers with fresh outlook. Many of these new comers had already experienced some of these opensource tools as part of their university or their first job. And by combining these people together, then we’re able to form a small team who could take on rewriting the entire infrastructure that taken us about ten years to write previously. What we found as part of this has been an interesting change which is the new comers who used to come for us for short term contracts, one year, two years, actually found themselves in leaving CERN with a set of skills which were immediately relevant to the market rather knowing a custom CERN product, they were leaving with open stack or puppet skills which were in high demand. And that greatly helps the staff motivation. On top of that, by participating communities outside of CERN, it helps to build the Devops culture, the sharing. On top of that, participating in the governance and various parts of the organization of these committees means that then we can assist and contribute back to the organization. Along with code, we also help with outreach. So examples of this is the time elected member of the OpenStack management board and help to represent the various members of the community within that organization.

Tim Bell 09:36

That having been said, as part of this, there were some interesting cultural barriers we encountered. Now, to make it clear, this is an extrapolation of people and culture, it’s not a reflection on staff at CERN but there was a clear feeling of don’t be hasty, don’t rush it, take it slowly. Unfortunately, you can’t do this kind of transition in an incremental fashion. It’s not an evolution, it’s a revolution. We had problem moving from the Silo model to the Laird model that comes along with the Cloud. Clearly, if you are a manager who owns a complete Silo, you have the hardware budget, you have the resources. At that point you want to keep ownership of that. You don’t want to give up your budget to the infrastructure or the platform as a service layer. You don’t give up your staff to building up and strengthen these layers.

Tim Bell 10:26

We had the knowledge management experts of the previous systems that they had written themselves. They used to have the new comers sit down next door to them for month to learn the magic. Now, they’re just one part of an overall global community rather than being this knowledge management specialists. That’s a difficult transition for people who had used to being in that position. And then finally, we have the people who consider their service and application as precious. These are the guys who come in and check the lights in the morning to make sure that the computer is okay and just before they go home at night, pop by just to check if everything is alright afterwards. And for them, there are some case of technology which are precious, however, the majority of them it’s a cultural question to understand how cloud models work and help design applications to avoid that sad situation.

Tim Bell 11:22

So where are we at the moment? We started the work about 18 months ago. We have the tool chain ready in production, running in both the computer center in CERN and the computer center Vigna. At the moment we’re running three OpenStack Clouds, around 50,000 course in total and one of these is in the CERN computer center. The two others were actually running on these large filter farms right next towards the experiments. And those are being used while the LHC is being upgraded. The aim is by 2015 when the LHC starts up again, when we’ll start to see the large data pouring in again that we move over to a largely virtualized and cloudified environment, aiming for about 300,000 course and 90% managed by the cloud.

Tim Bell 12:09

So in summary, we’ve been forced to make this major transition. However, that having been said, it’s actually has significant benefits in terms of the long term affordability and sustainability of the IT Operations. We found that participating in the opensource communities has helped that cultural transition by showing to people how to work and equally the staff have been highly motivated by the fact that they do get recognition from outside and it increases their ability to obtain a job following their time at CERN. And then finally, we contribute both through the code contributions but also through the outreach work where we can explain to people what we’re doing as a publicly funded organization, the information is generally available and we publish what we do.

Tim Bell 12:57

So, I’m happy to answer any questions now but since I think we’re quite tight on time then you’re welcome to drop me an email to tim.bell@send.ch if you have anything to ask for. Thank you.