Subscribe to DSC Newsletter

Big Data: Uncovering The Secrets of Our Universe At CERN

CERN is best known these days as the research organization which operates the Large Hadron Collider – the largest and most complicated science experiment ever undertaken, which aims to explain mysteries behind the creation of the universe.

That’s not where the story start though – CERN was established in 1954 by 12 European nations as a collaborative project. It has been responsible for numerous scientific breakthroughs mainly utilizing particle acceleration and collision, which culminated with the 2012 discovery of the long-sought Higgs boson.

As well as dealing with matters of physics linked to the Big Bang, CERN’s experiments are firmly in the realms of Big Data. The LHC alone generates 30 petabytes every year – that’s approximately 15 trillion pages of printed text – enough to fill 600 million standard filing cabinets!

So in this post I’m going to have a look at how CERN uses Big Data in the search for the Big Bang – another example of Big Data analytics pushing back the boundaries of what we can achieve as a species.

CERN had been operating for over half a century before the LHC was activated in 2008, and in that time its experiments have generated an ever-increasing amount of data. Data from the first experiment was stored entirely on one computer – albeit one the size of a building!

When physicists wanted to access the data, they had to travel to Geneva to personally retrieve it. This problem led to the development of CERNnet, a network of interconnected computers across the United States and Europe – and eventually in 1989 to the creation of the World Wide Web (WWW) - itself a CERN project.

Today, around 8,000 analysts work on the four main experiments which involve the LHC. They can all remotely access and analyze any of the data generated, in near real-time. Their work involves recreating conditions within particle colliders, similar to the state the universe was in during the immediate fractions of a second following the Big Bang. This allows them to probe questions related to matter and anti-matter, dark matter, and extra dimensions beyond the four we are already familiar with.

Crunching all of the data collected from monitoring 600 million particle collisions per second would require more processing power than any one organization has at its disposal. To get around this problem, CERN initiated the construction of the Worldwide LHC Computing Grid, utilising computer facilities available to the universities and research groups collaborating on the project, as well as private data and computing centers.

This “distributed computing” gives the experiment access to processing power and storage capacity which would be far too costly to build into one data center. It has other advantages over a centralized system – the data can be accessed at greater speed by researchers wherever they are in the world, and if disaster strikes at one location, multiple mirrors of the project exist elsewhere. After all the data is immensely valuable – since 2008 is has cost about $5.5 billion per year to collect it!

An even more widely-distributed network is involved in the [email protected] program which uses donated processing power from anyone who wants to get involved, to carry out calculations on users’ home computers, in a similar fashion to [email protected] and [email protected]. This program runs simulations designed to discover optimum configurations for running the LHC colliders, and does not involve data gathered from the experiments themselves.

The actual experiment data is primarily collected through light sensors – which are basically cameras, although ones capable of taking pictures at 100 megapixel resolution, and quick enough to capture events taking place on a sub-atomic scale.

Over the years, the amount of data generated has increased in volume and velocity, and CERN has developed methods of adding new computing power to the grid “on the fly”, to cope with spikes – this has been particularly vital following the recently-completed upgrade of the LHC which almost doubled its energy output.   

Since its establishment in the 1950s, generations of scientists have used the data created by CERN’s experiments to build their careers. Numerous breakthroughs have been made which have increased our understanding of how and why the universe works. None of this would have been possible if it wasn’t for CERN’s commitment to innovating in the field of Big Data. The impact of those innovations (if it wasn’t for them, you wouldn’t be reading this article, or any article on the web!) have been just as earth-shattering as the proton colliders which crash proton together at speeds very close to the speed of light!


I hope you found this post interesting. I am always keen to hear your views on the topic and invite you to comment with any thoughts you might have.

About : Bernard Marr is a globally recognized expert in analytics and big data. He helps companies manage, measure, analyze and improve performance using data.

His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance You can read a free sample chapter here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 3514


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Glenn Davis on May 27, 2015 at 9:07pm

One would expect, considering the cost of storing and moving terabytes of Big Data, that data compression would be a major Big Data technology and discipline. But it isn't; compression is rarely seen as part of a solution. For example, many of the books on Big Data don't even mention compression. Why do you think compression is just an 'also-ran' among Big Data-related technologies?

Comment by Sione Palu on May 26, 2015 at 1:05am
Comment by Sione Palu on May 26, 2015 at 12:55am

Big data has been the main domain of particle physics for half a century, since the establishment of large linear accelerator as Lawrence Livermore at University of California in 1952, SLAC (Stanford Linear Accelerator Center) a decade later (1962) as well as Fermi lab in 1967 a decade a half later. Those accelerators were bigger than CERN at the time, therefore they collected more massive data. The birth of scientific computing (analysis of massive datasets)  started in particle physics followed by other related disciplines (both Physics & Engineering) back then. CERN is now the biggest linear accelerator now, but it wasn't the biggest when it was first established in its earlier years.

Follow Us


  • Add Videos
  • View All


© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service