The Panama Papers – how did they pull off history’s biggest data leak?

Find out how Data to Value’s Graph Data software partners Neo4j and Linkurious have been used in the Panama Papers investigation.

Recently there has been a lot of interest around the newly published Panama papers. This giant trove of data that is said to contain a whopping 11.5 million documents or 2.6TB of data. This completely dwarfs pervious leaks like the 1.7GB WikiLeaks scandal or the 30GB Ashley Madison leak. It took two years, more than 400 journalists and cutting edge technology solutions to process all of this information and gain valuable insight.

The data was leaked from one of the world’s leading firms in incorporation offshore entities – Mossack Fonseca. The data was then gradually transferred to a German journalist that worked in the Süddeutsche Zeitung (SZ) via encrypted chat. The real work began shortly after the data started pouring in, as the SZ was not able to make sense of data that size and got in contact with the International Consortium of Investigative Journalists (ICIJ) to find a way of handling these millions of documents. The ICIJ were very efficient and very prudent when handling this data. The data and its copies were stored in encrypted drives using open-source software – VeraCrypt. The choice was made to use Apache Solr – as the main search server coupled with Apache Tika, a toolkit that detects and extracts metadata and text from over a thousand different file types. This made it possible for a seamless and near real-time way of searching different file types, such as PDFs, Word documents and emails. A custom UI developed by Blacklight was put on top of the solution for ease of use. Once built one of more than 400 journalists needed a link and a randomly generated password to start discovering interesting data.

To make sense of the highly connected and complex data the investigators decided to ask the help of two of our software partners - Neo4j and Linkurious. Using Neo4j, the world’s leading graph database, made it easy to find and analyse complex connections as graphs use special structures incorporating nodes, properties and edges to define and store data. Linkurious, a graph visualisation platform helped the journalists to navigate through this ocean of data uncovering unique insights into the offshore banking world, showing the relationships between banks, clients, offshore companies and their lawyers. 

The entire dataset of the Panama Papers is expected to be released early May. For more interesting articles about finding meaning in data visit our website and follow us on LinkedIn or Twitter.

Views: 2137

Tags: Big Data, Data leak, Panama Papers


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service