Home

Open sourcing spot the difference

Capital One UK’s Data Science team has been focused on move from proprietary (paid-for) software to open source for some time now.

There are several key benefits to making this change. Open source software is prevalent in academia which makes it much easier for our new starters to hit the ground running, building models and analysing data on day one with the company (the switch has also been a terrific development opportunity for my team to learn new skills). Our team now has greater and quicker access to cutting-edge techniques and approaches; as soon as a package is available, our team can install and get moving rather than having to wait for software updates and upgrade projects. Finally, there is a real cost benefit in using open source software.

Our switch to open source has been a journey requiring the team to learn a heap of new skills. Migrating large codebases from our legacy systems has involved a lot of work, and after two years, everything we do uses open source. Having come so far along this transformation, the time was right to start giving back to the open source community. Today, I’m excited to announce our first analytic package in R: dataCompareR.

 2220288292

In the Data Science team we often have to move code across environments or re-code from one language into another. The key to making sure this has been done correctly is the ability to compare two datasets (before and after) to make sure they are the same – if they’re not, you want to find out where they are different and why. Historically our software had a handy procedure to do this, though we didn’t feel like we had this in R. So, we thought: “Let’s build it!”

With dataCompareR you are able to point the package at two datasets. It will compare the two and highlight any differences. Simple. The process to build dataCompareR was actually a load of fun. After some initial planning the entire team, comprising 18 people, spent two full days in hackathon mode building the functionality, and testing harnesses and outputs for the package. That’s a lot of coffee and pizza! It was a great couple of days and the team learnt a lot – both about R and how to work together in an agile manner.

We feel pretty proud of the dataCompareR package and would love people to start using it and give us some feedback.

So get yourself to the CRAN and enjoy!

Tags: