Pretty much every data rookie starts with Excel. It is a wonderful program for storing, cleaning and analysing (yes, you read that correctly) your data.
Strictly speaking, Excel isn’t free, but really – who pays for it these days? If you buy a Windows PC or laptop it’ll usually come pre-installed, and if you get a new PC at work your employer will have it pre-installed for you. If you’re prepared to look the other way, there are guys who know guys who can get you a copy that fell off the back of a lorry, but I wouldn’t endorse that. That would be wrong.
While Excel is a great place to start, once you get into it you quickly realise just how long it takes to clean your data. Worse still, Excel is the source of a great deal of that wasted time!
Never mind – there is a new generation of free data cleaning programs for non-programmers that promise to clean your data quicker, easier and with a lot less hassle, and I’ll introduce those that I know of here, namely OpenRefine, Trifacta Wrangler and DataKleenr.
Importing data into Excel is extremely simple, and supports import from Microsoft Access, from the web, text sources, an SQL server, XML, CSV and – thanks to the Data Connection Wizard – a whole host of other data sources. Perhaps the most-used Excel import function though is your fingers. Excel is particularly good with manual data entry and copy/paste operations, which make it the most used program for storing, cleaning and analysing data.
Excel has lots of functions that can help with your data cleaning, such as Remove Duplicates, Find and Replace, Spell Checker, and loads more formulae, such as TRIM(), CLEAN() and SUBSTITUTE().
On the other hand, Excel has some severe disadvantages. Trailing and leading spaces can be difficult to spot and, although there are formulae to remove them, the procedure to do so can be awkward and quite lengthy:
This is a common problem with cleaning your data in Excel – functions and formulae, although not technically difficult to use, could have been implemented with more style by Microsoft. Would it have been so difficult to have created a button that with a single click would have removed all the trailing and leading spaces from selected columns or the whole worksheet?
The more advanced user could of course build macros and create custom buttons, but they shouldn’t have to. Trailing and leading spaces are such a common data cleaning problem that the Microsoft Excel team should have dealt with it already.
Another issue is that some Excel functions operate on selected data, whereas others act on the whole worksheet. If you select a column of data and use Find to identify certain characters, it will identify only those characters in your chosen column. If you now use Replace it will change all such characters in the entire worksheet – which is probably not what you wanted to do, and you may have unwittingly introduced new errors into your data without being aware of it.
The safest way to clean your data in Excel is to copy an individual column to a separate worksheet, perform all your cleaning operations in isolation until you’re happy with the result, then copy your cleaned data to your original sheet (or better still, to a new sheet that stores only clean data). The repeated use of Copy, Paste and using multiple worksheets to clean your data can become extremely messy.
All in all, I would heartily recommend Excel to data novices. It is a very powerful program, is easy to work with and easy to share your work with others. It also has extensive support, forums, books, email courses and all sorts of other help, so if you get stuck with something it’s easy to get support. On the other hand it won’t take you long before you start to get frustrated at just how much time you waste in cleaning your data by adjusting to Excel’s idiosyncracies.
Fear not, though, for there are alternatives to Excel. Read on, dear reader…
OpenRefine (previously Google Refine) has the reputation of being ‘Excel on steroids’, and is a powerful data cleaning tool for text and numerical data that uses your web browser as an interface.
A typical OpenRefine project consists of:
Importing your data to OpenRefine is fairly simple and leads to the creation of a table of records where each column has a title and each row is numbered. Many data formats are supported for both import and export, including various CSV formats, Excel spreadsheets, JSON, XML and XHTML.
Once imported, there are lots of tools and features offered by OpenRefine to work on your data. All procedures are recorded, making it possible to browse and undo at any time, which often proves extremely useful.
Most operations on data are row-based, column-based, or cell-based.
Row-based operations are limited to marking and deleting selected rows, and frustratingly there is no Add Rows functionality. Filtered and facetted searches are a powerful means to explore, search, and edit data, allowing rows to be selected by listing the distinct values in a column and the number of instances.
As you would expect, simple column operations such as renaming/deleting a column or sorting rows based on the values found in this column are relatively simple. OpenRefine also offers much more powerful column-based operation features, such as adding a column based on the values found in an existing column or by fetching a URL. You can also merge datasets with common columns, like performing a Join operation with a relational database.
OpenRefine also offers a variety of common transformations such as trimming leading and trailing whitespaces of a string.
OpenRefine, although it has a steeper learning curve than Excel, is much more powerful. Just by using the common transformations you can trim hours off your data cleaning, and the powerful undo/redo functionality take away a lot of the stress compared with Excel.
Although OpenRefine has a lot of well-thought out features, I find it quite annoying that a number of common, simple operations on data are un-necessarily complicated, requiring more mouse clicks than it really needs to. I’ve also found that lengthy data cleaning sessions lead to a degradation in performance, leading to either crashes or a need to save, close and re-open.
OpenRefine doesn’t have extensive support, although there are a few How-To videos and a book. If you get stuck and don’t know where to get help you may be out of luck.
All in all, OpenRefine is powerful, can save time and stress, but can also increase stress in some circumstances.
I think it might be fair to say that OpenRefine is like Marmite – you’ll either love it or hate it!
Trifacta Wrangler (previously Data Wrangler), unlike Excel and OpenRefine, is a semi-automated data cleaning tool for a variety of data types, including text and numerical data, binary data and several others.
When you download and install the application you’re required to create an account and log in each time you use it. Apparently all your data wrangling operations and workflows are done locally in your machine, and the account creation process is to help you keep the program up-to-date.
On opening the application and loading a dataset (supported formats: CSV, TSV, JSON, Excel), Trifacta Wrangler tries to figure out the data types of each column, then gives a visual summary of your dataset. Helpfully, it gives a horizontally-stacked bar chart at the top of each column to show the proportion of valid, mismatched and missing values, and this is a great help in identifying errors.
Data cleaning with Trifacta Wrangler is manual and column-based, but is very much easier to use than Excel or OpenRefine. Identify an erroneous data pattern in a given column and Trifacta Wrangler will automatically apply your data cleaning operation throughout the column, although you have to manually identify the errors in your data and provide your own solution to the issues. You can split and merge columns easily and it also has other, more complex features that can be accessed via the Transformer.
The focus of Trifacta Wrangler is firmly on business data, and on preparing your data for porting to Tableau. If you need help, it has a built-in tutorial, but little other support.
Trifacta Wrangler is a very impressive program that is visually pleasing and easy to work with. Its advantage over Excel and OpenRefine is in its semi-automation, allowing you to slash your data cleaning operations to a fraction of the typical time taken. If OpenRefine is not to your taste, Trifacta Wrangler just might be.
DataKleenr is the newest entrant into the free data cleaning software space, and is a fully-automated solution for cleaning text, numerical and binary data.
It is cloud-based, so there’s no need to download or install anything, and all you need is a browser and an internet connection. You need to create an account and log in each time, and all your data cleaning operations are performed on the cloud, encrypted and saved to your private personal workspace, where you can return to your projects at any time.
When you load a dataset (currently, only comma-separated CSV files are supported), DataKleenr automatically detects the type of data for each column and assigns it a label of Continuous, Ordinal or Nominal, and gives a visual stacked bar chart summary of your data showing the proportion of cleaned, uncleaned and missing values. You can change any of the data type assignments via a simple dropdown menu.
Data cleaning is column-based in DataKleenr, but is mostly already done at the upload stage. It has intelligent algorithms that automatically decide how to clean your data, so mostly you will only need to check the final result. Click on a variable to check the details for an individual variable.
For ordinal and nominal data, DataKleenr decides on the major families contained in each variable, corrects any spelling errors, and excludes data that it considers to be invalid. Bar charts help to show the distribution and order of categories. All decisions can be countermanded simply and easily, and DataKleenr learns from this, improving performance of future data cleaning operations.
For continuous data, invalid entries, such as text, symbols or mixed data types are automatically purged. Statistical outliers are identified and visualised with a Box-and-Whiskers plot and can be excluded individually, en-masse or not at all.
When you’re finished, you can download your data as a comma-separated CSV file.
DataKleenr has advantages over other programs in that it is fully automated, intuitive and simple, typically completing data cleaning in a few minutes. On the other hand, DataKleenr is not a tool for all types of data, nor is it meant to be. The focus is firmly on science-type data, on doing the simple things well and on providing visual aids at all stages of data cleaning to produce data that is analysis ready quickly and easily.
If you need help with DataKleenr there are few resources (not surprising since it’s so new), but there is an accompanying book and you can contact the company directly who will help with any issues you might have.
While there are many data cleaning programs in the commercial space, few of them are offered for free. Some offer free time-limited trials or demos, while others offer free programs with limited functionality. There are also free data cleaning plugins for use in programs like R, but you need to be a confident programmer to be able to use them.
For non-programmers looking to clean their data for free there are few tools available, and I’ve reviewed those that I know of above.
Microsoft Excel is the obvious starting point for pretty much everybody, but it has its disadvantages. The next generation of tools, such as OpenRefine, Trifacta Wrangler and DataKleenr are quicker, easier and more accurate options for those wishing to be more efficient in their data cleaning.
What do you think?
Do you know of any other free data cleaning programs for non-programmers?
What are your experiences with these programs?
Join the debate below and let me know your thoughts...
About the Author
Lee Baker is an award-winning software creator with a passion for turning data into a story.
A proud Yorkshireman, he now lives by the sparkling shores of the East Coast of Scotland. Physicist, statistician and programmer, child of the flower-power psychedelic ‘60s, it’s amazing he turned out so normal!
Turning his back on a promising academic career to do something more satisfying, as the CEO and co-founder of Chi-Squared Innovations he now works double the hours for half the pay and 10 times the stress - but 100 times the fun!
PS - Don't forget to connect with me in Twitter: @eelrekab
Other DSC Articles by the same Author
Disclaimer: DataKleenr is owned by Chi-Squared Innovations