Home » Uncategorized

How to Write an R Function to Match and Merge 2 Files (like VLOOKUP)

  • RayHall 

Matching and merging 2 files is task I find myself doing all of the time. Historically, I’ve used VLOOKUP in MS Excel and just worked around any limitations. Finally, I bit the bullet and wrote an R Function that does the trick faster, and with more flexibility. 

2808365327

(Image added by the Editor, source)

What is a VLOOKUP?

Before we get into it, let’s look at how Excel’s VLOOKUP function works so it is clear what we’re reproducing in R. VLOOKUP is used to copy data from one dataset to another based on matching values. “Dataset” in this case can refer to a column, table, sheet, etc. For example, you may have one sheet that has contact info for your customers. From your email program, you have a list of email addresses with an action taken on an email campaign. Now you want to combine your contact data with your email open/click data. With VLOOKUP, you can match by the “email” column in each dataset and copy over the open/click data to the contact data. 

Example:

So why not just use a VLOOKUP?

Great question. If you are only doing VLOOKUPs here and there, and on relatively small datasets, it may not make sense to invest your time in writing a function in R. However, if you find yourself using VLOOKUPs routinely, or needing several in a single project, you can save yourself a lot of time in the long term by investing a little time now, developing an R function.  

VLOOKUPs also have some limitations that can be overcome with R. First, VLOOKUPs only read left to right, so you have to make sure your lookup value (what you are matching on) is to the left of the data you wish to copy. Second, a VLOOKUP formula only handles one column at a time. If you need to copy multiple columns, it takes multiple VLOOKUP formulas. VLOOKUPs can also cause performance issues. Several VLOOKUPs in a large dataset can slow things down or crash Excel all together. In addition to being more efficient, an R function can overcome all of these issues.

Why R?

Well… this is my tutorial, and I chose R. The argument here isn’t really pro-R specifically, it’s an argument for an automated function vs manual work. So, if Python is your thing, use Python. Use whatever language you want. That said, if you work with data often and you’re looking for a language to learn, R is great. It’s easy to learn, free, and pretty powerful. As you will see here, even if you are new to R, this tutorial will be pretty easy to follow and implement.

Read full article and get the R code, here.