Creating a data extractor is usually simple in the mind of a developer, although in reality it’s usually not so simple. Data extraction as a concept is fairly straightforward: Get the data from A and copy it to destination B. However, as most developers know, the devil is in the detail.
Whenever a data company, such as IHS
, has a large data set and a complex set of rules to access the data required to build a software development kit (SDK) that allows customers to retrieve a product, it is expected that there will be some particular way to query the data as well as several ways to get the query result. This was the case for IHS and their product Enerdeq
. I faced many challenges while creating this data loader for IHS. Here are a few things I learned in the process.
Decisions, decisions, decisions
Usually, someone else provides the requirements for the solution. Once these parameters are in place, you have to work with what is provided and comply with the requirements to make your customer happy. Fortunately, IHS had SDKs for Java, .NET, C#, and plain WSDL, and, in my opinion, they worked just fine. That being said, a skeptic never trusts someone else’s code unless proven otherwise.
In this particular experience, part of the requirement was to integrate the extraction solution with Microsoft’s SQL Server Integration Services (SSIS
). I have fair knowledge on WSDL and, when presented with the option, I thought choosing WSDL over coding to provide a solution was a no brainer. This led me to start working on a proof of concept.
Although using Java was not an option for this project, most of the development work I’ve done has been in Java, so I am comfortable with the usual “Java world” tool sets. I chose SoapUI, a tool for API testing, to work on my proof of concept. In little time, I had a piece of software that was able to query the IHS services and retrieve the responses and data sets. I thought this was an adequate solution and started doing the same thing using Microsoft’s SSIS client implementation, which turned out to be a big mistake.
There is one question I’ve asked myself since I started coding: Why are companies paying Microsoft for IDEs when there is little support and unfinished libraries? I finally succeeded making the WSDL client inside SSIS, and after going through issues such as the IHS service not complying 100% with the WSDL standard, I found a catastrophic problem that would not allow me to use WSDL along with SSIS in the same way.
The problem was that the response data (in XML) was too complex for SSIS; it took 4 hours to generate the model (in a i5, 8gb RAM, 64bit Windows box). Every time I would open the solution or open between the “Control Flow” and “Data Flow”, it would either take large amounts of time, or the Visual Studio window would freeze. This path was not an option anymore.
The solution I ended up with was a Dynamic Link Library (DLL) along a Script Component in SSIS, and after dealing with some issues, the extractor worked.
Accept your destiny
Once the overall architecture was set, everything was straight forward on the coding side, but there were some issues with the IHS side. The IHS API and the query language were clear and easy to understand. However, there was a problem with the documentation. The software and services IHS provided were adequate, but the documentation was outdated. This caused delays in where to find certain information in the XML response.
When dealing with incremental data loads, one of the most important tasks is to correctly select the new data in order to avoid duplication and/or transfer overkill. At first, this didn’t seem like a problem since IHS provides a way to filter which data sets were modified from a given date to date. If you are doing this on daily basis, then you get the feeling you’re ready to get going on the right path. The problem is when you only know that a certain entity was modified; if that entity contains 50+ years worth of data, which in the Oil industry can include production, tests, contract history, etc, then selecting the data is not as straight forward as it first looked. IHS provides a way to filter some of this data, such as getting production information for a certain month, year or date range; but if you don’t really know what changed, you will have to make a full load for a certain entity from time to time.
If you make SDKs that are supposed to work on some standard, always make sure your implementation and API comply 100% with it, and please test it.
Never overestimate the power of a tool.
If something is not clear or may cause confusion, talk to your provider.
Being proficient in coding multi-threaded applications is a must.