Hi All, I'm curious what people see as their greatest challenges in preparing a file(s) for analysis. At my company we are highly focused on the front-end issues involved in data file preparation, and I thought it would be helpful to know the various different challenges others face when doing the tasks involved in getting a file ready to be used. Thank you in advance for your thoughts!
Allison, In my opinion the most critical step to prepare data for any analytic solution is to correctly define the business problem and perform the necessary due dillegence to identify relevant data. The nature of the business problem and context of deploying the solution helps me to understand what data is relevant, and I prefer to only prepare this data. An incomplete understanding can lead to significant loss of time, waste of resources, and even termination of the project. Furthermore, regularly including data that does not relate to the event I am trying to model will eventually give me a false relationship. Translating the business contect is a very conceptual step that contrasts with the technical data preparation steps that follow. This contrast is a major pain point for my team. We find it takes continuous attention not to wander, spending valuable resources preparing data that will not be used.
For example, the business problem(s) will usually help me understand how I should handle null values in a data set. If I the objective demands a lower degree of accuracy I could replace the null values with an average of other observations. But this practice can understate confidence intervals because I am reducing variation within my dataset. Therefore, I would not take this approach on a project that demands confidence intervals with a high degree of accuracy.
Similarly, the process or structure of the business problem will help me decide if I want to engineer new variables/features. If the given dataset describes the critical components of the events I am attempting to model, I probably will not spend time engineering new variables. But if my data does not directly measure the outcome I am attempting to model, I will spend significant time transorming my variables into the most useful form. Or I may even merge additional data such as census information that significantly increases the data prep effort.
I strive to customize my data preparation procedures to the business context on each project because 1) I believe it provides more accurate results and 2) I believe it helps eliminate unnecessary preparation of data that is available but not directly relevant. I'd be very intereested to hear about other experiences.