I am not a Data Scientist, although I do have an intermediate-level understanding of AI / ML.
I am curious as to whether commercial or open source data profiling tools might be used to discover patterns in a dataset prior to using the dataset as training data.
Big data has posed a volume challenge to some of the profiling tools (IBM, Informatica, Oracle, etc.), but most vendors have addressed or are continuing to address this issue and will connect to the big data platform directly (as opposed to physically importing the data it into the profiling tool work space). I have some limited experience with IBM and Hadoop in this area. In addition, the size/volume of a training dataset is typically not the same size as the production volume which may allow use of the tool without the additional connections at least for training and testing.
My question is: would a data profiling tool be helpful in obtaining a basic level of information about the contents of text datasets(ranges of values in a specific field, nulls or missing values, cardinality, etc.)? The profiling tools generate summary and detail reports and these might also be used in creating tags for unlabeled data items.
The information obtained from just one execution of a data profiling tool could then be used to enhance the training data for the first training iteration with a more accurate set of information (vs. a random approach) and could possibly reduce the number of training iterations needed.
I appreciate any answers/advice as well as any comments from anyone who has tried to use a data profiling tool to initially discover characteristics of data items prior to using a set of data for training.
If I am understanding, I think you are asking is there some set of 'exploratory' actions that are taken a larger portion of the time when a dataset is first presented, that, given what is learned, allows a determined direction to proceed ?
Yes, but may also cut down on the time taken in training iterations to 'clean' the data or determine the relative quality of the data.
There have been a couple of blog entries and articles recently stating that Data Scientists are spending 80% of their time handling data quality issues and only 20% performing Data Science tasks.
I asked this question after attending a DSC-hosted webinar on 12/4/2018 titled 'AI Models and Active Learning' presented by figure eight which described their 'human in the loop' platform and its ability ML to create high-quality training data. I was reminded about the issues in data quality (structured and unstructured) which I have seen in every enterprise I have consulted with. The figure eight approach is excellent although the many iterations my not be needed if the quality can be adjusted prior to training the model.
The combination of 2 approaches here may be desirable:
1. Use a data profiling tool to make adjustments to the training data first.
2. Use the figure eight (or similar platform) to make more improvements (tagging as well as other corrections).
Since the data profiling tools do not have as much of a performance hit as the number of iterations needed (mentioned in the presentation), this might be an approach.
The ideal approach is that the data has already been profiled and the quality issues addressed as part of a Data Governance initiative but my experience has shown that this seldom occurs.
Well take for instance in R, the package set 'Tidyverse', where it has an option to read in a file, where you can then exercise an option called 'problems', and it gives you summary information and also some initial pertinent information a single command.
I went back and listened to the webinar. I think you are possibly referring to something that does this in between iterations. Something that ceases working on data with a low.....rate of return? and focusing on data that shows greater potential?
I am far from the decider on such matters, but it reminds me of Apriori; where you are not going to get more information from a subset than you would a superset. So let us say we ran an initial PCA analysis, and learned certain attributes explained a high level of variation. I am not sure that we would later actually NOT provide a certain level of explanation.
Where I see the question is worth continued wrestling with, is where the data set is so large, that working through all the data can be as costly as iterations...where there is something that could be gained by working with smaller portions of the data, taking into account the information loss, but it being outweighed by the time gained?
So working with 2 TB and getting a 5% error rate in 2 hours, or working with a portion of columns and actually only looking at 300GB resulting in a 7% error rate in 20 minutes?
I believe that your synopsis is correct. The only difference here is that I would suggest doing the profiling prior to any iterations. I had recently worked with a very large data set to determine the data quality before releasing the data to any analytical activities.
What we found was that the data points of interest were either absent all together or were expressed in varying data types or varying data content.
The best way for me to express this is using the example of a binary data item (defined as character) to express True or False. In the data set we examined, we found the following in the field we were examining: T, F, True, False, t, f, Y, N, Yes, No, as well as 0 and 1 etc.
For this example, the profiling exercise would reveal counts of each value in the field and allow for these to be changed (via ETL to a new dataset) to a uniform value.
My question is, in cases like this, would the profiling and standardization first be a benefit before iterations?
There are a number of tools out there, Alteryx being one. My previous company built our own customized tool which delivers the results that you are looking for. We created an automated tool that delivered three outputs on a given file:
b)diagnostic report that looked at cardinality and missing values within each field as well as statistical results(mean, standard deviation, min and max).
c) Frequency distribution reports of each field
This above report was produced for each file that we worked with in building data science solutions.
Thank you - this is the information I was looking for.
Were you able to automate the process so that after establishing basic rules for quality, there was minimal human intervention prior to delivery of data to data science tool?