I am not a Data Scientist, although I do have an intermediate-level understanding of AI / ML.
I am curious as to whether commercial or open source data profiling tools might be used to discover patterns in a dataset prior to using the dataset as training data.
Big data has posed a volume challenge to some of the profiling tools (IBM, Informatica, Oracle, etc.), but most vendors have addressed or are continuing to address this issue and will connect to the big data platform directly (as opposed to physically importing the data it into the profiling tool work space). I have some limited experience with IBM and Hadoop in this area. In addition, the size/volume of a training dataset is typically not the same size as the production volume which may allow use of the tool without the additional connections at least for training and testing.
My question is: would a data profiling tool be helpful in obtaining a basic level of information about the contents of text datasets(ranges of values in a specific field, nulls or missing values, cardinality, etc.)? The profiling tools generate summary and detail reports and these might also be used in creating tags for unlabeled data items.
The information obtained from just one execution of a data profiling tool could then be used to enhance the training data for the first training iteration with a more accurate set of information (vs. a random approach) and could possibly reduce the number of training iterations needed.
I appreciate any answers/advice as well as any comments from anyone who has tried to use a data profiling tool to initially discover characteristics of data items prior to using a set of data for training.