Hi,
I am currently working on a problem where the dataset contains 200+ features (Let's call them the code features, e.g no.of.loops, memoryInst, loadInst, etc and Flags that are used to compile code that has such characteristics/code features)
The flags are represented as strings:
This is just dummy data.
snippet FlagsUsed no.of.loops loadInst memoryInst
1 Mergesort " -a -b -c -d=10 -e -f =19 -c " 1 0 10
2 Bubblesort " -a -c -f=230 " 2 5 3
3 MatrixMulti " -f=20 -z -f12 -f2f " 0 10 4
I need some help with how these flags should be represented in the data, I have tried one-hot encoding and dummy_variable methods but these methods have some disadvantage:
1) One-hot encoding: does not preserve order information, e.g -c flag was in 3rd position 1 first snippet but is in 2nd position in the second snippet and does not even exist for the third snippet.
2) Dummy_variable method: There are 200+ different types/levels/factors of these flags and dummy_variable method create a feature for each level/factor which is not feasible.
Also, the flags can repeat in a single string (1st snippet, -c repeats).
I am thinking of some clever hashing that would maintain information regarding the sequence of the flags and there value (toggle flag = 0/1, threshold flag = {lower, upper} ). But the problem with hashing is, I have to, in future, predict these flags using other features (code features) and if I hash these flags somehow I won't be able to reverse hash them.
I am thinking of some fixed size vector representation which could be reversed so that I can tell flag using a numeric or hex number.
Can anyone please guide me or put me in the right direction. Would be thankful!
Tags: R, extraction, feature, hashing, learning, machine, modeling
I've been dealing with a very similar problem in the past, and I built a table of flag vectors (see here) to address this issue.
Basically, say you have 3 features A, B, and C.
The values can be strings of arbitrary length.
Each observation was coded as follows in a hash table, the hash key representing the observations: "A~a2|B~b4|C-c34" for an observation with value a2 for the first feature, b4 for the second feature, and c34 for the third feature. This assumes that neither | nor ~ were characters present in the data itself. The value attached to this hask key was the frequency of this particular combination, in the data set. I wrote a Perl script to handle this, making extensive use of regular expression processing.
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles