Handling different length string features and prediction of these based on other features


I am currently working on a problem where the dataset contains 200+ features (Let's call them the code features, e.g no.of.loops, memoryInst, loadInst, etc  and Flags that are used to compile code that has such characteristics/code features)

The flags are represented as strings:

This is just dummy data.

    snippet                 FlagsUsed                           no.of.loops             loadInst      memoryInst

1  Mergesort      " -a -b -c -d=10 -e -f =19  -c "         1                                 0                   10

2  Bubblesort     " -a -c -f=230 "                                2                                 5                    3

3  MatrixMulti     " -f=20 -z -f12 -f2f "                         0                                10                   4 

I need some help with how these flags should be represented in the data, I have tried one-hot encoding and dummy_variable methods but these methods have some disadvantage:

1) One-hot encoding: does not preserve order information, e.g -c flag was in 3rd position 1 first snippet but is in 2nd position in the second snippet and does not even exist for the third snippet.

2) Dummy_variable method: There are 200+ different types/levels/factors of these flags and dummy_variable method create a feature for each level/factor which is not feasible.

Also, the flags can repeat in a single string (1st snippet, -c repeats).

I am thinking of some clever hashing that would maintain information regarding the sequence of the flags and there value (toggle flag = 0/1, threshold flag = {lower, upper} ). But the problem with hashing is, I have to, in future, predict these flags using other features (code features) and if I hash these flags somehow I won't be able to reverse hash them.

I am thinking of some fixed size vector representation which could be reversed so that I can tell flag using a numeric or hex number.

Can anyone please guide me or put me in the right direction. Would be thankful!

Tags: R, extraction, feature, hashing, learning, machine, modeling

Views: 399

Reply to This

Replies to This Discussion

I've been dealing with a very similar problem in the past, and I built a table of flag vectors (see here) to address this issue. 

Basically, say you have 3 features A, B, and C.

  • A can take on 3 different values a1, a2, a3
  • B can take on 5 different values b1, b2, b3, b4, b5
  • C can take on 200 different values c1, c2, ... ,c200

The values can be strings of arbitrary length.

Each observation was coded as follows in a hash table, the hash key representing the observations: "A~a2|B~b4|C-c34" for an observation with value a2 for the first feature, b4 for the second feature, and c34 for the third feature. This assumes that neither | nor ~ were characters present in the data itself. The value attached to this hask key was the frequency of this particular combination, in the data set. I wrote a Perl script to handle this, making extensive use of regular expression processing. 


© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service