Subscribe to DSC Newsletter

Python vs R: 4 Implementations of Same Machine Learning Technique

Actually, this is about two R versions (standard and improved), a Python version, and a Perl version of a new machine learning technique recently published here. We asked for help to translate the original Perl script to Python and R, and finally decided to work with Naveenkumar Ramaraju, who is currently pursuing a master's in Data Science at Indiana University. So the Python and R versions are from him.

We believe that this code comparison and translation will be very valuable to anyone learning Python or R with the purpose of applying it to data science and machine learning.

The code

The source code is easy to read and has deliberately made longer than needed to provide enough details, avoid complicated iterations, and facilitate maintenance.The main output file is hdt-out2.txt. The input data set is HDT-data3.txt. You need to read this article (see section 4 after clicking, it has been updated) to check out what the code is trying to accomplish. In short, it is an algorithm to classify blog posts as popular or not based on extracted features (mostly, keywords in the title.)

The code has been written in Perl, R and Python. Perl and Python run faster than R. Click on the relevant link below to access the source code, available as a text file. The code, originally written in Perl, was translated to Python and R by Naveenkumar Ramaraju.

For those learning Python or R, this is a great opportunity.

Note regarding the R implementation

Required library: hash (R doesn't have inbuilt hash or dictionary without imports.) You can use any one of below script files.

  • Standard version is the literal translation of the Perl code with same variable names to the maximum extent possible.
  • Improved version uses functions, more data frames and more R-like approach to reduce code running time (~30 % faster) and less lines of code. Variable names would vary from Perl. Output file would have comma(,) as delimiter between IDs.

Instructions to run:  Place the R file and HDT-data3.txt (input file) in root folder of R environment. Execute the '.R' file in R studio or using command line script:  > Rscript HDT_improved.R  R is known to be slow in text parsing. We can optimize further if all inputs are within double quotes or no quotes at all by using data frames. 

Julia version

This was added by Andre Bieler. The comments below are from him.

For what its worth, I did a quick translation from Python to Julia (v0.5) and attached a link to the file below, feel free to share it. I stayed as close as possible to the Python version, so it is not necessarily the most "Julian" code. A few remarks about benchmarking since I see this briefly mentioned in the text:

  • This code is absolutely not tuned for performance since everything is done in global scope. (In Julia it would be good practice to put everything in small functions)
  • Generally for run times of only a few 0.1 s Python will be faster due to the compilation times of Julia.

Julia really starts paying off for longer execution times. Click here to get the Julia code. 

Resources

Top DSC Resources

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 16415

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Jonathan D. Stallings on July 15, 2017 at 4:40am

forgot to show ptm for R: I'm using a 2015 MacBook Pro, 2.5 GHz Intel Core i7, 16 GB 1600 MHz DDR3

R

> proc.time() - ptm
   user  system elapsed
  5.000   0.042   5.045

R - Improved

> proc.time() - ptm
   user  system elapsed
  3.460   0.016   3.469

Comment by Jonathan D. Stallings on July 15, 2017 at 4:34am

For those interested in trying the R scripts, you may have encoding issues that break the for loop, resulting in the following: Error in tolower(line) : invalid multibyte string 1

This small change worked well for me in both R scripts. 

line = tolower(iconv(line,"WINDOWS-1252","UTF-8"))

Results on for me running the script on a MAC OS16:

[1] "Average pv: 6.83339217671537"
[1] "Number of articles marked as good: 225 (real number is 1079)"
[1] "Number of articles marked as bad: 2391 (real number is 1079)"
[1] "Avg pv: articles marked as good: 8.09631614978183"
[1] "Avg pv: articles marked as bad: 6.71454738627624"
[1] "Number of false positive: 26  (bad marked as good)"
[1] "Number of false negative: 880  (good marked as bad)"
[1] "Number of articles: 2616"
[1] "Error Rate:  0.346330275229358"
[1] "Number of feature values: 16711 (marked as good: 49)"
[1] "Aggregation factor: 29.3326390189103"

Comment by Tim Bollman on April 6, 2017 at 1:20pm

I made a perl 6 version and a cleaned up perl5 version at https://github.com/Tim-Tom/scratch/tree/master/HDT, perl6 currently runs about 30 times slower, but they are in the process of revamping the IO subsystem so hopefully that will speed up soon.

The cleaned up version fixes a few bugs with input (for example not reading the header line as data and stripping out all symbols instead of on a case by case basis). It also runs very slightly slower because I attempt to use what I think was the files actual encoding instead of using the default. If you change it to just the default encoding, the performance is the same.

Comment by Paul McLeod on February 23, 2017 at 11:25am

The Perl version really is rather longhanded and a good candidate for translation, although the algorithm is one which is very suited to Perl, and perhaps does not let Python or certainly R really sing, given its hashes and regexp approach. 

This is a great exercise.  It's interesting to see them in comparison.

I would actually be pretty interested in Perl6 compute times.  I imagine it would be both shorter (even if just for sigils, but structurally also) and yet bad performance at this stage. So how would it compare to R?

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service