Actually, this is about two R versions (standard and improved), a Python version, and a Perl version of a new machine learning technique recently published here. We asked for help to translate the original Perl script to Python and R, and finally decided to work with Naveenkumar Ramaraju, who is currently pursuing a master's in Data Science at Indiana University. So the Python and R versions are from him.
We believe that this code comparison and translation will be very valuable to anyone learning Python or R with the purpose of applying it to data science and machine learning.
The source code is easy to read and has deliberately made longer than needed to provide enough details, avoid complicated iterations, and facilitate maintenance.The main output file is hdt-out2.txt. The input data set is HDT-data3.txt. You need to read this article (see section 4 after clicking, it has been updated) to check out what the code is trying to accomplish. In short, it is an algorithm to classify blog posts as popular or not based on extracted features (mostly, keywords in the title.)
The code has been written in Perl, R and Python. Perl and Python run faster than R. Click on the relevant link below to access the source code, available as a text file. The code, originally written in Perl, was translated to Python and R by Naveenkumar Ramaraju.
For those learning Python or R, this is a great opportunity.
Note regarding the R implementation
Required library: hash (R doesn't have inbuilt hash or dictionary without imports.) You can use any one of below script files.
Instructions to run: Place the R file and HDT-data3.txt (input file) in root folder of R environment. Execute the '.R' file in R studio or using command line script: > Rscript HDT_improved.R R is known to be slow in text parsing. We can optimize further if all inputs are within double quotes or no quotes at all by using data frames.
This was added by Andre Bieler. The comments below are from him.
For what its worth, I did a quick translation from Python to Julia (v0.5) and attached a link to the file below, feel free to share it. I stayed as close as possible to the Python version, so it is not necessarily the most "Julian" code. A few remarks about benchmarking since I see this briefly mentioned in the text:
Julia really starts paying off for longer execution times. Click here to get the Julia code.
Top DSC Resources