What do you think of the open source data mining software offered by WEKA?
Here are some of my questions:
I used to have some troubles with memory when dealling with much data... Not big data, but when I was dealling with a 50Mb dataset (in a 4Gb RAM computer) Weka overflow memory. I was trying to fit a bayes network model...
I'm not sure if I was the problem or the software, but I heard a lot of people that had the same problem...
R is a programming language, so it's a rather different product. Still, Weka has some useful filters that allows one to do data munging like R. RapidMiner has a better UI, IMHO. Also, RapidMiner can embed the weka.jar file and access all the methods/filters that weka provides (but not the visualization resources), but the opposite is not true. Also, you can use RWeka on R, that gives the same funcionality as weka.jar on RapidMiner (but not the visualization resources also).
Not at all. You need java only if you are up to make a program that uses the routines. In this case, it generates the code to you.
No, you can use it with a console w/o any Java knowledge.
In addition to the explorer, one can use the Experimenter interface. There you can easily use multiple machines to do a process.
One point I should focus is that you get some visualization features just by loading the dataset, so it's good for exploratory analysis (point to Weka comparing to RapidMiner).
The default is ARFF. CSV, C4.5 and XRFF and some variants are also acceptable. You can also open an URL and a database via JDBC.
Yes. Actually, RapidMiner does that.
Nice, but some of them, like the boundary visualizer are not on the most obvious place to be (UI problem)
Not an enterprise version, but Pentaho offers support, since Pentaho is the main sponsor of the project.
I recently spent some time comparing Weka and R (using RStudio). The quick answer is that while I do like Weka , I find that I prefer R for multiple reasons. As to specifics, Luiz's comments and observations pretty much align with mine. I will add the following:
I use Weka for different tasks, it is very easy for the first steps of a research, when you are trying to manipulate the data in different ways.
I personally do not like RapidMiner, mainly because it is not transparent enough, not as much as Weka at least.
Regarding R, I am a physicist, so i traditionally prefer Matlab...even though it is not designed for the same purposes.
You can also use much of Weka (but not all) from R via the RWeka package:
You need to install Weka, before RWeka.
If you'd wish to further explore on WEKA over Hadoop, please check out the three series blog post from Mark Hall who is the core developer for WEKA.
*Do you need to know Java to use it?
There is a GUI version, which,in latest versions, allows you to use most of the functions, and much more user-friendly, than it was.
*Do you need to know Java to use it in batch mode?
Yes you do.
*Can you use it in batch mode?
Yes, if you know Java
*Are there limitations on the size of the data sets?
Batch version is not limited (depends on hardware of course, requires a lot of RAM).
GUI version allows you to run algorithms on data sets ~300K rows X ~100 columns - I personally had 12M RAM and it wasn't able to process more than this.
* What is it best at?
I used it in logistic and multinomial regressions, the results were good, even compared to expensive software like SPSS.
I use weka with Jython. It is useful but alas no interop with CPython (NumPy and Pandas). And any JVM language will do - Scala, Clojure, Groovy, JRuby...
Clojure Data Analysis Cookbook has decent tutorial on Weka + Clojure if you feel like making some Lisp necromancy :-) I'm playing with it now - it's fun.
Mostly weka runs in-memory. Well, actually in java's heap and you can tweak its size if it is not enough.
For bigger datasets, you can try Knowledge Flow interface that supports streamed data processing - but of course with a limited set of classifiers.
I understand that WEKA is for prototype development. I've talked to various ML researchers that have used it and they liked it, but none of them worked with big data. If you want something more scalable that is also open-source, I'd recommend R.
If you wanted to train a clusterer on data transformed into the PCA space using Weka, then it is quite simple using the FilteredClusterer along with the PrincipalComponentsFilter.
Lawrence Levin said: