Subscribe to DSC Newsletter

What do you think of the open source data mining software offered by WEKA?

Here are some of my questions:

  • How does it compare with R or Rapid Miner?
  • Do you need to know Java to use it?
  • Do you need to know Java to use it in batch mode?
  • Can you use it in batch mode?
  • Are there limitations on the size of the data sets? Is it an "in-memory" product, like R? Can you easily build a Map Reduce layer on top of it?
  • What is it best at?
  • What kind of formats does it accept as input / output
  • Can you run it via an API?
  • What do you think of its visualization capabilities?
  • Do they offer an Enterprise version? Cost?

Views: 16711

Reply to This

Replies to This Discussion

I used to have some troubles with memory when dealling with much data... Not big data, but when I was dealling with a 50Mb dataset (in a 4Gb RAM computer) Weka overflow memory. I was trying to fit a bayes network model...

I'm not sure if I was the problem or the software, but I heard a lot of people that had the same problem...

  • How does it compare with R or Rapid Miner?

R is a programming language, so it's a rather different product. Still, Weka has some useful filters that allows one to do data munging like R. RapidMiner has a better UI, IMHO. Also, RapidMiner can embed the weka.jar file and access all the methods/filters that weka provides (but not the visualization resources), but the opposite is not true. Also, you can use RWeka on R, that gives the same funcionality as weka.jar on RapidMiner (but not the visualization resources also).

  • Do you need to know Java to use it?

Not at all. You need java only if you are up to make a program that uses the routines. In this case, it generates the code to you.

  • Do you need to know Java to use it in batch mode?

No, you can use it with a console w/o any Java knowledge.

  • Can you use it in batch mode?

Yes.

  • Are there limitations on the size of the data sets? Is it an "in-memory" product, like R? Can you easily build a Map Reduce layer on top of it?

In addition to the explorer, one can use the Experimenter interface. There you can easily use multiple machines to do a process.

  • What is it best at?

One point I should focus is that you get some visualization features just by loading the dataset, so it's good for exploratory analysis (point to Weka comparing to RapidMiner).

  • What kind of formats does it accept as input / output

The default is ARFF. CSV, C4.5 and XRFF and some variants are also acceptable. You can also open an URL and a database via JDBC.

  • Can you run it via an API?

Yes. Actually, RapidMiner does that.

  • What do you think of its visualization capabilities?

Nice, but some of them, like the boundary visualizer are not on the most obvious place to be (UI problem)

  • Do they offer an Enterprise version? Cost?

Not an enterprise version, but Pentaho offers support, since Pentaho is the main sponsor of the project.

I recently spent some time comparing Weka and R (using RStudio). The quick answer is that while I do like Weka , I find that I prefer R for multiple reasons. As to specifics, Luiz's comments and observations pretty much align with mine. I will add the following:

  • Weka is not that hard to learn but the GUI is not as well documented or intuitive as RStudio. I would recommend getting the book 'Data Mining' by Witten and Frank. There is also a MOOC that provides a very good set of videos on how to use Weka (see https://weka.waikato.ac.nz/dataminingwithweka/course). 
  • Weka provides 3 ways to use the software: the GUI, a Java API, and a command line interface (CLI). The RStudio environment is, IMHO, a much much better GUI to work in than the Weka GUI.
  • I find that R is a much more flexible environment to work in. I wanted to do some analysis that combined PCA with hierarchical clustering. This was a very easy process to set up in R but proved to be too difficult for Weka . While Weka provides modules for both PCA and clustering, I was unable to combine them in the manner I desired without resorting to writing custom Java code.
  • The Weka GUI provides several built-in 'visualization' panels but these are very limited, especially when compared to what can be done with R packages like ggplot. Visualization of data is one of the big strengths of R. In Weka the approach to visualization is focused on understanding behavior of the AI algorithms, rather than the data sets. 
  • Manipulation of data sets is much easier in R than in Weka . 
  • Help documentation is much better in R.

I use Weka for different tasks, it is very easy for the first steps of a research, when you are trying to manipulate the data in different ways.

I personally do not like RapidMiner, mainly because it is not transparent enough, not as much as Weka at least.

Regarding R, I am a physicist, so i traditionally prefer Matlab...even though it is not designed for the same purposes. 

You can also use much of Weka (but not all) from R via the RWeka package:

http://cran.r-project.org/web/packages/RWeka/index.html

You need to install Weka, before RWeka.

If you'd wish to further explore on WEKA over Hadoop, please check out the three series blog post from Mark Hall who is the core developer for WEKA.

http://markahall.blogspot.co.nz/2013/10/weka-and-hadoop-part-1.html

*Do you need to know Java to use it?

There is  a GUI version, which,in latest versions, allows you to use most of the functions, and much more user-friendly, than it was.

*Do you need to know Java to use it in batch mode?

Yes you do.

*Can you use it in batch mode?

Yes, if you know Java 

*Are there limitations on the size of the data sets?

Batch version is not limited (depends on hardware of course, requires a lot of RAM).

GUI version allows you to run algorithms on data sets ~300K rows X ~100 columns - I personally had 12M RAM and it wasn't able to process more than this. 

* What is it best at? 

I used it in logistic and multinomial regressions, the results were good, even compared to expensive software like SPSS.

I use weka with Jython. It is useful but alas no interop with CPython (NumPy and Pandas). And any JVM language will do - Scala, Clojure, Groovy, JRuby...

Clojure Data Analysis Cookbook has decent tutorial on Weka + Clojure if you feel like making some Lisp necromancy :-) I'm playing with it now - it's fun.

Mostly weka runs in-memory. Well, actually in java's heap and you can tweak its size if it is not enough.

For bigger datasets, you can try Knowledge Flow interface that supports streamed data processing - but of course with a limited set of classifiers.

 

I understand that WEKA is for prototype development. I've talked to various ML researchers that have used it and they liked it, but none of them worked with big data. If you want something more scalable that is also open-source, I'd recommend R.

If you wanted to train a clusterer on data transformed into the PCA space using Weka, then it is quite simple using the FilteredClusterer along with the PrincipalComponentsFilter.

Cheers,

Mark.

Lawrence Levin said:

  • I find that R is a much more flexible environment to work in. I wanted to do some analysis that combined PCA with hierarchical clustering. This was a very easy process to set up in R but proved to be too difficult for Weka . While Weka provides modules for both PCA and clustering, I was unable to combine them in the manner I desired without resorting to writing custom Java code.

  • How does it compare with R or Rapid Miner?
    • In my opinion it's simpler to use than R or RapidMiner. I use R a lot for data preparation but the find Weka more user friendly. I like the experimenter which really helps compare the performance of different algorithm when I start analyzing a new set of data. It also seems to have better default values than RapidMiner. I usually get better classification results when using out of the box algorithms and params in Weka than in RapidMiner.
  • Do you need to know Java to use it?
    • No, the gui is very easy to use. Filters and meta classifiers allow you to build quite complex analysis without any code.
  • Do you need to know Java to use it in batch mode?
    • Not if you use the console.
  • Can you use it in batch mode?
    • Yes through the console.
  • Are there limitations on the size of the data sets? Is it an "in-memory" product, like R? Can you easily build a Map Reduce layer on top of it?
    • Memory is limited and as far as I know you cannot build a map reduce on top of it.
    • Experimenter let you run it on multiple machines but it will still run one classification run on one machine. It's usefull for 10x cross validation or test of multiple classification algorithms on multiple data sets in parallel but does not help increase performance on one single run and one single data set.
  • What is it best at?
    • Initial analysis of data, comparison of classification or clustering algorithms
  • What kind of formats does it accept as input / output
    • It's quite limited. The default format is ARFF. It does CSV but cannot handle some situations like line feeds within double quotes. I tend to convert everything to ARFF using R (RWeka) before loading the data into Weka.
  • Can you run it via an API?
    • Yes, you can use it as a java library.
  • What do you think of its visualization capabilities?
    • I appreciate the ROC curves, and boundary visualizer.
  • Do they offer an Enterprise version? Cost?
    • I don't think so
@ Lawrence Levin, you didn't like Weka because you had to resort to a programming language, so instead you chose R? That's embarrassing. Also the reply to your post came from the co-author of the third edition of the book you mention. IMHO, if you want to code use R. If you want to design solutions use Weka.

RSS

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service