Subscribe to DSC Newsletter

Interesting picture that went viral on Facebook. We've had plenty of discussions about Python versus R on DSC. This picture is trying to convince us that Python is superior to Java. It is about a tiny piece of code to draw a pyramid.

This raises several questions:

  • Is Java faster than Python? If yes, under what circumstances? And by how much? 
  • Does the speed of an algorithm depend more on the design (quick sort versus naive sort) or on the architecture (Map-Reduce) than on the programming language used to code it?
  • For data science, does Python offer better libraries (including for visualization), easier to install, than Java? What about the learning curve?
  • Is Java more popular than Python for data science, mathematics, string processing, or NLP?
  • Is it better to write simple code (like the Java example above) or compact, hard to read code (like the Python example). You can write Python code that is much longer for this pyramid  project (like the Java example) but far easier to read, yet executes just as fast. The converse is also true. What are the advantages of writing the compact, hard-to-read code?

Related article:

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 55010

Reply to This

Replies to This Discussion

To be fair, that's not a complete python program, it's only a method. The actually equivalent python code would be more like:

#!/usr/bin/env python
def create_pyramid(rows):
for i in range(rows):
print('*' * (i + 1))
create_pyramid(5)

Also, the big savings in the python program is the ability to use multiply to repeat a string. Handy, but kind of weird to read. Finally, just for fun, here's a alternative way to write the Java program using lambda's and streams.

import java.util.stream.*;import java.util.function.*;

public class Pyramid {

// In python: '*' * 2 = '**'. Here: repeat.apply(2, "*") = "**"
static BiFunction<Integer, String, String> repeat =
(i, s) -> new String(new char[i]).replace("\0", s);

public static void main(String[] args) {
createPyramid(5);
}

public static void createPyramid(int rows) {
IntStream.rangeClosed(1, rows)
.mapToObj(i -> repeat.apply(i, "*"))
.forEach(System.out::println);
}
}

Forget number of lines of code.  What matters in data science, is speed, flexibility and readability. 

The biggest strength I would see for Python is its versatility and simplicity, and ability to interact so easily with C/C++.  When you are talking about crunching big data in general, you have to work at the machine level to save yourself time.  C and C++ are more difficult to interface in Java, not to mention how more difficult it would be to port such solutions, lack of pointer safety, etc.  Python modules are written in C.

Data science is not easy.  The more time you spend trying to figure out why your code is not working, the less time spend on refining your ETL procedures, algorithms, and statistical computations.  Python frees you up for that needed additional time.  With Java, I am scrolling through copious amounts of stack trace messages to figure out what went wrong, and constantly waiting for garbage collection to finish.

That being said, Java shines in SOA development because it is the perfect language for wrapping components around with in an enterprise framework.  However, I personally would not use it for data science applications. 

  • Is Java faster than Python? If yes, under what circumstances? And by how much? 

Python can be tuned with a number of libraries and techniques that makes it competitive with benchmark C and Fortran libraries, but generally, Python will be slower than Java because Java is compiled into bytecode and python is interpreted. That said, it can be faster once you realize how much can be done with Python and factor in programmer time and cost.

  • Does the speed of an algorithm depend more on the design (quick sort versus naive sort) or on the architecture (Map-Reduce) than on the programming language used to code it?

Yes and no. Map Reduce (MR) is a data algorithm that supports distributed processing, various sorting algorithms may not. It depends on the architecture and nature of the problem. In most cases, the algorithm speed trumps all other factors with the exception of overall architecture. If you are running on a slow platform, its gonna be slow. 

  • For data science, does Python offer better libraries (including for visualization), easier to install, than Java? What about the learning curve?

Java and Python have a lot of libraries. It again depends on what you are doing. Python has some great but evolving visualization libraries. Java has some great but expensive ones. Python is easier to learn and use - if the documentation exists. Bokeh for example is great, but the documentation lags development. 

  • Is Java more popular than Python for data science, mathematics, string processing, or NLP?

Python seems to be winning the war for hearts and minds, even among R users, in the data science domain due to its open source direction and general flexibility. And based on what I've seen, Python is much better for attacking problems than Java due to the overhead of Java syntax.

  • Is it better to write simple code (like the Java example above) or compact, hard to read code (like the Python example). You can write Python code that is much longer for this pyramid  project (like the Java example) but far easier to read, yet executes just as fast. The converse is also true. What are the advantages of writing the compact, hard-to-read code?

If you are new to python, the code can look hard, but once you recognize python idioms, it becomes much easier. Having coded in large (1M+) Java codebase many years ago, I'll take python any day of the week. The advantage of compact hard to read code is that it is typically faster. Someone used one of more of the many profiling tols and found the fastest way to do something. I do a lot of python code reviews and revisions - most of the hard to read code was someone writing something in a very inelegant, non-pythonic way (using loops rather than numpy vector operations for example). Here are some wayss to speed up you Python - with some benchmarks that are interesting.

Generally speaking, write good code that is easy to read - in any language.

Here's my experiences and opinions.

Is Java faster than Python? If yes, under what circumstances? And by how much?


For exploratory analysis, and for modest data sets, Python is generally fast enough and works very well. I've found that for more production-oriented analysis, for example ones that produce 100K+ distinct data products, Python's slower speed, lack of multi-threading, and general pain in deploying Python apps makes Python a non-starter. An analysis run that might take hours on the JVM would take days in Python.

On the other hand, writing analysis code in Java is no fun, although lambdas and streams in Java 8 do take away some of that pain. But you don't have to write in Java to take advantage of the JVM. Scala is a popular choice (and a favorite of mine), or you can use jython, jruby, Groovy, Kotlin, or Clojure.

For data science, does Python offer better libraries (including for visualization), easier to install, than Java?


Python's visualization libraries like Bokeh, Seaborn and good ol' Matplotlib are hands-down better than what's on the JVM. On Java, generally folks seem to use JFreeChart, although Java now ships with some standard charts and NASA has used FXyz for some orbital visualization apps.

Historically, installation, packaging and deploying Python apps has historically been horrible. If you are a person who is just doing your own analysis, this generally won't be a problem as you can just use pip. But if you work on larger projects it's a huge issue. I was involved in a large python-based project with many internal modules and external dependencies. Getting the thing to run was a horrible experience and usually involved talking to the build-guru, the one-person who knew how to get the dependencies installed correctly. With Docker, this pain can be alleviated somewhat, but I don't expect most data scientists will be building their own Docker images. Java, with its dependency-aware build tools (Maven, SBT, Gradle) and it's usually platform-independent libraries, does not have this same type of install/deploy issue.

Is Java faster? Yes, by at least two orders of magnitude. Python is an interpreted scripting language. For that purpose it's reasonably good. But for large systems it has various problems. The lack of a strong type system is one of those. When you are doing little things, it might not matter, but if your program runs for a day-and-a-half before you find out that it's got a typecast error, that kind of sucks. 

Scala is actually a very nice middle ground of a language. It is more compact than Python, faster than Java, more strongly typed than Java bu uses implicit types, and supports all of the Java ecosystem. 

In terms of the language itself, Julia is better than all of the above. It's more compact than Scala and faster than Scala and is strongly typed. It just doesn't have the library support yet. 

One should use the programming language that will provide lowest time to production

Matrix Manipulation: R, Matlab, Mathematica.  I Know SAS has a proc to manipulate matrices.

Great graphics: Mathematica and Tableau

Graph Network Analysis:  Mathematica (much faster than R)

Simulations:  Matlab

Text manipulation, NLP : Mathematica and then Python.  Java has great API for NLP

Great all purpose object oriented language Java

Great procedural language C  

Touchscreens: Lingo, Objective C

Games: C++

Prepare, split, model, test all can be done in few steps in R 

....

Here is a somewhat objective article that has an objective comparison of a number of languages.  I believe that the table has been updated (on the main Julia web site).

http://istc-bigdata.org/index.php/open-big-data-computing-with-julia/
Article by MIT Chemistry Professor.

If you read through the article (second figure-chart), you will see that Julia is slower (with one exception) than FORTRAN and about the same speed as JavaScript (with one exception) but generally much faster than other environments such as MATLAB, Python, and R. The benchmarks may not be representative of what individuals might want to do in general.

One takeway is that if you already are proficient in FORTRAN or JavaScript (or C for that matter) that you might be better off sticking with them. If you need to learn a new environment (or a first environment), then maybe learn Julia.

Another takeway if you are concerned with data mining, modeling, or clean-up with relatively large data sources. You probably want a software package that allows you to do at least one of the three things very rapidly. If commercial or shareware packages are likely not be be sufficiently fast, then you may need to develop something. If you develop something, then suitably optimized Julia, JavaScript, FORTRAN or C may be appropriate. Because we have extensive experience with highly optimized C and FORTRAN code (which has consistently been 8-12 times as fast as C++ (C plus plus) or compiled Java, we will likely stick with those two languages.

I did not look for any write-ups for how suitable Julia is for parallel programming.

If lines of code actually mattered, Scala (which compiles to the Java Virtual Machine) easily does it in one line:


def
create_pyramid(rows: Int) = Seq.tabulate(rows)(i ⇒ println("*" * i))

In terms of language performance, I think it's important to keep in mind that computationally intensive libraries in Python (e.g. NumPy, Pandas) are often written partly in C. While that can also be the case in for some JVM libraries, it is much less necessary. With Scala, you get a syntax with expressiveness on par with Python with the runtime and deployment benefits of the Java environment.

PS: Here's the command-line executable form:

  

scala -e 'def create_pyramid(rows: Int) = Seq.tabulate(rows)(i ⇒ println("*" * i)); create_pyramid(5)'


or

scala -e 'Seq.tabulate(5)(i ⇒ println("*" * i))'


But then again, lines-of-code games are silly. :-) Java's notoriously verbose, and an easy target.

Good points. I'm just going to add a pointer for folks to the very nice ND4J library (http://nd4j.org/) which is basically a JVM version of Numpy. There's a Scala wrapper for ND4J at https://github.com/deeplearning4j/nd4s 

Simeon Fitch said:


In terms of language performance, I think it's important to keep in mind that computationally intensive libraries in Python (e.g. NumPy, Pandas) are often written partly in C. While that can also be the case in for some JVM libraries, it is much less necessary. With Scala, you get a syntax with expressiveness on par with Python with the runtime and deployment benefits of the Java environment.

Scala is a very interesting animal. I was at JavaOne in 2008 when Martin Odersky showed Scala off to the world. At the time I said that the next big language will look a lot like Scala. I still believe that, but I don't think it is Scala. Once you get into implicits and some of the crazy left-associative versus right-associative stuff, it can get confusing. But for present purposes, perhaps a bigger issue is its marriage to the Java Platform. There has always been a bit of an impedance mismatch between the Java community and the numerics community. Scala deals with some of that. For instance, the numerics guys want complex numbers as a native type. You can make that clean enough in Scala that you don't feel totally embarrassed when you do complex arithmetic. But there are also issues dealing with special hardware. Java requires floating point calculations to match the standard test suites exactly. But what if I've got some special hardware to do some of the most expensive calculations 100x faster and they only match the standard test results to 14 decimal places? That's a non-starter in Java, but the numerics guys think that's a big issue. And the Big Data people probably echo that because they need hyper-performance, and every little bit helps. Scala inherits these deficiencies from Java. In all likelihood there is a way to get around it using compiler macros, but that's getting into another Scala minefield. I have high hopes for Julia. If I had to lay down a bet on which language would be the next big thing, I'd put my money on Julia. They seem to be doing a very good job and they've learned a lot from things like Scala. And their performance is pretty awesome. They just don't have a complete ecosystem yet.

Brian Schlining said:

Good points. I'm just going to add a pointer for folks to the very nice ND4J library (http://nd4j.org/) which is basically a JVM version of Numpy. There's a Scala wrapper for ND4J at https://github.com/deeplearning4j/nd4s 

Simeon Fitch said:


In terms of language performance, I think it's important to keep in mind that computationally intensive libraries in Python (e.g. NumPy, Pandas) are often written partly in C. While that can also be the case in for some JVM libraries, it is much less necessary. With Scala, you get a syntax with expressiveness on par with Python with the runtime and deployment benefits of the Java environment.

This is mostly silly. Java is much faster than Python. If you write a program in Python to, say, take the inverse of a large dense matrix and a program employing the same algorithm in Java, the Java program will run 100x faster, maybe more. But Java wasn't designed for solving computational problems. It's quite good at it, but the Java Grande group that set out to make numerical Java never got great support. Python's "strength" is as a glue language. It lets you call a program written in a faster language to invert that matrix. Note that matrix inversion isn't something you should try to do. Most numerics experts would say that if you are inverting matrices you are doing something wrong. But it's a rich example. If you don't like it, think eigenvalues instead. Regardless, since Python is interpreted, you can have syntax errors that you don't know about until you get there...at run time. Compiled languages have real advantages for serious work. You know that they are syntactically correct. The initial example here is also very flawed. For loops? Yuck! For loops are another sign that you're doing things wrong. Basically, your choice of language should be driven by the problem that you are solving and the algorithmic approach you take. I was recently confronted by a problem in matrix chain multiplication. A solution using loops is ugly and hard to verify in any language. But I was able to give an efficient solution in a single line of Mathematica because of the intelligent way Mathematica does memoization. That one line is easy to verify. My prediction is that in about 3 years, Python will largely go away because Julia is pretty much better at everything Python does, and is easier to verify, and is about as fast as C or Fortran. Java? It's trying to keep up with Scala, but has too much baggage. Scala is actually a very good language for most data science and computational stuff. It's more compact than Python and more logical too. 

Harvey Summers said:

  • Is Java faster than Python? If yes, under what circumstances? And by how much? 

Python can be tuned with a number of libraries and techniques that makes it competitive with benchmark C and Fortran libraries, but generally, Python will be slower than Java because Java is compiled into bytecode and python is interpreted. That said, it can be faster once you realize how much can be done with Python and factor in programmer time and cost.

  • Does the speed of an algorithm depend more on the design (quick sort versus naive sort) or on the architecture (Map-Reduce) than on the programming language used to code it?

Yes and no. Map Reduce (MR) is a data algorithm that supports distributed processing, various sorting algorithms may not. It depends on the architecture and nature of the problem. In most cases, the algorithm speed trumps all other factors with the exception of overall architecture. If you are running on a slow platform, its gonna be slow. 

  • For data science, does Python offer better libraries (including for visualization), easier to install, than Java? What about the learning curve?

Java and Python have a lot of libraries. It again depends on what you are doing. Python has some great but evolving visualization libraries. Java has some great but expensive ones. Python is easier to learn and use - if the documentation exists. Bokeh for example is great, but the documentation lags development. 

  • Is Java more popular than Python for data science, mathematics, string processing, or NLP?

Python seems to be winning the war for hearts and minds, even among R users, in the data science domain due to its open source direction and general flexibility. And based on what I've seen, Python is much better for attacking problems than Java due to the overhead of Java syntax.

  • Is it better to write simple code (like the Java example above) or compact, hard to read code (like the Python example). You can write Python code that is much longer for this pyramid  project (like the Java example) but far easier to read, yet executes just as fast. The converse is also true. What are the advantages of writing the compact, hard-to-read code?

If you are new to python, the code can look hard, but once you recognize python idioms, it becomes much easier. Having coded in large (1M+) Java codebase many years ago, I'll take python any day of the week. The advantage of compact hard to read code is that it is typically faster. Someone used one of more of the many profiling tols and found the fastest way to do something. I do a lot of python code reviews and revisions - most of the hard to read code was someone writing something in a very inelegant, non-pythonic way (using loops rather than numpy vector operations for example). Here are some wayss to speed up you Python - with some benchmarks that are interesting.

Generally speaking, write good code that is easy to read - in any language.

RSS

Videos

  • Add Videos
  • View All

Follow Us

© 2018   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service