Companies build or rent grid machines when data length doesn't fit into HDFS, or the latency of parallel interconnects is too slow in the cloud. This review explores the overlap of the two paradigms at the ends of the parallel processing latency spectrum. The comparison is almost poetic and leads to many other comparisons in languages, interfaces, formats, and hardware, but there is amazingly little overlap.
Your Laptop Is A Supercomputer
To put things in perspective, 60 years ago, "computer" was a job title. When The Wu Tang Clan dropped 36 Chambers, the bottom ranking machine in the TOP500 was a quad-core Cray. Armed with your current machine, you should be able to dip your toes into any project before diving in head first. Take a small slice of the data to get a glimpse of the obstacles ahead first. Start with 1/10th, 1/8, 1/4.. until your machine can't handle it anymore. Usually by that time, your project will encounter problems that can't be fixed simply by getting a bigger computer.
Depending on the kind of problem you are solving, building your own Beowulf cluster out of old commodity hardware might be the way to go. If you need a constant run of physics simulations or BLAST alignments, a load of wholesale off-lease laptops should get the job done for under 2000$.
Password hashing and BitCoin farms use ASICs and FPGAs. In these cases, latency of interconnects is much less important than single-thread processing.
Move To The Cloud
You don't need to go through the hassle of wiring and configuring a cluster for a short-term project. The hourly cost savings of running your own servers quickly diminish as you struggle through the details of MIS: DevOps, provisioning, deployment, hardware failure, etc. Small development shops and big enterprises like Netflix are happy to pay premiums for a managed solution. We have a staggering variety of SLAs available today as service providers compete to capture new markets.
When your cluster can't quite handle the demand of your process, rent a few servers from the cloud to handle the over-flow.
Use your cluster to handle sensitive private data, and shift non-critical data to a public cloud.
Companies like EMC use graphics cards in cloud clusters to handle vector arithmetic. It works great for a specific sub-set of business solutions that use SVMs and other kernel methods.
Vectorization is at the heart of optimization parallel processes. Understanding how your code uses low-level libraries will help you write faster code. De-vectorized R code is a well-known performance killer.
Julia: The convergence of Big Data and HPC
"Julia makes it easy to connect to a bunch of machines—collocated or not, physical or virtual—and start doing distributed computing without any hassle. You can add and remove machines in the middle of jobs, and Julia knows how to serialize and deserialize your data without you having to tell it. References to data that lives on another machine are a first-class citizen in Julia like functions are first-class in functional languages. This is not the traditional HPC model for parallel computing but it isn’t Hadoop either. It’s somewhere in between. We believe that the traditional HPC and “Big Data” worlds are converging, and we’re aiming Julia right at that convergence point." -Julia development team.
Julia is designed to handle the vectorization for you, making de-vectorized code run faster than vectorized code.
New Compile Targets via LLVM
Scripting languages are built on top of low-level libraries like BLAS, so that under the hood, you are actually running FORTRAN.
Python can be efficient because libraries like NumPy have optimized how they use underlying libraries.
LLVM is acting as the middle-man between the scripting languages and machine code.
AMD has bet their future on the convergence of the CPU and GPU with their heterogeneous system architecture (HSA) and OpenCL. Most Data Scientists will never write such low-level code, but it is worth noting in this review.