All Blog Posts Tagged 'Apache' (24)

Spark Troubleshooting, Part 1 - Ten Challenges

“The most difficult thing is finding out why your job is failing, which parameters to change. Most of the time, it’s OOM errors…”…


Added by Sara Petrie on September 30, 2021 at 6:00am — No Comments

Building machine learning models in Apache Spark using SCALA in 6 steps


When dealing with building machine learning models, Data scientists spend most of the time on 2 main tasks when building machine learning models

Pre-processing and Cleaning

The major portion of time goes in to collecting, understanding, and analysing, cleaning the data and then building features. All the above steps mentioned are very important and critical to build successful machine learning…


Added by Rohit Walimbe on April 21, 2019 at 9:00pm — 1 Comment

Apache Kafka + KSQL + TensorFlow for Data Scientists via Python + Jupyter Notebook

Why would a data scientist use Kafka Jupyter Python KSQL TensorFlow all together in a single notebook?

There is an impedance mismatch between model development using Python and its Machine Learning tool stack and a scalable, reliable data platform. The former is what you need for quick and easy prototyping to build analytic models. The latter is what you need to use for data ingestion, preprocessing, model deployment and monitoring at scale. It…


Added by Kai Waehner on January 22, 2019 at 10:00am — No Comments

Scalable IoT ML Platform with Apache Kafka + Deep Learning + MQTT

I built a scenario for a hybrid machine learning infrastructure leveraging Apache Kafka as scalable central nervous system. The public cloud is used for training analytic models at extreme scale (e.g. using TensorFlow and TPUs on Google Cloud Platform (GCP) via Google ML Engine. The predictions (i.e.…


Added by Kai Waehner on August 1, 2018 at 11:00pm — 1 Comment

Model Serving: Stream Processing vs. RPC / REST - A Deep Learning Example with TensorFlow and Kafka

Machine Learning / Deep Learning models can be used in different ways to do predictions. My preferred way is to deploy an analytic model directly into a stream processing application (like Kafka Streams or KSQL). You could e.g. use the …


Added by Kai Waehner on July 8, 2018 at 4:26pm — No Comments

Apache Hadoop Admin Tips and Tricks

In this post I will share some tips I learned after using the Apache Hadoop environment for some years, and  doing many many workshops and courses. The information here considers Apache Hadoop around version 2.9, but it could definably be extended to other similar versions.

These are considerations for when building or using a Hadoop cluster. Some are considerations over the Cloudera distribution. Anyway, hope it…


Added by Renata Ghisloti Duarte Souza Gra on May 24, 2018 at 5:00pm — No Comments

Deep Learning Infrastructure for Extreme Scale with 
the Apache Kafka Open Source Ecosystem

I had a new talk presented at "Codemotion Amsterdam 2018" this week. I discussed the relation of Apache Kafka and Machine Learning to build a Machine Learning infrastructure for extreme scale.

Long version of the title:

"Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Source Ecosystem - How to Build a Machine Learning Infrastructure with Kafka, Connect, Streams, KSQL, etc."

As always, I want to share the slide deck. The talk was…


Added by Kai Waehner on May 8, 2018 at 9:30pm — No Comments

Record linking with Apache Spark’s MLlib & GraphX

The challenge

Recently a colleague asked me to help her with a data problem, that seemed very straightforward at a glance. 

She had purchased a small set of data from the chamber of commerce (Kamer van Koophandel: KvK) that contained roughly 50k small sized companies (5–20FTE), which can be hard to find online.

She noticed that many of those companies share the same address,…


Added by Tom Lous on April 4, 2017 at 11:00pm — 5 Comments

Characteristics of Good Visual Analytics and Data Discovery Tools

Visual Analytics and Data Discovery allow analysis of big data sets to find insights and valuable information. This is much more than just classical Business Intelligence (BI). See this article for more details and motivation: "Using Visual Analytics to Make Better Decisions: the Death Pill Example". Let's take a look at important characteristics to choose the right tool for…


Added by Kai Waehner on July 27, 2016 at 10:00pm — No Comments

Apache Beam - Create Data Processing Pipelines

At the Data Science Association our members often complain about the major data engineering problem of finding the right tools and programming models to build both robust data processing pipelines and efficient ETL processes for data transformation and integration.…


Added by Michael Walker on May 19, 2016 at 10:00pm — No Comments

How to Architect a Big Data Application to Unleash Its Full Potential

For a world, that's churning out and recording infinite volumes of data every second, where dependency on data is steeply rising, the need to implement Big Data architecture becomes natural.

Big Data solutions can resolve specific big data problems and requirements for data analysis, curation, capturing, sharing, searching,…


Added by Ritesh Gujrati on May 5, 2016 at 3:30am — No Comments

Two Sides of "Big?" Data

The ongoing pursuit of data solutions occupies mindshare of consumers, vendors and service providers alike as they invest considerable amount of time, costs and efforts. The past attempts to concur data have resulted in solutions that combined databases, applications and tools with limited success. We are still struggling with a few unresolved, persistent legacy challenges such as - 

Data everywhere

Today, every enterprise has huge data…


Added by Suhas Marathe on February 23, 2016 at 9:48am — No Comments

Self-Learn Yourself Apache Spark in 21 Blogs – #5

In Blog 5, we will see Apache Spark Languages with basic Hands-on. Click to have quick read on the other blogs of Apache Spark in this learning series.

With our cloud setup of our Apache Spark now we are ready to develop big data Spark applications. And before getting started with building Spark applications let’s review the languages which can be used to develop Apache Spark applications. It has many APIs like Scala, Hive, R, Python, Java, and Pig.

Scala – It’s the language…


Added by Kumar Chinnakali on January 23, 2016 at 3:32am — No Comments

Celebrate the Big Data Problems – #2

Celebrate the Big Data Problems – #2

How to identify the no of buckets for a Hive table while executing the HiveQL DDLs ?

The dataottam team has come up with blog sharing initiative called “Celebrate the Big Data Problems”. In this series of blogs we will share our big data problems using CPS (Context, Problem, Solutions) Framework.


Bucketing is another…


Added by Kumar Chinnakali on January 21, 2016 at 7:41pm — No Comments

Celebrate the Big Data Problems – #1

Celebrate the Big Data Problems – #1

Daily we are facing many big data problems in production, PoC, and more perspective. Do we have any common repo to collect and share?  No, as we know we don’t have any. As always dataottam is looking forward to share the learnings with community to celebrate their similar, same kind of problems.  And…


Added by Kumar Chinnakali on January 15, 2016 at 11:30pm — No Comments

Self-Learn Yourself Apache Spark in 21 Blogs – #4

In Blog 4, we will see what are Apache Spark Core and its ecosystem and Apache Spark on AWS Cloud. Click to have quick read on blog 1-3 in this learning series.

Apache Spark has many components including Spark Core which is responsible for Task Scheduling, Memory Management, Fault Recovery, and Interacting with storage…


Added by Kumar Chinnakali on January 12, 2016 at 8:00am — No Comments

Self-Learn Yourself Apache Spark in 21 Blogs – #3

In this Blog 3 – We will see what is Apache Spark’s History and Unified Platform for Big Data, and like to have quick read on blog 1 and blog 2.

Spark was initially started by Matei at UC Berkeley AMPLab in 2009, and open sourced in 2010…


Added by Kumar Chinnakali on January 9, 2016 at 9:00pm — 1 Comment

Self-Learn Yourself Apache Spark in 21 Blogs – #2

By this blog we will share the titles for learning Apache Spark, Basics on Hadoop which is one of the big data tool, and motivations for Apache Spark which is not replacement of Apache Hadoop, but its friend of big data.

Blog 1 – Introduction to Big Data

Blog 2 – Hadoop, Spark’s Motivations

Blog 3 – Apache Spark’s History and Unified Platform for Big Data

Blog 4 – Apache Spark’s First Step – AWS, Apache Spark

Blog 5 – Apache Spark Languages with basic…


Added by Kumar Chinnakali on January 8, 2016 at 9:00pm — No Comments

5 Reasons Apache Spark is So Awesome

Those who follow big data technology news probably know about Apache Spark, and how it’s popularly known as the Hadoop Swiss Army Knife. For those not so familiar, Spark is a cluster computing framework for data analytics designed to speed up and simplify common data-crunching and analytics tasks. Spark is certainly creating buzz in the big data world, but why? What’s so special about this…


Added by Ritesh Gujrati on January 8, 2016 at 2:30am — No Comments

Self-Learn Yourself Apache Spark in 21 Blogs - #1

We have received many requests from friends who are constantly reading our blogs to provide them a complete guide to sparkle in Apache Spark. So here we have come up with learning initiative called “Self-Learn Yourself Apache Spark in 21 Blogs".

We have drilled down various sources and archives to provide a perfect learning path for you to understand and excel in Apache Spark. These 21 blogs which will be written over a course of time will be a complete guide for you to understand and…


Added by Kumar Chinnakali on December 30, 2015 at 3:00am — No Comments

Monthly Archives













© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service