Home » Uncategorized

18 Big Data tools you need to know!!

logo
In today’s digital digital transformation, big data has given organization an edge to analyze the customer behavior & hyper-personalize every interaction which results into cross-sell, improved customer experience and obviously more revenues. The market for Big Data has grown up steadily as more and more enterprises have implemented a data-driven strategy. While Apache Hadoop is the most well-established tool for analyzing big data, there are thousands of big data tools out there. All of them promising to save you time, money and help you uncover never-before-seen business insights.
 
I have selected few to get you going….
Avro: It was developed by Doug Cutting & used for data serialization for encoding the schema of Hadoop files.
 
Cassandra: is a distributed and Open Source database. Designed to handle large amounts of distributed data across commodity servers while providing a highly available service. It is a NoSQL solution that was initially developed by Facebook. It is used by many organizations like Netflix, Cisco, Twitter.
 
Drill: An open source distributed system for performing interactive analysis on large-scale datasets. It is similar to Google’s Dremel, and is managed by Apache.
 
Elasticsearch: An open source search engine built on Apache Lucene. It is developed on Java, can power extremely fast searches that support your data discovery applications.
 
Flume: is a framework for populating Hadoop with data from web servers, application servers and mobile devices. It is the plumbing between sources and Hadoop.
 
HCatalog: is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored.
 
impala-logoImpala: provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase using the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
 
JSON: Many of today’s NoSQL databases store data in the JSON (JavaScript Object Notation) format that’s become popular with Web developers
 
Kafka: is a distributed publish-subscribe messaging system that offers a solution capable of handling all data flow activity and processing these data on a consumer website. This type of data (page views, searches, and other user actions) are a key ingredient in the current social web.
 
MongoDB: is a NoSQL database oriented to documents, developed under the open source concept. This comes with full index support and the flexibility to index any attribute and scale horizontally without affecting functionality.
 
Neo4j: is a graph database & boasts performance improvements of up to 1000x or more when in comparison with relational databases.
Oozie: is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive. It further intelligently links them to one another. Oozie allows users to specify dependancies.
 
Pig: is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines.
 
Storm: is a system of real-time distributed computing, open source and free.  Storm makes it easy to reliably process unstructured data flows in the field of real-time processing. Storm is fault-tolerant and works with nearly all programming languages, though typically Java is used. Descending from the Apache family, Storm is now owned by Twitter.
 
Tableau: is a data visualization tool with a primary focus on business intelligence. You can create maps, bar charts, scatter plots and more without the need for programming. They recently released a web connector that allows you to connect to a database or API thus giving you the ability to get live data in a visualization.
 
ZooKeeper: is a service that provides centralized configuration and open code name registration for large distributed systems. 
 
Everyday many more tools are getting added the big data technology stack and its extremely difficult to cope up with each and every tool. Select few which you can master and continue upgrading your knowledge.

Leave a Reply

Your email address will not be published. Required fields are marked *