Apache Drill is a low-latency distributed query engine for large-scale datasets, including structured and semi-structured/nested data. Inspired by Google’s Dremel, Drill is designed to scale to several thousands of nodes and query petabytes of data at interactive speeds that BI/Analytics environments require.
Apache Drill includes a distributed execution environment, purpose built for large-scale data processing. At the core of Apache Drill is the “Drillbit” service which…Continue
Added by Raghavan Madabusi on March 31, 2015 at 3:10pm — No Comments
Continued from - Art of Data Science part 1
During the execution of a Data science initiative, one person has to constantly think big and about the business application of the project. May be this is needed in any IT project. But the…Continue
Added by INSOFE on March 31, 2015 at 1:00am — No Comments
Finding insight within one data stream is a challenge. Finding insight from multiple streams can be significantly more so. The simple example? Two different databases created independently of each other that claim to capture the same kind of data. The larger the dataset, the more challenges we face aligning columns, de-duping content, making sure we don’t overwrite newer data with old data, and otherwise cleaning and preparing data for analysis. Ask anyone who has worked trying to align…Continue
Added by Anne Russell on March 30, 2015 at 4:00pm — No Comments
In this post, we’ll use a supervised machine learning technique called logistic regression to predict delayed flights. But before we proceed, I like to give condolences to the family of the the victims of the Germanwings tragedy.
This analysis is conducted using a public data set that can be obtained here:…Continue
Have you experienced or thought how corporates manage their analytical assets which are mission critical to the business? A Bank or a Telecom Service Provider may often have more than 100 predictive model assets developed over a time period, but faces an important issue of how to effectively manage,store,share or archive these assets.
“The next breakthrough in data analysis may not be in individual algorithms, but in the ability to rapidly combine, deploy, and…Continue
Fresh data is usually pristine. It’s data in it’s clearest, most accurate form – straight from the customer or client. If you’ve put measures in place to cut back on data input errors, such as form validation, you can be reasonably sure that the newest records in your CRM are the “latest and greatest”.
If your CRM has been active for some time, you’ll have a…Continue
Added by Martin Doyle on March 27, 2015 at 1:00am — No Comments
The full version is always published Monday. Starred articles or sections are new additions or updated content, posted between Thursday and Sunday.
Added by Vincent Granville on March 25, 2015 at 8:00am — No Comments
Good data is the driving force behind successful marketing. Data can be analyzed to determine what your customers are looking for, what will drive them to purchase, and to establish a best prospect profile. According to a report by GlobalSpec, the primary marketing goals for manufacturers are customer acquisition (43%) and lead generation (29%), with 54% planning to increase marketing spend.…Continue
Added by Larisa Bedgood on March 25, 2015 at 5:00am — No Comments
During my 30 years of analytics career, prospective employers and clients have often asked me: "How can you help us with data-driven insights when you have not worked in this industry before?". I argue for greater emphasis on machine learning skills in the data scientist and their partnership with domain experts as an effective pathway to bring data science to a business.
Clearly, the description of data scientist as the mythical unicorn who has computer science skills,…Continue
It probably comes as no surprise, but we talk to a lot of data scientists at CrowdFlower. We like learning the tools they use, the programs that make their lives easier, and how everything works together. Today, we'll really pleased to unveil the first of a three-part series about the data science ecosystem. Here it is in infographic form because, let's face it, everybody likes infographics: …Continue
I've received an unsolicited email today from Pedro Marcus, from DataOnFocus. While usually I don't even open them due to the volume that I get each day, this one was actually very interesting, thus I'm sharing it with you.
Free data mining booksContinue
Guest blog post.
Growth Hacking is turning out to be one of the hottest growing fields for data analysts & scientists. Although, there is controversy about the term & the specific meaning, the general connotation implies a function, activity or person which is primarily focused on growing a set of metrics such as users, revenue, visits &…Continue
When Apple CEO Tim Cook finally unveiled his company’s new Apple Watch in a widely-publicized rollout earlier this month, most of the press coverage centered on its cost ($349 to start) and whether it would be as popular among consumers as the iPod or iMac.
Nitin Indurkhya saw things differently.
“I think the…Continue
Added by Peter Bruce on March 23, 2015 at 4:30am — No Comments
Let’s walk through an example of predictive analytics using a data set that most people can relate to:prices of cars. In this case, we have a data set with historical Toyota Corolla prices along with related car attributes.
Let’s load in the Toyota Corolla file and check…
This is part two of the series. In part one, we used linear regression model to predict the prices of used Toyota Corollas. There are some overlap in the materials for those just reading this post for the first time. For those who read the part 1 of the series using linear regression, then you can safely skip to the section where I applied neural networks to the same data set.
In this post, we will…Continue
I made a recent discovery that I would like to share with the community. In my previous blog, I introduced a special algorithmic shell that distributes stocks based on their price movements (along the x-axis) and volume movements (y-axis). Using this shell, it is possible to visualize the trading behaviours of dozens of stocks simultaneously. I noticed one day that the stocks seemed to be lining up in formation. I decided to test the accuracy of my visual interpretation. Below I present the…Continue
Added by Don Philip Faithful on March 22, 2015 at 5:22am — No Comments
Added by Athanassios Hatzis on March 21, 2015 at 5:30am — No Comments
When we try to build classification models from training data, the proportion of target classes do impact the accuracy levels of predictions. This is an experiment to measure the level of impact of these proportions.
Let us say you are trying to predict which visitors to your website would buy a product. You collect historical data about the visitor's characteristics and actions and also whether they brought something or not. This is the model building data…
Asking questions is easy. It’s so easy that, as askers, we often don’t think about the quality of our questions. Poorly framed questions waste everyone’s time—yours included—because they require the answerer to make assumptions. When it comes to asking analysts to explore a problem you’re trying to solve, better questions will drive better analysis and, ultimately, more actionable answers.
Here’s an example:
Marketer: “How many people converted from paid ad…Continue
Added by Derek Steer on March 19, 2015 at 11:49am — No Comments
A Visual Studio 2013 demo project including the WebpageDownloader and LinkCrawler can be downloaded here.
The US digital universe currently doubles in size approximately every three years . In fact, Hewlett Packard estimates that by the end of this decade, the digital universe will be measured in ‘Brontobytes’, which…Continue