What are the most challenging issues when dealing with analytic tools?

As an analytic or data science professional, what are the software bottlenecks / nightmares in your daily job? In my case, my challenges are:

  1. Google or Bing programmable API's: there's very little documentation, no technical support (other than a Google group where nobody answers your question), and no training that I am aware of on how to use the API. While some code is provided in different languages, the Perl version is not offered for the most recent services, the Python code has many bugs, and only the Java code seems robust.
  2. Excel: when you compute a metric such as standard deviation or percentile, what is the formula behind Excel's computations? Excel is well known for using non-standard definitions, its random number generator has been highly criticized, and many people report buggy formulas. What are you thoughts on this, and how to get better accuracy from Excel? Is it something bothering you?
  3. Hadoop: Is there a way to leverage Hadoop for problems that are very difficult to implement in an distributed architecture, such as testing the discriminating power of thousand of vectors (each vector consisting of 5-15 attributes) in the context of fraud detection or scoring technology? Or finding the optimum vector out of a universe of trillion of trillion potential vector (by vector, I mean a set of attributes)?

Thanks for your help!

Views: 399

Reply to This

Replies to This Discussion

With SQL, merging two data sets coming from two different databases is challenging, when the database key (for the join) is encoded in two different, non-compatible formats (e.g. in one database, German characters are coded one way, and in the second database, it is coded a different way or German words are removed).

I would not recommend Excel for serious analytics due to mistakes found in many Excel sheets.

I think highly of Hadoop, but you are right: not every algorithm can efficiently be implemented in a distributed architecture. Nevertheless, Hadoop as a representative of NoSQL technology is better than any SQL based tool when analyzing big data. The main problem with SQL when used with big data is table joins that become very slow. Hadoop, however, doesn't remedy this drawback of SQL; instead, your data need to be almost ready for analysis with Hadoop without doing joins. You can join tables in Hadoop as well, but I don't advise to do so as Hadoop would instantly lose its advantages (there could be few exceptions, but they don't change the entire picture).  

I think Excel has its advantages, as the analyst without experienced IT background could also operate easily, but you are right, it has its shortcomings, and at present, there are many small tools like esProc, esCalc, etc. can help solve such problems very conveniently, they can deal with the complicated data processing easily, but not-high IT technology requirements, so combinition of Excel and some tools may be a good choice for some analysts.

Java and SQL have powerful computation ability, but there are too complex for general users, as they need specialized IT specialized background. while there are also many other solutions can help make up their shortcomings.

Reply to Discussion

RSS

Follow us

© 2013   Data Science Central

Badges  |  Report an Issue  |  Terms of Service