Subscribe to DSC Newsletter

How come there is no mention at all about the underlying hardware?

I am just very curious about the fact that most of the information, blog posts and discussions are focused on software and methods of analysis.

How come no one is interested in the underlying hardware?

is the actual server cluster too far from the big data user?

or is the compute and storage now so abstract that people do not know where they are or what they are? it is a very interesting phenomena to me since it is bound to reflect how much people value the extensive hardware infrastructure.

yet people are worried about the carbon impact of big data servers so they much know that there are big data servers some where that run all the data analysis.

whats the story?

Views: 684

Reply to This

Replies to This Discussion

Perhaps no one is interested in the underlying hardware because no one person has all the pieces to the puzzle. From the concepts, algorithms, all the way to the code, hardware. It is a very rare person that does it all, which is why this new "data science developer" position is in demand: no one fits the bill.

Companies think that moving to Hadoop will fix many of their issues. Then they realize how expensive their little cluster will be to maintain. People also think that the "cloud" is the answer until they fully understand how much data needs to move around.

Moore's law is coming to an end, but their have been many hardware enhancements in the past several years that it has kept up with the increase in data.

so when people sign up for an example to the Amazon EC2 service and choose this configuration :

Medium Instance

3.75 GB memory
2 EC2 Compute Unit (1 virtual core with 2 EC2 Compute Unit)
410 GB instance storage
32-bit or 64-bit platform
I/O Performance: Moderate
API name: m1.medium

 then this is it, that's what you get for your application to run. You can not tweak the system for better

bus speed, switching or memory latency?

People that want better performance have to move up to a

 

Extra Large Instance

15 GB memory
8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
1,690 GB instance storage
64-bit platform
I/O Performance: High
API name: m1.xlarge

The change between configuration is under user control but not the hardware details of the underlying machine.

I guess then that the expertise to optimize the system to extract the maximum cost/performance ratio 

is left in the hands of the cloud infrastructure provider whose best suited to optimize this ratio.

A question for any Amazon EC2 or other IAAS users out there,

do you get a 50% price reduction every year to keep up with Moore's law?

or do you get a faster CPU every year at the same price? if yes, how do you know that it is faster?

People who cannot define their computing requirements (hardware) are not the people you should be taking advice from on designing Big Data architectures.

It is one thing to engage in the theoretic poetry and its another thing to do the prose of building systems with optimal price / peformance / profitability ratios. But to get that it is usually best to run your own ops and optimize to your own tasks, instead of punting it to AWS and overpaying through the nose.

Word of caution, hardware is often omitted from discussions because there is just not enough room to cover everything in every venue / format. But keep asking and sometimes you'll get answers.

Thanks for the comments.

I am not looking for the actual hardware configuration to use.

I am trying to understand how much people care about the underlying HW , or NOT.

So far, I am leaning towards the fact that most cloud infrastructure users do not really care about the 

HW details, probably for lack of control as opposed for lack of concern.

Amazon has improved the specifications of their instances in the past. By how much/often I no longer recall (haven't used AWS in almost a year), but they have made improvements. These updates perhaps have not kept up with Moore's law, but they have reduced pricing, which means the amount of computing power per dollar has increased.

If users want to tweak hardware at the lowest levels, then the cloud is not for them. Flexibility is always one of the major tradeoffs with any system. For a fast turnaround, I rather forsake fine-tuning and get a baseline server running. Might not work in the long haul, but this scenario is where the cloud excels.

People do care about hardware, but they are often focused on speed, memory and physical storage. One of the selling points of Hadoop is that you can use commodity hardware. Task too big to fit in memory? Apply mapreduce.

Hardware used to cost more than a developer's time. Nowadays the inverse is true.



Diya Soubra said:

A question for any Amazon EC2 or other IAAS users out there,

do you get a 50% price reduction every year to keep up with Moore's law?

or do you get a faster CPU every year at the same price? if yes, how do you know that it is faster?

thank you for the constructive feedback.

This statement pointed out to me another view point that I had not considered before.

"Hardware used to cost more than a developer's time. Nowadays the inverse is true."

I found this article that puts forward an interesting point about Moore's law and the cloud.

http://www.wired.com/cloudline/2011/12/moores-law-cloud/

Given his argument, people doing analytics must be very happy the cloud exists!

I found one cloud provider that lets customers select the details of the server in terms of memory and speed.

http://www.cloudsigma.com/

RSS

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service