When our customers ask us what the best data warehouse is for their growing company, we consider the answer based on their specific needs. Usually, they need nearly real-time data for a low price without the need to maintain data warehouse infrastructure. In this case, we advise them to use modern data warehouses such as **Redshift, BigQuery, or Snowflake**.

Most of the modern data warehouse solutions are aimed at working with raw data, that makes them Big Data compatible. This approach allows you to redefine your analytics on the fly without a need to re-ingest your data stored in a warehouse.

In this post, we would like to dig deeper into the factors to consider while choosing a data warehouse. Here they are:

- Volume of data
- Dedicated human resources for the support and maintenance
- Scalability: horizontally vs. vertically
- Pricing models

You need to know the estimates of the volume of data that you will be dealing with. If it’s a clear cut scenario that you are working with datasets ranging in the hundreds of TBs or in petabytes, then it is strongly advised to go for non-relational databases. The architecture of such databases that supports working with huge data sets is ingrained in their DNA.

On the other hand, many relational databases have really great time-proven query optimizers. You can consider them as an option for an analytics warehouse as soon as your dataset fits into one single node.

Let’s look at some mathematics relevant to dataset size:

- The Postgres, MySQL, MSSQL, and many other RDBMS sweet spot is up to 1TB of data involved in analytics. If this size is exceeded, you may experience degraded performance.
- Amazon Redshift, Google BigQuery, Snowflake, and Hadoop-based solutions support a dataset size up to multiple petabytes in an optimal manner.

Another important aspect to evaluate is whether you have any dedicated resources for the maintenance, support, and fixes for your database, if any. This facet plays an important role in the comparison.

If you have dedicated resources for the support and maintenance, you have earned yourself a lot more options in choosing the database.

You can opt for creating your own Big Data warehousing options based on something like Hadoop or Greenplum. These systems **do require significant setup, maintenance engineering resources, and** **skilled personnel.**

But in the case that you don’t have any dedicated resources for maintenance then you limit your options a bit. We advise going for a modern data warehouse solution like Redshift, BigQuery, or Snowflake. As an administrator or user, you don’t need to worry about deploying, hosting, resizing VMs, handling replication, or encryption. You can start using it by issuing SQL commands.

When you start working with a database, you expect it to be scalable enough to support your further growth. Broadly, database scalability can be achieved in two ways, **horizontally or vertically**.

Horizontal scalability refers to the addition of more machines, whereas vertical scalability means the addition of resources into a single node to increase its capability.

**Redshift** provides easy scalable options. With a few mouse clicks, you can increase the number of nodes and configure them to meet your needs. Redshift scales very well until you hit about 100TB of data that is processed at once in a query. Computing capacity of your Redshift cluster will always rely on a count of nodes in your cluster, unlike some other data warehouse options.

It’s a point where such solutions as **BigQuery** come into play. There’s virtually no cluster capacity as BigQuery can allocate up to 2000 slots, which is the equivalence of nodes in Redshift. Also, due to this multi-tenancy strategy, even when customers’ concurrency demands grow, BigQuery scales seamlessly with those demands and can go over this limitation of 2000 slots if required.

BigQuery relies on Colossus, which is Google’s latest generation distributed file system. Colossus allows BigQuery users to scale to dozens of petabytes in storage seamlessly, without paying the penalty of attaching much more expensive computing resources.

**Snowflake** is built on Amazon S3 cloud storage and its storage layer holds all the diverse data, tables, and query results. Because this storage layer is engineered to scale completely independent of computing resources, it ensures that maximum scalability can be achieved effortlessly for big data warehousing and analytics.

In addition to this, Snowflake offers multiple virtual warehouses, at nearly any scale and concurrency, which can simultaneously operate on the same data while fully enforcing global system-wide transactional integrity and keeping it scalable.

If you go with the self-hosted option like Hadoop, your pricing will consist mostly of VM or hardware bills. AWS provides an EMR solution which is a cost efficient option to consider when going with Hadoop stuff.

Drilling down further into RedShift vs BigQuery vs Snowflake, all of them offer on-demand pricing, but each one comes with their own unique pricing model flavor.

**Amazon RedShift** offers three pricing models:

- On-Demand Pricing: no upfront commitments and cost, you simply pay an hourly rate depending upon the types and number of nodes in your cluster. Here, one important factor often ignored is that the rates do vary depending upon the region. The rates cover both computation and data storage.
- Spectrum Pricing: you merely pay for the bytes scanned while querying against Amazon S3.
- Reserved Instance Pricing: if you are sure that you will be running on RedShift for at least a few years, then you can save up to 75% over on-demand rates by opting for reserved instance pricing.

**Google BigQuery** offers scalable, flexible pricing options and charges for data storage, streaming inserts, and for querying data, but loading and exporting data are free of charge. The pricing strategy of BigQuery is quite unique as it is based on a rate per GB for storage and a bytes scanned rate for queries. Also, it offers cost control mechanisms that enable you to cap your daily costs to an amount that you choose. It offers a long-term pricing model as well.

**Snowflake** offers on-demand pricing, which is similar to BigQuery and Redshift Spectrum. Unlike BigQuery, computational usage is billed on a per-second basis rather than bytes scanned, with a minimum of 60 seconds. Snowflake decouples data storage from computation, and hence the billing is individual for both of them.

The storage rate starts at around $40/TB/month for the standard edition and remains the same for other editions. On the other hand, for compute the rate starts at $2.00 per hour for standard and goes up to $4.00 per hour for the enterprise edition.

General advice for choosing a data warehouse, which we usually provide to our customers, is the following:

- Use index optimized RDBMS such as Postgres, MySQL, or MSSQL when data volumes are much less than 1TB of data in total and much less than 500M rows per analyzed table, and the whole database can fit into a single node.
- Use modern data warehouses like Redshift, BigQuery, or Snowflake when your data volume is between 1TB and 100TB. Also consider Hadoop with Hive, Spark SQL, or Impala as a solution if you have access to this expertise and you can allocate dedicated human resources to support that.
- When your data volume is over 100TB use BigQuery, Snowflake, Redshift Spectrum, or the self-hosted Hadoop equivalent solution.

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

**Technical**

- Free Books and Resources for DSC Members
- Learn Machine Learning Coding Basics in a weekend
- New Machine Learning Cheat Sheet | Old one
- Advanced Machine Learning with Basic Excel
- 12 Algorithms Every Data Scientist Should Know
- Hitchhiker's Guide to Data Science, Machine Learning, R, Python
- Visualizations: Comparing Tableau, SPSS, R, Excel, Matlab, JS, Pyth...
- How to Automatically Determine the Number of Clusters in your Data
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- Fast Combinatorial Feature Selection with New Definition of Predict...
- 10 types of regressions. Which one to use?
- 40 Techniques Used by Data Scientists
- 15 Deep Learning Tutorials
- R: a survival guide to data science with R

**Non Technical**

- Advanced Analytic Platforms - Incumbents Fall - Challengers Rise
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- How to Become a Data Scientist - On your own
- 16 analytic disciplines compared to data science
- Six categories of Data Scientists
- 21 data science systems used by Amazon to operate its business
- 24 Uses of Statistical Modeling
- 33 unusual problems that can be solved with data science
- 22 Differences Between Junior and Senior Data Scientists
- Why You Should be a Data Science Generalist - and How to Become One
- Becoming a Billionaire Data Scientist vs Struggling to Get a $100k Job
- Why do people with no experience want to become data scientists?

**Articles from top bloggers**

- Kirk Borne | Stephanie Glen | Vincent Granville
- Ajit Jaokar | Ronald van Loon | Bernard Marr
- Steve Miller | Bill Schmarzo | Bill Vorhies

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives**: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central