Redshift knocked it out of the park, in my opinion. In fact, the appeal of Redshift is much like the appeal of the Summarizer to me: it takes a non-trivial business problem (executing expressive SQL over medium data) and completely undercuts the mainstream solutions’ price points by trading away generality for ease-of-use and performance.
And sure, it has some flaws: COPY failures, load-induced cluster crashes, dashboard slowness, and so on. But I don’t care. I found workarounds for all of them over the course of a few days. Moreover, my communication with the Redshift team has been nothing but constructive, and given what I heard the beta days were like, they’re making real progress. I have no doubt that the service is only going to get better and cheaper.
Oddly enough, Redshift isn’t going to sell because devs think it’s super-duper-whizz-bang. It’s going to sell because it took a problem and an industry famous for it’s opaque pricing, high TCO, and unreliable results and completely turned it on its head. MPP was never a bad idea. Selling that way was. Yeah, the effective TCO is closer to $5k/TB/year than it is to their stated $1k/TB/year, but the pricing scheme is transparent and it’s half a quarter the price of the other MPP alternatives.
It was a cloud-based, fully managed and hosted solution, which would allow us to scale as needed to massive data volumes, and it’s is built on Postgres, making it easy to use and integrate with other tools.
Furthermore, Redshift was designed specifically for the purpose that we wanted to use it for: to perform complex queries against very large data sets of structured data. Redshift can scan over terabytes of data in just a few minutes, meaning that our users can get answers to their questions quickly: in seconds to minutes, depending on the query complexity and data volume.
Last but not least, using Redshift is incredibly cost-effective, which is necessary for us to provide our users with a reasonably priced service. The on-demand pricing structure allows us to pay only for what we need, and easily add nodes for more data storage and processing power as we add more data from users.
Compared to traditional data warehouses, where you have to pay for an upfront license fee as well as hardware and maintenance costs, Redshift provides huge savings. According to Amazon (so this data point may be somewhat biased), running an old school data warehouse costs $19,000 – $25,000 per terabyte per year. Redshift, on the other hand, boasts that it costs only $1,000 per terabyte per year at its lowest pricing tier.
Beyond all the technical bells and whistles, though, the real value of Redshift is that our customers can do ad hoc queries on all of their analytics data whenever they want. They don’t need to worry about the ETL process, or the costs of hosting their data; all they need to do is log in.
This allows us to provide a unified business intelligence solution for app product teams – from marketers to product managers to data scientists to engineers. Our dashboards allow anyone to get insights about how their users are behaving in their app with just a few clicks. And more technical users can dig even deeper, going beyond the dashboards to look at custom metrics and answer really complex questions.
Every month Heyzap holds a company wide hack day. Roughly two years ago, several of our engineers collaborated and built a unified analytics database to help us make faster and more informed business decisions. This hack day project looked great and had obvious value, but the database backend couldn’t keep up after a few days. That’s when we started using Amazon Redshift, which worked perfectly for this type of workload. Amazon Redshift allows our project to thrive internally, and it continues to be our “secret weapon” for investigating problems and making the best decisions for Heyzap.
The move to Redshift also significantly improved dashboard query performance. We ran a test on our current hardware vs. a two node Redshift cluster that cost a quarter as much on a monthly basis before building any aggregates. Redshift performed ~200% faster than the traditional SQL Server we had been using in the past.
Given the ease by which you can spin up additional nodes, we could easily test the cost/benefit tradeoffs by simply adding additional processing capacity when needed vs. building additional aggregates, which took time and lots of labor to build and complete. We also set a strategy only to buy the capacity we needed for the processing demand we had at that time.
We learned a few things as well. Due to its parallel nature, Redshift likes to consume files for load in smaller bites than SQL server. We needed to parallelize the process of breaking up the source files to ensure the loading process did not become a bottleneck.
We also gained a big advantage in scalability. For many of the pre-processing steps, we needed processing capacity only intermittently. So, we used that capacity only when needed. For the Redshift DB itself, we developed an architecture that would scale up additional nodes only when the demand was there.
There are technical requirements for databases that can support a dimensional model.
The main operations on such a database are aggregations and joins. These two operations are expensive on traditional databases, intended to be the data store for applications. Try joining a 100m table to a 500m table in MySQL. Even Postgres would buckle. We needed something specialized.
We needed something specialized for analytics use. Luckily for us there exists a cheap, easy to use, easy to maintain database that is optimized for analytical queries such as joins and aggregations: Amazon Redshift on AWS.
A Redshift database with 2tb of disk space, sufficient for our needs, costs a little over $4000 a year.