Have you struggled in your data science function because of underlying data processing issues? Here is the list of 4 data processing architecture of top web companies to help you overcome those issues.
Nextdoor - Offline Data Composition
The first step was to define our SLA. We decided that the data needed to be fresh, but not up to the second. An SLA of a couple of hours was just fine for what we needed. Once we knew the schedule in which we would need to pull the data, we began defining a pipeline.
Nextdoor has partnered with over 1,300 agencies across the US. We provide them tools to very specifically geo-target content in order to keep it relevant and unique to individual neighborhoods.
One such tool is the Map and Metrics page. It displays information about the agency’s area that enables them to better target content by displaying anonymized reach and engagement data overlaid on a map.
500px - Analytics & Reporting Infrastructure
When I first started, I was presented with two sources of data that I could work with: Splunk and MySQL.MySQL could give states like the total number of likes on a single photo, but it couldn’t tell me how many likes the photo got in the last hour. Splunk is a log search engine on top of logs. Wait… what does that mean? Logs can tell you a lot about user behaviour. When a user likes a photo, that action is written to a log file and provides a timestamped record. So by looking at the logs, you can get an answer to how many likes a photo got in the last hour by going through and counting every line where you see a like event on that particular photo. But log files are just text files. Getting value out of them is difficult since they need to be parsed. This is where Splunk comes in. Splunk is a tool that allows you to search and analyze log data. I could ask questions about user behaviour:
On average, how many photos does a user upload in a single session?
How many android users used our platform in the last month?
Splunk would then go through the logs line by line very quickly, and count the number of events that match the query.
Netflix - Data Analytics Platform
The current architecture’s primary interface is the viewing service, which is segmented into a stateful and stateless tier. The stateful tier has the latest data for all active views stored in memory. Data is partitioned into N stateful nodes by a simple mod N of the member’s account id. When stateful nodes come online they go through a slot selection process to determine which data partition will belong to them. Cassandra is the primary data store for all persistent data. Memcached is layered on top of Cassandra as a guaranteed low latency read path for materialized, but possibly stale, views of the data.
We started with a stateful architecture design that favored consistency over availability in the face of network partitions (for background, see the CAP theorem). At that time, we thought that accurate data was better than stale or no data. Also, we were pioneering running Cassandra and memcached in the cloud so starting with a stateful solution allowed us to mitigate risk of failure for those components. The biggest downside of this approach was that failure of a single stateful node would prevent 1/nth of the member base from writing to or reading from their viewing history. via Netflix
Do not use the the same back-end data store across microservices. You want the team for each microservice to choose the database that best suits the service. Moreover, with a single data store it’s too easy for microservices written by different teams to share database structures, perhaps in the name of reducing duplication of work. You end up with the situation where if one team updates a database structure, other services that also use that structure have to be changed too.
Breaking apart the data can make data management more complicated, because the separate storage systems can more easily get out sync or become inconsistent, and foreign keys can change unexpectedly. You need to add a tool that performs master data management (MDM) by operating in the background to find and fix inconsistencies. For example, it might examine every database that stores subscriber IDs, to verify that the same IDs exist in all of them (there aren’t missing or extra IDs in any one database). You can write your own tool or buy one. Many commercial relational database management systems (RDBMSs) do these kinds of checks, but they usually impose too many requirements for coupling, and so don’t scale. via nginx
Swipely - Data Processing Architecture
AWS Data Pipeline serves an integral role in Swipely’s new data processing architecture, coordinating the processing and transformation of data between different compute and storage services. The company amasses all user actions, payment events, and external data inputs as facts in Amazon Relational Database Service (Amazon RDS) instances. Swipely engineers define the data transformations that map these raw facts to sophisticated analytics that can be efficiently queried to render on a web page or view on a mobile device. The resulting analytics documents are indexed and stored in another Amazon RDS instance.
AWS Data Pipeline rebuilds the analytics documents from the facts every night, by creating Amazon Elastic Map Reduce (Amazon EMR) clusters, executing activities on the clusters according to Swipely’s data transformation definitions, and then shutting the clusters down once the activities complete. Swipely only incurs resource usage costs for the duration required to complete the activities. The data pipeline architecture allows Swipely to mash up historical payments data with social media and user actions, providing its clients with integrated, insightful reports on demand. Moreover, Swipely can update its analytics to enable new reports and features by simply updating the data transformation definitions, as AWS Data Pipeline will rebuild all the historical data dependencies overnight.
This article is compiled by Banjog.