Robinhood hits the bullseye with AWS Data Lake - jonfernz6/technical-doc GitHub Wiki

Robinhood always knew they had to keep a close eye on the scalability of their application. With almost one million people on the waiting list even before launch, they were fully aware that despite starting off as a monolithic application, they would eventually have to turn to microservices as the business expanded.

And expanded, they did. In the first three years of Robinhood’s inception in 2013, the company had experienced exponential growth, with approximately 10 terabytes of data ingested and processed every day, as they oversaw over 3 billion dollars in transactions.

Most fintech start-ups find it easier to lay the groundwork by deploying monoliths. This was especially true for Robinhood, who had begun their journey with a paper-thin DevOps team of only two developers. However, as the company grew in stature, so did their team.

Robinhood's Early Challenges

By 2016, they started to realize the limitations of their tightly-coupled storage and compute architecture. They were using Amazon Elasticsearch (Open Search Service) and Amazon Redshift as their analytics and data warehouse solutions. As the app started storing data in the realm of petabytes, their early architecture struggled to cope. So instead of spending more money scaling up a system that was quickly going out-of-date, Robinhood opted to future-proof themselves by building an environment that could expand in tandem with the company’s rapid growth.

Scalability, however, was not the only challenge they were faced with. The Robinhood team was also scrambling to organize their data, which was scattered all over the place. This was due to a lack of a unified interface for the different types of data that were being processed and queried. The engineers were keen on developing a system that was more friendly for non-technical employees, particularly the data science team.

After some extensive research, which inculded drawing inspiration from the modern microservices architectures of Netflix and Uber, a consensus was reached among their engineers. The answer was clear - in order to function with minimal effort and operational overhead, they needed to leverage a managed cloud solution in the form of AWS Data Lake.

Services used in AWS Data Lake

Amazon Managed Streaming for Apache Kafka (MSK) - Ingestion

Robinhood ingests terabytes of data every day. The data enters the application - via streams or in batches - through Amazon Managing Streaming for Apache Kafka (MSK) at the ingestion layer. Amazon MSK is a fully managed Kafka service that securely ingests data in real-time with zero operational overhead.

Amazon S3 - General storage

Data older than the retention period at the ingestion layer is persisted to Amazon S3 using the open-source Apache Kafka Connect. Considering the sheer volume of data ingestion, it was imperative that Robinhood settled on a storage solution that not only offered high resiliency but also S3’s cost-efficiency and performance. Amazon S3 Glacier was used as cold storage to archive source data stores.

Amazon Redshift - Storage and analysis

Robinhood used Amazon Redshift powerful analytics capabilities as a business intelligence tool to get insights from derived datasets. Redshift is more than just a data warehouse as it also has the ability to analyze both structured and non-structured data across data lakes using SQL and machine learning.

AWS Glue - Processing

One of the biggest challenges that Robinhood faced was processing the large bundles of raw datasets produced by in-app event or load balancer logs. A large chunk of these datasets was usually irrelevant. So Robinhood’s immediate task was to segregate the small section of meaningful data from the rest. AWS Glue was used as a distributed data processing tool to process and split these datasets into mini sets. All relevant data would then be persisted to S3, where they used Glue crawlers and metastore to generate schemas and make the datasets discoverable.

Amazon Athena - Querying

Amazon Athena offered everything Robinhood wanted in an interactive querying service, especially since they primarily used Presto. The previous architecture lacked a unified platform to perform queries on multiple types of datasets, which posed a huge problem for their team of data scientists. With Athena already integrated with AWS Glue Data Catalog, they could seamlessly query the processed data in Amazon S3. Aggregated datasets are then sent to Redshift for visualization.