Case Study: Instagram - rFronteddu/general_wiki GitHub Wiki

Instagram Clone

Simpler version of IG, where a user can share photos and can also follow other users. The "News Feed" for each user will consist of top photos of all the people the user follows.

Requirements and Goals of The System

Functional Requirements

Users should be able to upload/download/view photos
Users can perform searches based on photo/video titles.
Users can follow other users
The system should be able to generate and display a user's News Feed consisting of top photos from all the people the user follows.

Non Functional Requirements

Highly Available
Max 200ms latency for news feed generation
Consistency can take a hit in the interest of availability, if a user doesn't see a photo for a while, it is ok
Highly reliable, any uploaded photo or video should never be lost

Design Considerations

Read heavy for photos
- Need efficient storage
- Low latency is expected when viewing photo
- Never lose photos after being uploaded

Capacity Estimation

Assume 500M total users, with 1M daily active users.
2M new photos every day => 23 new photos every second
Average photo file size => 200KB
- Total space required for 1 day of photos: 2M (pics) * 200KB (avg pic size) => 400GB
- 10 years: 400 * 365 (Days) * 10 (years) ~= 1425 TB

High level system design

We need to support two scenarios, one to upload photos and one to view/search photos. We need an object storage to store the photos and a db for metadata about the photos.

High Level Design

Database Schema

We need to store data about users, their uploaded photos, and people they follow.

Photo
PK - PhotoID: int
     PhotoPath: varchar(256)
     PhotoLat: int
     PhotoLon: int
     UserLat: int
     UserLon: int 
     CreationDate: datetime

User
PK - UserID: int
     Name: varchar(20)
     email: varchar(32)
     DateOfBirth: datetime
     CreationDate: datetime
     LastLogin: datetime
     UserLon: int 
     CreationDate: datetime

UserFollow
PK - UserID1: int
PK - UserID2: int

SQL would be straightforward but scaling could be an issue. We can instead use a distributed key-value store enjoy the benefits offered by NoSQL. We can store pictures in S3.

We need to store relationship between users and photos and the list of people a user follows. We can use a wide-column datastore like Cassandra. For the UserPhoto table, the key would be UserID and the value would be the list of PhotoIDs the user owns, stored in different columns. Similarly for UserFollow table.

Key-value stores in general, always maintain a certain number of replicas to offer reliability. In these stores deleting don't get applied instantly, data is retained for certain days to support undeleting.

Data Size Estimation

Let's estimate how much data will be going into each table and how much total storage we will need for 10 years.

User

Assuming each int and datetime is four bytes, each row in the user's table will be of 68 bytes. If we have 500M users, we will need ~32GB

Photo

With the same assumptions, each row will be 284 bytes. If 2M photos are uploaded every day, we will need ~0.5GB of storage per day. For 10 years we will need ~1.88TB

UserFollow

Each row in UserFollow table will consist of 8 bytes. If we have 500M users that on average follow 500 users, we would need 1.82TB of storage for the UserFollow table.

Total

In total we would need 32GB + 1.88TB + 1.82 TB ~= 3.7TB

Component Design

Photo uploads can be slow as they have to go to disk whereas reads will be faster, especially when served from cache.

Uploading users can consume all the available connections, as uploading is a slow process. This means that reads cannot be served if the system gets busy with all the write requests. We should keep in mind that servers have a connection limit and leave space for reads.

Assume a connection limit of 500 at any time (no more tan 500 concurrent uploads or reads). To handle this bottleneck, split reads and writes into separate services.

High Level Design

Reliability And Redundancy

Losing files is not an option for our service, therefore, we will store multiple copies of each file so that if one storage server dies we can retrieve the photo from another.

The same principle applies to other components of the system. If we want high availability we need to have multiple replicas of services running in the system so that if a few services die we have other available. Failover can happen automatically or require manual intervention. Through redundancy we remove single points of failure and provide backup or spare functionalities if needed.

Detailed Design

Data Sharding

Let's discuss different schemes for metadata sharding:

Partitioning based on UserID:

If we partition based on UserID we will keep al photos of a user on the same shard. If one DB shard is 1TB we will need four shard to store 3.7TB. For better performance and scalability we can assume to keep 10 shards.

We can find the shard number by doing UserID % 10 and then store the data there. To uniquely identify any photo in our system, we can append shard number with each PhotoID.

We can generate PhotoID using each shard own auto-increment sequence since we append ShardId with each PhotoID, this will make it unique throughout the system.

Some issues:

Hot users:
- db is unbalanced, some users will have more viewers
- db is unbalanced, some users will have more pictures
What if we cannot store all pics in one shard? If we distribute them will we have higher latencies?
One shard can be a bottleneck for availability (one user completely lost) and latency (lots of users can request pictures from hot users)

Partitioning based on PhotoID:

If we can generate unique PhotoIDs first and then find a shard number through PhotoID % 10, the above problems will have been solved. We would not need to append ShardID with PhotoID as PhotoID will be itself unique through the system.

We cannot use an auto-incrementing sequence in each shard because we need to know PhotoID to find the shard where it will be stored. We could have a separate DB autogenerate ID. If our PhotoID can fit into 64 bits, we can define a table containing only a 64 bit ID field. Whenever we need to add a photo we can insert a new row in this table and take that ID to be the PhotoID for the new photo.

We could have two, one for even ids and one for odds and use a RR load balancer to pick one or the other. This would remove single point of failure and protect from downtimes. We can use this approach for generating the other IDs too.

Alternatively we can implement a key generation scheme similar to what was discussed in URL Shortening and TinyURL clone.

Plan for growth

We can have a large number of logical partitions to accommodate future data growth, such tat in the beginning, multiple logical partitions reside on a single physical db. We can use separate DB for each logical partition, whenever one db has a lot of data we can migrate some logical partitions from it to another server. We can maintain a config file or a separate db that map our logical partitions to db servers. When we want to move a partition, we only have to update the config file to announce the change.

Ranking and News Feed Generation

To create the News Feed for any given user, we need to fetch the latest, most popular and relevant photos of the people the user follows.

For simplicity, assume to need to fetch top 100 photos for a user's news feed.

The applications server will first get a list of people the user follows and then fetch metadata info of the latest 100 photos from each.

In the final step, the server will submit all these photos to our ranking algorithm which will determine the top 100 photos (based on recency, likeness, etc.) and return them to the user.

To reduce latency, we may want to pre-generate the News Feed and store it in a separate table.

Pre generate News Feed

We can have a dedicated server that continuously generates users' News Feed and stores them in a "UserNewsFeed" table. Whenever any user needs the latest photos for their Feed, we will simply query this table and return the result to the user.

Whenever these servers need to generate the Feed of a user, they will first query the UserNewsFeed table to find the last time the Feed was generated for that user. The Feed will be generated from that time onwards.

Approaches for sending News Feed contents to users

Pull: Clients can pull the Feed from the server on a regular basis or manually whenever they need it.
- Problem1: New data will not be shown until pull
- Problem2: Empty responses when nothing happens potentially waste lots of resources
Push: Server push data to users when it is available. Clients could maintain a Long Poll with the server. Users who follow lots of people or celebrities with millions of followers push data quite frequently.
Hybrid: Move users who have a high number of follows to a pull-based model and only push data to those users who have a few hundred follows. We could also push updates to all the users not more than a certain frequency, letting users with a lot of follows/updates to regularly pull data.

News Feed Creation With Shared Data

The Feed needs the latest photos from all people the user follows. For this, we need a mechanism to sort photos on their time of creation. We can make this efficient by having this time part of the PhotoID so that we can take advantage of the primary index to make sort fast.

We can use epoch time for this, the PhotoID can then have two parts, the first representing the epoch time and the second an auto incrementing sequence.

To store the number of seconds for the next 50 years, assuming an epoch that starts now we would need 86400 sec/day * 365 days a yer * 50 years ~= 1.6 billions seconds => 31 bits

On average we expect 23 new photos per second, we can use 9 bits to store auto incremented sequence. So every second we can store 2^9 => 512 new photos. We can reset the incrementing sequence every second.

Cache and load balancing

Service is massive and should push data closer to users geographically using CDNs.

We can introduce a cache for metadata servers to cache hot db rows. We can use Memcache to cache the data before hitting the db. We can use a LRU approach for eviction. Under this policy, we discard the least recently viewed row first.

Using the 80-20 rule, 20% of daily pictures generating 80% of traffic we can assume that certain photos are so popular that the majority of people read them. We can try caching 20% of them and metadata to make the system faster.