Case Study: Twitter Clone - rFronteddu/general_wiki GitHub Wiki

Twitter Clone

Requirements

Functional

  • Users post new tweets
  • Users follows other users
  • Users can mark tweets as favorites
  • Display timeline of top tweets from all the people the user follows
  • Tweets can contain photos and videos

Non-functional

  • Highly available
  • 200ms for timeline generation
  • Consistency can be lax, it's ok if it takes a while for a user to see a new tweet

Extended

  • Search for tweets
  • Reply to a tweet
  • Hot topics
  • Tag other users
  • Tweet notification
  • Who to follow? Suggestions
  • Moments

Capacity Estimation

  • Assume 1 billion users with 200 million daily active users.
  • Assume 100M new tweets daily
  • Assume each user follows 200 people on avg

How many favorites per day?

If on avg, each user favorites five tweets per day: 200M * 5 fav = 1 billion favs

Total tweet-views?

If on avg a user visit their timeline two times a day and visits five other people's pages:

  • 200M * ((2 * 5) * 20 tweets) = 40 billion per day

Storage Estimates

If each tweet has 140 characters and we need 2 bytes per character without compression. Let's add 30 bytes of metadata (ID, timestamp, userID, etc), the total storage becomes:

  • 100M * (140 * 3 + 30) - 30GB per day
  • In five years => 30GB * 365 * 5
  • We also need to account for user's data, follows, favorites etc.

Finally, consider that a fifth of tweet has a photo and 1/10 a video. Assume that a photo uses 200KB on avg and a video 2MB. This will lead to:

  • 100M / 5 photo * 200KB + 100M/10 videos * 2MB ~ 24TB per day

Bandwidth Estimate

Total ingress is 24TB per day, this translates to 290MB per second.

If we have 28B views per day, assume we must show every picture but that a user will likely only watch a video every 3, the total egress will be:

  • 28B * 280 / 86400 seconds for text => 93MB sec
  • 28B/ 5 * 200KB / 86400 seconds for photos => 13GB sec
  • 28B / 10 / 3 * 2MB / 86400s of video => 22 GB
  • Total ~ 35 GB sec

System API

We can have a REST API to expose the functionality of our service.

Add tweet

URL: /api/tweet Method: PUT

Parameters

  • api_dev_key (string, required):
    • Description: The API developer key of a registered account.
    • Purpose: Used to throttle users based on their allocated quota.
  • tweet_data (string, required):
    • Description: The text of the tweet.
    • Constraints: Typically up to 140 characters.
  • tweet_location (string, optional):
    • Description: The location (longitude, latitude) this Tweet refers to.
  • user_location (string, optional):
    • Description: The location (longitude, latitude) of the user adding the tweet.
  • media_ids (number[], optional):
    • Description: List of media IDs to be associated with the Tweet.
    • Note: All the media (photo, video, etc.) need to be uploaded separately. Returns:
    • A successful post will return the URL to access that tweet. Otherwise, an appropriate HTTP error.

Request example:

{
  "api_dev_key": "example_key",
  "tweet_data": "This is a new tweet!",
  "tweet_location": "40.7128,-74.0060",
  "user_location": "40.7128,-74.0060",
  "media_ids": [12345, 67890]
}

Response example:

Success (200 OK)
{
  "url": "https://api.example.com/tweet/1234567890"
}

Wrong REQUEST

Error(400 BAD REQUEST)
{
  "error": "api_dev_key and tweet_data are required."
}

Wrong API_DEV_KEY

Error(401 UNAUTHORIZED)
{
  "error": "Invalid api_dev_key."
}

High Level System Design

HL

  • We need to store 100M/86500 = 1150 tweets per second and read 28 billions / 86400 = 325K tweets per second. This looks like a read heavy system.
  • We need multiple application servers to serve all these requests with load balancers in front of them for traffic distributions.
  • We need an efficient DB that can store all the new tweets and that can support a huge number of reads.
  • We need some file storage for photos and videos

On avg, the system will receive around 1160 new tweets and 325K read requests per second. This traffic will be distributed unevenly through the day, we should keep this in mind while designing the architecture of our system.

DB Schema

We need to store data about users, their tweets, their favorite tweets, and people they follow.

Tweet

PK TweetID: Int
   UserID: int
   Content: varchar(140)
   TweetLat: int
   TweetLon: int
   UserLat: int
   UserLon: int
   CreationDate: datetime
   NumFavorites: int

User

PK UserID: int
   Name: varchar(20)
   Email: varchar(32)
   DOB: datetime
   CreationDate: datetime
   LastLogin: datetime

UserFollow

PK UserID1: int
PK UserID2: int

Favorite

PK TweetID: int
PK UserID: int
   CreationDate: datetime

We can use the same consideration we used on instagram to figure out which db to pick.

Data Sharding

We need to distributed tweets onto multiple machines to keep good read/write performance.

UserID based Sharding

We could store all the data of a user on one server. We could pass the UserID to an hash function to map each user to a DB to store all of the user's tweets, favorites, follows, etc. This approach has a couple of issues:

  • Hot users are queried a lot by a lot of users
  • Over time some users can end up storing a lot of tweets or having a lot of follows compared to others. Maintaining a uniform distribution of growing user data can be difficult.

To recover from these situations we either have to repartition/redistribute our data or use consistent hashing.

TweetID-based sharding

Map each TweetID to a random server. To search for tweets, we have to query all servers, each return a set of tweets. A centralized server can aggregate these results to return them to the user.

To generate a user's timeline:

  1. app server find all the people the user follows.
  2. app server send query to all db servers to find tweets from these people
  3. Each db server find tweets for each user, sort them by recency and return the top tweets.
  4. app server merge results and sort them again to return the top results to the user

This solve the problem of hot users but we make us query all db partitions to find tweets of a user which can increase latency. We can improve performance by introducing cache to store hot tweets in front of the db servers.

Tweet-creation-time-based sharding

We could store based on creation time to get top tweets quickly but that will make an uneven distribution since the server with the more recent data will be queried more.

Combine TweetID and Tweet creation time

We can quickly find the latest tweets. W must make each TweetID universally unique in our system and each must also contain a timestamp.

We can use epoch time for this. So a TweetID will have two parts: an epoch seconds and autoincrementing sequence.

To make a new TweetID, we can take the current epoch time and append an auto-incrementing number to it. We can figure out the shard number from this TweetID and store it there.

If our epoch started today and we want to store the number of seconds for the next 50 years,, we would need 86400 sec/dai * 365 days * 50 years => 1.6 billion => 31 bits

Since on average we expect 1150 new tweets per second, we can allocate 17 bits to store auto incremented sequence; this will make our TweetID 48 bits long.

  • Every second we can store 2^17 = 130K new tweets.
  • We can reset the auto incrementing sequence every second.
  • For performance and fault tolerance, we can have two db generate the auto-incrementing keys, one for even numbers and one for odd numbers.
  • If we make the TweetID 64bits (8 bytes) long, we can store tweets for the next 100 years and store them for milli-seconds granularity.

We would still have to query all the servers for timeline generation, but our reads and writes will be substantially quicker.

  • No secondary index => reduce write latency
  • While reading, no need to filter on creation-time as the primary key has epoch time include in it.

Cache

We can introduce a cache for db servers to cache hot tweets and users. We can use off-the-shelf solution like Memcache that can store the whole tweet objects.

Applications servers can hit the cache before hitting the DBs and quickly fetch. We can use clients' usage patterns to determine how many cache servers we need.

Cache replacement policy

When the cache is full, we could use a LRU policy to discard the least recently viewed tweet first.

More intelligent cache

Using the 80-20 rule, we can assume that 20% of tweets is responsible for 80% of traffic, we can try to cache 20% of daily read volume from each shard.

Cache latest data

We could try to cache the last three days of data assuming that most users only go that back in time.

If we are getting 100 million new tweets or 30GB of data per day (without photos or videos) to store 3 days we need less than 100GB.

This data can easily fit in one server but we should replicate it to increase availability. We can use a similar design for photos and videos.

Our cache would be like a hash table where key would be the OwnerID and value would be a doubly linked list containing all the tweets from that user in the past three days. Since we want most recent first, we can always insert at the head of the linked list, so that older tweets will be at the end. We can remove from the tail to make more space.

Timeline generation

Similar to Facebook's Newsfeed

Replication and Fault Tolerance

System is read-heavy, we can have multiple secondary db servers for each DB partition. Secondary will be used for read only. All writes will go first to primary server and then propagated to secondary servers. If the primary server goes down, the secondary can take over.

Load Balancing

We can add LB at three places:

  • Between clients and application servers
  • Between application servers and db replication servers
  • Between aggregation servers and cache servers.

Initially we can use a simple Random Robin approach that distributes requests equally. Since this does not take into consideration server loads, we can improve upon this by having the LB query the status of each server and pick the server with lower load.

Monitoring

Some important metrics to collect could be:

  • New tweets per day/second, daily peak etc
  • Timeline delivery stats, how many tweets per day/second the service is delivering
  • Avg latency that is seen by the user to refresh timeline

By monitoring these counters we can figure out if we need more replication, lb, or caching.

Extended Requirements

Hot to serve feeds

We can get the latest tweets from the people someone follows and merge/sort them by time. We can use pagination to fetch/show tweets. Only fetching top N tweets from all the people someone follows. This N could depend on the client's Viewport. We could also cache the next top tweets to sped things up.

Alternatively we can do a pre-generate feed approach similar to the Ranking and Timeline Generation we did in Instagram Clone.

Retweet

With each Tweet object in the DB, we can store the ID of the original Tweet and not store any contents on this retweet object.

Trending Topics

Fetch most frequently occurring hashtags or search queries in the last N seconds and keep updating them after every M seconds. We can rank trending topics based on the frequency of tweets or search queries or retweets or likes. We can give more weight to topics which are shown to more people.

Who to follow/suggestions

We can suggest friends of people someone follows, give preference to celebrities (people with more followers).

We could use some algorithm or AI to shuffle and re-prioritize to maximize engage. We could use as signals info about people with growing follow-ship, common followers, location of interests, and more.

Moments

We could use ML to get top news for different website for past 1 or 2 hours, figure out related tweets, prioritize them, categorize them (news, support, financial, entertainment, etc). Then we could sho these articles as trending topics in Moments

Search

Search involves indexing, ranking, and retrieval of tweets. See Design Twitter Search.