Case Study: Pastebin - rFronteddu/general_wiki GitHub Wiki
Pastebin clone
Let's design a service where users can store plain text. Users of the service will enter a piece of text and get a randomly generated URL to access it.
Requirements and Goals of the System
Functional Requirements
- users should be able to upload their data and get a unique URL to access it.
- Users will only be able to upload text.
- Data and links will expire after a specific timespan automatically; users can specify an expiration time
- Users should optionally be able to pick a custom alias for their paste
Non functional Requirements
- System should be highly reliable, any data uploaded should not be lost.
- Highly available. If service is down, user will not be able to access their pastes
- User should be able to access their files with minimum latency
- Generated URLs should not be predictable
More requirements:
- Analytics? How many times a paste was accessed?
- REST API
Design considerations
- To prevent abuse impose maximum file limit to 10MB
- To prevent abuse, max URL LEN
Capacity estimation and constraints
Service is read-heavy, more read compared to new uploads. Assume a ration of 5:1 between read and write.
Traffic Estimate:
Assume 1 million of pastes per day, this leaves us with 5 million reads per day.
-
New paste per second: 1M / (24h * 3600s) ~= 12pastes per second
-
Paste reads per second: 5M / (24h * 3600s) ~= 57 reads per second
Storage Estimates
Assume max 10MB of data per paste. Uploaded text is usually much smaller, assume an average of 10KB.
- Storage per day= 1M (pastes) * 10KB = 10GB per day
- If we want to store this data for 10 years we need (365 * 10GB) = 36TB
- With 1M pastes per day, we will have 3.6B pastes in 10 years. We need a unique key for each, in abse64 ([A-Z, a-z, 0-9, .,-]) we would need six letters strings: 64^7 ~ 68Billion unique strings
- If we need one byte per character, to store 3.6B keys we need 3.6B * 6 = 22GB
- Since it is negligible to our other requirement of 36TB, let's assume to never use more than 70% to leave some margin rising our total storage to 51TB
Bandwidth Estimates
- For write, we expect 12 pastes per second => 12 * 10KB ~= 120KB per second
- For read, we expect 58 requests per second => 58 * 10KB ~= 0.6MB per second
Memory Estimate
For caching, assume the 80 (traffic)-20 (of requests) rule, 20% of 5M read requests we can cache 0.2 * 5M * 10KB ~= 10GB
System API
POST
POST /api/v1/paste
Headers:
Content-Type: application/json
Body:
{
"api_dev_key": "YOUR_API_DEV_KEY",
"paste_data": "This is the content of the paste.",
"custom_url": "optional-custom-url",
"user_name": "optional-username",
"paste_name": "optional-paste-name",
"expire_date": "optional-expire-date"
}
Success
{
"status": "success",
"message": "Paste created successfully",
"url": "https://pastebin.com/optional-custom-url"
}
Error
{
"status": "error",
"message": "Invalid API key"
}
Parameters
- api_dev_key (string, required): Your API developer key.
- paste_data (string, required): The content of the paste.
- custom_url (string, optional): A custom URL for the paste.
- user_name (string, optional): The username of the person submitting the paste.
- paste_name (string, optional): The name/title of the paste.
- expire_date (string, optional): The expiration date/time for the paste
Returns:
- URL or error
GET
GET /api/v1/paste
Headers:
Content-Type: application/json
api_dev_key: YOUR_API_DEV_KEY
api_paste_key: paste_key
Returns: textual data of the paste
DELETE
DELETE /api/v1/paste
Headers:
Content-Type: application/json
api_dev_key: YOUR_API_DEV_KEY
api_paste_key: paste_key
Returns success or error
Database Design
- We need billions of records
- Small metadata
- Medium objects
- No relationship except if we want to store which user created what paste
- Read heavy
We can use two tables, one to store the pastes and one to store the users
PASTE
PK URLHash: varchar(16)
ContentKey: varchar(512)
ExpirationDate: datetime
UserID: int
CreationDate: datetime
USER
PK UserID: int
Name: varchar(20)
Email: varchar(32)
CreationDate: datetime
LastLogin: datetime
URIHash is the URL equivalent of the Tiny URL and ContentKey is the object key storing the contents of the past.
High Level Design
We need a server that receives paste requests, we can separate metadata (DB) and object storage (S3) so they can scale individually.
Component Design
Application Layer
- Process incoming and outgoing request.
- Write requests: upon receiving a write request, server generates a six-letter random string, which will serve as the key of the paste (if the user has not provided a custom key). The server will then store the contents of the paste and the generated key in the storages. If insert is successful, server returns key to user or an error otherwise (duplicate key provided by the user)
- we could also use a key generation server similar to TinyURL problem
- Read: Upon receiving a read request, check in the metadata if the key is present, if it is, retrieve key and corresponding object from S3
- Purging, Partitioning, Replication, Cache, LB, Security, similar to TinyURL