Case Study: Tiny URL - rFronteddu/general_wiki GitHub Wiki

Consider a URL shortening service like TinyURL which provide short aliases redirecting to long URLs.

Requirements and goals of the system

Ask questions to clarify requirements.

Functional Requirements:

Given a URL, the service must generate a shorter and unique alias of it. This is called a short link.
When users access a short link, the service should redirect them to the original link.
Users should optionally be able to pick a custom short link for their URL
Links will expire after a standard default timespan. Users should be able to specify the expiration time.

Non Functional Requirements:

Highly available. Otherwise URL redirection fails.
URL redirection with minimal latency
Shortened link should not be predictable

Others

Analytics; answer questions such as how many times a redirection happened
Service also accessible through REST APIs by other services.

Capacity Estimation And Constraints

System is read-heavy, likely lots of redirection requests compared to new URL shortenings.

Traffic Estimate: Assume 500M new URL shortenings per month, with a 100:1 read/write ratio we get 50B redirections.

Traffic

With 500M per month, we have to create 200 URL per second
With 200 URL per second we get 20K reads/redirections per second

Storage Assume to have to store URL and shortened URL for 5 years. With 500M new URLs per month we get 30Billion objects to store. If each object stored is ~500 bytes, we will need 15TB of total storage.

Bandwidth

For write (incoming data), if we expect 200 URL per second, we get 200*500 = 100KB per second
For read (outgoing): We expect 20K redirections, 20K * 500 bytes = 10MB per second

Memory for cache If we want to cache, we can use the 80-20 rule (20% of URL generates 80% of traffic), if we get 20K request per second -> ~1.7B requests per day, we multiply this by the 500 bytes we need to store and we get that the 20% is ~170GB

System API

Once the requirements are finalized, let's define the system API to state what is expected from the system.

We use an API_KEY to limit access to the service API to avoid abuses. We can limit creations/delete per second

createURL (api_key, original_url, custom_alias=None, user_name=None, expire_date=None)

Parameters

api_key (string): API key of the registered account that requests the use of the shortening service. Can be used to throttle users, set limits etc.
original_url (string): Original URL to be shortened
custom_alias (string): Custom key for the URL
user_name (string): Optional name to be used in encoding
expire_date (string): Optional expiration date for the shortened URL

Returns (string): shortened URL or an error code if something went wrong

deleteURL(api_key, url_key)

Parameters

api_key (string): API key of the registered account that requests the use of the shortening service. An user can only delete its own mappings.
url_key (string): Shortened URL to delete, must have been generated by the registered user.

Database Design

We define the schema to better define the data flow among various components and to guide us to partitioning later.

We need to store billions of records
Each object is fairly small
There are no relationship between records (other than to store the user that created a URL)
Service is read heavy

Schema We need two tables, one for the URL mappings and one for the user

URL
PK Hash: varchar(16)
OriginalURL: varchar(512)
CreationDate: datetime
ExpirationDate: datetime
UserID: int

User
pk UserID: int
Name: varchar(20)
Email: varchar(32)
CreationDate: datetime
LastLogin: datetime

Since we anticipate billions of rows, and we don't need relationship between objects, a NoSQL Key-value store seems like a better choice.

Basic System Design And Algorithm

We need to generate a short and unique key for a given URL. Let's consider 2 solutions:

Encoding actual URL

We can compute a unique hash (using MD5 or SHA256) of the given URL and encode the hash for displaying. We can use a base36 ([a-z, 0-9]) or a base62 ([A-Z, a-z, 0-9]) encoding and if we add - and . we can use a base64.

If we use base64:

a 6 letter key would result in 64^7 ~= 68B possible strings.
a 8 => 281B

Assume the 6 letter encoding to be sufficient for our system.

If we use MD5 for hash, it will produce 128-bits hash value, after base64 encoding, we get a string with more than 21 character (each base64 character encodes 6 bits of the hash value). Since we only have space for 8 characters per short key, we could take the first 6 or 8 letters for the key, this could result in key duplication upon which we can chose some other characters out of the encoding string or swap some characters.

Some issues:

If multiple users enter the same URL, they can get the same shortened URL, this is not acceptable.
What if parts of the URL are URL-encoded? are identical except for the URL encoding (think of accessing REST resources in the shortened URL)

Workarounds

We can append an increasing ssn to each URL to make it unique, and then generate a hash of it (even if we don't need to store this ssn, it will slow the system a bit and will cause us to manage overflow).
We could append userID to the input URL, if the user is not signed in, we can ask him to choose a uniqueness key, if we have a conflict, we have to keep generating a key until we get a unique one.

TinyURL Approach 1

Offline Key generation

Another approach is to have a standalone Key Generation Service that generates random six letter string beforehand and stores them in a database. This approach is simple and fast, as soon as a key is used, we must mark it in the database to ensure it doesn't get used again. This structure needs to be locked to prevented multiple servers from accessing the same keys.

Assuming we want to generate all 68B unique six letters key, and to need a byte for each character, we can store these keys in:

6 (character per key) * 68B (keys) = 412GB
Single point of failure: Il key generator is a single point of failure, we can have a standby replica of KGS, whenever the primary server dies, the standby server can take over to generate and provide keys.
Caching: Each app server can cache some keys to make things faster.
Key Lookup: Whenever we get a key lookup, we check the DB and if the key is present we issue a HTTP 302 Redirect back passing the stored URL in the Location field of the request. Otherwise, we issue HTTP 404 Not found or redirect back to the homepage.
Custom Aliases: We should impose a size limit on custom aliases. Assume at most 16 character key.

Partitioning and Replication

To scale our DB, we need to partition it so that it can store information about billions of URLs.

Range Based Partitioning: We could store URLs based on the first letter of the URL or the hash key but this could lead to unbalanced servers.
Hash-Based Partitioning: We can take an hash of the object and store it in a corresponding partition. The hash function will randomly distribute URL into different partitions. We can use consistent hashing to prevent overloaded partitions.

Caching

We can cache frequently accessed URL using some off-the-shelf solution like Memcache, which can store full URLs with their respective hashes. The application server can, before hitting the DB, quickly check if the cache has the desired URL.

How much cache? If we use the 80-20 empirical rule we can monitor user patterns and store 170GB to cache 20% of daily traffic. We can easily fit that into a single machine or use a couple smaller servers to store all these hot URLs.
We should use LFU as a cache eviction policy. We can use an hash map or similar data structure to store URLs and hashes and the frequency.
To make cache faster we can replicate the caching servers to distribute the load.
Cache Invalidation: Whenever there is a cache miss we can update the cache and pass the new entry to all the cache replicas. Each replica can update their cache by adding the new entry.

TinyURL With Cache

Load Balancer

We can add a load balancer in three places:

Between clients and application servers
Between application servers and database servers
Between applications servers and cache servers

Initially we can use a simple RR approach that distributes incoming requests equally among backend servers. We can use this to avoid using servers that died and to balance traffic.

Purging or DB cleanup

We should do a lazy cleanup to avoid needlessly pressing our DB. Our service will only return non-expired results and delete expired ones on detection.
A separate cleanup service can run periodically to remove expired links from our storage and cache. This service should be schedule to run when traffic is expected to be low.

Telemetry

Some statistics are worth tracking: country of the visitor, date and time of access, we have to be careful not to slow down access of popular pages.

Security and Permissions

We can store permission level (public/private) with each URL in the DB. And another table that stores UserID that have permission to see a specific URL.