Youtube or Netflix Clone - rFronteddu/general_wiki GitHub Wiki

Video Sharing Service Clone

Requirements and Goals of the System

Functional Requirements

Users should be able to:
- upload/share/view videos
- search videos based on titles
- add and view comments on videos
record stats of videos, likes, dislikes, total number of views, etc.

Non-functional Requirements

Highly reliable, uploaded videos should not be lost
Highly available, consistency can take a hit in the interest of availability, if a user doesn't see a video for a while it should be fine.
Users should have a real time experience while watching videos and should not feel any lag

Not in scope

Video recommendations, most popular videos, channels, subscriptions, watch later, favorites, etc.

Capacity Estimation and Constraints

Assume a 1.5 billions total users, 800 million of whom are daily active users.
If on avg a user views five videos per day, the total video-views per second would be:
- 800M * 5 / 86400 sec = 46K videos per second
Assume upload to view ratio is 1 to 200, i.e., for each video upload we have 200 videos viewed, giving us 230 videos uploaded per second:
- 46K / 200 = 230 videos per second

Storage Estimates

Assume every minute, 500 hours of videos are uploaded. If on avg, 1 minute of video needs 50MB of storage (videos need to be stored in multiple formats), the total storage needed for videos uploaded in a minute would be:

500 hours * 60 min * 50 MB = 1500 GB per minute (25GB per sec)

These estimates ignore video compression and replication.

Bandwidth Estimates

With 500 hours of video uploads per minute, and assuming each video upload takes 10MB per minute, we would be getting 300GB of uploads every minute:

500 h * 60 min * 10MB = 300GB per min (5GB per sec)

Assuming an upload view ration of 1 to 200, we would need 1TB per second of outgoing bandwidth.

System API

POST /upload/video

Parameters:
- api_dev_key (string)
  - Description: The API developer key of a registered account. This key is used for authentication and to manage user quotas.
  - Required: Yes
- video_title (string)
  - Description: The title of the video.
  - Required: Yes
- video_description (string)
  - Description: An optional description of the video.
  - Required: No
- tags (string[]):
  - Description: Optional tags for the video to help categorize and search for it.
  - Required: No
- category_id (string)
  - Description: The category of the video, e.g., Film, Song, People, etc.
  - Required: Yes
- default_language (string)
  - Description: The language of the video content, e.g., English, Mandarin, Hindi, etc.
  - Required: Yes
- recording_details (string):
  - Description: The location where the video was recorded.
  - Required: Yes
- video_contents (stream):
  - Description: The video file to be uploaded.
  - Required: Yes

Request Example:

POST /upload/video HTTP/1.1
Host: api.example.com
Content-Type: multipart/form-data
Authorization: Bearer <api_dev_key>
{
  "video_title": "My Vacation",
  "video_description": "A short video of my vacation.",
  "tags": ["vacation", "travel", "fun"],
  "category_id": "travel",
  "default_language": "English",
  "recording_details": "Hawaii, USA",
  "video_contents": "<binary video data>"
}

Response:

HTTP Status Code: 202 Accepted
Response Body:
- Description: A message indicating that the upload was successful and that the video encoding process has begun. The user will receive an email once the encoding is completed with a link to access the video.

Response Example:

{
  "message": "Upload successful. Video encoding in progress.",
  "status": "accepted"
}

GET /upload/video

Parameters:
- api_dev_key (string):
  - Description: The API developer key of a registered account of our service. This key is used for authentication and to manage user quotas. Required: Yes
- search_query (string):
  - Description: A string containing the search terms.
  - Required: Yes
- user_location (string):
  - Description: Optional location of the user performing the search.
  - Required: No
- maximum_videos_to_return (number):
  - Description: Maximum number of results returned in one request.
  - Required: Yes
- page_token (string):
  - Description: This token will specify a page in the result set that should be returned.
  - Required: No

Request Example:

GET /search/video?api_dev_key=<api_dev_key>&search_query=funny+cats&user_location=New+York&maximum_videos_to_return=10&page_token=XYZ123 HTTP/1.1
Host: api.example.com

Response:

HTTP Status Code: 200 OK
Response Body:
- Description: A JSON object containing information about the list of video resources matching the search query. Each video resource will have a video title, a thumbnail URL, a video creation date, and a view count.
- Response Example:

{
  "videos": [
    {
      "video_id": "abc123",
      "video_title": "Funny Cat Compilation",
      "thumbnail_url": "http://example.com/thumbnails/abc123.jpg",
      "video_creation_date": "2023-05-01T12:34:56Z",
      "view_count": 123456
    },
    {
      "video_id": "def456",
      "video_title": "Cats Doing Funny Things",
      "thumbnail_url": "http://example.com/thumbnails/def456.jpg",
      "video_creation_date": "2023-04-28T11:22:33Z",
      "view_count": 98765
    }
  ],
  "next_page_token": "XYZ456"
}

GET /stream/video

Parameters:
- api_dev_key (string):
  - Description: The API developer key of a registered account of our service. This key is used for authentication and to manage user quotas.
  - Required: Yes
- video_id (string):
  - Description: A string to identify the video.
  - Required: Yes
- offset (number):
  - Description: The time in seconds from the beginning of the video from which the stream should start.
  - Required: Yes
- codec (string):
  - Description: The codec to be used for streaming the video.
  - Required: Yes
- resolution (string):
  - Description: The resolution in which the video should be streamed.
  - Required: Yes

Request Example:

GET /stream/video?api_dev_key=<api_dev_key>&video_id=abc123&offset=120&codec=h264&resolution=720p HTTP/1.1
Host: api.example.com

Response:

HTTP Status Code: 206 Partial Content
Response Body:
- Description: A media stream (a video chunk) starting from the given offset.
Headers:
- Content-Type: The MIME type of the video stream (e.g., video/mp4).
- Content-Range: The range of bytes being sent in the response.

Response Example:

Headers:
  HTTP/1.1 206 Partial Content
  Content-Type: video/mp4
  Content-Range: bytes 1000000-2000000/5000000
  Body: (Binary data of the video chunk starting from the offset)

High Level Design

At a high-level we would need the following components:

Processing Queue: Each uploaded video will be pushed to a processing queue to be de-queued later for encoding, thumbnail generation, and storage.
Encoder: To encode each video in multiple formats
Thumbnails generator: To generate a few thumbnails for each video
Video and thumbnail storage: To store video and thumbnail files in some distributed file storage.
User Database: To store user's info (name, email, address, etc.)
Video metadata storage: DB to store all the info about videos like title, file path in the system, uploading user, total views, likes, dislikes, etc. It will also be used to store video comments.

HL Design

DB Schema

Video Metadata Storage - MySQL

For each video store:

VideoID, Title, Description, Size, Thumbnail, Uploader/User, Total likes, total dislikes, total views For each comment:
CommentID, VideoID, UserID, Comment, TimeOfCreation

User Data Storage - MySQL

UserID, Name, email, address, age, registration details etc.

Detailed Component Design

Service is read-heavy, so we need to focus on building a system that can retrieve videos quickly.

We expect read to write ratio to be 200:1, which means that for every video upload there are 200 video views.

Where do we store videos

We can use something like HDFS or GlusterFS, or S3.

How to efficiently manage read traffic

We can segregate read and write traffic. Since we have multiple copies of each video, we can distribute read traffic on different servers.

For metadata, we can have master-slave configurations where writes go to master first and then get applied to all slaves.

This configuration can cause some staleness that is acceptable as it is short lived and increase performance.

Where to store thumbnails:

There will be a lot more thumbnails than videos.

Small files, max 5KB each
Read traffic of thumbnails will be huge compared to videos. User will be wathcing one video at a time but they might be looking a ta page that has 20 thumbnails of other videos.

We could use Bigtable, as it combines multiple files into one block to store on the disk and is very efficient in reading small amount of data. Keeping hot thumbnails cached will also help improving latencies, given that they are small, we can store a lot of them.

Video uploads

Video can be huge, we need to support resume if connection drops.

Video Encoding

New videos are stored on the server which adds an encoding task in the processing queue to encode the video in multiple formats. Once all the encodings are completed, the uploader will be notified and the video made available for view/sharing.

DL Design 1

Metadata Sharding

We have a huge number of new video every day and a high read load rate. Therefore, we need to distribute data into multiple machines so that we can perform read/write operations efficiently.

We have several options for sharding:

Based on UserID

We can try storing all the data for a particular user on one server. We can pass the UserID to the hash function which will map the user to a db server where we will store all the metadata for that user's video.

When we query for videos of a user, we can ask our hash function to find the server holding the user's data and then read it from there.
To search videos by titles we have to query all servers and each server will return a set of videos. A centralized server can then aggregate and rank these results before returning them to the user.

This approach has a couple of issues:

Hot users, lots of queries for a popular users.
Over time some users can end up storing a lot of videos compared to others. Maintaining a uniform distribution of growing user data can be tricky.

We then have to repartition/redistribute or have used consistent hashing to balance the load between servers.

Based on VideoID

We could map each VideoID to a random server where we will store that Video's metadata. To find videos of a user we will query all servers and each server will return a set of videos. A centralized server can aggregate and rank these results before returning them to the user. This solves the problem for hot users by shifting it to hot videos.

We can improve performance by introducing a cache to store hot videos in front of the db servers.

Video Deduplication

With a huge number of users uploading a massive amount of video data, we will have to deal with widespread duplication. Duplicate videos often differ in aspect rations or encodings, can contain overlays or additional borders, or can be excerpts from a longer video. This can have an impact on many levels:

Wasted data storage
Wasted cache
Wasted network usage
Wasted energy

This will also impact user experience, making it harder to search videos.

For this design, deduplication makes most sense early, inlining deduplication can save a lot of resources. As soon as a user starts uploading as video, we can run video matching algorithms such as (block matching, phase correlation , etc.) to find duplications.

If we already have a copy we can either stop the upload and use the existing copy or continue and using the new video if it is of higher quality. We could also intelligently divide the video into smaller chunks so that we only upload the parts that are messing.

Load Balancing

We should use Consistent Hashing among our cache servers, which will also help in balancing the load between cache servers. Since we will be using a stati cache-based scheme to map videos to hostnames, we have to use dynamic HTTP redirections to even the load between servers hosting popular and unpopular videos. We will redirect a client to a less busy server in the same cache location.

The use of redirection has its drawbacks, first, since our service tries to load balance locally, it leads to multiple redirections if the host that receives the redirection can't serve the vide. Also, each redirection requires a client to make an additional HTTP request; it slows video startup, and can lead clients to distant cache locations.

Cache

To serve globally distributed users, we need to push content closer to the user using a large number of geographically distributed video cache servers.

We can introduce a cache for metadata servers to cache hot db rows using for example MemCache.

We can use LRU policy for cache eviction, to discard least recently viewed rows first.

To build a more intelligent cache, we can use the 80-20 rule, 20% of daily read volume for video is generating 80% of traffic, meaning that certain videos are so popular that the majority of people view them; we can try to cache 20% of daily read volume of videos and metadata.

CDN

A CD is a system of distributed servers that deliver web content to a user based in the geographic locations of the user, the origin of the web page and a content delivery server.

We can move popular videos to CDNs which will then replicate them in multiple places increasing the chance that they are closer to the users.

Less popular video can be served directly by our servers.

Fault tolerance

We can use Consistent Hashing for distribution among db servers to help us replacing dead server and distributing load among them.