Hits - BB-Media-IT/Data-Hub GitHub Wiki
This document details the schemas used to manage the Hits
product data. During the delivery process, a specific S3 Bucket is created for each client. In each of these buckets, we have a main folder containing files in JSONL format. This folder is named Hits
. Additionally, within this folder, there is a subfolder named latest
.
Document Structure
This document is organized into the following sections:
Details of the S3 Buckets
Each client receives an S3 Bucket where data is organized into specific folders to ensure efficient management and quick access. The folder names reflect the types of data they contain, facilitating the identification and specific processing of each data set:
Example:
s3://bucket-client/Hits
This folder contains a latest
subfolder, which is periodically updated with the latest available snapshot, ensuring that customers always have access to the latest information.
Update frequency and scope
- The update of data in S3 Bucket is weekly.
🚀 You can get a weekly updated demo by connecting to the following Bucket s3://bb-media-data/hits/
using AWS CLI or any software and the endpoint parameter --endpoint https://nyc3.digitaloceanspaces.com
.
Command example
aws s3 --endpoint https://nyc3.digitaloceanspaces.com cp s3://bb-media-data/hits/ /Demo/BB-Media --recursive
File Description
We provide a detailed description of the files contained in the Hits
folder, explaining the structure and type of data handled by each one. If you wish to see the schemas in YAML, click here.
Hits
Field | Type | Description | Example |
---|---|---|---|
UID | string | Hash identifying the movie or series universally | c4d8bb0055f0cf866ec6f5d16b5471c5 |
IMDBId | string | ID coming from IMDB | tt13016388 |
TMDBId | string | ID coming from TMDB | 108545 |
DateHits | string | First day of the analyzed week | 2024-04-01 |
WeekOn | integer | Week number of the year corresponding to the date the data is processed (1 to 52/53) | 14 |
YearOn | integer | Year corresponding to the date the data is processed | 2024 |
Title | string | Titles coming from IMDB | 3 Body Problem |
Year | integer | Year when the movie/series was released, coming from IMDB | 2024 |
Country | string | Country where scores are calculated (ISO Alpha2 Code) | US |
Type | string | Content type | Movie |
Genre | string | Title primary genre, most common genre in collaborative databases | Action |
Scores | array | None | View More In Scores |
Position | integer | Title position inside current country and content type | 1 |
DeltaPositionInt | integer | Position change from previous week to current, if title wasn't in previous week data, is null | 3 |
Average | numeric | Hits score average inside corresponding Country and Type | 1.55 |
HitsRelative | numeric | (Hits score / Average)-1 , compare Hits score over Average, "how many times is Hits score over Average" | 63.66 |
HitsLocal | numeric | Compare Hits score to the Hits score of the same title in its origin country. | 100 |
ReleaseDate | string | Title release date in corresponding country | 2022-08-02 |
HitsCategory | string | Category assigned to a title according to its Hits score | Unicorn |
TrendScore | numeric | Relative score based on the difference between the moving averages of the last week and the previous 2 weeks, divided by the latter moving average | 1.888 |
TrendCategory | string | Category assigned to a title according to its 'trend_score' score | Average Trend |
Scores
Field | Type | Description | Example |
---|---|---|---|
Score | numeric | - | 100 |
Source | string | Raw scores have no limits, others are standarized between 0-100. All scores are calculated in specific week. Piracy: The number of views and downloads of title offered on pirate websites Twitter: The number of published tweets (of public profiles with the users' location to determine the country) that include hashtags with the title of the content Imdb: The number of votes registered on IMDB Cdb: The number of views from collaborative databases (some websites are filmaffinity, letterboxd, sensacine) Youtube: The number of weekly views of the first 10 videos related to the title Hits: The weighted average of the different sources previously mentioned (20% Piracy + 15% Twitter + 10% Search Engine + 25% Collaborative DB Votes + 20% Collaborative DB Popularity + 10% Youtube) | YoutubeRaw |