Data Modeling ‐ Cumulative Dimensions, Struct and Array Lab - sachit914/datawarehouse GitHub Wiki

how to work with structs and arrays in PostgreSQL.

  • The session focuses on using PostgreSQL's struct and array data types to solve temporal problems in data modeling, improving performance and storage efficiency.

Problem Overview

1. Table Analysis:

  • The player_seasons table contains player statistics for different NBA seasons. Each row corresponds to a player in a specific season.

2. Temporal Issues:

  • Joining this table with downstream tables can introduce temporal shuffling and loss of compression, as it duplicates data unnecessarily.

3. Optimization Goal:

  • Transition to a model where each player has a single row with all seasonal data stored as an array, removing redundant temporal data and improving efficiency.

Data Inspection

dentifying Static vs. Dynamic Data:

  • Attributes like player_name, height, college, country, etc., are static and do not change across seasons, (all this data dosent change on every season).
  • Attributes like games_played, points, rebounds, and assists are dynamic and season-specific all this data changes every season.

Highlighting Redundancy:

  • Static attributes are repeated in every row for the same player, which wastes storage and computation.

Defining a Custom Data Type

Struct for Season Stats

  • A PostgreSQL type named season_stats is created to group season-specific attributes
    • season (integer)
    • games_played (integer)
    • points
    • rebounds,
    • assists (real numbers).
create type season_stats as (
	season INTEGER,
	gp INTEGER,
	pts REAL,
	reb REAL,
	ast Real
)
  • season_stats is a composite type, not a table
  • a composite type is a user-defined structure that groups multiple fields together, much like a row or a record

Purpose of Struct:

  • Consolidates season-specific attributes into a single data type to encapsulate temporal details.
  • By separating static and dynamic data, you eliminate unnecessary duplication. For instance, instead of repeating player name and height for every season, you store them once, and the seasons (with their stats) are grouped in an array of structs.

Designing the New Table (player table which has player data)

  • We start by defining a players table that stores static attributes about players (information that doesn’t change over time, such as name, height, and college).

Player-Level Attributes:

  • A new table, players, is designed with columns for static attributes like player_name, height, college, and country.

Seasonal Data as an Array:

  • An array of season_stats structs is added as a column to store all seasons for a player.

Current Season Field:

  • An additional current_season column is added to keep track of the most recent season for a player.

Primary Key:

  • The combination of player_name and current_season ensures uniqueness for cumulative updates.
create table players (
player_name TEXT,
height TEXT,
college TEXT,
country TEXT,
draft_year Text,
draft_round TEXT,
draft_number Text,
season_stats season_stats[],
current_Season INTEGER,
PRIMARY KEY(player_name,current_season)
)

Data Transformation and Querying

Data Migration Approach:

  • Data from player_seasons is transformed and migrated into the new players table using SQL queries.

Seed Query:

  • A seed query selects the first year’s data (1996) as a starting point for cumulative updates.

Full Outer Join Logic:

  • Subsequent years’ data is joined with the previous state (yesterday) to build cumulative player records. (WE HAVE TWO TABLE, YESTERDAY TABLE FOR THAT TABLE WE ARE JOINNING TODAYS TABLE DATA)

Incremental Data Updates

Cumulative Querying:

  • Data for each subsequent year is joined and appended to the array using a full outer join.

Dynamic Updates:

  • The current_season field is updated with the latest value to facilitate incremental appending.

Query Validation:

  • Queries are executed and validated to ensure the correctness of data structure and transformations.

insert into players
WITH yesterday AS (
    SELECT * FROM players
    WHERE current_season = 2000
),
today AS (
    SELECT * FROM player_seasons
    WHERE season = 2001
)
SELECT 
    COALESCE(t.player_name, y.player_name) AS player_name,
    COALESCE(t.height, y.height) AS height,
    COALESCE(t.college, y.college) AS college,
	COALESCE(t.country, y.country) AS country,
    COALESCE(t.draft_year, y.draft_year) AS draft_year,
    COALESCE(t.draft_round, y.draft_round) AS draft_round,
    COALESCE(t.draft_number, y.draft_number) AS draft_number,
    CASE 
        WHEN y.season_stats IS NULL THEN 
            ARRAY[row(t.season, t.gp, t.pts, t.reb, t.ast)::season_stats]
        when t.season is not null then 
            y.season_stats || ARRAY[row(t.season, t.gp, t.pts, t.reb, t.ast)::season_stats]
		else y.season_stats
    END AS season_stats,
	COALESCE(t.season,y.current_Season + 1) as current_season
FROM today t
FULL OUTER JOIN yesterday y
    ON t.player_name = y.player_name;

Unnesting Data

  • Concept: Nested data structures (e.g., arrays or JSON columns) are common in modern databases. Using the UNNEST function, you can flatten these structures into individual rows.
with unnested as(
	select player_name,
		unnest(season_stats)::season_stats as season_stats
	from players
	where current_season = 2001
)
select player_name,
	(season_stats::season_stats).*
from unnested

final

insert into players
WITH yesterday AS (
    SELECT * FROM players
    WHERE current_season = 2000
),
today AS (
    SELECT * FROM player_seasons
    WHERE season = 2001
)
SELECT 
    COALESCE(t.player_name, y.player_name) AS player_name,
    COALESCE(t.height, y.height) AS height,
    COALESCE(t.college, y.college) AS college,
	COALESCE(t.country, y.country) AS country,
    COALESCE(t.draft_year, y.draft_year) AS draft_year,
    COALESCE(t.draft_round, y.draft_round) AS draft_round,
    COALESCE(t.draft_number, y.draft_number) AS draft_number,
    CASE 
        WHEN y.season_stats IS NULL THEN 
            ARRAY[row(t.season, t.gp, t.pts, t.reb, t.ast)::season_stats]
        when t.season is not null then 
            y.season_stats || ARRAY[row(t.season, t.gp, t.pts, t.reb, t.ast)::season_stats]
		else y.season_stats
    END AS season_stats,
	CASE 
		WHEN t.season is not null then 
			case when t.pts > 20 then 'star'
				when t.pts > 15 then 'good'
				when t.pts > 10 then 'average'
				else 'bad'
			end::scoring_class
			else y.scoring_class
	end as scoring_class,
	case when t.season is not null then 0
		else y.years_since_last_season + 1
			end as years_since_last_season,
	COALESCE(t.season,y.current_Season + 1) as current_season
	
FROM today t
FULL OUTER JOIN yesterday y
    ON t.player_name = y.player_name;

summary

question

  • The actual goal of this SQL process is to manage and update a basketball player's information over different seasons

Track Player Data Across Seasons:

  • Maintain historical data for each player, such as height, college, country, draft details, and performance statistics (season_stats).
  • Aggregate and append new seasonal data (gp, pts, reb, ast) to the season_stats array for each player.

Classify Players by Performance:

  • Determine the performance class (scoring_class) of each player for the current season based on their points per game (pts):
    • "star" for players scoring more than 20 points.
    • "good" for players scoring 15–20 points.
    • "average" for players scoring 10–15 points.
    • "bad" for players scoring less than 10 points.

Handle Players with No Current Season Data:

  • For players who didn't play in the current season (not present in the player_seasons table for 2001), retain their previous season's stats and increment the years_since_last_season field to indicate inactivity.
 CREATE TABLE players (
     player_name TEXT,
     height TEXT,
     college TEXT,
     country TEXT,
     draft_year TEXT,
     draft_round TEXT,
     draft_number TEXT,
     season_stats season_stats[],
     scoring_class scoring_class,
     years_since_last_season INTEGER,
     -- is_active BOOLEAN,
     current_season INTEGER,
     PRIMARY KEY (player_name, current_season)
 );
 CREATE TYPE scoring_class AS
     ENUM ('bad', 'average', 'good', 'star');
insert into players
WITH yesterday AS (
    SELECT * FROM players
    WHERE current_season = 2001
),
today AS (
    SELECT * FROM player_seasons
    WHERE season = 2002
)
SELECT 
    COALESCE(t.player_name, y.player_name) AS player_name,
    COALESCE(t.height, y.height) AS height,
    COALESCE(t.college, y.college) AS college,
	COALESCE(t.country, y.country) AS country,
    COALESCE(t.draft_year, y.draft_year) AS draft_year,
    COALESCE(t.draft_round, y.draft_round) AS draft_round,
    COALESCE(t.draft_number, y.draft_number) AS draft_number,
    CASE 
        WHEN y.season_stats IS NULL THEN 
            ARRAY[row(t.season, t.gp, t.pts, t.reb, t.ast)::season_stats]
        when t.season is not null then 
            y.season_stats || ARRAY[row(t.season, t.gp, t.pts, t.reb, t.ast)::season_stats]
		else y.season_stats
    END AS season_stats,
	CASE 
		WHEN t.season is not null then 
			case when t.pts > 20 then 'star'
				when t.pts > 15 then 'good'
				when t.pts > 10 then 'average'
				else 'bad'
			end::scoring_class
			else y.scoring_class
	end as scoring_class,
	case when t.season is not null then 0
		else y.years_since_last_season + 1
			end as years_since_last_season,
	COALESCE(t.season,y.current_season + 1) as current_season
	
FROM today t
FULL OUTER JOIN yesterday y
    ON t.player_name = y.player_name;
  • we have empty left table player at starting
  • then we have another table of year 1996
  • then we merge first table with second table( yesterdays table with todays table)

analytics

  • which player had improvement from first season to current season