ETL Pipeline Process for Financial Data - supertypeai/sectors-kb GitHub Wiki

- wsj_format mapping: -1: Null, 1: General, 2: [Reserved, currently unused], 3: Insurance, 4: Banking
- source mapping: -1: Null, 1: YF, 2: WSJ
- Check
wsj_format = -1
andcurrent_source <= 1 & wsj_format in (3,4)
by scraping the WSJ data source.- If there is any change in
wsj_format
, update thecurrent_source
inidx_company_profile
table to2
. - If there is no change, proceed to the next step.
- If there is any change in
-
Check for
current_source = null
in the YF API data source.- If data is found, check if latest date falls within retention range.
- If within range, update the
current_source
inidx_company_profile
table to1
. - If not, go to step 3.
- If within range, update the
- If data is found, check if latest date falls within retention range.
-
If data is not found in YF API, check it in the WSJ data source.
- If data is found, check if latest date falls within retention range.
- If within range, update the
current_source
inidx_company_profile
table to2
. - If not, do not update the source. Keep it as it is.
- If within range, update the
- If data is found, check if latest date falls within retention range.
-
Scrape data from YF API for
current_source = 1
and from WSJ data source forcurrent_source = 2
. -
Upsert data to the database.

- Initialize Supabase client with URL and secret key
- Get symbols from
idx_active_company_profile
withcurrent_source
= 1 - Check for null
yf_currency
inidx_company_profile
and update if necessary.
- For each symbol check the latest date on the database
a. If there is new data in the source, scrape the data
b. Else check the next symbol - If scraping has new data, proceed to the cleaning. If no new data or the database already has the latest data, exit the process.
- Make adjustments to some columns based on the
wsj_format
. Details are available here. - Cast column types for upsert
- If the currency (
yf_currency
) is not IDR, proceed to convert all columns that contain monetary values to IDR based on the conversion rate applicable on the date of the financial report.
- Upsert the data on batch, on every batch if it wasn't successful, retry until the limit.
- If batch upsert is successful, then exit the process.
Program to be used for GitHub Actions. Command-line options:
- -tt, --target_table, help="Target table to update", required=True, type=str
- -bs, --batch_size, help="Batch size", type=int, default=-1
- -bn, --batch_number, help="Batch number", type=int, default=1
Example usage: python scrape_data.py -tt idx_financials_annual -bs -1
& python scrape_data.py -tt idx_financials_quarterly -bs -1

- Initialize Supabase client with URL and secret key
- Get symbols from
idx_active_company_profile
withcurrent_source
= 2
- For each symbol check the latest date on the database
a. If there is new data in the source, scrape the data
b. Else check the next symbol - If scraping has new data, save raw data to CSV and proceed to the cleaning. If no new data or the database already has the latest data, exit the process.
- Clean null values
a. If succeed proceed
b. Else exit the cleaning. - Enrich columns and cast column types for upsert
a. If succeed proceed
b. Else exit the cleaning and save the partial clean data to CSV.
- Upsert the data on batch, on every batch if it wasn't successful, retry until the limit. If it still fails then save the clean data to CSV.
- If batch upsert is successful, then exit the process.
Main program to run the updater with usage:
- Optional:
- -i, --infile -> The path to a CSV file containing a list of symbols to scrape. (For debugging)
- -db, --save_to_db -> Specifies whether to save the cleaned file to db or not. Defaults to not saving to DB, CSV files are always saved.
- -a, --append -> The path to a CSV file to append to. Used for resuming scraping. (For debugging)
- --save_every_symbol -> Specifies whether to save CSV file to /temp every time data is scraped for a symbol. (For debugging)
- Required:
- -q, --quarter -> Specifies whether to scrape annual or quarterly financial data. Defaults to annual.
Use -h for help on usage.
The cleaner program to clean the data after scraping. Can be used standalone without the scraper. Parsed arguments N/A. Take either a CSV file or a Supabase Client to retrieve the financial data. Saving and upserting data are done in this script.
Program to be used for GitHub Actions. Usage: -q, --quarter or -a, --annual to scrape quarterly or annual respectively.