Data Sources and Processing - supertypeai/sectors-kb GitHub Wiki

This section focuses on how we obtain and process the data before storing it inside the database.

List of Columns for idx_financials_annual and idx_financials_quarterly

Metadata:

  • symbol: IDX company symbol ending with JK
  • date: date of the financial report

Income statement metrics:

  • total_revenue: For banks, obtained by adding the Net Interest Income and Non-Interest Income
  • gross_income: Total Revenue - Costs of Goods Sold; Not applicable for banks or insurance companies, since they don't have Costs of Goods Sold
  • operating_income: Gross Income - Operating Expense (excluding Costs of Goods Sold); For banks, obtained by subtracting the Interest Expense, Non-Interest Expense, and Loan Loss Provision from the Total Revenue.
  • pretax_income: Operating Income + Non-Operating Income
  • income_taxes
  • net_income: Earnings Before Tax - Income Tax - Minority Interest
  • ebit: Earnings Before Tax (Pretax Income) + Non-Operating Interest Expense; Not applicable for banks
  • ebitda: EBIT + Deprecation and Amortization; Not applicable for banks or insurance companies
  • diluted_shares_outstanding
  • interest_expense_non_operating: Not applicable for banks

Balance sheet metrics:

  • cash_and_short_term_investments: For companies outsides bank/ insurance companies, equivalent of Cash and Equivalents
  • cash_only: For insurance companies, equivalent of Cash and Equivalents
  • total_cash_and_due_from_banks: For banks, equivalent of Cash and Equivalents
  • total_assets
  • total_non_current_assets: Also known as fixed assets, not applicable for banks or insurance companies
  • total_liabilities
  • total_current_liabilities: Not applicable for banks or insurance companies
  • total_debt
  • total_equity
  • stockholders_equity: Also known as book value equity

Cash flow metrics:

  • net_operating_cash_flow
  • free_cash_flow

Data Sources for idx_financials_annual and idx_financials_quarterly

We have 2 data sources: WSJ and Yahoo Finance (YF) API. We determined the wsj_format by the formatting from WSJ because it follows strict formatting for companies in the same industry, e.g. Banking and Insurance have different formats due to having different metrics to consider. On the other hand, the formatting from YF API is quite random even for companies in the same industry.

We prioritize YF API as the source due to more convenient data retrieval. But for Banking and Insurance, we prioritize WSJ as the source due to the lack of certain metrics and standardization on the YF API. There are 3 steps in deciding how to determine this process:

  1. Filter companies by their industries, then for Banks and Insurance companies, set their data source to WSJ if the company is available in WSJ.
  2. Set the other companies' data source to YF if the company is available in YF. Note: this may still include Banking or Insurance industries if the company is not available in WSJ.
  3. For companies that are not available in YF, set their data source to WSJ if the company is available in WSJ. Note: this may still result in missing data for companies if data does not exist in both sources.

Looking at the 3 steps above, ideally, we would want to have Banking and Insurance company data from WSJ and other industries from YF. Due to that reason, we have a check mechanism to check this periodically ensuring Banking and Insurance industries data are sourced from WSJ. See below to see the wsj_format classification details.

WSJ

Data from WSJ are provided by FactSet, which also provides data for TradingView and MarketWatch.

Example URL:

Note: quarter can be replaced with annual

Formats

There are 4 different formats of financial data for IDX companies:

  1. General 1 (mostly non-insurance/ non-banking companies): https://www.wsj.com/market-data/quotes/ID/XIDX/ADRO/financials/quarter/income-statement
  2. General 2 (mostly companies related to real estate or finance): https://www.wsj.com/market-data/quotes/ID/XIDX/APLN/financials/quarter/income-statement
  3. Insurance (mostly insurance companies): https://www.wsj.com/market-data/quotes/ID/XIDX/AHAP/financials/quarter/income-statement
  4. Banking (mostly banking companies): https://www.wsj.com/market-data/quotes/ID/XIDX/BBCA/financials/quarter/income-statement

Note that WSJ may employ different category classifications compared to IDX

Data Processing

General 1 Format

  • Interest Expense is regarded as Non-Operating Interest Expense
  • Cash & Short Term Investments would be shown as Cash and Equivalents in Frontend
  • Total Debt = Long Term Debt + Short Term Debt & Current Portion of Long Term Debt

If both Long Term Debt and Short Term Debt & Current Portion of Long Term Debt are "-", then we will leave Total Debt as null.

  • Operating Income = Gross Income - SG&A Expense - Other Operating Expense
  • EBIT = Pretax Income + Non-Operating Interest Expense
  • EBITDA = EBIT + Deprecation and Amortization
  • Fixed Assets = Total Assets - Total Current Assets

Total Debt, Operating Income, EBIT and Fixed Assets are not available by default. EBITDA is available, but the value does not conform to our defined formula.

General 2 Format

  • Total Interest Expense is regarded as Non-Operating Interest Expense
  • Cash & Short Term Investments would be shown as Cash and Equivalents in Frontend
  • Gross Income = Operating Income + Selling, General & Admin. Expenses + Other Operating Expense
  • COGS = Total Revenue - Gross Income (not needed in Frontend)
  • EBIT = Pretax Income + Non-Operating Interest Expense
  • EBITDA = EBIT + Deprecation and Amortization
  • Total Non-Current Liabilities = Long Term Debt + Provision for Risks & Charges + Deferred Taxes Credit + Other Liabilities
  • Total Current Liabilities = Total Liabilities - Total Non-Current Liabilities

Gross Income, COGS, EBIT, EBITDA, Total Non-Current Liabilities and Total Current Liabilities are not available by default.

Banking Format

  • Total Cash & Due from Banks would be shown as Cash and Equivalents in Frontend
  • Total Revenue = Net Interest Income + Non-Interest Income

Insurance Format

  • Cash Only would be shown as Cash and Equivalents in Frontend
  • Operating Income Before Interest Expense is regarded as Operating Income
  • Interest Expense, Net of Interest Capitalized is regarded as Non-Operating Interest Expense
  • EBIT = Pretax Income + Non-Operating Interest Expense

Yahoo Finance API

Data are obtained through yfinance Python package (https://pypi.org/project/yfinance/).
Example usage:

import yfinance as yf
ticker = yf.Ticker("ADRO.JK")

# to obtain annual data (return a dataframe)
ticker.income_stmt
ticker.balance_sheet
ticker.cashflow

# to obtain quarterly data (return a dataframe)
ticker.quarterly_income_stmt
ticker.quarterly_balance_sheet
ticker.quarterly_cashflow

Data Processing

For banks, the following metrics are set to null:

  • gross_income
  • ebitda
  • cash_and_short_term_investments
  • total_non_current_assets
  • total_current_liabilities
  • ebit
  • interest_expense_non_operating

Note: total_cash_and_due_from_banks is not available on YF API

For insurance companies, the following metrics are set to null:

  • gross_income
  • ebitda
  • cash_and_short_term_investments
  • total_non_current_assets
  • total_current_liabilities

Note: cash_only is not available on YF API

Data Sources for idx_company_profile

IDX

Data (company profile) are obtained from IDX official website (e.g. https://www.idx.co.id/en/listed-companies/company-profiles/BBCA). Using Selenium to scrape the following information:

"company_name",
"symbol",
"address",
"email",
"phone",
"fax",
"NPWP",
"website",
"listing_date",
"listing_board",
"sub_sector_id",
"industry",
"sub_industry",
"register",
"shareholders",
"directors",
"commissioners",
"audit_committees",
"delisting_date"

The following columns are cleaned so the format and values are more standardized.

"shareholders",
"directors",
"commissioners",
"audit_committees"

Other columns:

  • nologo: Manual input
  • wsj_format: WSJ
  • current_source: YF API and WSJ
  • yf_currency: YF API
  • morningstar_code: MS website
  • ipo_price: IDX E-Ipo

Check https://github.com/supertypeai/sectors-kb/wiki/ETL-Pipeline-Process for more information about wsj_format and current_source

Data Sources for idx_daily_data

Yahoo Finance API

Example usage:

import yfinance as yf
ticker = yf.Ticker("ADRO.JK")

# return the current market cap 
ticker.info.get("marketCap", None)

# return a dataframe containing the daily close price and volume starting from last_date. auto_adjust is set to false because we don't want the price to be adjusted by dividend.
ticker.history(start=last_date, auto_adjust=False)["Close", "Volume"](/supertypeai/sectors-kb/wiki/"Close",-"Volume")

If the current market cap is not available from YF API, we will try to obtain it by scraping the YF website, e.g. https://finance.yahoo.com/quote/BBCA.JK/key-statistics?p=BBCA.JK. If the current market cap is still not available, we will manually derive it using the most recently available market cap and close price (assuming the share number is constant). Occasionally we might also have a null market cap for the past date. In that case, we will manually derive it using the current market cap and close price (if available) or resorting to the most recently available market cap and close price.

Data Sources for idx_key_stats

Yahoo Finance API

Example usage:

import yfinance as yf
ticker = yf.Ticker("ADRO.JK")

# return a dictionary
ticker.info

# return a dataframe containing % share held by insiders and institutions
ticker.major_holders

The following columns are obtained from ticker.info: forward_eps, recommendation_mean, and employee_num. Data from ticker.major_holders are processed into a dictionary and stored in holders_breakdown column.

Scripted Input

Point Summaries

point_summaries is updated using a script. Calculated based on the ticker's metrics on several sections.

Intrinsic Value through Discounted Cash Flow Analysis

Inflation data is obtained from https://www.inflationtool.com/indonesian-rupiah?amount=100&year1=2019&year2={year2}&frequency=yearly
Financial data is obtained from:

  • idx_company_report_mv: historical_valuation, historical_financials, market_cap
  • idx_financials_annual: basic_eps, share_issued
  • idx_sector_reports_calc: historical_valuation

Data Processing:

  • Calculate Average Cyclically Adjusted Earning (future value): price×(1+avg_inflation_rate)**(total_elements−i−1). In simpler terms, it's applying an inflation factor to each element in the list based on its position in the list. The further down the list (higher index i), the more times it applies the inflation factor. The result is a list of future values for each element in the net_income_list.
  • Calculate the sub_sector_roe : (sub_sector_pb_ttm/sub_sector_pe_ttm), sub_sector_npm : (sub_sector_ps_ttm/sub_sector_pe_ttm), ticker_roe : (ticker_pb_ttm/ticker_pe_ttm), ticker_npm : (ticker_ps_ttm/ticker_pe_ttm)
  • Calculate ticker_der: for banking sub sector (19), der formula is ticker_total_liabilities/ticker_total_equity. Instead, der formula is ticker_total_debt/ticker_total_equity
  • Calculate ticker_profit_margin_stability:
  1. Operating Profit Margin (OPM): Net Income/ Total Revenue
  2. Calculate Average OPM & Standard Deviation OPM
  3. Profit Margin Stability: Average Operating Profit Margin/Standard Deviation ​- Calculate ticker_earning_predictability: calculate the correlation between years and net_income
  • Calculate discount_rate:
  1. Beta Calculation: Define a list of beta values [0.33, 0.67, 1, 1.5 , 2]. Then, create a dictionary (data) containing financial metrics and their corresponding thresholds. The thresholds are specified for metrics such as Return on Equity (ROE), Debt-to-Equity Ratio (DER), Net Profit Margin (NPM), Profit Margin Stability, Earning Predictability, and Market Capitalization. Loop through the metrics in the dictionary: a. If the metric is 'roe', 'npm', 'earning_predictability', or 'market_cap': If the metric value is greater than the first threshold, assign the lowest beta (0.33). Otherwise, iterate through the remaining thresholds, and interpolate the beta based on the metric's position between the thresholds. b. If the metric is 'der' or not in the specified metrics: If the metric value is less than the first threshold, assign the lowest beta (0.33). Otherwise, iterate through the remaining thresholds, and interpolate the beta based on the metric's position between the thresholds.
  2. Calculate Median Beta:
  3. Calculate the discount rate using the formula: Discount Rate=0.07+(0.05×Median Beta)
  • Calculate cae_per_share : avg_cae/share_issued
  • Calculate avg_eps : average between cae_per_share & diluted_eps
  • Calculate intrinsic_value :
  1. Calculate the intrinsic value for the next 100 years. For the first 10 years: value = avg_eps×(1+growth_rate_10y)i . If the year is after the first 10 years:value=values[9]×(1+growth_rate_after_10y)(i−10)
  2. Convert to present value: present_value= value / (1+discount_rate)**(i+1) ​3. Calculate the intrinsic value by adding all of the present_value

TradingView

Data of stock's rating are gathered from TradingView. There are two types of stocks rating that are gathered, the technical rating and the analyst rating.

Technical Rating

Technical rating is breakdown into three periods of time, which are daily, weekly, and monthly. Each of technical rating data is collected frequently based on its relevant periods. The structure of the data is displayed as below:

{
  "summary": {
    "buy": 7,
    "sell": 9,
    "neutral": 10,
    "updated_on": "2024-07-21 08:15:02"
  },
  "oscillator": {
    "buy": 1,
    "data": [
      {
        "name": "Relative Strength Index (14)",
        "value": 51,
        "action": "Neutral"
      },
      ...
    ],
    "sell": 1,
    "neutral": 9,
    "updated_on": "2024-07-21 08:15:02"
  },
  "moving_average": {
    "buy": 6,
    "data": [
      {
        "name": "Exponential Moving Average (10)",
        "value": 2316,
        "action": "Sell"
      },
      ...
    ],
    "sell": 8,
    "neutral": 1,
    "updated_on": "2024-07-21 08:15:02"
  }
}

Analyst Rating

Analyst rating is collected once every three months. The structure of the data is displayed as below:

{
  "buy": 2,
  "hold": 3,
  "sell": 0,
  "n_analyst": 10,
  "strong_buy": 4,
  "updated_on": "2024-07-10 00:35:21",
  "strong_sell": 1
}

Data Sources for idx_dividend

SahamIDX

Data Processing

Dividend records are retrieved for a specified date range. Besides the dividend value, we also store the yield in idx_dividend. The dividend yield is calculated by dividing the dividend value for each date with the mean closing price of the stock for the corresponding year. The dividend yield is computed in Python and saved in the table only for the preceding year. Meanwhile, the dividend yield for the current year is dynamically computed at the view level.

Data Sources for idx_stock_split

SahamIDX: Stock Split & Reverse Stock Split

Data Processing

  1. Retrieve (reverse) stock split records from the database for future dates
  2. Retrieve (reverse) stock split records from SahamIDX for future dates
  3. Compare data from step 1 and 2, and perform necessary update and delete, so the records in the database match the ones in the SahamIDX

Note: deletion of record from the database may occur when there is a modification on the stock split date on SahamIDX website

Data Sources for idx_institution_transactions

Morningstar through RapidAPI (MS Finance and Morning Star)

API endpoint:

Example request:

import requests

url = "https://morning-star.p.rapidapi.com/stock/v2/get-ownership"

querystring = {
                "performanceId": ms_code,
                "ownership": ownership, # either "Buyers" or "Sellers"
                "asset": "institution",
            }

headers = {
	"X-RapidAPI-Key": rapid_api_key,
	"X-RapidAPI-Host": "morning-star.p.rapidapi.com"
}

response = requests.get(url, headers=headers, params=querystring)

print(response.json())

Data Processing

The API will return the top 20 institutional buyers and sellers for each symbol in the previous month, and we store the institution name and changeAmount. We calculate the net_transaction by summing over the changeAmount of the top buyers and sellers.

Data Sources for idx_manual_input

Income statement data manually extracted from PDF report. USD-IDR conversion rate used from the last working day of the financial year from here

Data Processing

Data will be processed from an Excel file with a predetermined format to the following structure:

sankey_component:

{
  "links": [
    {
      "value": 6512275000000,
      "source": "Loan",
      "target": "Interest Income"
    },
    // ... (more links)
  ],
  "nodes": [
    {
      "id": "Loan",
      "nodeColor": "hsl(195, 53%, 79%)"
    },
    // ... (more nodes)
  ]
}

income_stmt_metrics:

{
  "provision": 147840000000,
  "net_income": 4052678000000,
  "income_taxes": 975392000000,
  ... (more income statement metrics)
  "int_income_breakdown": [
    {
      "amount": 6512275000000,
      "category": "Loan"
    },
    // ... (more breakdown items)
  ],
  "operating_expense_breakdown": [
    {
      "amount": 11749000000,
      "category": "Fees and commissions"
    },
    // ... (more breakdown items)
  ]
}

Data Sources for idx_esg_score

Data (esg score) are obtained from IDX official website (e.g. https://idx.co.id/secondary/get/esg/detail/BBCA?language=en-us). Using Selenium to scrape the following information:

'symbol'
'last_esg_update_date'
'esg_score'
'controversy_risk'
'environment_risk_score'
'social_risk_score'
'governance_risk_score'

Data Sources for idx_historical_mcap

Because historical market cap data are not available from YF API, currently we retrieve the data from IDX Digital Statistic. Going on forward, we will populate the idx_historical_mcap table based on the data store on idx_daily_data

Data Sources for idx_company_forecast

Data (company_forecast) are obtained from Trading View Website in Forecast section (e.g https://www.tradingview.com/symbols/IDX-TPIA/forecast/)

Backup: Yahoo Finance official website (e.g https://finance.yahoo.com/quote/BBCA/analysis?p=BBCA).

Scrape the following information: revenue forecast amount, eps forecast amount

Data Processing

  • Process numeric value: replace T, B, M, K to number (e.g multiply by 1e12 for T), then save the value as an integer
  • Change datatypes: eps_estimate ( float32 ), revenue_estimate ( float64)

Final Columns

  • symbol: IDX company symbol ending with JK
  • year: the estimate for which year
  • revenue_estimate: estimated revenue based on yahoo finance analysis tab
  • eps_estimate: estimated earnings per share based on yahoo finance analysis tab

Data Sources for idx_ipo_perf

Price performances (7d, 30d, 90d, and 365d) are calculated using price data from YF API, calculated since the first trading date. The first trading date is considered valid if the date is at most 1 day later after the listing date.

Data Sources for sgx_short_sell

All of the data (except symbol variable) is sourced from sgx website using this url "f'https://api2.sgx.com/sites/default/files/reports/short-sell/{today.year}/{month}/website_DailyShortSell{date}1815.txt'".

  • symbol: From sgx_companies table. Left join the sgx short sell data with sgx_companies data using name column, however it used fuzzywuzzy package to take the name similarity, because the company name is not identical

Data Sources for index_daily_data

Daily Indices Price data since January 2019. All the index prices come from Yahoo Finance

  • All prices data in here using IDR except for:
    • STI (Straits Time Index, SGX), use SGD
    • KLSE (Bursa Malaysia), use MYR