Data Sources and Processing - supertypeai/sectors-kb GitHub Wiki
This section focuses on how we obtain and process the data before storing it inside the database.
idx_financials_annual
and idx_financials_quarterly
List of Columns for Metadata:
- symbol: IDX company symbol ending with JK
- date: date of the financial report
Income statement metrics:
- total_revenue: For banks, obtained by adding the Net Interest Income and Non-Interest Income
- gross_income: Total Revenue - Costs of Goods Sold; Not applicable for banks or insurance companies, since they don't have Costs of Goods Sold
- operating_income: Gross Income - Operating Expense (excluding Costs of Goods Sold); For banks, obtained by subtracting the Interest Expense, Non-Interest Expense, and Loan Loss Provision from the Total Revenue.
- pretax_income: Operating Income + Non-Operating Income
- income_taxes
- net_income: Earnings Before Tax - Income Tax - Minority Interest
- ebit: Earnings Before Tax (Pretax Income) + Non-Operating Interest Expense; Not applicable for banks
- ebitda: EBIT + Deprecation and Amortization; Not applicable for banks or insurance companies
- diluted_shares_outstanding
- interest_expense_non_operating: Not applicable for banks
Balance sheet metrics:
- cash_and_short_term_investments: For companies outsides bank/ insurance companies, equivalent of Cash and Equivalents
- cash_only: For insurance companies, equivalent of Cash and Equivalents
- total_cash_and_due_from_banks: For banks, equivalent of Cash and Equivalents
- total_assets
- total_non_current_assets: Also known as fixed assets, not applicable for banks or insurance companies
- total_liabilities
- total_current_liabilities: Not applicable for banks or insurance companies
- total_debt
- total_equity
- stockholders_equity: Also known as book value equity
Cash flow metrics:
- net_operating_cash_flow
- free_cash_flow
idx_financials_annual
and idx_financials_quarterly
Data Sources for We have 2 data sources: WSJ and Yahoo Finance (YF) API. We determined the wsj_format by the formatting from WSJ because it follows strict formatting for companies in the same industry, e.g. Banking and Insurance have different formats due to having different metrics to consider. On the other hand, the formatting from YF API is quite random even for companies in the same industry.
We prioritize YF API as the source due to more convenient data retrieval. But for Banking and Insurance, we prioritize WSJ as the source due to the lack of certain metrics and standardization on the YF API. There are 3 steps in deciding how to determine this process:
- Filter companies by their industries, then for Banks and Insurance companies, set their data source to WSJ if the company is available in WSJ.
- Set the other companies' data source to YF if the company is available in YF. Note: this may still include Banking or Insurance industries if the company is not available in WSJ.
- For companies that are not available in YF, set their data source to WSJ if the company is available in WSJ. Note: this may still result in missing data for companies if data does not exist in both sources.
Looking at the 3 steps above, ideally, we would want to have Banking and Insurance company data from WSJ and other industries from YF. Due to that reason, we have a check mechanism to check this periodically ensuring Banking and Insurance industries data are sourced from WSJ. See below to see the wsj_format classification details.
WSJ
Data from WSJ are provided by FactSet, which also provides data for TradingView and MarketWatch.
Example URL:
- https://www.wsj.com/market-data/quotes/ID/XIDX/ADRO/financials/quarter/income-statement
- https://www.wsj.com/market-data/quotes/ID/XIDX/ADRO/financials/quarter/balance-sheet
- https://www.wsj.com/market-data/quotes/ID/XIDX/ADRO/financials/quarter/cash-flow
Note: quarter
can be replaced with annual
Formats
There are 4 different formats of financial data for IDX companies:
- General 1 (mostly non-insurance/ non-banking companies): https://www.wsj.com/market-data/quotes/ID/XIDX/ADRO/financials/quarter/income-statement
- General 2 (mostly companies related to real estate or finance): https://www.wsj.com/market-data/quotes/ID/XIDX/APLN/financials/quarter/income-statement
- Insurance (mostly insurance companies): https://www.wsj.com/market-data/quotes/ID/XIDX/AHAP/financials/quarter/income-statement
- Banking (mostly banking companies): https://www.wsj.com/market-data/quotes/ID/XIDX/BBCA/financials/quarter/income-statement
Note that WSJ may employ different category classifications compared to IDX
Data Processing
General 1 Format
- Interest Expense is regarded as Non-Operating Interest Expense
- Cash & Short Term Investments would be shown as Cash and Equivalents in Frontend
- Total Debt = Long Term Debt + Short Term Debt & Current Portion of Long Term Debt
If both Long Term Debt and Short Term Debt & Current Portion of Long Term Debt are "-", then we will leave Total Debt as null.
- Operating Income = Gross Income - SG&A Expense - Other Operating Expense
- EBIT = Pretax Income + Non-Operating Interest Expense
- EBITDA = EBIT + Deprecation and Amortization
- Fixed Assets = Total Assets - Total Current Assets
Total Debt, Operating Income, EBIT and Fixed Assets are not available by default. EBITDA is available, but the value does not conform to our defined formula.
General 2 Format
- Total Interest Expense is regarded as Non-Operating Interest Expense
- Cash & Short Term Investments would be shown as Cash and Equivalents in Frontend
- Gross Income = Operating Income + Selling, General & Admin. Expenses + Other Operating Expense
- COGS = Total Revenue - Gross Income (not needed in Frontend)
- EBIT = Pretax Income + Non-Operating Interest Expense
- EBITDA = EBIT + Deprecation and Amortization
- Total Non-Current Liabilities = Long Term Debt + Provision for Risks & Charges + Deferred Taxes Credit + Other Liabilities
- Total Current Liabilities = Total Liabilities - Total Non-Current Liabilities
Gross Income, COGS, EBIT, EBITDA, Total Non-Current Liabilities and Total Current Liabilities are not available by default.
Banking Format
- Total Cash & Due from Banks would be shown as Cash and Equivalents in Frontend
- Total Revenue = Net Interest Income + Non-Interest Income
Insurance Format
- Cash Only would be shown as Cash and Equivalents in Frontend
- Operating Income Before Interest Expense is regarded as Operating Income
- Interest Expense, Net of Interest Capitalized is regarded as Non-Operating Interest Expense
- EBIT = Pretax Income + Non-Operating Interest Expense
Yahoo Finance API
Data are obtained through yfinance Python package (https://pypi.org/project/yfinance/).
Example usage:
import yfinance as yf
ticker = yf.Ticker("ADRO.JK")
# to obtain annual data (return a dataframe)
ticker.income_stmt
ticker.balance_sheet
ticker.cashflow
# to obtain quarterly data (return a dataframe)
ticker.quarterly_income_stmt
ticker.quarterly_balance_sheet
ticker.quarterly_cashflow
Data Processing
For banks, the following metrics are set to null:
- gross_income
- ebitda
- cash_and_short_term_investments
- total_non_current_assets
- total_current_liabilities
- ebit
- interest_expense_non_operating
Note: total_cash_and_due_from_banks is not available on YF API
For insurance companies, the following metrics are set to null:
- gross_income
- ebitda
- cash_and_short_term_investments
- total_non_current_assets
- total_current_liabilities
Note: cash_only is not available on YF API
idx_company_profile
Data Sources for IDX
Data (company profile) are obtained from IDX official website (e.g. https://www.idx.co.id/en/listed-companies/company-profiles/BBCA). Using Selenium to scrape the following information:
"company_name",
"symbol",
"address",
"email",
"phone",
"fax",
"NPWP",
"website",
"listing_date",
"listing_board",
"sub_sector_id",
"industry",
"sub_industry",
"register",
"shareholders",
"directors",
"commissioners",
"audit_committees",
"delisting_date"
The following columns are cleaned so the format and values are more standardized.
"shareholders",
"directors",
"commissioners",
"audit_committees"
Other columns:
- nologo: Manual input
- wsj_format: WSJ
- current_source: YF API and WSJ
- yf_currency: YF API
- morningstar_code: MS website
- ipo_price: IDX E-Ipo
Check https://github.com/supertypeai/sectors-kb/wiki/ETL-Pipeline-Process for more information about wsj_format and current_source
idx_daily_data
Data Sources for Yahoo Finance API
Example usage:
import yfinance as yf
ticker = yf.Ticker("ADRO.JK")
# return the current market cap
ticker.info.get("marketCap", None)
# return a dataframe containing the daily close price and volume starting from last_date. auto_adjust is set to false because we don't want the price to be adjusted by dividend.
ticker.history(start=last_date, auto_adjust=False)["Close", "Volume"](/supertypeai/sectors-kb/wiki/"Close",-"Volume")
If the current market cap is not available from YF API, we will try to obtain it by scraping the YF website, e.g. https://finance.yahoo.com/quote/BBCA.JK/key-statistics?p=BBCA.JK. If the current market cap is still not available, we will manually derive it using the most recently available market cap and close price (assuming the share number is constant). Occasionally we might also have a null market cap for the past date. In that case, we will manually derive it using the current market cap and close price (if available) or resorting to the most recently available market cap and close price.
idx_key_stats
Data Sources for Yahoo Finance API
Example usage:
import yfinance as yf
ticker = yf.Ticker("ADRO.JK")
# return a dictionary
ticker.info
# return a dataframe containing % share held by insiders and institutions
ticker.major_holders
The following columns are obtained from ticker.info
: forward_eps, recommendation_mean, and employee_num. Data from ticker.major_holders
are processed into a dictionary and stored in holders_breakdown
column.
Scripted Input
Point Summaries
point_summaries
is updated using a script. Calculated based on the ticker's metrics on several sections.
Intrinsic Value through Discounted Cash Flow Analysis
Inflation data is obtained from https://www.inflationtool.com/indonesian-rupiah?amount=100&year1=2019&year2={year2}&frequency=yearly
Financial data is obtained from:
- idx_company_report_mv: historical_valuation, historical_financials, market_cap
- idx_financials_annual: basic_eps, share_issued
- idx_sector_reports_calc: historical_valuation
Data Processing:
- Calculate Average Cyclically Adjusted Earning (future value): price×(1+avg_inflation_rate)**(total_elements−i−1). In simpler terms, it's applying an inflation factor to each element in the list based on its position in the list. The further down the list (higher index i), the more times it applies the inflation factor. The result is a list of future values for each element in the net_income_list.
- Calculate the sub_sector_roe : (sub_sector_pb_ttm/sub_sector_pe_ttm), sub_sector_npm : (sub_sector_ps_ttm/sub_sector_pe_ttm), ticker_roe : (ticker_pb_ttm/ticker_pe_ttm), ticker_npm : (ticker_ps_ttm/ticker_pe_ttm)
- Calculate ticker_der: for banking sub sector (19), der formula is ticker_total_liabilities/ticker_total_equity. Instead, der formula is ticker_total_debt/ticker_total_equity
- Calculate ticker_profit_margin_stability:
- Operating Profit Margin (OPM): Net Income/ Total Revenue
- Calculate Average OPM & Standard Deviation OPM
- Profit Margin Stability: Average Operating Profit Margin/Standard Deviation - Calculate ticker_earning_predictability: calculate the correlation between years and net_income
- Calculate discount_rate:
- Beta Calculation: Define a list of beta values [0.33, 0.67, 1, 1.5 , 2]. Then, create a dictionary (data) containing financial metrics and their corresponding thresholds. The thresholds are specified for metrics such as Return on Equity (ROE), Debt-to-Equity Ratio (DER), Net Profit Margin (NPM), Profit Margin Stability, Earning Predictability, and Market Capitalization. Loop through the metrics in the dictionary: a. If the metric is 'roe', 'npm', 'earning_predictability', or 'market_cap': If the metric value is greater than the first threshold, assign the lowest beta (0.33). Otherwise, iterate through the remaining thresholds, and interpolate the beta based on the metric's position between the thresholds. b. If the metric is 'der' or not in the specified metrics: If the metric value is less than the first threshold, assign the lowest beta (0.33). Otherwise, iterate through the remaining thresholds, and interpolate the beta based on the metric's position between the thresholds.
- Calculate Median Beta:
- Calculate the discount rate using the formula: Discount Rate=0.07+(0.05×Median Beta)
- Calculate cae_per_share : avg_cae/share_issued
- Calculate avg_eps : average between cae_per_share & diluted_eps
- Calculate intrinsic_value :
- Calculate the intrinsic value for the next 100 years. For the first 10 years: value = avg_eps×(1+growth_rate_10y)i . If the year is after the first 10 years:value=values[9]×(1+growth_rate_after_10y)(i−10)
- Convert to present value: present_value= value / (1+discount_rate)**(i+1) 3. Calculate the intrinsic value by adding all of the present_value
TradingView
Data of stock's rating are gathered from TradingView. There are two types of stocks rating that are gathered, the technical rating and the analyst rating.
Technical Rating
Technical rating is breakdown into three periods of time, which are daily, weekly, and monthly. Each of technical rating data is collected frequently based on its relevant periods. The structure of the data is displayed as below:
{
"summary": {
"buy": 7,
"sell": 9,
"neutral": 10,
"updated_on": "2024-07-21 08:15:02"
},
"oscillator": {
"buy": 1,
"data": [
{
"name": "Relative Strength Index (14)",
"value": 51,
"action": "Neutral"
},
...
],
"sell": 1,
"neutral": 9,
"updated_on": "2024-07-21 08:15:02"
},
"moving_average": {
"buy": 6,
"data": [
{
"name": "Exponential Moving Average (10)",
"value": 2316,
"action": "Sell"
},
...
],
"sell": 8,
"neutral": 1,
"updated_on": "2024-07-21 08:15:02"
}
}
Analyst Rating
Analyst rating is collected once every three months. The structure of the data is displayed as below:
{
"buy": 2,
"hold": 3,
"sell": 0,
"n_analyst": 10,
"strong_buy": 4,
"updated_on": "2024-07-10 00:35:21",
"strong_sell": 1
}
idx_dividend
Data Sources for SahamIDX
Data Processing
Dividend records are retrieved for a specified date range. Besides the dividend value, we also store the yield in idx_dividend
. The dividend yield is calculated by dividing the dividend value for each date with the mean closing price of the stock for the corresponding year. The dividend yield is computed in Python and saved in the table only for the preceding year. Meanwhile, the dividend yield for the current year is dynamically computed at the view level.
idx_stock_split
Data Sources for Stock Split & Reverse Stock Split
SahamIDX:Data Processing
- Retrieve (reverse) stock split records from the database for future dates
- Retrieve (reverse) stock split records from SahamIDX for future dates
- Compare data from step 1 and 2, and perform necessary update and delete, so the records in the database match the ones in the SahamIDX
Note: deletion of record from the database may occur when there is a modification on the stock split date on SahamIDX website
idx_institution_transactions
Data Sources for Morningstar through RapidAPI (MS Finance and Morning Star)
API endpoint:
- https://ms-finance.p.rapidapi.com/stock/v2/get-ownership
- https://morning-star.p.rapidapi.com/stock/v2/get-ownership
Example request:
import requests
url = "https://morning-star.p.rapidapi.com/stock/v2/get-ownership"
querystring = {
"performanceId": ms_code,
"ownership": ownership, # either "Buyers" or "Sellers"
"asset": "institution",
}
headers = {
"X-RapidAPI-Key": rapid_api_key,
"X-RapidAPI-Host": "morning-star.p.rapidapi.com"
}
response = requests.get(url, headers=headers, params=querystring)
print(response.json())
Data Processing
The API will return the top 20 institutional buyers and sellers for each symbol in the previous month, and we store the institution name
and changeAmount
. We calculate the net_transaction
by summing over the changeAmount
of the top buyers and sellers.
idx_manual_input
Data Sources for Income statement data manually extracted from PDF report. USD-IDR conversion rate used from the last working day of the financial year from here
Data Processing
Data will be processed from an Excel file with a predetermined format to the following structure:
sankey_component
:
{
"links": [
{
"value": 6512275000000,
"source": "Loan",
"target": "Interest Income"
},
// ... (more links)
],
"nodes": [
{
"id": "Loan",
"nodeColor": "hsl(195, 53%, 79%)"
},
// ... (more nodes)
]
}
income_stmt_metrics
:
{
"provision": 147840000000,
"net_income": 4052678000000,
"income_taxes": 975392000000,
... (more income statement metrics)
"int_income_breakdown": [
{
"amount": 6512275000000,
"category": "Loan"
},
// ... (more breakdown items)
],
"operating_expense_breakdown": [
{
"amount": 11749000000,
"category": "Fees and commissions"
},
// ... (more breakdown items)
]
}
idx_esg_score
Data Sources for Data (esg score) are obtained from IDX official website (e.g. https://idx.co.id/secondary/get/esg/detail/BBCA?language=en-us). Using Selenium to scrape the following information:
'symbol'
'last_esg_update_date'
'esg_score'
'controversy_risk'
'environment_risk_score'
'social_risk_score'
'governance_risk_score'
idx_historical_mcap
Data Sources for Because historical market cap data are not available from YF API, currently we retrieve the data from IDX Digital Statistic. Going on forward, we will populate the idx_historical_mcap
table based on the data store on idx_daily_data
idx_company_forecast
Data Sources for Data (company_forecast) are obtained from Trading View Website in Forecast section (e.g https://www.tradingview.com/symbols/IDX-TPIA/forecast/)
Backup: Yahoo Finance official website (e.g https://finance.yahoo.com/quote/BBCA/analysis?p=BBCA).
Scrape the following information: revenue forecast amount, eps forecast amount
Data Processing
- Process numeric value: replace T, B, M, K to number (e.g multiply by 1e12 for T), then save the value as an integer
- Change datatypes: eps_estimate ( float32 ), revenue_estimate ( float64)
Final Columns
- symbol: IDX company symbol ending with JK
- year: the estimate for which year
- revenue_estimate: estimated revenue based on yahoo finance analysis tab
- eps_estimate: estimated earnings per share based on yahoo finance analysis tab
idx_ipo_perf
Data Sources for Price performances (7d, 30d, 90d, and 365d) are calculated using price data from YF API, calculated since the first trading date. The first trading date is considered valid if the date is at most 1 day later after the listing date.
sgx_short_sell
Data Sources for All of the data (except symbol variable) is sourced from sgx website using this url "f'https://api2.sgx.com/sites/default/files/reports/short-sell/{today.year}/{month}/website_DailyShortSell{date}1815.txt'".
- symbol: From sgx_companies table. Left join the sgx short sell data with sgx_companies data using name column, however it used fuzzywuzzy package to take the name similarity, because the company name is not identical
index_daily_data
Data Sources for Daily Indices Price data since January 2019. All the index prices come from Yahoo Finance
- All prices data in here using IDR except for:
- STI (Straits Time Index, SGX), use SGD
- KLSE (Bursa Malaysia), use MYR