Predicting Bank Holding Company Returns Based on Sentiment Analysis - minalee-research/cs257-students GitHub Wiki

Predicting Bank Holding Company Returns Based on Sentiment Analysis

#SentimentAnalysis, #TimeSeriesForecasting, #Transformers

Emma Mignocchi

Abstract

This project explores the predictive power of sentiment analysis in financial disclosures by analyzing bank holding companies’ Management’s Discussion and Analysis (MD&A), Risk Factors, and Quantitative and Qualitative Disclosures About Market Risk sections. Using sentence-based and word-based semantic analysis, we aim to compare their effectiveness in forecasting bank stock returns within one day of the release of annual reports. Comparing transformer-based sentence semantics with word-level semantics is crucial, as sentence embeddings can capture contextual nuance and phrase-level sentiment shifts that individual word-based methods may overlook, potentially leading to improving or simplifying current prediction models. To evaluate predictive performance, both traditional econometric models— ARMA and aggregated ARMA—and deep learning methods via Long Short-Term Memory (LSTM) networks are implemented. Results indicate that sentence-level semantic analysis is significantly more volatile, while word-level models consistently outperform sentence-level models across all three approaches with the best performance achieved by the aggregated ARMA model using word-level sentiment scores with 10% lower MSE than word-based LSTM.

What this project is about

This project investigates the role of sentiment analysis in predicting bank holding company stock price movements following the release of annual reports. While prior research has explored textual sentiment’s impact on banks themselves, existing studies lack transparency regarding the specific banks used. To address this, we shift our focus to bank holding companies and analyze sentiment within key financial report sections.

We use data from 44 bank holding companies spanning 2007 to 2019, examining negative and net sentiment scores extracted from:

  • MD&A (Management’s Discussion and Analysis)
  • Risk Factors
  • Quantitative and Qualitative Disclosures About Market Risk

Additionally, we incorporate bank closing prices (1-day returns) post-earnings release to measure the market reaction.

The primary objectives are:

  1. Compare word-level vs. sentence-level sentiment analysis and evaluate their statistical properties.
  2. Examine whether sentiment scores offer predictive value in stock returns, particularly through ARMA, aggregated ARMA, and LSTM models. Although System GMM was initially considered, it was deemed unsuitable due to the limited dataset (~400 observations), leading to the implementation of ARMA(2,0,1) models as baselines.

Progress

Data Analysis

  • Collected financial reports and stock return data for 44 bank holding companies (filtered to 33 companies with complete data).
  • Conducted word-based vs. sentence-based sentiment analysis, identifying significant differences in sentiment variability.
  • Identified key trends in sentiment over time, particularly around major financial events (e.g., 2010 downturn).

Modeling & Experiments

  • Attempted System GMM, but the small dataset resulted in an unreliable Hansen J-statistic.
  • Developed word-based and sentence-based ARMA(2,0,1) models for baseline analysis.
  • Developed word-based and sentence-based Aggregated ARMA model
  • Developed word-based and sentence-based LSTM model

Preliminary Findings

  • Sentence-level sentiment scores exhibit greater variance, while word-level scores provide smoother and more stable trends.
  • In the ARMA(2,0,1) models, neither word- nor sentence-based sentiment scores show strong statistical significance as predictors of returns.
  • Aggregated ARMA models tend to produce more conservative predictions overall than LSTM, with sentence-based inputs being less responsive to extremes, whereas word-based inputs drive more pronounced predictions.
  • In contrast, LSTM models exhibit the opposite behavior—sentence-based inputs lead to more extreme predictions, while word-based models remain more conservative. Overall, LSTM predictions are more volatile compared to the aggregated ARMA approach.
  • Overall, the best-performing models across both approaches used word-based sentiment scores. Among all models tested, the aggregated ARMA model with word-based inputs performed best, followed by the LSTM with word-based inputs. Both of these models greatly outperformed the baseline ARMA(2,0,1) models in terms of predictive accuracy and error reduction.

Approach

Main approach: Once the dataset was prepared, I applied two different sentiment analysis methodologies:

1. Sentence-Based Sentiment Analysis (Transformer-Based Approach) For sentence-level sentiment scoring, I utilized the DistilRoberta model, specifically the mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis model, which is fine-tuned on the Financial PhraseBank dataset. The model was trained with the following hyperparameters: Learning Rate: 2e-5, Training Batch Size: 8, Evaluation Batch Size: 8, Optimizer: Adam (betas = (0.9, 0.999), epsilon = 1e-8), LR Scheduler: Linear, Number of Epochs: 5. This model was chosen for its ability to capture contextual nuance at the sentence level, providing more refined sentiment scoring than traditional word-based methods.

2. Word-Based Sentiment Analysis (Lexicon-Based Approach) For word-level sentiment scoring, I implemented the Loughran and McDonald’s financial dictionary, following the same methodology as Javid Iqbal & Khalid Riaz in their research. This method involves counting occurrences of positive, negative, and neutral words within each financial report section. Both sentiment analysis methods resulted in counts of positive, negative, and neutral words for each company and year. To convert these counts into sentiment indices, I applied the widely used sentiment formulas:

$$ \text{Negative Sentiment Score} = \frac{\text{Negative Word Count}}{\text{Total Word Count}} \times 100 $$

$$ \text{Net Sentiment Score} = \frac{\text{Positive Count} - \text{Negative Count}}{\text{Positive Count} + \text{Negative Count}} $$

These formulas assign two sentiment scores per section (MD&A, Market Risk Disclosures, Risk Factors) per company per year. Companies missing any of these three sections were removed from the analysis, reducing the dataset to 33 bank-holding companies.

Predictive Modeling: ARMA Model

To analyze the relationship between sentiment scores and stock returns, I implemented an ARMA model (AutoRegressive Moving Average). The ARMA(2,0,1) model was chosen to account for lagged dependencies, capturing the effect of current sentiment scores and past stock returns. The mathematical representation of the ARMA(2,0,1) model used in this analysis is:

$$ R_t = c + \phi_1 R_{t-1} + \phi_2 R_{t-2} + \theta_1 \epsilon_{t-1} + \beta_1 X_{1,t} + \beta_2 X_{2,t} + \dots + \beta_k X_{6,t} + \epsilon_t $$

where:

  • $R_t$: Stock return at time t.
  • $R_{t-1}, R_{t-2}$: Lagged returns from time t-1 and t-2, representing the autoregressive (AR) terms.
  • $\epsilon_{t-1}$: The lagged error term from time t-1, representing the moving average (MA) component.
  • $X_{1,t}, X_{2,t}, \dots, X_{6,t}$: The 6 sentiment sentiment scores (i.e. 3 sections each with net and negative sentiment scores).
  • $c$: Intercept term.
  • $\phi_1, \phi_2$: Coefficients for the autoregressive terms.
  • $\theta_1$: Coefficient for the moving average term.
  • $\beta_1, \beta_2, \dots, \beta_6$: Coefficients for each exogenous variable.
  • $\epsilon_t$: Error term at time t.

Predictive Modeling: Aggregated ARMA Model

To enhance the robustness of the time series analysis, an aggregated ARMA modeling approach was implemented. This method involves training individual ARMA(2,0,1) models for each bank holding company in the dataset and then averaging their coefficients to construct a composite, sector-wide model. By aggregating across firms, this approach smooths out idiosyncratic effects and provides a more comprehensive view of how sentiment influences financial performance across the banking sector. Unlike the initial ARMA(2,0,1), this aggregated approach included a clear training/testing split within each ARMA(2,0,1) training, enabling more robust evaluation and comparison. The architecture for this model is illustrated below in Figure 1.

Predictive Modeling: LSTM Model

To explore the potential of deep learning for predicting bank performance metrics, I implemented a Long Short-Term Memory (LSTM) network. The model was trained on sequential data where each input sample consisted of 3 consecutive time steps, with 6 features per time step (input shape: (3, 6)). The LSTM layer contained 32 hidden units, followed by a Dense output layer for the final prediction. The model was trained with 50 epochs, a batch size of 16, Adam optimizer, and mean squared error (MSE) as the loss function.

Baselines: To assess the effectiveness of sentiment scoring methods, I used Sum of Squared Residuals (SSR) and Log-Likelihood as evaluation metrics in ARMA regression results. The key comparison was between: Sentence-Based Sentiment Scores (Transformer-Based Semantics) and Word-Based Sentiment Scores (Lexicon Approach). To contrast the models themselves with one another I used Mean Sqaured Error (MSE) to compare across different test sizes. Originally, I planned to replicate the analysis from Javid Iqbal & Khalid Riaz by using the same banks as in their study. However, their paper did not disclose the specific banks they used, which required me to compile my own dataset of bank holding companies.

Novelty: The novelty of this project is highlighted in two key ways. First, it provides a direct comparison between Transformer-based sentence-level sentiment analysis and traditional word-level sentiment scoring, addressing a gap in existing research that often emphasizes embeddings or prompt-based approaches without standardized, structured evaluations. By applying consistent sentiment scoring methodologies across both levels of analysis, this study offers a rigorous comparison of their effectiveness in predictive financial modeling. Second, the project focuses specifically on bank holding companies—a subset of financial institutions that has received limited attention in sentiment-driven forecasting research. By isolating sentiment impacts within this sector and examining key regulatory filings, such as MD&A and Risk Factor sections, the study yields more targeted insights. Additionally, by implementing and comparing multiple modeling approaches—including baseline ARMA, aggregated ARMA, and LSTM models—this research demonstrates the relative strengths of each method, highlighting the superior performance of aggregated ARMA and word-based sentiment scores.

Experiments

Data The first step of this project involved gathering financial report data from the JanosAudran/financial-reports-sec repository. However, the dataset was too large to load in its entirety, so I downloaded it in 10 separate JSONL files to facilitate import into Jupyter Notebook. Once the data was successfully loaded, I examined the filing data, which contained over 20 different sections. To focus on sections that reflected tone and sentiment rather than numerical or logical text, I selected the following three sections: Management’s Discussion and Analysis (MD&A), Quantitative and Qualitative Disclosures About Market Risk, and Risk Factors. Additionally, I compiled a list of 44 bank-holding companies through independent research, ensuring the dataset was specifically tailored to financial institutions. The dataset contained 1-day, 5-day, and 30-day return data, and I chose to use 1-day returns to capture the immediate market reaction to financial report disclosures. Additionally, the original list of 44 companies became 33 after filtering out those that had a removing section from the time period 2007-2019.

Evaluation method The effectiveness of sentiment analysis in predicting stock returns was evaluated using a time-series regression approach, specifically an ARMA model. The evaluation metrics include:

  • Sum of Squared Residuals (SSR) & Mean Squared Error (MSE): Measures the goodness of fit of the model by evaluating the residual errors.
  • Log-Likelihood: Evaluates the likelihood of the observed data given the model parameters.
  • Statistical Significance (p-values): Determines whether sentiment variables have a meaningful impact on return prediction.

Experimental details Preprocessing and Sentiment Analysis Once the dataset was collected and filtered, sentiment analysis was applied using two distinct methodologies:

  • Sentence-Based Sentiment Analysis using DistilRoberta fine-tuned for financial sentiment analysis (mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis)
  • Word-Based Sentiment Analysis using Loughran & McDonald’s financial sentiment dictionary

After computing sentiment scores for each company’s report sections, the following modeling approaches were implemented:

  • ARMA(2,0,1) Model: Initially included sentiment scores and lagged returns (Return_t-1) as predictors. Lagged returns were excluded due to lack of predictive power.
  • Aggregated ARMA Model: Individual ARMA models for each bank, with aggregated (averaged) coefficients to develop a generalized ARMA model.
  • LSTM Model: Employed to address limitations observed in the ARMA models by capturing complex, non-linear temporal dependencies.

While Javid Iqbal and Khalid Riaz used a System GMM model, my results with this approach were highly inconsistent, suggesting either an insufficient amount of data for the model to generate reliable estimates or potential issues with instrument validity. It is also possible that the structure of the data, such as multicollinearity among explanatory variables or weak instruments, contributed to the poor model performance. Further investigation into alternative model specifications or additional data preprocessing steps may be necessary to improve the robustness of the results.

Hardware Specifications Experiments were conducted on a MacBook Pro M1 (16GB RAM) using Jupyter Notebook with Python libraries:

  • pandas for data manipulation
  • transformers for sentence-based sentiment analysis
  • statsmodels for ARMA modeling
  • TensorFlow for LSTM modeling
  • matplotlib/seaborn for data visualization

Results Firstly, for the quantitative comparison of word-level semantic analysis vs. sentence-level.

The preliminary statistics reveal that sentence-level semantic analysis exhibits a significantly higher standard deviation, primarily due to the smaller sample size. Additionally, the range of sentiment scores is much broader at the sentence level; for instance, the maximum negative sentiment in Risk Factors (Table 4) reaches 78.5, whereas the corresponding word-level analysis maxes out at just 2.6. Interestingly, while word-level analysis consistently categorizes all three sections as having an overall negative sentiment, sentence-level analysis identifies section 7 (MD&A) as having a net positive sentiment. Moreover, the visual comparison below shows that the MD&A section exhibits a much more pronounced difference between word-based and sentence-based semantic scoring than the Risk Factors section does.

Figure 3 presents a comparison between word-based and sentence-based semantic methods for sentiment analysis. Panels A and C display results using the word-based approach, while Panels B and D use the sentence-based approach. Additionally, Panels A and B focus on the negative sentiment scores from the MD&A sections, whereas Panels C and D represent the negative sentiment scores from the Risk Factor Reports.

As seen above, while the two techniques produce similar results in the Risk Factors section, they diverge significantly in the MD&A sections as well as the Disclosure sections. Although both methods capture a similar trend from 2015 to 2017, the upward trend observed between 2008 and 2012 appears exclusively in the word-based semantic analysis. Analyzing these section-based discrepancies is important, as they help identify the root causes of performance differences between models and highlight where sentence- or word-level approaches may be more effective.

Additionally, the plots below display the Risk Factors' Negative Sentiment Score and returns over time for the 33 companies with complete data. Notably, in 2010, when returns experienced a significant decline, the negative sentiment score was elevated, suggesting that risk factor sentiments were more negative on average across these companies. Similarly, in 2014, an increase in company returns coincided with a decline in negative sentiment within risk reports. However, the negative sentiment score exhibits a much smoother trend compared to the more volatile fluctuations in returns. This visualization was included to highlight the relationship between sentiment trends and market performance over time, providing insight into how sentiment scores may correspond to real-world financial outcomes.

Next, shifting the focus to the performance metrics comparing both techniques. Initially, I considered using the System GMM; however, given the limited dataset—33 companies from 2007 to 2019, totaling approximately 400 observations—the Hansen J statistic was unusually low, suggesting potential misinterpretation of the data. As a result, I opted for individual ARMA models to establish a baseline performance for both metrics.

ARMA(2,0,1) Results

First, I implemented an ARMA(2,0,1) model with the following ACFs:

Both ACFs resemble random noise, with a slight autocorrelation observed at lag 11. The resulting coefficients are shown below:

These preliminary results indicate that word-based semantic scoring performed slightly better based on log-likelihood and sum of squared residuals (SSR). However, depending on the variations in AR and MA parameter choices, sentence-based scoring outperformed word-based scoring in some cases. This suggests that both scoring methods are comparable in their predictive capabilities for this model. Notably, neither scoring approach demonstrated statistically significant sentiment values, despite indications of meaningful predictive potential discussed below. A critical limitation of prior studies was the absence of a proper training/testing split, focusing instead on coefficient significance alone. To address this, analyses of the model below will incorporate comprehensive training/testing splits. This current model serves as an initial exploration into the effectiveness of these sentiment scoring methods in predictive modeling.

Aggregated ARMA Model Results

Preliminary results seen in Tables 1 and 2 demonstrate that the word-based semantic scoring significantly outperformed the sentence-based approach, reflected by notably lower SSR values in testing. While these SSR values are not directly comparable to those from the initial ARMA(2,0,1) model (which lacked a training/testing split), they provide important insights into model effectiveness. A comprehensive table comparing average SSR per test is presented at the conclusion of this section.

These findings highlight the effectiveness of word-level semantic scoring in predictive financial modeling, suggesting potential directions for further enhancement of the model's predictive capabilities. Additionally, the word-based sentiment has a high coefficient value for negative and net disclosures sentiment that is not seen in the sentence-based model coefficients. This indicates that the sentence-based model fails to extract meaningful sentiment information from company disclosure sections compared to word-based. Finally, the prediction outputs are shown below using the word-based and sentence-based models:

These predictions indicate that sentence-based aggregated ARMA predicts more consistent values, as is reflected in the small coefficients associated with net and negative sentiment scoring in Table 2 above. In contrast, word-based aggregated ARMA has much more variability.

LSTM Model Results

Recall, the models were trained over 50 epochs with a batch size of 16, using the Adam optimizer and Mean Squared Error (MSE) as the loss function. The training process demonstrated stable convergence with no strong evidence of overfitting, as seen in Figures 12 and 13.

The Sum of Squared Residuals (SSR) was computed for the training sets to quantify the model's predictive performance:

  • Word-based LSTM: .02126
  • Sentence-based LSTM: .04474

The Word-Based LSTM Model demonstrated stronger overall predictive performance compared to the Sentence-Based LSTM Model, achieving a testing SSR approximately half that of the sentence-based model. However, closer examination of the prediction outputs reveals important trade-offs between the two approaches. Figures 14 and 15 below illustrate the predicted versus actual values for both models over the test indices. The Word-Based LSTM closely follows the general trend but lacks polarity and the ability to under-predict. In contrast, the Sentence-Based LSTM captures a wider variety of predictions, although with higher variance and less consistent accuracy.

The Word-Based LSTM produces generally accurate and stable predictions but tends to generate less variation in its outputs. As a result, it captures the central trend of the target variable effectively but may fail to fully reflect extreme fluctuations in bank performance. The Sentence-Based LSTM, on the other hand, produces predictions with a wider range and greater variability. While this allows the model to reflect stronger positive and negative shifts, it also results in higher overall prediction errors, contributing to its larger SSR.

Cross-model Comprison Results

Below, we can observe the MSE for each model showing the word-based Aggregated ARMA model had the lowest MSE. The original single ARMA models are not included in this comparison, as they lacked a proper training/testing split and clearly underperformed relative to the other approaches. Specifically, the training MSE for both the word- and sentence-based single ARMA models ranged from approximately 0.0011 to 0.0012, which is higher than the test MSE of even the worst-performing aggregated ARMA and LSTM models.

When considering the cross-model comparisons, it becomes clear based on Figures 14 and 15 that the LSTM model predicts more extreme ups and downs, whereas in contrast, the aggregated ARMA model provides much more consistent and stable predictions based on Figures 10 and 11. Interestingly, the aggregated ARMA model produces more extreme predictions when using word-based sentiment scores, while the LSTM model produces more extreme predictions with sentence-based scores. One possible explanation is that the aggregated ARMA approach, by averaging coefficients across banks, amplifies the broader trends captured in word-level sentiment, whereas the LSTM model, with its sequential learning capacity, is more responsive to the highly variable nature of sentence-level sentiment.

Summary of Results

This study compared word-level and sentence-level sentiment analysis in predicting bank stock returns using ARMA, aggregated ARMA, and LSTM models. Across all model types, word-based sentiment scores consistently outperformed sentence-based scores in terms of lower error rates. The aggregated ARMA model with word-level inputs achieved the best overall performance, followed by the LSTM model, with both significantly outperforming the baseline single ARMA models. Notably, LSTM models captured more extreme fluctuations, particularly with sentence-based sentiment, while aggregated ARMA models produced more consistent predictions. Much of the difference in model behavior appears driven by the divergence between word- and sentence-level sentiment scores in the MD&A and Disclosure sections, whereas Risk Factors showed more alignment across methods.

These findings have important implications for the use of sentiment analysis in financial forecasting. By demonstrating that word-level sentiment scores offer more stable and reliable predictive signals, this research suggests that simpler, less computationally intensive methods can still provide strong forecasting performance—particularly when combined with models like aggregated ARMA. Additionally, the contrasting behaviors of LSTM and aggregated ARMA models highlight how each interacts differently with word- and sentence-level sentiment data, underscoring the importance of aligning data granularity with the modeling approach in financial forecasting.

Code zip CMSC257-FinalPJ.zip

(The content is based on Stanford CS224N’s Custom Final Project.)