Analyzing Sentiment to Predict Stock Prices - minalee-research/cs257-students GitHub Wiki

Riyush Thakur and Jerry Liu

Abstract

In traditional stock analysis, an individual must carefully analyze news articles, company reports, and other text data to predict the price of a stock in the future. Consequential text data such as earnings reports and news events compel the market to immediately reprice a stock based on the nature of the news. If a language model is capable of understanding these text sources, then an algorithm could efficiently analyze newly revealed text data and swiftly reprice a stock. In this project, we attempt to use vector representations of news articles to predict the 3 day performance of a stock more efficiently than an analyst. Using a dataset of historical news articles and stock prices, we train a model to predict the 3 day price movement of a stock on the day a news article is released.

What This Project is About

The goal of this project is to create a stock trading algorithm capable of predicting a stock's price primarily using public news data. Prior attempts at using news articles to predict stock price fall short in either the magnitude of news data or the predictive power of the text data. For example, A sentiment score for a news article can be highly correlated with classical variables like earnings per share that nullify the predictive power of a sentiment score. To address these issues, we use the most comprehensive dataset on news data currently available: FNSPID released in 2024. It includes both news data and stock prices. All data ranges from 2009 to 2020.

Specifically, we use word vector representations of articles in the above dataset as an input to predict the price movement of a stock 3 days after the article is released. We evaluate the predictions by measuring their accuracy in predicting future price movement compared to a classical ARIMA time series approach to price prediction. In addition, we test a variety of approaches to creating word vector representations of the articles, such as Word2Vec, Bert, finBert, and sBert. In each case, we use the word vector representations to estimate the parameters of a logistic model and LightGBM decision tree to predict stock price. By training various models on different word vector representations, we optimize prediction accuracy by identifying the most effective word vectors and model architecture for the task, while also comparing performance across different stocks.

Description of Our Work

Approach

We have successfully preprocesed the FNSPID news dataset by merging it with historical stock price dataset. Each row of the dataset is formatted as follows:

Our approach is to append 4 new columns called "Word2Vec", "Bert", "finBert", and "sBert" where each column is a word vector representation of the article title. We generate representation columns by training Word2Vec word vectors on a corpus of all news article titles and incorporating pre-trained BERT family models. For Word2Vec, the sentence representation for each article title is computed as the average word vector of each word in the article title. For example, if a title reads: "Apple releases stunning iPhone Sales ahead of Christmas", we gather word vectors for each word in the title and take an average of the 8 word vectors. This average goes into the corresponding representation column. For the BERT family models, we use the CLS token embedding to represent the sentence in BERT and FinBERT, whereas SBERT (Sentence-BERT) directly provides a single vector representation of the sentence.

Additionally, we create helper columns such as 1d-open and 3d-close, which represent the opening and closing prices one and three days later, respectively. These columns enable the calculation of 3 Day Price Change and 3 Day Price Direction. Specifically, we predict the difference between the closing price on the third day and the opening price on the next day. This assumption aligns with the idea that when news is released today, we can trade on it at the market open tomorrow, ensuring that our target variable is logically sound. The sign of this numerical difference (3 Day Price Direction) is the variable we aim to predict using different word vector representations.

We create a benchmark by training an ARIMA model to predict the 3 day closing stock price minus the 1 Day Open price for each row, using parameters (1,0,1) and look back window of 20 trading days. ARIMA is the gold standard for quantitative price and trend prediction. By showing an improvement on the 3-day prediction accuracy of the ARIMA model, we validate our approach of using the news article sentiment to predict 3 day price movement.

Now, we have trained a logistic model, a LightGBM decision tree model, and an ARIMA baseline model to predict the 3-day price movement direction of the stock. Accuracy, F1 Score, Precision, Recall, and ROC AUC are used to evaluate predictions with each combination of word vector and model architecture. We also conducted hyper parameter tuning using grid search to ensure optimal training parameters such as learning rate, number of estimators, and number of leaves in the LightGBM approach. The prediction quality metrics across all approaches and configurations allow us to select the best approach to using news sentiment for stock price prediction. The best approach is compared to the ARIMA baseline model for an initial round of validation.

As an additional step, since our non-baseline models (logistic, LightGBM) rely solely on news data, we enhance it by incorporating historical stock price data. This approach is inspired by our reference paper, which employs an LSTM to process both news sentiment and past stock prices. We experiment with this method and introduce an alternative approach where we directly integrate ARIMA predictions into our model. Specifically, we augment the 768-dimensional BERT vector by adding a one-dimensional ARIMA prediction, resulting in a 769-dimensional input vector. This approach indirectly incorporates price information into our model while also enabling a direct comparison between models using news + price data versus those using only price data.

Experiments and Results

To begin, we plotted some data distributions.

image

image

We observe that the ups and downs are fairly balanced but not perfectly so, indicating some degree of class imbalance. However, given the nature of stock prices, downsampling the larger class is not a practical approach in real-world scenarios. Therefore, we maintain the dataset as is.

Benchmarking with ARIMA Model

To validate the predictive value of our word vector approach, we establish a benchmark using an ARIMA (AutoRegressive Integrated Moving Average) model. The ARIMA model is trained to predict the 3-day closing stock price minus the 1-day opening price, using historical price data. Since ARIMA is a well-established method for time-series forecasting, it serves as a standard against which we can compare our approach that incorporates news article sentiment.

Model Setup: We use a simple ARIMA(1,0,1) model with 20-day lookback window to predict the 3-day price movement for each day or row in the dataset.

Performance Metric: ARIMA’s predictive accuracy is measured by the number of times it correctly predicts the direction of price movement with -1 mapping to down and 1 mapping to up. As previously mentioned, other binary classification metrics like F1 and AUC are evaluated as well.

Comparison Results: We find that the ARIMA model correctly predicts the price movement on 60% of days when a news article is released. Here, the ARIMA doesn't take in any news data: we're just testing its performing on the subset of dates that follows news release.

Next, we train BERT and Word2Vec embeddings on a corpus of article titles. This step allows us to transform unstructured text data into numerical inputs for our Logistic and LightGBM models. We employ two machine learning models to predict the 3-day price change and 3-day price direction based on article title representations. We take a random 80-20 time-wise train-test split of the processed data table for each model architecture.

Logistic Regression

We fit separate logistic regression models using Word2Vec, Bert, finBert, and sBert representations as features. The trained models make predictions on the test data, and evaluation metrics are computed. The training procedure finds the best estimates of w and B for the equation:

$$P(y = 1 \mid X) = \frac{1}{1 + e^{-(wX + b)}}$$

If P(Y = 1) > 0.5, the model predicts an upward price movement. If P(Y = 1) < 0.5, the model predicts a downward price movement.

Then, we come up with predictions of 3-day price direction and evaluate the prediction accuracy on the test split. As the table below shows, we report accuracy rates across different models as follows: 0.517 using BERT vectors, 0.498 using FinBERT vectors, 0.534 using SBERT vectors, 0.547 using Word2Vec vectors, and 0.590 using ARIMA predictions.

image

From the confusion matrix, we observed some bias in the predictions, with a tendency to predict more 1’s, likely due to class imbalance. However, traditional down-sampling or up-sampling techniques are not viable solutions in this context, as they do not reflect real-world financial market conditions. Therefore, this is the best we could get with this method.

Now, we add the ARIMA prediction to the input vector representations. using Hstack([vector, ARIMA_prediction]), which becomes our new input to the model. With this, we get overall performance shown below:

image

Now the model performance is better, and closer to the ARIMA result. Though we still couldn't pass the baseline performance, if we look at all stocks together. Therefore, we look into the next model, LightGBM, checking if it yields better outcome.

LightGBM Classifier

LightGBM is a gradient boosting framework optimized for speed and efficiency. We initialize separate LightGBM classifiers, again, for each of the embedding types. Model hyperparameters such as learning rate, number of estimators, and number of leaves are set using GridSearch 5 fold cross validation. We arrived at hyper parameters: learning_rate: 0.1, number of estimators: 200, and number of leaves: 50.

Similar to the Logistic model, we trained on an 80-20 time-wise train-test split, ensuring that we evaluate generalization performance on unseen data. Then, we come up with predictions of 3-day price direction and evaluate the prediction accuracy on the test split. The results are shown below:

image

In fact, we still see some bias due to class inbalance. However, if we look at the performance after adding ARIMA to our input:

image

We see a great reduction in class bias, and our Bert model (accuracy: 0.5913) actually slightly outperforms the ARIMA baseline (accuracy: 0.5904). This is caused by greatly improved true positive rate, and slightly worse false positive rate.

Stock level results.

While our overall results across all stocks do not show a significant improvement over the baseline ARIMA model, the stock-level results are more promising.

Without adding ARIMA to our model, we see that sometimes we get better performances with some of the embeddings, for example, word2vec for Netflix (NFLX):

and Bert for Nvidia (NVDA)

The method of news+ARIMA actually works very well for some stocks too. For example, for the stock of Oracle (ORCL), the following results are obtained:

Here, we see that finBert (accuracy: 0.678) significantly outperformed ARIMA (0.607), along with other embeddings like word2vec and sBert.

It demonstrates that incorporating news vectors improves model performance compared to using ARIMA alone, indicating that our model effectively utilizes news information for better predictions.

Evaluation of LSTM (Preliminary)

As a cherry on top, we include some results that are not the main focus of this study, but are interesting to see. Below, we show the results from training an LSTM model on news sentiment using the BERT sentiment model, a pretrained model that directly classifies sentiment into three categories: positive, negative, and neutral. The LSTM model is structured in a way that takes in each day's sentiment vector (NA becomes [0.33, 0.33, 0.33]) alongside price data, with a window size of 10 days. The criterion is MSE loss and optimizer is Adam. The training is run for 70 epochs.

Snapshot of model architecture:

class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size=64, num_layers=2, output_size=1):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        return self.fc(lstm_out[:, -1, :])

For evaluation, we used Mean Absolute Percentage Error (MAPE), the same metric employed in our reference paper, and compared our LSTM-based approach against ARIMA. The results are summarized in the table below:

Stock News Count MAPE (LSTM vs True) MAPE (ARIMA vs True)
INTC 1844 112.1116% 223.7873%
NFLX 2245 130.5877% 455.1711%
NVDA 2483 116.6089% 787.8857%
ORCL 2180 100.8102% 143.2080%

Here's an example confusion matrix for NVDA. (Others show similar results.)

image

Our model achieved significantly lower MAPE scores across all tested stocks compared to ARIMA, even though it showed no clear advantage in the binary classification task.

Upon further investigation, we observed that the LSTM model tends to predict very small 3-d price changes, close to zero, effectively indicating no change from the current value. As a result, its MAPE remains around 100%, by the definition of MAPE being average percentage error.

On the other hand, ARIMA—despite correctly predicting price direction—exhibits higher variance, leading to MAPE scores exceeding 100%, sometimes reaching several hundreds. This discrepancy does not necessarily imply that LSTM is a better model, but rather suggests that it is not learning properly.

To improve performance, further fine-tuning and architectural adjustments are required to help the LSTM model truly learn from the data rather than defaulting to near-zero predictions. In addition, we think MAPE is not really a better standard of measurement, as opposed to more standard ones like R^2.

Conclusion

We demonstrate that incorporating news insights by representing the article titles as word vectors improves model performance compared to using ARIMA alone. This means that articles capture new information that aids in price prediction. Additionally, our heuristic approach to capturing sentiment adequately models the qualitative insights from the articles into quantities for training our logistic and LightGBM models. We also show that configurations of parameters such as the choice of pretrained corpus, choice of model, and volume of training data can significantly impact the final obtained testing accuracy. Since finBert is fine-tuned on financial text, its word vectors better capture the sentiment of the text for the model to understand. With further refining of model parameters to the specific task and dataset, we believe better results can be achieved. It is important to note that there is risk of overfitting on our specific dataset of financial news for technology stocks. It may be the case that over optimizing model parameters may jeopardise the models ability to generalise to other stocks. Ultimately, our analysis ultimately shows how language models create an opportunity to improve predicted stock return by capturing information that purely quantitative metrics fail to incorporate.

Further Work

Our analysis raises further questions as to the usefulness of an LSTM model to assign each news article a particular sentiment score. In principle, we believe an LSTM would better capture sentiment by taking into account news articles further in the past. Our preliminary results unfortunately do not add predictive value mostly because the LSTM seems to "reward hack" the MAPE error evaluation metric by consistently predicting minimal change in stock price. This result shows the importance of choosing training parameters that force the model to learn meaningful relationships from training data rather than simply gaming the reward mechanism. We also believe there is potential to improve model results by training a deep neural network for stock prediction. Ultimately, we've shown a proof of concept that sentiment data can complement quantitative price data to deliver better stock prediction results. Further research questions revolve around how to best model news data for incorporation in the prediction task.

Here's a repo of our code: https://github.com/JerryChinaL/nlp-final-code/tree/main