Stock Price Prediction: Spec - mikec964/chelmbigstock GitHub Wiki
Andy Webber created a stock price prediction program. It is under the chelmbigstock/chelmbigstock directory.
The problem uses the ridge regression to make prediction. This document explains what are used as feature set and target value and how they are extracted from the raw stock data.
Features
The features are adjusted closing prices in a range of dates. The date range is specified by using parameters below.
- reference_dates
reference_dates are the first dates of date ranges. If the reference_dates are not an open date of the stock market, the first open date after reference_dates is taken as the first date of the range. When two ore more reference_dates are specified, they make different date ranges. - train_days
train_days is the difference between the first date (reference_dates) and the last date in a range. train_days only counts open dates and doesn't count weekends and holidays. The last date is exclusive, that is, is not included in the range. - train_increment
Every (train_increment)th date is taken as feature. Other dates are ignored.
ex)
reference_dates = ['2015-01-01']
train_days = 9
train_increment = 3
Jan 2, 2015 is used as the first day because Jan 1, 2015 is a holiday. The next feature date is the third open day from the first day, that is, Jan 7, 2015. The next feature date is Jan 12, 2015. We don't take the next feature date because it is the ninth open date from the first day - train_days is exclusive. So the feature dates are
[ '2015-01-02', '2015-01-07', '2015-01-12' ]
target
The target value is the stock price at a certain date in future. the future date is specified by using a parameter below.
- future_day
The difference between reference_dates and the future date in open dates.
ex)
reference_dates = ['2015-01-01']
future_day = 10
Jan 2, 2015 is used as the first day because Jan 1, 2015 is a holiday. The target date is the 10th open date from the first day, that is, Jan 16, 2015.
2015-01-16
Train Set, CV Set
The stock symbols the program uses are stored in the data/stocks_read.txt file. Some stocks are used as train data, and others are used as CV data.
- max_stocks
This parameter specifies how many stocks are used. This is the sum of train data and CV data. - cv_factor
This parameter determines what portion of stocks to put in cross validation set and what portion to leave in training set. For example, it is 3, every third stock goes into CV set.
ex)
max_stocks = 40
cv_factor = 4
The first 40 stocks are picked up from stocks_read.txt. Out of the stocks, 30 are train data and 10 are CV data.
From stocks in the train set, stock prices in the date range explained above are extracted and used as train data. From stocks in the CV set, stock prices in the same date range are used as CV data.
Test Set
The stock data used as the train/CV set is also used as the test set. The test data is extracted by using the parameter below.
- test_dates
This is similar to reference_dates. test_dates specify the first dates of test date ranges. If the test_dates are not an open date of the stock market, the first open date after test_dates is taken as the first date of the range. When two ore more test_dates are specified, they make different date ranges.
The parameters, train_days and train_increment, are used to make the train date range. The same parameters are used for the test date range.