Subscribe to DSC Newsletter

CNN for Short-Term Stocks Prediction using Tensorflow

Summary

In machine learning, a convolutional neural network (CNN, or ConvNet) is a class of neural networks that has successfully been applied to image recognition and analysis. In this project I've approached this class of models trying to apply it to stock market prediction, combining stock prices with sentiment analysis. The implementation of the network has been made using TensorFlow, starting from the online tutorial. In this article, I will describe the following steps: dataset creation, CNN training and evaluation of the model.

Dataset

In this section, it's briefly described the procedure used to build the dataset, the data sources and the sentiment analysis performed.

Ticks

In order to build a dataset, I first chose a sector and I time period to focus on. I decided to pick up the Healthcare sector and the time range between 4th January 2016 and 30th September 2017, to be further splitted in training set and evaluation set. In particular, the list of ticks was downloaded from nasdaq.com, keeping only companies with Mega, Large or Mid capitalization. Starting from this list of ticks, stocks and news data were retrieved using Google Finance and Intrinio API respectively.

Stocks Data

As already mentioned before, stocks data has been retrieved from Google Finance historical API ("https://finance.google.com/finance/historical?q={tick}&startdate={startdate}&output=csv", for each tick in the list).
The time unit is the day and the value I kept is the Close price. For training purposes, missing days have been filled using linear interpolation (pandas.DataFrame.interpolate):

News Data and Sentiment Analysis

For each tick, I downloaded the related news from "https://api.intrinio.com/news.csv?ticker={tick}". Data are in csv format with the following columns:
TICKER,FIGI_TICKER,FIGI,TITLE,PUBLICATION_DATE,URL,SUMMARY, here an example:

"AAAP,AAAP:UW,BBG007K5CV53,"3 Stocks to Watch on Thursday: Advanced Accelerator Application SA(ADR) (AAAP), Jabil Inc (JBL) and Medtronic Plc. (MDT)",2017-09-28 15:45:56 +0000,http://articlefeeds.nasdaq.com/~r/nasdaq/symbols/~3/ywZ6I5j5mIE/3-s... Market News Stock Advice amp Trading Tips Most major U S indices rose Wednesday with financial stocks leading the way popping 1 3 The 160 S amp P 500 Index gained 0 4 the 160 Dow Jones Industrial Average surged 0 3 and the 160".

News have been de-duplicated based on the title. Finally, TICKER, PUBLICATION_DATE and SUMMARY columns were kept.

Sentiment Analysis was performed on the SUMMARY column using Loughran and McDonald Financial Sentiment Dictionary for financial sentiment analysis, implemented in the pysentiment python library.

This library offers both a tokenizer, that performs also stemming and stop words removal, and a method to score a tokenized text. The value chosen from the get_score method as a proxy of the sentiment is the Polarity, computed as:

(#Positives - #Negatives)/(#Positives + #Negatives)

import pysentiment as ps

lm = ps.LM()
df_news['SUMMARY_SCORES'] = df_news.SUMMARY.map(lambda x: lm.get_score(lm.tokenize(str(x))))
df_news['POLARITY'] = df_news['SUMMARY_SCORES'].map(lambda x: x['Polarity'])

The days in which there are no news are filled with 0s for Polarity.
Finally, data was groupped by tick and date, summing up the Polarity score for days in which a tick has more than one news.

Full Dataset

By merging stocks and news data, we get a dataset as follows, with all the days from 2016-01-04 to 2017-09-30 for 154 ticks, with the close value of the stock and the respective polarity value:

Date Tick Close Polarity
2017-09-26 ALXN 139.700000 2.333332
2017-09-27 ALXN 139.450000 3.599997
2017-09-28 ALXN 138.340000 1.000000
2017-09-29 ALXN 140.290000 -0.999999

CNN with TensorFlow

In order to get started with Convolutional Neural Network in Tensorflow, I used the official tutorial as reference. It shows how to use layers to build a convolutional neural network model to recognize the handwritten digits in the MNIST data set. In order to make this working for our purpose, we need to adapt our input data and the network.

Data Model

The input data has been modelled such that a single features element is a 154x100x2 tensor:

  • 154 ticks;
  • 100 consecutive days;
  • 2 channels, one for the stock price and one for the polarity value.

Lables instead are modelled as a vector of length 154, where each element is 1, if the corrresponding stock raised on the next day, 0 otherwise.

In tihs way, there is a sliding time window of 100 days, so the first 100 days can't be used as labels. The training set contains 435 entries, while the evaluation set 100. 

Convolutional Neural Network

The CNN has been built starting from the example of TensorFlow's tutorial and then adapted to this use case. The first 2 convolutional and pooling layers have both height equal to 1, so they perform convolutions and poolings on single stocks, the last layer has height equal to 154, to learn correlations between stocks. Finally, there are the dense layers, with the last one of length 154, one for each stock.


The network has been dimensioned in a way that it could be trained in a couple of hours on this dataset using a laptop. Part of the code is reported here:

def cnn_model_fn(features, labels, mode):  

"""Model function for CNN."""


# Input Layer

input_layer = tf.reshape(tf.cast(features["x"], tf.float32), [-1, 154, 100, 2])

# Convolutional Layer #1
conv1 = tf.layers.conv2d(
inputs=input_layer,
filters=32,
kernel_size=[1, 5],
padding="same",
activation=tf.nn.relu)

# Pooling Layer #1
pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[1, 2], strides=[1,2])

# Convolutional Layer #2
conv2 = tf.layers.conv2d(
inputs=pool1,
filters=8,
kernel_size=[1, 5],
padding="same",
activation=tf.nn.relu)

# Pooling Layer #2
pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[1, 5], strides=[1,5])

# Convolutional Layer #3
conv3 = tf.layers.conv2d(
inputs=pool2,
filters=2,
kernel_size=[154, 5],
padding="same",
activation=tf.nn.relu)

# Pooling Layer #3
pool3 = tf.layers.max_pooling2d(inputs=conv3, pool_size=[1, 2], strides=[1, 2])

# Dense Layer
pool3_flat = tf.reshape(pool3, [-1, 154 * 5 * 2])

dense = tf.layers.dense(inputs=pool3_flat, units=512, activation=tf.nn.relu)

dropout = tf.layers.dropout(
inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN)

# Logits Layer
logits = tf.layers.dense(inputs=dropout, units=154)

predictions = {
# Generate predictions (for PREDICT and EVAL mode)
"classes": tf.argmax(input=logits, axis=1),
"probabilities": tf.nn.softmax(logits, name="softmax_tensor")
}

if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

# Calculate Loss (for both TRAIN and EVAL modes)
multiclass_labels = tf.reshape(tf.cast(labels, tf.int32), [-1, 154])
loss = tf.losses.sigmoid_cross_entropy(
multi_class_labels=multiclass_labels, logits=logits)

# Configure the Training Op (for TRAIN mode)
if mode == tf.estimator.ModeKeys.TRAIN:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
train_op = optimizer.minimize(
loss=loss,
global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

Evaluation

In order to evaluate the performance of the model, no standard metrics were used, but it has been built a simulation closer to a practical use of the model.

Assuming to start with an initial capital (C) equal to 1, for each day of the evaluation set we divide the capital in N equal parts, where N goes from 1 to 154.

We put C/N on the top N stocks that our model predicts with the highest probabilities, 0 on the others.

At this point we have a vector A that represents our daily allocation, we can compute the daily gain/loss as A multiplied by the percentage variation of each stock for that day.

We and up with a new capital C = C + delta, that we can re-invest on the next day.

At the end, we will end up with a capital greater or smaller than 1, depending on the goodness of our choices.

A good baseline for the model has been identified in N=154: this represents the generic performance of all the stocks and it models the scenario in which we divide the capital equally on all of them. This produces a gain around 4.27%.

For evaluation purposes, the data has been corrected, removing the days in which the market was closed.

The performance of the model, for different values of N, is reported in the picture below.

The red dotted line is the 0 baseline, while the orange line is the basline with N=154.
The best performance is obtained with N=12with a gain around 8.41%, almost twice the market baseline.
For almost every N greater than 10 we have a decent performance, better than the baseline, while too small values of N degrade the performance.

Conclusion

It has been very interesting to try Tensorflow and CNN for the first time and trying to apply them to financial data.
This is a toy example, using quite small dataset and network, but it shows the potential of this models.

Please feel free to provide feedback and advice or simply to get in touch with me on LinkedIn

Views: 21026

Tags: CNN, Convolutional Neural Network, Finance, Machine Learning, Python, Tensowflow

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Sasha on November 24, 2018 at 12:03pm

Hi Mattia

Can you please send source code to [email protected] ?

Thank you

Sasha

Comment by Yin on December 30, 2017 at 12:05am

Nice work! I'm very interested in your model, and I want to know more details of your code. Could you send me the source code? My email address is [email protected]

Thank you very much!

Comment by Mattia Brusamento on December 12, 2017 at 2:32am

Hi Pat,

I dropped you an email :)

Comment by Pat O'Neil on December 10, 2017 at 4:49pm

I very much like your model and would like to take it as a starting point for my own.

Is there any chance you would send me the source code at oneilpatr at gmail dot com?

Thanks.

Comment by Shannon Callan on November 28, 2017 at 10:24am

Nice!  Can't wait to try it out myself.  Thanks for publishing this.

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service