Transformer Stock Prediction Part 1

2024-07-08

I was listening to an episode of the Acquired Podcast where they were discussing Renaissance Technologies. They're a secretive hedge fund that uses machine learning (probably) to make a lot of money. At one point, the hosts hypothesized that RenTech discovered the underlying transformer architecture that powers modern large language models and have been using it for years. There is no way to verify this and I think it's probably not true. But, knowing a little about algorithmic trading (I took a class in grad school 1) and a little about machine learning2, I decided to take a stab.

What are we doing here?

I'll keep this simple. We are trying to use a sequence of days of stock data from multiple stocks to predict the next day's stock information for every stock. We're basically attempting to forecast the entire market.

We're going to do this using a transformer model. We're going to pretty closely follow the architecture of the model in the original paper but we're only going to use the encoder. If we were attempting to generate a sequence of stock prices, we would use the decoder part of the architecture as well, but we're going to keep it relatively simple and just predict a single day's values.

Why this might work

Language is inherently a time series. The order of words matters. The stock market is a time series. The order of stock prices matters. So, it's not a huge leap to think that a model that is good at predicting the next word in a sentence might be good at predicting the next stock price. I'm not going to go into detail about how self-attenuating transformers work, but if you're interested and technically inclined, you should read the original paper.

Why this probably won't work

The interesting thing about work like this is that if it does work, it means it can make money. If it can make money, it means that someone is already doing it. If they're already doing it, they're probably doing it better than a random guy playing with Tensorflow on his laptop. So, odds are it doesn't work. But we can have fun along the way.

The plan

Let's organize this project a bit.

  1. Get data and select features
  2. Preprocess data
  3. Create a model
  4. Train and evaluate the model

These will probably all be a separate post.

Let's talk data.

False start

My first thought was to grab a few random stocks and etfs and load them into a fairly simple model. I pulled some csvs of five years of data for AAPL, AMC, AMD, BA, BBY, CHWY, DOCU, GLD, GM, GME, JNJ, JPM, MSFT, NVDA, QQQ, SPY, UAL. I basically chose tech stocks, index funds, and a few meme stocks. I thought it would be interesting to see if the model could predict the meme stocks since they basically follow no real market-inclusive patterns.

I grabbed as many features as I could and built a fairly complex model using some lstm layers, a couple dense layers, and adding the requisite dropout, etc. I trained it for 20 epochs and it did terribly. The big thing I saw was huge values and standard deviations MAPE. Like, in the hundreds. So I'm hundreds of percentage points off in terms of accuracy. I was seeing things like 400+/-200. Which is obviously bad. However, mean directional accuracy was around 60%, which is better than random. But if we think about this from an operational perspective, if you get the directionality of a stock right 60% of the time, but the magnitude is off by a significant amount, you're still losing money. So, it's no good.

There were two things to be done, use a more sophisticated model or add more data. I'm going to do both. But let's talk about data first.

Data 3

This part was actually easy. I searched for "stock data" and found a kaggle dataset with all of nasdaq up to 2020. It looks like this was a direct download of yahoo finance data, which is convenient, because that's what I was using before. This is just basically a csv with OHLCV data. It includes 5,884 stocks. This is what we'll use for testing.

Features

This part shows my level of sophistication here. As I have mentioned previously, I'm a brute force engineer. We're going to throw everything at the wall. But I'm thinking about this from two perspectives: one based on technical indicators (financial stuff) and one based on signal processing (thanks again Acquired for helping me think of this as a signal processing problem).

Financial features

We're grabbing everything we can think of. This includes:

The point with these to to capture as much information about the stock as possible4. I'm doing literally no thinking or complex feature engineering here. I'm just grabbing everything I can think of.

Signal processing features

I was a little more creative here. I grabbed some things I could think of, but also did a little bit of research. I'm not going to go into detail about what these are, but I'll link to the wikipedia page for each.

So, the main things we want to do extract some salient data from each stock as if it were a signal. Thinking about the standard waveform things like amplitude, frequency, and phase. But also some more complex things like entropy and autocorrelation. The idea is to analyze each signal by breaking it down into its constituent parts and seeing if we can predict the next value based on those parts.

A note on features

I can't give a good answer why I am using so many technical features and so few signal processing features. I think it's because I'm more comfortable with the technical indicators. I'm going to try to add more signal processing features in the future. But for now, this is what we have.

Data processing

It's hard for me to describe how painful data processing can be. But maybe this story about getting the stock pricing data to a point where I could actually use it will help you understand.

So, when we left off we had decided on boat load of features, we had a dataset with 5,884 stocks (all of nasdaq up to 2020), and we were ready to fit the data into a model. But all is not so simple.

The general process is:

  1. Load the data into a pandas dataframe
  2. Add technical features
  3. Add signal processing features
  4. Backfill and forward fill missing data
  5. Normalize the data
  6. Send the data to our training script

I first tested this out with about 20 stocks and things worked pretty well. But once I tried the whole dataset I ran into some issues. The process died in the middle of step 3. With all the features we were adding, we were running out of memory. So we had to change things up.

First, I replaced some pandas code with iteration. I decided to iterate through each stock, one at a time, to add the features to a separate dataframe. Then I saved that dataframe to disk as an .npy file. . So, something like this:

for filepath in filepaths:
    df = pd.read_csv(filepath)
    df = add_technical_features(df)
    df = add_signal_processing_features(df)
    data = data.reindex(pd.date_range(data.index.min(), end=data.index.max())) # fill in missing dates
    data.ffill(inplace=True) # forward fill NaNs with 0
    data.bfill(inplace=True) # backfill NaNs with 0

    features = data[[ ... long list of features here ... ]]
    scaler = MinMaxScaler(feature_range=(0, 1)) # scale the data
    features_scaled = scaler.fit_transform(features)
    stock_name = os.path.basename(filepath).split('.')[0]

    np.save(os.path.join(output_dir, f"{stock_name}_scaled.npy"), features_scaled) # save the data

This is mainly so I don't have to do data processing each time, but I could write some sort of dynamic data generator to load in the data as we go. But that sounds like work. Regardless, this worked out pretty well. I was able to get all the data processed and saved to disk. I loaded a subset of the data in chunks to the training script and trained a small, throwaway model for testing. It worked. Well, at least the process didn't die. Well, this process didn't die. The next one died, but not this one. So, progress.

The obvious issue

The clever among you may see the obvious problem here. What happens when we load all the data? We're going to have to load all of the data into memory to train the model. How much memory is all this anyway? This is smaller than most LLM datasets, right? The problem is that this is a time series problem, so we are creating sequences, basically a sliding window of values. The default sequence lenth I was using was 60 days, so this drastically increased the size of the dataset. I don't know about you, but I don't have 768GB of RAM on my computer. Were I a big hedge fund, I could just throw a few H100s at the problem and be done with it. But alas, I am not. So, we'll have to actually reduce the size of the dataset.

We can do that in a few ways. We can reduce the number of features we're using, we can reduce the number of stocks we're using, and/or we can reduce our sequence length. I first tried to reduce the number of features, drastically reducing the amount of technical indicators I was using. We went down to 12 features overall. This helped a lot, but not quite enough. ~400GB of RAM still won't cut it. I have 64GB on my machine and I can only safely allocate up to 58GB to my vm. So, we have more work to do.

The next thing I did is reduce the data type to float16. This drastically reduced the memory usage, but also diminishes accuracy given how small some of the numbers are. But we do what we have to do. At this point, we're down to sub 100GB. Still not enough.

Well, now we can reduce the size of the sequences and see what we get. I reduced the sequence length to 20 days. This helped a lot. We were down to about 50GB of memory usage. This is still a lot, but I can allocate that to my vm. So, we're good to go. Or are we?

The not so obvious issue

One thing I didn't account for is that the training process also needs a bit of memory in addition to the memory needed to load the data. But even loading the data into the training script was a bit of a problem. This is where data generators step in.

Data generators are incredibly powerful, with models, and a robust API for managing streaming and batched data. But I wasn't really in the mood to learn all that. So I wrote a simple data generator that just loads data into chunks. It looks like this:

def data_generator(X, y, batch_size):
    num_samples = len(X)
    while True:
        for offset in range(0, num_samples, batch_size):
            batch_X = X[offset:offset+batch_size]
            batch_y = y[offset:offset+batch_size]

            yield batch_X, batch_y

Okay, so what are we doing here? We're creating a standard generator that will yield batches of data. This way we don't have to load all the data into memory at once. We can load it in chunks. This is a bit slower, but it's the only way we can get this to work. We can then use this generator in our training script like so:

train_gen = data_generator(X_train, y_train, batch_size)
val_gen = data_generator(X_val, y_val, batch_size)

model.fit(train_gen, validation_data=val_gen, epochs=epochs, steps_per_epoch=len(X_train)//batch_size, validation_steps=len(X_val)//batch_size)

I'm fairly certain I'm on the right track so far, but I won't know for sure until I actually have a full model I can train with. But I'm getting ahead of ourselves. I still have to build the model, which I'll go through next time. But at least I can get to the training step! I mean, it dies immediately upon training, but still a W. For now. I can tell you for a fact I'm going to run into memory issues again. There's a reason why H100s are so in demand.


  1. Funny story, in that very same class we had an assignment to perform fundamental and technical analysis of a stock. I randomly picked GME and wound up deciding it was severely undervalued, so I bought a bunch of shares at $4. I held on to those shares for years. I eventually sold them at $18. Had I held out for another two months I could have sold them at $300. 

  2. I should preface that I really don't know what I'm doing here. I haven't built a model from scratch in a while, and I definitely haven't built a transformer before. My explanations and reasoning are probably inaccurate. So, think about this as an adventure in mediocre data science. 

  3. Warning, most data science is data. This post is mainly about data mangling. The next one we'll actually talk about the model. 

  4. Okay, this is an absurd number of features. In no world is this necessary or in any way a good idea. But I'm just throwing things at the wall knowing that I'll need to clean up a little later. Brute force.