An interesting (if unsuccessful) look into predicting horse races via machine learning with F#

Has Zone’s head of .NET development, Andy Butland, found a foolproof way to beat the bookies using a decision tree algorithm in F#? Not just yet…

Image provided by and is licensed under a Creative Commons Attribution 3.0 Unported License.

A few work colleagues and I have started to get interested in machine learning and have taken on a practical (well, kind of) project to see what insights we can mine from a large set of horse racing data. There will hopefully be a few posts following this one outlining different strategies and technologies we’ve looked into as part of this.

My first steps into this area have been to see if I could implement a decision tree algorithm in F# that could predict which horses would place based on a number of different features.

Caveats and credits

Let me start this post with a couple of caveats. First, this is a long way from a how-to article — I’m just feeling my way with this and have posted this in part because I find it interesting, and mostly to share what I’ve done so far with my colleagues. Secondly, none of us have got rich from this yet! Nonetheless it’s always more fun learning with something real and this is a nice data set to work with.

In terms of credit for helping me get as far as I have, almost all goes to Mathias Brandewinder, author of Machine Learning Projects for .NET Developers.

I can strongly recommend this book — not only for its clear examples of how different algorithms can be applied in .Net, but also as a great, practical introduction to F#, which was new to me. I wouldn’t say it it is easy going — in this area there’s a fair bit of maths involved of course — but he explains all the steps very clearly.

And hopefully after that recommendation he won’t mind me referencing and adapting some code from the work he did on that book — which he’s open-sourced on GitHub — particularly from Chapter 6, where he does a similar exercise using the Titanic passenger list data set.

Some F# pointers

For anyone like me coming to F# from a C# background, it’s likely this section will be useful to make sense of the code that follows. It’s not a comprehensive introduction, but should cover enough to follow the rest of the article. Feel free to skip if you know F# already.

In Visual Studio, starting from a new Visual F# Library project you get a couple of sample files created. There’s one with a .fs extension that are used to create modules, analogous to C# classes. And one with a .fsx extension that’s for scripting. This scripting option is one of the features that make F# a very suitable language for experimental, data analysis work. You are able to run scripts, or parts of them, in the F# interactive window, for immediate execution. What’s more, previous executions are held in memory, so if for example you’ve loaded a fairly large data set, you can tweak and re-run your functions operating on that data without having to wait to reload it again.

In an F# project we can pull in NuGet references in just the same way as any other .Net one, for example:

We then reference them in our script like this (using open instead of using as we would in C#):

We can also load modules from our project into our scripts like this:

In this project I’ve utilised another very handy F# feature that we don’t have in C#, which is type providers. Specifically I’m using a CSV type provider, which can load a CSV file with headers, sample it to determine the types of each column used and create a type to describe that data which can then be used in code.

I’ve used the IgnoreErrors flag because I ran into a problem with the sampling. By default it uses the first 1,000 rows to examine the data and determine the types of each field. In our data we had a date field that was populated for the first 1,000 rows but had some missing values thereafter. As such the type provider determined to use DateTime rather than a DateTime? (a nullable date time). There are other ways around that by providing parts of the schema yourself, to override the behaviour, but for me just ignoring a few rows was fine.

Am very much just scratching the surface here but there’s probably a few key things to note about F# functions coming from a C# background. Firstly, there’s no return statement. The function simply returns the last evaluated value. So the following simple function returns a boolean of true if the horse placed in the first three positions:

The name of the function above is isPlaced and it takes one parameter called position. It’s a string, but we aren’t specifying that in the signature as we do in C#. Nonetheless it’s still statically typed. F# is able, in most cases, to analyse the code and infer what the type of the parameters to functions are. In cases where it can’t you can specify it but the usual style is not to do so unless you have to, leading to the terser syntax idiomatic with F#.

Tuples we have in C# of course, but it seems they are used much more frequently in F# code. They are a comma-separated collection of values used for creating ad hoc data structures, which group together related values.

Many operators, such as the or (||) used above, are the same as used in C#, but there are also quite a few differences. For a full reference, see here.

Important to reference for our purposes is the pipe operator (|>). What this does is say “take the output of the previous function and pass it as the last parameter to the next one”. Using this operator you can create chained function calls, in some ways quite familiar to that we see with using LINQ in C#.

So in this function for example we take in our data set, pipe it to a sequence function to take a limited number of records, pipe that to a function that takes an average value from the result of another function, and finally pipes that into a print function so we can see the result.

F# has a stack of support for operations on collections, some of which are seen in the code samples that follow. Seq represents sequences, which in C# are implementations of IEnumerable. Most operations are obvious (take, averageBy). Map is worth mentioning though, it results in a new collection whose elements are the results of applying the given function to each element in the collection.

There’s a couple of examples of discriminated unions in the following code. These aren’t particularly familiar to a C# developer, but offer a powerful way of having a function handle different, but a restricted set of types, and act on each type accordingly. A bit like a switch statement on types.

Evaluating a baseline

With this model we’re going to look to see if we can predict whether or not a horse is placed in a race, based on the values of one or more features we can potentially know before the race begins. For simplification I’m taking placed as finished in the top three places — in real races, with more riders, more positions may pay out for a placed (or each-way) bet.

For any predictive model it’s valuable to start with a baseline, likely the simplest possible probability estimate we can make. In this case it would simply be to guess for a particular horse, is it going to place or not, based on no other information at all.

To do that (and for the rest of the script code) there’s a bit of set up — referencing the necessary libraries, creating types via type provider, loading up our dataset and setting up some variable values and helper functions:

Then we can run this statement to determine the chance of picking a placed horse simply by randomly picking from the list of riders:

The result for the 50,000 row sample is 29.006%, or putting it another way, the majority of horses are not placed, and so if we were going to predict for a single given horse, with no other information, whether it would place or not, we’d predict that it wouldn’t. And we’d have a 70.994% chance of being right.

Analysing with decision stumps

We can now move on to look at cases where, given the value of a particular feature of the horse or race, can we improve this prediction? The features I’m going to look at — at this point, mostly as they are easy to deal with from the data set rather than necessarily being likely to affect the result of the race — are the sex and age of the horse (colt, mare, filly etc.), the country of origin and the form going into the race.

The first two are held as string data in the dataset so we can use as is. Form though is recorded as a decimal value, with numbers less that one indicating the the horse is improving it’s position in recent races. Given that we can define our features, and the label we are looking to predict (whether the horse will place), as follows, extracting the appropriate fields from the RaceDetail type we defined using the type provider:

In order to make our code more general, and to potentially extend to other features, we first need to create a function that can act as our “classifier”. We want to be able to say, “given a particular observation (an instance of our RaceDetail type), using a given feature (e.g. form), predict a given label (e.g. placed”.

Firstly we define a helper function, that calculates the most frequent label found for the provided parameter group, which is defined as a sequence of tuples of a feature and a label, drawn from a sample of observations. We first count by the second value of the tuple (the label), so we get a list of labels and the number of times it’s found. Then we find which label is found the most. And finally return it.

So given this input for our origin filter:

We’d get the following output: “Not Placed”

Then a second helper function that uses the first to obtain a grouped result of all the feature values along with the most frequently found label for that value. It takes three parameters — the data set, and two functions as defined above, one that returns the feature value and one the label value.

Firstly we map the data set to a sequence of tuples, made up of the feature and label values. We then group by the feature. And finally map to another sequence made up of the feature value and the most frequently found label in the group.

And now to our main function that creates the general purpose classifier function. It take the same parameters — a data set, a function to retrieve a feature and a function to retrieve a label value — and returns a new function, one that will take an observation (e.g. a race detail) and return a label (e.g. “placed” or “not placed”).

First of all we use the helper function above to group together observations that have the same value for the selected feature (e.g. “origin”), and find
the most frequent label (e.g. “placed/not placed”) by group. We then return a function, that for a given observation, will find the group with matching feature value, and make it’s prediction using the most frequent label for that group.

We can then use the above functions to create classifier functions for each of our candidate features, like this:

And test them like this — by checking for each observation whether the predicted value matches the actual value, and seeing how often we get it right:

We get the following results:

Which to all intents and purposes, gives us zero improvement from our baseline. For any given feature, if we had to predict its result, we’d be best predicting that horse isn’t placed, same as we would with no information at all.

Maybe disappointing, but actually not surprising when you consider it. Clearly if there were one single feature that could help predict the binary choice of whether or not a horse would place in a race, likely someone would have found it by now! Placing in a race is a relatively unlikely event.

As discussed initially I’m modelling this analysis on the work presented in Mathias Brandewinder’s book using the Titanic data set. In that case, the label under test — whether a given passenger survived or not — is more of a close-run thing. As it turned out, the majority of women survived, whereas the majority of men didn’t. Similarly for ticket class, more than 50% in first class survived, which wasn’t the case for those in third. So in that data set there were some features that even by themselves had some predictive value — knowing the feature value would in some cases lead to you predict “survived”, even if the majority of passengers didn’t. We don’t have that in our horse racing data by the looks of it, at least for the features currently under test.

Growing stumps into trees

Although a decision stump didn’t add any predictive value, it’s possible a decision tree might. This is where we take several features and see if together they can improve on our model. So for example, just potentially if a horse is showing improving form, and then is also of a particular sex and from a certain country, that combination may prove more fruitful for picking winners. Going back to the Titanic example, if the passenger was a woman, in first class, then her survival chances were (relative to other passengers at least) quite good — and those two features in tandem have more predictive value.

We don’t just want to consider from our set of features at random though, rather at each branch in the tree we ask the question that gives us most information gain in terms of arriving at our answer. Taking a trivial, illustrative example, if we were playing Guess Who? and got an affirmative answer for “are they a man”, following up with “are they wearing earrings,” wouldn’t add much information compared to other questions that could be asked.

Mathematically this is represented by the concepts of entropy and information gain, which can be calculated in F# like this. More evenly distributed samples, give us less information (each guess is as good as another), and have a higher entropy value:

Given the fact that with lower entropy we have better information, we can create a function to identify the weighted average entropy of each group after we split a data set on a given feature:

Testing the above like this:

Leads to these results:

Thus we see that the horse’s form looks to be — marginally, admittedly — the most informative, which seems on the right lines. After all dedicated punters spend a fair bit of time studying this before placing their bets.

We can then do a split again:

Leading to results like this:

Note that for each group, as we’d expect, we don’t get any additional information gain by splitting on the same feature (form) again. But the next group that’s most advantageous to split on may vary for each base group.

Based on the above, we can extend to a fully-fledged decision tree.

Worth just flagging again here that much of the code in this section is slightly adapted from the Titanic data study in the Machine Learning Projects for .Net Developers book, open-sourced here. The implementation there is more sophisticated — in particular with handling missing values and using generics to support a variety of data sets rather than just one. I’ve removed those features, partly as didn’t strictly need them for this analysis but also just to reduce the concept count when working with (for me) a new language and paradigms.

First of all we define some types, a Feature and a Label both which have a function taking a RaceDetail observation and returning a string value. And a NamedFeature consisting of a tuple of a string label and a Feature:

Then a type for the Tree. The Tree can result in one of two things — either we have reached an Answer, containing the label value as a string. Or we have a decision Stump, for which we need and define a tuple of a NamedFeature and a mapping of elements for each feature value, another Tree, which in turn may lead to an Answer or another Stump.

We next need a function that handles a particular observation. If we’ve got an answer, that’s what we return. Otherwise we recurse down the tree:

We can then grow our decision tree via a function that takes a collection of features, another of filters for these features (explained later — we may not want to include all features), our data set and the label we want to predict. Further explanation on the next code sample is in the inline comments:

Lastly we can add a handy helper, recursive function for displaying the tree in an easy to read format:

To use the above tree, we first define our label and features:

We can also define some filters. This is useful to try to mitigate the danger of “over-fitting”, which is where we create a model that fits our test data extremely well, but doesn’t perform well as a predictor. A couple of filters we might apply are to only use features that give us a positive entropy gain, and to remove tree branches that have very small sample sizes.

Then we can grow our tree:

We can then view our results in two ways, firstly by viewing the tree itself:

Which gives us:

And with a check on how our predictions compare with the actual placings from the historical data.

Which comes back as 70.994%.

Sound familiar? Yes, unfortunately even with this decision tree we still aren’t able to improve our predictive ability. Which is clear from the display of the tree too. Only a split based on form and then into the sex of the horse is getting past our filters, and then for each of those we still see we’d be best off assuming a horse in that category won’t place in the race.

So, rather disappointing so far, but just to get something out, let’s remove the leaf minimum size filter. Without this we don’t have a model we really expect to rely on as many of the predictive recommendations will be based on far too small sample sizes. But at least we do get some predictions:

Testing it out

At that point, certainly completely jumping the gun with this poor excuse for a predictive model, figured I’d try it out. I picked out a horse running at Aintree in the 2:20 that the model suggested would be likely to place, based on sex, origin and form — and found a colt, from France with improving form. At 16/1, so a long shot, but just maybe with my data insight I’d have an edge.

Unfortunately, he came in ninth.

Out of nine.

Oh well, I won’t be giving up the day job quite yet.

We write about customer experience, employee experience, design, content & technology to share our knowledge with the wider community.