Teaching a .NET developer new tricks: machine learning with ML.NET
Zone’s head of .NET development, Andy Butland, predicts a bright future for machine learning framework ML.NET…
I’ve been looking at the topic of machine learning for a couple of months now, and in particular how progress can be made by a .NET developer like myself, who has no particular background in data science. It’s certainly an exciting time to be working in this space, with the availability of services and libraries making machine learning problems open to more generalist developers.
Data science experts are generally working within a language with a greater historic association with the subject, such as Python or R, and getting a grounding in one of these is still likely a prerequisite for really getting into the topic. For developers happier working in .NET languages such as C# and F# though, there’s still plenty to dig into.
Personally, inspired by the book Machine Learning for .NET Developers, I’ve worked on a sample problem to try to predict the results of horse races. I’ve been spectacularly unsuccessfully from a ‘get rich quick’ point of view, but it’s been interesting and useful learning how to build a low-level decision tree using F#. At the other extreme, I’ve worked with the Custom Vision Service, part of the Azure Cognitive Services, which provides a suite of algorithms available for use as hosted services that you can treat as ‘black boxes’ in terms of how they operate, working with the APIs to train models and using them to make predictions.
Most recently, I’ve started investigating ML.NET — a machine learning framework built for use by .NET developers — which provides a way in to the topic somewhere in between the two previous examples in terms of level of abstraction.
Picking a topic
When looking for a dataset with which to start investigating a new machine learning technology, Kaggle offers a wealth of structured data that is available to download and use. I picked two datasets — one was a set of country-based geographic and demographic data, and the other the results of a survey for how happy people in a particular country claim to be, leading to a ‘happiness score’ and rank for each.
My idea was to investigate if a model based on the demographic data could be used to predict the happiness score for a country. In addition, it would be interesting to see which factors have the most influence, and for some, in which direction they act.
Before opening Visual Studio, I first did some data manipulation in good old Excel. Steps involved importing the two datasets and a list of country ISO codes from Wikipedia, adding some look-up formulae and making some manual updates — matching up country names and steering clear of various political issues such as disputed territories — to end up with a single sheet containing the combined dataset that I could output as a CSV file.
During the following investigations I often switched to simpler data files, with the same format but with obvious examples such as all columns set to zero apart from one with a clear positive or negative correlation. By temporarily using these I could sanity check the results, and make sure I’d not missed something obvious like the direction of a result or a mismatched column.
With the input data prepared, I then created a new console app project in Visual Studio using .NET Core and added the NuGet package Microsoft.ML package. At the time of writing the latest version available was 0.11. Obviously this is still a zero-point release, so the code shown in this article and held in the code repository could change, but as there’s already been some churn in the APIs it’s likely they are settling down now.
When working with ML.NET the first step is to create a new instance of MLContext, which will be required as a parameter for most other methods involved with training and evaluating a model.
After that we need to import the data, which is done via the LoadFromTextFile method, to which we need to provide the path to the data and some details of the file format. We also provide a type parameter — in my case CountryData — a custom class within which we define the columns we wish to load, and attribute them with their position in the input file. Unless any pre-processing is required, it makes sense to simply define any numeric or boolean fields as floats, as that’s the type that’ll be required for the training steps.
Initially I modelled these as more appropriate types, but then found errors such as System.InvalidOperationException: ‘Column ‘Population’ has values of I8 which is not the same as earlier observed type of R4.’. This was resolved simply by using floating point values as the field type.
Before training the model, a step often taken is to divide the data into a training and testing set. The former is used to train the model and the latter to evaluate it. This separation is important in that it ensures the evaluation is using data the model hasn’t already seen as part of the training process, and hence undergoes a fairer review of its predictive potential.
We can create that split using a ratio we provide using the MulticlassClassification.TrainTestSplit method available from the MLContext.
Training a model
The next step is to train the model which, when using ML.NET, involves constructing a training pipeline with the various steps necessary for the process. This might include further data cleansing if you haven’t been able to do it in the previous stages of preparing the input data file.
Missing values is a typical situation faced. ML.NET provides some straightforward ways to deal with these, allowing you to replace missing values with a default, or the minimum, maximum or mean of the rest of the data in the column. There’s no option, at least yet, for applying a prediction to the missing values . This might be feasible and useful if the column with the missing data correlates with other columns, from which it might be clear that using the mean value is likely an over or underestimate. This could potentially be done using a separate model training operation if the effort is worthwhile.
Another preparation step often required is converting categorical data — eg values from a set of categories — into numerical forms that the model can then work with.
Given the data source I was using was already fully numerical and had no missing values (likely having been cleaned before being uploaded to Kaggle), I didn’t have to worry about these, but still implemented some steps for illustration that you’ll see in the code sample below.
With prepared data we can then pipe on additional tasks to rename and concatenate columns into the standard names expected for model training. The value we want to predict is named to ‘Label’ and the set of columns used to make the prediction to ‘Features’.
Finally, we apply a regression algorithm to the pipeline. There are a number to pick from and I experimented with a few based on decision trees and forests to see which gave the best results.
With the pipeline in place, we can call the Fit method, passing in the data. The model, now held in memory, can be saved to a file to then be exposed by other applications, such as a web API.
Evaluating a model
To evaluate the model, we use the dataset retained following the initial load to be used as a testing set. We apply that to the mode via the Transform method to get back another dataset (IDataView), this time containing predictions. By evaluating that we get some statistics indicating how successful our model was in predicting the score compared to the expected score we know from the training data.
The RSquared property gives us the value of an evaluation metric also known as the “coefficient of determination”. The closer to the value of 1 we can get, the better the model is.
RMS is another metric, calculated as the square root of the average of the squares of the errors. Here we’re aiming for lower values.
Note: ConsoleTable comes from this handy library for outputting to the console in a tabular view. It’s just to help make the results more readable.
Using the model for predictions
Once we have the model trained, evaluated to our satisfaction and saved to a file, we can then use it to make individual predictions (likely the use case for predictive models in practice). To do that, we construct an instance of our data object, populating the fields we have available to drive the prediction — in our case the country’s demographic data.
We can pass that to a function retrieved from the model and then read off the predicted value populated on the appropriate field of our class.
Further investigations into the results
As mentioned earlier, there are several regression algorithms available for selection when tackling this type of statistical problem with ML.NET. Someone better versed in the mathematics than I could likely justify in advance which would likely perform better, but my method was simply to try them out and see.
My initial choice — based on that selected in the online documentation tutorial for regression problems — was the FastTree algorithm. That did OK, but not brilliantly, giving an RSquared score of 0.53. Having experimented with the others though, the best result I found was from the FastForest selection, scoring 0.72. This seems in line with expectations for use of decision trees and forests and their respective expected results.
Although I had a model that was predicting results, it was acting rather as a black box, and it would be interesting to know a bit more about how it was reaching its predictive decisions. In particular, the question of which factors were considered most important was relevant.
ML.NET proves a means for discovering this via a regression method call PermutationFeatureImportance, available from the context.
You’ll see in the code below that this was slightly awkward to use, as I needed access to the LastTransformer method from the model, but this isn’t exposed via the ITransformer interface used to reference the model up to now. I had to cast it back to the concrete type, which then requires a hard reference to the type of algorithm used. No great problem for this sample application when I could just recompile, but if you wanted to be able to select the algorithm at run-time there would need to be a better way.
From that though we can combine the results for feature with their names to output then in order of importance. For this dataset, GDP, phone availability (likely itself closely correlated with GDP) and infant mortality looked to be the key factors. It was also interesting to note that for the decision tree, one factor (GDP) dominated, whereas for the random forest algorithm, the influence of each factor had a wider spread.
Direction of correlations
At this point we can see which features are considered most important by the model, but not the direction they act. We might expect some to be obvious, such as greater GDP and lower infant mortality leading to an increase in the reported happiness score. But we can’t tell that from the metrics for sure, and for some — net migration to take a hot political topic — the direction of influence may not be obvious.
To tackle this, I stepped away from ML.NET for the actual calculations, as it’s likely there’s less need of the library here. We just need to see how two factors correlate with each other, which we can do via a statistical function (thanks StackOverflow).
ML.NET has a handy GetColumn method that allows us to get the data from a given column within the loaded and transformed dataset as a standard enumerable, to which we can apply the necessary calculation.
This confirmed the expected direction for the key predictive factors and, for the case of net migration, showed a positive correlation with happiness score.
I found this an interesting exercise and was pleased to see how straightforward it was to use ML.NET for these types of problems for someone with a .NET developer background but more of a layman’s interest in data science problems. Learning Python or R is of course a possibility for someone in this position, but it’s a lot to take on, especially when you consider all the mathematical subject matter that’s also on the learning path. It’s nice to see this is something Microsoft is rapidly developing, as well as supporting the use of models generated from other platforms like TensorFlow.
I suspect, like any framework, there will be times when you look to go beyond it and come up against issues or things it can’t yet handle — which I perhaps did in places, though these could of course be down to the limits of my current understanding — but I could usually find a way around them, and am looking forward to monitoring its development.
If you’d like to review the code and data from this example more fully, it’s available here on GitHub.