Machine learning uses statistical techniques to give computer systems the ability to ‘learn’ rather than being explicitly programmed. By learning from historical inputs. we’re able to achieve far greater accuracy in our predictions & constantly refine the model with new data.
A machine learning model can be drawn as below. In the top left, we have our data, which gets fed into the model. That model computes an output & passes it to the learner. The learner, receives both the newly computed values and also ‘the truth’ – that is, the actual result to the event. The learner will then look at the difference between the predicted and actual outputs. Based on these, it will adjust the parameters that feed the model – parameters are the factors that the model uses to make decisions.
So, let’s look at an example. In our database, we have the amount of hours that a student studies for and their corresponding exam grade. This data is pushed into our model, which uses the parameters we’ve set to predict what the students grade will be. In the above example, we have initial parameters shown in the top right. So, if a student studies for 1 hour, the model will predict they’ll get 40% in their exam.
The model then passes its prediction of 40% onto the learner. The learner compares this to the actual exam results achieved by students and adjusts the parameters accordingly. So now, when a student that’s studied for 1 hour gets passed into the model, it will predict a result of 45%, rather than the original 40%, as a result of the learning process the model has gone through.
There are tonnes of statistical models that we can use to help us predict the outcome based on certain input parameters. Many of these are extremely complex & take a signifiant amount of time to explain and understand. So in this post, I want to focus on two major modelling techniques.
The first and probably most common type of model is a regression model. This finds the relationship between two numeric variables. As in the above example, we find that relationship & continuously refine it to improve the accuracy of the model. A few regression models we could use are:
The next possible model is a decision tree, also known as a classification tree. A decision tree is a graphical representation of possible solutions to a decision based on certain conditions. Decision trees can be very powerful, they’re easy for the business to understand and take actions on, without any statistical background. Below, we have a very simple tree. This looks at whether you should keep or sell your investment in a particular company. This model says that, if there has been a major political event and the currency has been affected, then you should sell your shares. Most of us can probably say with a good deal of confidence that this model would not help us in managing our investment portfolio, but through the addition of further parameters & model training, it may do in future.
We can also use basic statistical methods (like correlations, hypothesis testing and summary statistics (mean, median, mode, standard deviation, etc…)) in our modelling.
In subsequent articles, we’ll look at some simple machine learning algorithms using the Spark MLLib module. This will bring some of the above to life and help to cement our understanding of machine learning. Once we’ve got a baseline understanding, we’ll progress to more complex models using real-world data to make predictions on the value of stock prices among other interesting topics.
We’ll also look at some high profile examples of machine learning (e.g. the Netflix recommendation engine) and try to unpick at a high level how they may work & the machine learning models that they use to make their predictions.