Machine Learning

A high level look at machine learning

Machine learning uses statistical techniques to give computer systems the ability to ‘learn’ rather than being explicitly programmed. By learning from historical inputs. we’re able to achieve far greater accuracy in our predictions & constantly refine the model with new data.

What parts comprise a machine learning model?

A machine learning model can be drawn as below. In the top left, we have our data, which gets fed into the model. That model computes an output & passes it to the learner. The learner, receives both the newly computed values and also ‘the truth’ – that is, the actual result to the event.  The learner will then look at the difference between the predicted and actual outputs. Based on these, it will adjust the parameters that feed the model – parameters are the factors that the model uses to make decisions.

So, let’s look at an example. In our database, we have the amount of hours that a student studies for and their corresponding exam grade. This data is pushed into our model, which uses the parameters we’ve set to predict what the students grade will be. In the above example, we have initial parameters shown in the top right. So, if a student studies for 1 hour, the model will predict they’ll get 40% in their exam.

The model then passes its prediction of 40% onto the learner. The learner compares this to the actual exam results achieved by students and adjusts the parameters accordingly. So now, when a student that’s studied for 1 hour gets passed into the model, it will predict a result of 45%, rather than the original 40%, as a result of the learning process the model has gone through.

What kind of models can we use?

There are tonnes of statistical models that we can use to help us predict the outcome based on certain input parameters. Many of these are extremely complex & take a signifiant amount of time to explain and understand.  So in this post, I want to focus on two major modelling techniques.

The first and probably most common type of model is a regression model. This finds the relationship between two numeric variables. As in the above example, we find that relationship & continuously refine it to improve the accuracy of the model. A few regression models we could use are:

  • Linear – with linear regression, we find the relationship between two variables by using a line of best fit (a straight line)
  • Logistic – used for binary problems (e.g. does weighing more increase diabetes risk – YES / NO)
  • Polynominal – here, we find the relationship between two variables, but the line of best fit does not have to be straight

The next possible model is a decision tree, also known as a classification tree. A decision tree is a graphical representation of possible solutions to a decision based on certain conditions. Decision trees can be very powerful, they’re easy for the business to understand and take actions on, without any statistical background. Below, we have a very simple tree. This looks at whether you should keep or sell your investment in a particular company. This model says that, if there has been a major political event and the currency has been affected, then you should sell your shares. Most of us can probably say with a good deal of confidence that this model would not help us in managing our investment portfolio, but through the addition of further parameters & model training, it may do in future.

We can also use basic statistical methods (like correlations, hypothesis testing and summary statistics (mean, median, mode, standard deviation, etc…)) in our modelling.

Types of machine learning

Supervised Learning

Supervised learning is where we provide the model with the actual outputs from the data. This let’s it build a picture of the data and form links between the historic parameters (or features) that have influenced the output. To put a formula onto supervised learning, it would be as below, where, Y is the predicted output, produced by the model and X is the input data. So, by executing a function against X, we can predict Y.

Y = f(X)

The goal of supervised learning is to be able to model the influence of input parameters (X) so well, that we can accurately predict the output (Y).

Supervised learning is used for regression, classification, decision trees and neural networks, a few examples are included below:

  • Regression aims to predict the numeric value of something, given a set of input parameters. For example, the we could approximate the price of a car, given its mileage, age, brand, MOT status, etc..
  • Classification is about classifying things – for example, given the input parameters of an email (sender, subject line, email body), it should be classified as ‘spam’ or ‘not spam’.
  • Decision trees can be used for loan approval, based on income ranges, credit rating, criminal records and more

As supervised learning forms the bulk of machine learning projects in the real-world, this will be our main focus.

Unsupervised learning

Unsupervised learning is where we do not provide the model with the actual outputs from the data. Unsupervised learning aims to model the underling structure or distribution in the data to learn more about the data. The most popular use-cases of unsupervised learning are association rules – which is where we uncover rules that describe a large chunk of our data. For example, Amazon uses such learning to state that people that bought this, also bought that.

Semi-supervised learning

Semi-supervised learning sits in the middle of supervised and unsupervised. It’s where only some of our input parameters have associated outputs. So, we don’t receive the actual result/output for given input parameters.

A good use-case for semi-supervised learning is web page classification. Let’s say we wish to classify web pages as ‘news’, ‘learning’, ‘entertainment’, etc… It’s very cheap and easy to crawl the web and extract a list of webpages, but it’s very expensive for humans to sit and classify them manually. So, we may choose to classify a sub-set of the data manually & use that to help train the model.

Summarising model types:
  • Supervised learning has the output for every input & we can use these to accurately train the model
  • Unsupervised learning has no outputs for any inputs & is left t0 find links / trends / patterns in the data
  • Semi-supervised has some outputs for some inputs and can have a mixture of supervised and unsupervised techniques applied to these problems.