Load data, split into test & training data and apply to model

In [1]:
import sklearn
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

data = load_iris()

# split our x and y to be training and testing data (30% testing)
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3)

# apply model to data
model = LogisticRegression()
model.fit(x_train, y_train)

# predict based on 'training data' - first 5
model.predict(x_test)[:5]
Out[1]:
array([2, 2, 0, 1, 2])

Check model score on training data

In [2]:
model.score(x_train, y_train)
Out[2]:
0.9714285714285714

Check model score on test data

Model score changes, depending which sample data was selected for the test 30%

In [3]:
model.score(x_test, y_test)
Out[3]:
0.9555555555555556

Cross Validation Scores - how accurate is the model?

This removes the impact of different samples on the machine learning model score

In [4]:
# k-fold cross validation
from sklearn.model_selection import cross_val_score

# k-fold requires an unfitted model:
model = LogisticRegression()

# run the model 10 times (cv) and print the scores for each. We can select the % split as a parameter
cross_val_score(model, data.data, data.target, cv=10)
Out[4]:
array([1.        , 1.        , 1.        , 0.93333333, 0.93333333,
       0.93333333, 0.8       , 0.93333333, 1.        , 1.        ])
In [5]:
# store the scores in a variable
scores = cross_val_score(model, data.data, data.target, cv=10)

# Calculate the mean and standard deviation for each generated score
scores.mean(), scores.std()
Out[5]:
(0.9533333333333334, 0.059999999999999984)

Now check another model type - which is more accurate?

We will do the same for k nearest neighbours

In [6]:
# import the KNeighbors model
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()

# run the model 10 times (cv) and print the scores for each. We can select the % split as a parameter
cross_val_score(model, data.data, data.target, cv=10)
Out[6]:
array([1.        , 0.93333333, 1.        , 1.        , 0.86666667,
       0.93333333, 0.93333333, 1.        , 1.        , 1.        ])
In [7]:
# store the scores in a variable
scores = cross_val_score(model, data.data, data.target, cv=10)

# Calculate the mean and standard deviation for each generated score
scores.mean(), scores.std()
Out[7]:
(0.9666666666666668, 0.04472135954999579)

Now we can see that this model gets a different accuracy and standard deviation to the logistic regression model.

We've got the best model, now which hyperparameters should I use?

SK Learn can tell us the best hyperparameter to use. In the below example, we calculate the best number of neighbors

In [20]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

model = KNeighborsClassifier()

# Find the best number of neighbours in kneighbors model. 
# make_pipeline(model) = makes a dictionary of the model values (key = k and values = values)
# Range = consider K's in a range of 1 to 10 in steps of 2
# cv = partition the dataset in 10 different ways, for every k
# parameter = lower case name of model + the parameter you're trying to calculate
best = GridSearchCV(make_pipeline(model), {'kneighborsclassifier__n_neighbors': range(1,10, 2)}, 
                    cv=10, return_train_score=True, iid=True)

best.fit(data.data, data.target)
best.score(data.data, data.target)
best.best_estimator_.named_steps['kneighborsclassifier'].n_neighbors
Out[20]:
9

Create confusion matrix to validate accuracy

In [27]:
from sklearn.metrics import confusion_matrix

y =           [-1, 1, -1, -1, -1, 1]
y_predicted = [-1, -1, -1, 1, -1, 1]

# The confusion matrix describes:
# True positive (bottom right)
# True Negative (top left) 
# False Positive (top right)
# False Negative (bottom left)
confusion_matrix(y, y_predicted)
Out[27]:
array([[3, 1],
       [1, 1]])

Create classification report, when we have more than 2 dimensions

In [32]:
from sklearn.metrics import classification_report
y_true =      [0, 1, 2, 2, 2]
y_predicted = [0, 0, 2, 2, 1]

# list must be ordered, then we can name these things
target_names = ['class0', 'class1', 'class2']

print(classification_report(y_true, y_predicted, target_names=target_names))

# precision = how often is the prediction correct
# recall = if the actual Y is a particular class, this is the degree to which we get it right. So, class0 is 1.00 because where y is actually 0, we always correctly predict it to be zero
# fl-score = considers both the precision p and the recall r of the test to compute the score
# support = The support is the number of occurrences of each class 
             precision    recall  f1-score   support

     class0       0.50      1.00      0.67         1
     class1       0.00      0.00      0.00         1
     class2       1.00      0.67      0.80         3

avg / total       0.70      0.60      0.61         5

In [ ]: