In [1]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()

#create a dataframe which has feature_names as column names
df = pd.DataFrame(data['data'], columns = data['feature_names'])

#Add target as a column to the dataframe
df['target'] = pd.DataFrame(data['target']).replace(set(data['target']), data['target_names'])
In [2]:
df.sample(5)
Out[2]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
245 10.480 19.86 66.72 337.7 0.10700 0.05971 0.04831 0.03070 0.1737 0.06440 ... 29.46 73.68 402.8 0.1515 0.10260 0.11810 0.06736 0.2883 0.07748 benign
153 11.150 13.08 70.87 381.9 0.09754 0.05113 0.01982 0.01786 0.1830 0.06105 ... 16.30 76.25 440.8 0.1341 0.08971 0.07116 0.05506 0.2859 0.06772 benign
66 9.465 21.01 60.11 269.4 0.10440 0.07773 0.02172 0.01504 0.1717 0.06899 ... 31.56 67.03 330.7 0.1548 0.16640 0.09412 0.06517 0.2878 0.09211 benign
264 17.190 22.07 111.60 928.3 0.09726 0.08995 0.09061 0.06527 0.1867 0.05580 ... 29.33 140.50 1436.0 0.1558 0.25670 0.38890 0.19840 0.3216 0.07570 malignant
15 14.540 27.54 96.73 658.8 0.11390 0.15950 0.16390 0.07364 0.2303 0.07077 ... 37.13 124.10 943.2 0.1678 0.65770 0.70260 0.17120 0.4218 0.13410 malignant

5 rows × 31 columns

Explore the data

In [3]:
#validate how many NA values you have per column
df.isna().sum()
Out[3]:
mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64
In [4]:
#count the amount of each Y value
counts = df['target'].value_counts()
counts
Out[4]:
benign       357
malignant    212
Name: target, dtype: int64
In [5]:
#plot the output
import seaborn as sns
sns.barplot(counts.index, counts)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a19dd4198>

Standardize & encode data, ready for the model

When we’re getting our data ready for our machine learning models, it’s important to consider scaling and encoding.

Scaling is a method used to standardise the range of data. This is important as if one field stores age (between 18 and 90) and another stores salary (between 10,000 and 200,000), the machine learning algorithm might bias its results towards the larger numbers, as it may assume they’re more important. SciKitLearn state that “If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.”

Using this SciKitLearn library, we can convert each feature to have a mean of zero and a standard deviation of 1; removing the potential bias in the model.

For some models, this is an absolute requirement, as certain algorithms expect that your data is normally distributed and centre around zero.

Encoding is simple – machine learning algorithms can only accept numerical features. If you have input variables of Male & Female, we can encode them to be 0 or 1 so that they can be used in the machine learning model.

In [7]:
#machine learning algorithms cannot understand strings. So we need to convert string values to be numeric. We do that using the label encoder function
#we can then standardise the scale, so columns with larger numbers do not cause bias
# make_pipeline = makes a dictionary of the model values (key = k and values = values)
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
    
y = df['target']    
X = df.drop('target', axis=1)

y = LabelEncoder().fit_transform(y)
X = StandardScaler().fit_transform(X)

# PCA (principal component analysis) aims to reduce the number of dimensions in the dataset, without losing those which are very relevant to the model
# it provides a score, you can drop those with poor scores.
X_pc = PCA(n_components=2).fit_transform(X)

# Create a dataframe with the now scaled and encoded x's
pd.DataFrame({'PC1': X[:, 0], 'PC2': X[:, 1], 'Y': y}).sample(10)
Out[7]:
PC1 PC2 Y
220 -0.135558 -1.426410 0
208 -0.288925 0.756379 0
228 -0.428092 1.089149 0
286 -0.621222 0.342161 0
13 0.489274 1.084495 1
346 -0.587140 -0.090674 0
295 -0.101476 -1.400813 0
249 -0.740508 -1.014519 0
77 1.114105 -0.730617 1
254 1.511725 0.009390 1
In [ ]: