Handling Missing Values

In [1]:
'''
Dataset from: https://github.com/dataoptimal/posts/blob/master/data%20cleaning%20with%20python%20and%20pandas/property%20data.csv

'''
import pandas as pd
df = pd.read_csv("property.csv")
In [2]:
df
Out[2]:
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3 1 1000
1 100002000.0 197.0 LEXINGTON N 3 1.5 --
2 100003000.0 NaN LEXINGTON N NaN 1 850
3 100004000.0 201.0 BERKELEY 12 1 NaN 700
4 NaN 203.0 BERKELEY Y 3 2 1600
5 100006000.0 207.0 BERKELEY Y NaN 1 800
6 100007000.0 NaN WASHINGTON NaN 2 HURLEY 950
7 100008000.0 213.0 TREMONT Y 1 1 NaN
8 100009000.0 215.0 TREMONT Y na 2 1800
In [6]:
'''
CHECKING IF WE HAVE MISSING VALUES:
-----------------------------------
We have out of the box functionality with Pandas

Check if a column has any nulls: df['column name'].isnull()
Check if anywhere in the dataframe has nulls: df.isnull().values.any()

Sometimes however, these functions do not catch all missing values. Because, they may show as NA, N/A, n.a., null etc... So we can define what we consider to be missing values
'''
missing_values = ["n/a", "na", "--"]
df = pd.read_csv("property.csv", na_values = missing_values)
df

'''Now, they all show as NaN'''
Out[6]:
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3.0 1 1000.0
1 100002000.0 197.0 LEXINGTON N 3.0 1.5 NaN
2 100003000.0 NaN LEXINGTON N NaN 1 850.0
3 100004000.0 201.0 BERKELEY 12 1.0 NaN 700.0
4 NaN 203.0 BERKELEY Y 3.0 2 1600.0
5 100006000.0 207.0 BERKELEY Y NaN 1 800.0
6 100007000.0 NaN WASHINGTON NaN 2.0 HURLEY 950.0
7 100008000.0 213.0 TREMONT Y 1.0 1 NaN
8 100009000.0 215.0 TREMONT Y NaN 2 1800.0
In [15]:
'''What if we want to replace missing values? Well, we could:

1) Replace them with a static value
2) Replace them with the median of the column
3) Replace them with the mean of the column
'''

#Option 1
df['NUM_BEDROOMS'].fillna(62, inplace=True)

#Option2
median = df['NUM_BEDROOMS'].median()
df['NUM_BEDROOMS'].fillna(median, inplace=True)

#Option3
mean = df['NUM_BEDROOMS'].mean()
df['NUM_BEDROOMS'].fillna(mean, inplace=True)
In [16]:
df
Out[16]:
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3.0 1 1000.0
1 100002000.0 197.0 LEXINGTON N 3.0 1.5 NaN
2 100003000.0 NaN LEXINGTON N 55.0 1 850.0
3 100004000.0 201.0 BERKELEY 12 1.0 NaN 700.0
4 NaN 203.0 BERKELEY Y 3.0 2 1600.0
5 100006000.0 207.0 BERKELEY Y 55.0 1 800.0
6 100007000.0 NaN WASHINGTON NaN 2.0 HURLEY 950.0
7 100008000.0 213.0 TREMONT Y 1.0 1 NaN
8 100009000.0 215.0 TREMONT Y 55.0 2 1800.0
In [20]:
'''
We could choose to fill null values with the value that immediately preceeded it in the dataframe
'''
df.fillna(method='pad', limit=1)
Out[20]:
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3.0 1 1000.0
1 100002000.0 197.0 LEXINGTON N 3.0 1.5 1000.0
2 100003000.0 197.0 LEXINGTON N 55.0 1 850.0
3 100004000.0 201.0 BERKELEY 12 1.0 1 700.0
4 100004000.0 203.0 BERKELEY Y 3.0 2 1600.0
5 100006000.0 207.0 BERKELEY Y 55.0 1 800.0
6 100007000.0 207.0 WASHINGTON Y 2.0 HURLEY 950.0
7 100008000.0 213.0 TREMONT Y 1.0 1 950.0
8 100009000.0 215.0 TREMONT Y 55.0 2 1800.0
In [21]:
'''
We could also do the same, but backfill (i.e. take the value that comes after the null)
'''
df.fillna(method='bfill', limit=1)
Out[21]:
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 PUTNAM Y 3.0 1 1000.0
1 100002000.0 197.0 LEXINGTON N 3.0 1.5 850.0
2 100003000.0 201.0 LEXINGTON N 55.0 1 850.0
3 100004000.0 201.0 BERKELEY 12 1.0 2 700.0
4 100006000.0 203.0 BERKELEY Y 3.0 2 1600.0
5 100006000.0 207.0 BERKELEY Y 55.0 1 800.0
6 100007000.0 213.0 WASHINGTON Y 2.0 HURLEY 950.0
7 100008000.0 213.0 TREMONT Y 1.0 1 1800.0
8 100009000.0 215.0 TREMONT Y 55.0 2 1800.0
In [ ]:
'''
We could also choose to:
1) Drop an entire row, if it includes a null
2) Drop an entire column that includes a null
3) Drop a column that has a certain percentage of nulls. So, 0.7 = 30% null values
'''

# Option 1
df.dropna()

# Option 2
df.dropna(axis=1)

# Option 3
df.dropna(thresh=int(df.shape[0] * .7), axis=1)

Drop columns we don't need

In [14]:
df.drop(['PID', 'SQ_FT'], axis=1)
Out[14]:
ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH
0 104.0 PUTNAM Y 3.0 1
1 197.0 LEXINGTON N 3.0 1.5
2 NaN LEXINGTON N 55.0 1
3 201.0 BERKELEY 12 1.0 NaN
4 203.0 BERKELEY Y 3.0 2
5 207.0 BERKELEY Y 55.0 1
6 NaN WASHINGTON NaN 2.0 HURLEY
7 213.0 TREMONT Y 1.0 1
8 215.0 TREMONT Y 55.0 2
In [ ]: