Handling Missing Values
- nikolaos giannakoulis

- Nov 27, 2019
- 4 min read
Updated: Jun 15, 2022
Introduction
There are many ways data can end up with missing values. For example
A 2 bedroom house wouldn't include an answer for How large is the third bedroomSomeone being surveyed may choose not to share their income
Python libraries represent missing numbers as nan which is short for "not a number". You can detect which cells have missing values, and then count how many there are in each column with the command:
missing_val_count_by_column = (data.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values. So you'll need to choose one of the strategies below.
Solutions
1) A Simple Option: Drop Columns with Missing Values
If your data is in a DataFrame called original_data, you can drop columns with missing values. One way to do that is
data_without_missing_values = original_data.dropna(axis=1)In many cases, you'll have both a training dataset and a test dataset. You will want to drop the same columns in both DataFrames. In that case, you would write
cols_with_missing = [col for col in original_data.columns
if original_data[col].isnull().any()]
reduced_original_data = original_data.drop(cols_with_missing, axis=1)
reduced_test_data = test_data.drop(cols_with_missing, axis=1)If those columns had useful information (in the places that were not missing), your model loses access to this information when the column is dropped. Also, if your test data has missing values in places where your training data did not, this will result in an error.
So, it's somewhat usually not the best solution. However, it can be useful when most values in a column are missing
2) A Better Option: Imputation
Imputation fills in the missing value with some number. The imputed value won't be exactly right in most cases, but it usually gives more accurate models than dropping the column entirely.
This is done with
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
data_with_imputed_values = my_imputer.fit_transform(original_data)The default behavior fills in the mean value for imputation. Statisticians have researched more complex strategies, but those complex strategies typically give no benefit once you plug the results into sophisticated machine learning models.
One (of many) nice things about Imputation is that it can be included in a scikit-learn Pipeline. Pipelines simplify model building, model validation and model deployment.
3) An Extension To Imputation
Imputation is the standard approach, and it usually works well. However, imputed values may by systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing. Here's how it might look:
# make copy to avoid changing original data (when Imputing)
new_data = original_data.copy()
# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns
if new_data[col].isnull().any())
for col in cols_with_missing:
new_data[col + '_was_missing'] = new_data[col].isnull()
# Imputation
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
new_data.columns = original_data.columnsIn some cases this approach will meaningfully improve results. In other cases, it doesn't help at all.
Example (Comparing All Solutions)
We will see am example predicting housing prices from the Melbourne Housing data. To master missing value handling, fork this notebook and repeat the same steps with the Iowa Housing data. Find information about both in the Data section of the header menu.
Basic Problem Set-up
In [1]:
import pandas as pd
# Load data
melb_data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
melb_target = melb_data.Price
melb_predictors = melb_data.drop(['Price'], axis=1)
# For the sake of keeping the example simple, we'll use only numeric predictors.
melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object'])
Create Function to Measure Quality of An Approach
We divide our data into training and test. If the reason for this is unfamiliar, review Welcome to Data Science.
We've loaded a function score_dataset(X_train, X_test, y_train, y_test) to compare the quality of diffrent approaches to missing values. This function reports the out-of-sample MAE score from a RandomForest.
HideIn [2]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(melb_numeric_predictors,
melb_target,
train_size=0.7,
test_size=0.3,
random_state=0)
def score_dataset(X_train, X_test, y_train, y_test):
model = RandomForestRegressor()
model.fit(X_train, y_train)
preds = model.predict(X_test)
return mean_absolute_error(y_test, preds)
Get Model Score from Dropping Columns with Missing Values
In [3]:
cols_with_missing = [col for col in X_train.columns
if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))
Mean Absolute Error from dropping columns with Missing Values:
/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
190206.25740935552
Get Model Score from Imputation
In [4]:
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))
Mean Absolute Error from Imputation:
/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
183672.59241765435
Get Score from Imputation with Extra Columns Showing What Was Imputed
In [5]:
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()
cols_with_missing = (col for col in X_train.columns
if X_train[col].isnull().any())
for col in cols_with_missing:
imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()
# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)
print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))
Mean Absolute Error from Imputation while Track What Was Imputed:
/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
184784.40251186382
Conclusion
As is common, imputing missing values allowed us to improve our model compared to dropping those columns. We got an additional boost by tracking what values had been imputed.






Comments