Using fancyimpute in Python
In a recent Kaggle competition that I participated, I faced with the challenge of imputing missing values in the most effective manner. Since it was a competition the criteria was to get maximum possible accuracy, which depended largely on handling the missing data. Few Kagglers suggested on using R’s MICE package for this purpose. As my code was in Python, I was hunting for an alternative and that’s when I stumbled upon “fancyimpute”.
Fancyimpute is available with Python 3.6 and consists of several imputation algorithms. In this article I will be focusing on using KNN for imputing numerical and categorical variables. KNN or K-nearest neighbor replaces missing values using the mean squared difference of nearest non-missing feature values.
In the below code snippet I am imputing numerical data in my training data set. You can install fancyimpute from pip using pip install fancyimpute. Then you can import required modules from fancyimpute.
#Impute missing values using KNN
from fancyimpute import KNN
imputer = KNN(2) #use 2 nearest rows which have a feature to fill in each row’s missing features
trainfillna = imputer.fit_transform(traindata)
Before imputing categorical variables using fancyimpute you have to encode the strings to numerical values. In the below code snippet I am using ordinal encoding method to encode the categorical variables in my training data and then imputing using KNN.
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()#list of categorical variables
cat_cols = traindatacat.columns#This function will encode non-null data and replace it in the original data
def ordinalencode(train):
nonulls = np.array(data.dropna())
impute_reshape = nonulls.reshape(-1,1)
impute_ordinal = encoder.fit_transform(impute_reshape)
data.loc[data.notnull()] = np.squeeze(impute_ordinal)
return data#encoding all the categorical data in the data set through looping
for columns in cat_cols:
encode(traindatacat[columns])
Now the data set traindatacat has encoded categorical variables. Missing values can be imputed using the same KNN technique that was used above for numerical features.
In this manner fancyimpute can be easily used to replace missing values in huge data sets. Compared to commonly used imputing techniques like replacing with median and mean, this method yields better model accuracy. The only drawback of this package is that it works only on numerical data. Hence, categorical variables needs to be encoded before imputing.
Another algorithm of fancyimpute that is more robust than KNN is MICE(Multiple Imputations by Chained Equations). MICE performs multiple regression for imputing. Use the below code snippet to run MICE,
from fancyimpute import IterativeImputer
mice_impute = IterativeImputer()
traindatafill = Mice_impute.fit_transform(traindata)
IterativeImputer was merged into scikit-learn from fancyimpute. However, it can still be imported from fancyimpute.
There are some interesting algorithms to explore in fancyimpute such as SimpleFill, MatrixFactorization, and SoftImpute. You can try them out and find which works best. Refer fancyimpute documentation for more information.