Earlier this year, our data science team of three people were gathered in a small meeting room, looking over plots describing feature behavior for a new model. We were thinking about productionizing a new model, and for us a crucial step in this process involves poking and prodding the model to find out how it works. In this case, that meant pulling up partial dependence plots for each feature, and checking how sensibly each feature behaved.

Staring at these plots, one thing quickly became obvious: certain categorical features were behaving strangely. Some label-encoded categorical features had strong model responses at their imputed values, making for strange bumps and dips in their PDP plots. After chatting a bit about what these meant and what to do about it, we decided to investigate what was going on with these features by modifying our imputation strategy. Instead of imputing by mean or median for every feature, we’d instead impute a value of -1 for all our categorical features, and see what the new partial dependence plots would look like. We were thinking that if the strong model response moves with the changed imputed value, that would be a clear sign that the model is relying on the presence of a missing value, as opposed to intuiting a relationship with the feature values.

On that note we ended the meeting. Now it was time to get back to the laptop, and start coding up a custom imputer to handle these categorical variables!

The first thing I wanted to do was look at the code behind Sci-kit Learn’s `SimpleImputer`. I was curious about two things:

- How did they store the calculated values, for future imputation?
- How did they apply those stored values when imputing?

Time to dive through some source code for SimpleImputer. Reading through the source code, I was able to answer the two questions I had in mind. Imputed values are stored in the `statistics` attribute as a numpy array, and applied to null values via boolean masks.

With that in mind, a design for a quick and dirty implementation of the imputer we needed came to mind. We needed a way to identify categorical features, and change the stored imputed values to -1 for those features.

So first, let’s see what we can do to identify categorical features. This is a bit challenging to automate, but the main idea behind categorical features is the existence of a set number of possible values that can populate the feature. If we take in the assumption of a large dataset, we can do something like check what percentage of the feature data is composed of unique values. Programmed into Python, that function would look something like this (source here):

def is_categorical(array, percent_unique_cutoff = 0.1): test_array = array[~np.isnan(array)] not_int = (test_array.astype(int) != test_array).sum() if not_int: return False percent_unique = len(np.unique(test_array)) / len(array) return percent_unique < percent_unique_cutoff

Now that we have a way to identify categorical features, we need a way to integrate this into an imputer. As we noted before, there’s a point that, when fitting the imputer, the imputed values as stored in the `statistics` attribute as a numpy array. The easiest thing to implement would be to modify the `statistics` array, and change the imputed values for the identified categorical features to our desired value of -1. Specifically, we’d make a new class inheriting from `SimpleImputer`, and modify the `__init__` and `fit` methods to include this new functionality.

What that looks like in code:

import numpy as np from sklearn.impute._base import _get_mask from sklearn.impute import SimpleImputer class DTypeImputer(SimpleImputer): def __init__(self, missing_values=np.nan, strategy="mean", fill_value=None, verbose=0, copy=True, add_indicator=False, categorical_fill_value = -1): super().__init__(missing_values, strategy, fill_value, verbose, copy, add_indicator) self.categorical_fill_value = categorical_fill_value def fit(self, X, y=None): super().fit(X, y) X = self._validate_input(X) # needed to change this into a np.array mask = _get_mask(X, np.nan) masked_X = np.ma.masked_array(X, mask=mask) categorical_mask = np.apply_along_axis(is_categorical, 0, masked_X) # find categorical features self.statistics_[categorical_mask] = self.categorical_fill_value # apply categorical fill value return self

With that code implemented, we used this new imputer to generate another set of partial dependence plots. Looking at those, we saw that the strong model response followed the change in imputed value, telling us that the model was getting predictive power out of whether that feature had missing data or not. In some cases, the most powerful signal is whether or not we even have the data!

Looking back at the code I’ve written almost a year ago, I’m struck by both how much I learned from doing this, and by how, well, unnecessary it was. If I were smarter, I would have manually identified categorical features, and used Sci-kit Learn’s ColumnTransformer to enable separate imputation of continuous vs. categorical features.

At the same time, however, I wouldn’t be nearly as familiar with imputation as I am today. All in all, it’s an exercise I’m glad I went through.