# How to Improve Univariate Feature Imputation with Just One Line of Code

Imagine you’re on Zillow.com looking at million dollar houses to buy or rent, all the while having exactly 5 dollars in your bank account(We’ve all been there, I don’t judge ). You see the most beautiful house with a lovely garden located in a neighborhood that seems to have better neighbors than the lousy ones you currently have. And the price of the house seems to be okay but the number of rooms in the house is missing in the advert. Now with the lack of information on such an important factor, can we still say the house is worth the price? If not, should we let a model predict the price of a house with such critical data missing?

In most online courses, the attention that’s given to dealing with missing data is NOT enough. Since ML frameworks are not a big fan of missing data, and given the messiness of real world data sets and the complexity it adds to the data cleaning process, this is a critical skill to learn when building accurate models. The most common way the ski-kit learn Python package offers to solve the missing data is using its univariate feature imputation class called “SimpleImputer”, which imputes missing data by using statistics such as mean, median, and mode of the missing data columns. Without diving straight into more sophisticated methods such as multivariate feature imputation, can we revamp this univariate approach to provide more accurate results?

## What’s so bad about the univariate approach?

Since we are in the discussion of house prices, I imported the Melbourne housing price data from Kaggle and randomly removed 20% of the data from the ‘Rooms’ column. Since taking an average of rooms gives us fractions to work with, a better approach here would be to impute the missing data with the most frequent value (mode) for number of rooms — which in this case was 3.

To measure the deviation between the actual value and the imputed value, we will create a new column with the differences for each house and then take the average of it, which gives ~0.67 deviation on average, meaning that our imputation is off by ~0.67 per house when filling in missing values for number of rooms.

Let’s go back to the time of disappointment when you realized that the number of rooms for your dream house was not available in the advert. How would *you* go about estimating the number of rooms in that specific house? Personally, if the number of bathrooms is given, I would look at other houses that has the same number of bathrooms and take the most common number that pops up for those houses with that many bathrooms. So in case your house of interest has 6 bathrooms, is it fair to assume that the number of rooms in that is case is just 3 — which is the most common occurrence given by “Simple Imputer”?

## How can we improve this?

As mentioned above, we can look at the most common occurrence by taking another variable into account, such as the number of bathrooms.

As we can see, the most common occurrence of rooms for each number of bathrooms have an overall increasing pattern (as expected), which then gives a fairer estimation than taking the overall statistic (3 rooms) for all houses neglecting other variables at play. Therefore, in this case if your dream house has 6 bathrooms, you would be putting 5 rooms instead of 3 rooms when imputing, which is possibly closer to the actual value.

And you can easily add even more features such as the type of the house to bring the values even closer to the actual figures as shown below.

And finally comparing the metric that we created to measure the deviation between the imputed values and actual values, we can clearly see an improvement with the new method of carrying out univariate feature imputation by utilizing other variables in the process.

Even though this is a minor modification to the univariate feature imputation process, this is a highly impactful way to go about imputing missing data the right way. Hope you find this useful when dealing with missing data!