Table of Contents
- Types of Missing Data
- Detecting Missing Data in Python
- Handling Missing Data: Techniques
- Python Libraries for Missing Data Handling
- Best Practices for Handling Missing Data
- Conclusion
- References
1. Types of Missing Data
Before handling missing data, it’s critical to understand why it’s missing. Statisticians classify missing data into three types based on the “missingness mechanism”—the relationship between missingness and other variables in the dataset.
1.1 Missing Completely at Random (MCAR)
Missingness is unrelated to any observed or unobserved variables. For example:
- A sensor randomly fails to record temperature readings.
- A survey respondent accidentally skips a question (not due to the question’s content).
Key trait: The missing data is a random subset of the dataset. Statistical properties (e.g., mean, variance) of the observed data remain representative of the full dataset.
1.2 Missing at Random (MAR)
Missingness is related to observed variables but not the unobserved missing values themselves. For example:
- Income data is missing more frequently for younger respondents (age is observed).
- Blood pressure readings are missing for patients who visited the clinic on weekends (visit day is observed).
Key trait: The missingness pattern can be explained by other variables in the dataset. With proper adjustment (e.g., imputation using observed variables), bias can be minimized.
1.3 Missing Not at Random (MNAR)
Missingness is related to the unobserved missing values or unmeasured variables. This is the most problematic type. For example:
- Employees with very high salaries refuse to report their income (missingness depends on the unobserved salary itself).
- Patients with severe symptoms avoid a follow-up survey (missingness linked to unmeasured health severity).
Key trait: MNAR data is biased and hard to correct. Solutions often require domain knowledge or sensitivity analysis.
2. Detecting Missing Data in Python
Before handling missing data, we first need to identify where and how much data is missing. Python offers powerful tools for this, combining statistical summaries and visualizations.
2.1 Statistical Detection with Pandas
Pandas, the workhorse of data manipulation, provides simple methods to quantify missing values.
Example Dataset
We’ll use a sample dataset with missing values for demonstration:
import pandas as pd
import numpy as np
# Create a DataFrame with missing values
data = {
'Age': [25, np.nan, 30, 35, np.nan, 40],
'Income': [50000, 60000, np.nan, 75000, 80000, np.nan],
'Gender': ['M', 'F', np.nan, 'F', 'M', 'F']
}
df = pd.DataFrame(data)
Key Pandas Functions
isnull()/notnull(): Identify missing values (returns a boolean DataFrame).sum(): Count missing values per column/row.mean(): Calculate the percentage of missing values.
# Check for missing values (True = missing)
print(df.isnull())
# Count missing values per column
print("\nMissing values per column:\n", df.isnull().sum())
# Percentage of missing values per column
print("\nPercentage missing per column:\n", df.isnull().mean().round(2) * 100)
Output:
Age Income Gender
0 False False False
1 True False False
2 False True True
3 False False False
4 True False False
5 False True False
Missing values per column:
Age 2
Income 2
Gender 1
dtype: int64
Percentage missing per column:
Age 33.0
Income 33.0
Gender 17.0
dtype: float64
2.2 Visual Detection with missingno
The missingno library (install with pip install missingno) visualizes missing data patterns, making it easier to spot trends.
Key Visualizations
- Matrix Plot: Shows missingness patterns across rows.
- Bar Plot: Summarizes missing values per column.
- Heatmap: Reveals correlations between missingness of variables.
import missingno as msno
import matplotlib.pyplot as plt
# Matrix plot: Rows = observations, Columns = features (white = missing)
msno.matrix(df)
plt.title("Missing Data Matrix")
plt.show()
# Bar plot: Count of missing values per column
msno.bar(df)
plt.title("Missing Values per Column")
plt.show()
# Heatmap: Correlation of missingness between columns
msno.heatmap(df)
plt.title("Missingness Correlation Heatmap")
plt.show()
Interpretation:
- The matrix plot highlights rows with clustered missing values (e.g., row 2 has missing
IncomeandGender). - The bar plot quantifies missingness (e.g., 33% of
Agevalues are missing). - The heatmap shows if missingness in one column correlates with another (e.g.,
IncomeandGendermissingness may be linked).
3. Handling Missing Data: Techniques
Once missing data is detected, the next step is to handle it. The choice of method depends on the type of missingness (MCAR/MAR/MNAR), dataset size, and analysis goals.
3.1 Deletion Methods
Deletion removes rows or columns with missing values. It’s simple but risky, as it reduces sample size.
3.1.1 Listwise Deletion (Complete Case Analysis)
Removes entire rows containing any missing values.
Code:
# Drop rows with any missing values
df_listwise = df.dropna(axis=0) # axis=0 = rows
print("Original rows:", len(df), "| Rows after listwise deletion:", len(df_listwise))
Output:
Original rows: 6 | Rows after listwise deletion: 3
Pros: Simple, no bias introduced if data is MCAR.
Cons: Reduces sample size (critical for small datasets). Risk of bias if data is MAR/MNAR.
3.1.2 Pairwise Deletion
Uses available data for each variable, ignoring missing values in other variables. For example, when calculating the mean of Age, only rows with non-missing Age are used.
Code:
# Mean of 'Age' using pairwise deletion (ignores NaN)
mean_age = df['Age'].mean()
print("Mean Age (pairwise deletion):", mean_age)
Output:
Mean Age (pairwise deletion): 32.5
Pros: Retains more data than listwise deletion.
Cons: Results may be inconsistent (e.g., correlation matrices use different sample sizes for each pair of variables).
3.2 Imputation Methods
Imputation replaces missing values with estimated ones. It’s more robust than deletion and preserves sample size.
3.2.1 Mean/Median Imputation (Numeric Variables)
Replace missing values with the mean (for normally distributed data) or median (for skewed data/outliers).
Code:
# Impute 'Age' with median (robust to outliers)
df['Age_median_imputed'] = df['Age'].fillna(df['Age'].median())
# Impute 'Income' with mean
df['Income_mean_imputed'] = df['Income'].fillna(df['Income'].mean())
print(df[['Age', 'Age_median_imputed', 'Income', 'Income_mean_imputed']])
Output:
Age Age_median_imputed Income Income_mean_imputed
0 25.0 25.0 50000.0 50000.0
1 NaN 32.5 60000.0 60000.0
2 30.0 30.0 NaN 66250.0
3 35.0 35.0 75000.0 75000.0
4 NaN 32.5 80000.0 80000.0
5 40.0 40.0 NaN 66250.0
Pros: Simple, preserves mean/median of the original data.
Cons: Reduces variance (since imputed values cluster around the mean/median). Not suitable for MNAR data.
3.2.2 Mode Imputation (Categorical Variables)
Replace missing categorical values with the most frequent category (mode).
Code:
# Impute 'Gender' with mode
mode_gender = df['Gender'].mode()[0] # mode() returns a Series; [0] gets the value
df['Gender_mode_imputed'] = df['Gender'].fillna(mode_gender)
print(df[['Gender', 'Gender_mode_imputed']])
Output:
Gender Gender_mode_imputed
0 M M
1 F F
2 NaN F # Mode of 'Gender' is 'F'
3 F F
4 M M
5 F F
Pros: Simple, works for nominal/categorical data.
Cons: Over-represents the mode, distorts relationships between variables.
3.2.3 Constant Value Imputation
Replace missing values with a fixed constant (e.g., 0, 999, or 'Unknown').
Code:
# Impute missing 'Income' with 0 (e.g., for "no income reported")
df['Income_constant_imputed'] = df['Income'].fillna(0)
Pros: Useful for domain-specific cases (e.g., 0 for “no transaction”).
Cons: Arbitrary constant may introduce bias if not justified.
3.2.4 Forward/Backward Fill (Time-Series Data)
For time-series data, missing values can be replaced with the previous (ffill()) or next (bfill()) observed value.
Example Time-Series Data:
# Create a time-series DataFrame
dates = pd.date_range(start='2023-01-01', periods=6)
ts_data = {'Sales': [100, np.nan, 150, np.nan, 200, 250]}
df_ts = pd.DataFrame(ts_data, index=dates)
# Forward fill (carry last observation forward)
df_ts['Sales_ffill'] = df_ts['Sales'].ffill()
# Backward fill (carry next observation backward)
df_ts['Sales_bfill'] = df_ts['Sales'].bfill()
print(df_ts)
Output:
Sales Sales_ffill Sales_bfill
2023-01-01 100.0 100.0 100.0
2023-01-02 NaN 100.0 150.0
2023-01-03 150.0 150.0 150.0
2023-01-04 NaN 150.0 200.0
2023-01-05 200.0 200.0 200.0
2023-01-06 250.0 250.0 250.0
Pros: Preserves temporal trends.
Cons: Assumes data is平稳 (no sudden changes); may propagate errors.
3.2.5 Interpolation
Estimates missing values using a mathematical function (e.g., linear, polynomial) based on neighboring data points.
Code:
# Linear interpolation for time-series
df_ts['Sales_interpolated'] = df_ts['Sales'].interpolate(method='linear')
print(df_ts[['Sales', 'Sales_interpolated']])
Output:
Sales Sales_interpolated
2023-01-01 100.0 100.0
2023-01-02 NaN 125.0 # Linear between 100 and 150
2023-01-03 150.0 150.0
2023-01-04 NaN 175.0 # Linear between 150 and 200
2023-01-05 200.0 200.0
2023-01-06 250.0 250.0
Pros: More accurate than forward/backward fill for smooth trends.
Cons: Struggles with non-linear or noisy data.
3.3 Advanced Imputation Methods
For complex datasets, advanced methods use machine learning or statistical models to predict missing values.
3.3.1 K-Nearest Neighbors (KNN) Imputation
Finds the k most similar rows (neighbors) based on non-missing features and averages their values to impute the missing one.
Code (using fancyimpute):
# Install: pip install fancyimpute
from fancyimpute import KNN
# Convert DataFrame to numpy array (fancyimpute requires arrays)
df_numeric = df[['Age', 'Income']].copy() # KNN works with numeric data
df_knn_imputed = KNN(k=3).fit_transform(df_numeric) # k=3 neighbors
# Convert back to DataFrame
df_knn = pd.DataFrame(df_knn_imputed, columns=['Age_knn', 'Income_knn'])
print(df_knn)
Output (example):
Age_knn Income_knn
0 25.00 50000.0
1 32.50 60000.0 # Imputed from neighbors
2 30.00 61666.7 # Imputed from neighbors
3 35.00 75000.0
4 32.50 80000.0
5 40.00 61666.7
Pros: Captures relationships between features (e.g., Income may correlate with Age).
Cons: Computationally expensive for large datasets; requires scaling features.
3.3.2 MICE (Multiple Imputation by Chained Equations)
MICE creates multiple imputed datasets by iteratively predicting missing values using regression models. It accounts for uncertainty by averaging results across datasets.
Code (using fancyimpute):
from fancyimpute import IterativeImputer # MICE implementation
# MICE imputation
mice_imputer = IterativeImputer(max_iter=10, random_state=42)
df_mice_imputed = mice_imputer.fit_transform(df_numeric)
# Convert back to DataFrame
df_mice = pd.DataFrame(df_mice_imputed, columns=['Age_mice', 'Income_mice'])
print(df_mice)
Pros: State-of-the-art for MAR data; handles complex relationships.
Cons: Computationally intensive; requires tuning (e.g., max_iter).
4. Python Libraries for Missing Data Handling
Several libraries simplify missing data workflows. Here are the most popular:
4.1 Pandas
- Purpose: Core data manipulation.
- Key Functions:
isnull(),dropna(),fillna(),interpolate(). - Use Case: Basic detection, deletion, and simple imputation (mean/median/mode).
4.2 Scikit-learn
- Purpose: Machine learning utilities.
- Key Tools:
SimpleImputer: Impute with mean/median/mode/constant.IterativeImputer: MICE implementation (experimental in older versions).
Example with SimpleImputer:
from sklearn.impute import SimpleImputer
# Impute 'Age' with median
imputer = SimpleImputer(strategy='median')
df['Age_sklearn_imputed'] = imputer.fit_transform(df[['Age']])
4.3 missingno
- Purpose: Visualization of missing data patterns.
- Key Plots:
matrix(),bar(),heatmap(),dendrogram()(clusters missingness).
4.4 fancyimpute
- Purpose: Advanced imputation (KNN, MICE, matrix factorization).
- Key Methods:
KNN(),IterativeImputer()(MICE),SoftImpute()(matrix factorization).
4.5 impyute
- Purpose: Alternative to
fancyimputewith more imputation strategies (e.g., random forest, EM algorithm). - Example:
from impyute.imputation.cs import fast_knn # KNN imputation df_impyute_knn = fast_knn(df_numeric.values, k=3)
5. Best Practices for Handling Missing Data
- Understand the Cause: Investigate why data is missing (e.g., human error vs. sensor failure). This guides method selection.
- Visualize First: Use
missingnoto spot patterns (e.g., clustered missingness in time-series). - Avoid Deletion Unless Necessary: Only delete data if it’s MCAR and missingness is <5%.
- Impute Strategically:
- Use mean/median for numeric data with low variance.
- Use KNN/MICE for high-dimensional data with feature correlations.
- Use constants (e.g.,
'Unknown') for categorical data.
- Validate Imputation: Check if imputed values preserve the original data distribution (e.g., compare histograms of original vs. imputed
Age). - Document Everything: Note which method was used and why (critical for reproducibility).
6. Conclusion
Missing data is not a dead end—it’s an opportunity to refine your dataset. By combining detection tools like missingno, simple imputers like pandas, and advanced methods like MICE, you can transform messy data into reliable insights. Remember: the best approach depends on your data’s missingness type, size, and analysis goals. With these techniques, you’ll build more robust models and make data-driven decisions with confidence.
7. References
- Little, R. J., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley.
- Pandas Documentation: Working with Missing Data.
- scikit-learn Documentation: Imputation of Missing Values.
- Shah, A. (2017). missingno: GitHub Repository.
- van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.). Chapman & Hall.