Data Collection and Cleaning
- Explore: I check for columns with missing values and look for relationships that could help fill them in.
- Correlate: I calculate correlations between columns and focus on the strongest ones.
- Predict: For each missing value, I use a strongly correlated column as a predictor in a simple regression model.
- Impute: I use the model to predict the missing value and fill it in.
- Validate: I double-check the results to make sure they’re reasonable. If not, I might try different models or predictors.
The key is to think of the strongly correlated column as a stand-in for the missing information. It’s a simple but often effective way to clean up my data.