How I Cracked the Top 100 in the Kaggle House Prices Competition

Kaggle is a vibrant online community for data science and machine learning, providing a platform for learning, sharing, and competition. It’s an invaluable resource for individuals interested in these fields, regardless of their level of experience.

The Kaggle House Prices – Advanced Regression Techniques Competition, in particular, is an excellent starting point for anyone who has completed a data science or machine learning course and is eager to gain practical experience.

Participants are tasked with predicting the final price of residential homes in Ames, Iowa, based on 79 explanatory variables describing various aspects of the properties.

The variables include a vast array of house attributes such as the type of dwelling, the size of the living area, the number of rooms, the year the house was built, the quality and condition of various features, the neighborhood, and many more. The challenge aims to encourage the application of advanced regression techniques and creative feature engineering to build models that can accurately predict house prices, an important task in real estate analytics.

A couple of years ago, right after finishing an online data science bootcamp, I decided to try my hand at the House Prices competition. I found it equally fun and frustrating. I became obsessed with cracking the top 100 on the leaderboard. After much struggle, I finally made it. The code can be found here.

I thought it would be fun to revisit this challenge and write an article about it.

After dusting off the code, I found it held up pretty well. It put me in the 130s on the public leaderboard. I figured I’d tweak the code a bit, get back in the top 100 and write my article. Unfortunately, I got stuck just below 110 and found myself trapped in the same cycle:

Try anything and everything I can think of
Review other notebooks and try everything everyone else thought of
Find my current notebook a bloated mess and hard to work with so I start another one.

Finally, I found another notebook someone graciously posted here which got a score I was looking for. It took a while to unpack the code. In doing so, I found some things that worked, but I didn’t really know how they were derived or why they worked. The biggest difference I found was this notebook focused much more heavily on feature engineering than I did.

After much effort, I finally stumbled upon “3 simple tricks” that I found particularly helpful. I was able to grind my way to the score I wanted while feeling like I actually understood what was going on. Here they are:

Use a sklearn pipeline and a “train_test” function to organize the code.
Use visualizations and Pandas groupby queries to brainstorm feature engineering ideas
Use the tpot library to help brainstorm ideas for jazzing up the pipeline and using more advanced models.

The full model can be found in this kaggle notebook. But I hope you will take a stab at the competition first, then compare your code to mine. I got a kaggle score of .11229 which at this point in time is good enough for rank 84 out of 4742 entries.

Getting Started

The easiest way to get started on the competition is to join the competition and create a notebook within Kaggle. From the competition page, click the code tab and the New Notebook button.

The first cell in the new notebook will already be populated. If you run the cell it will show you where to get the data. You can then load the data into Pandas DataFrames using the given locations.

sample_submission = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv")
train = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")
test = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")

Next, all we have to do is split the data into the standard X and y for the features and target. Then we will be ready to create the pipeline.

X = train.drop('SalePrice', axis=1)
y = train[['SalePrice']].copy()
y = np.log1p(y)

SalePrice is skewed. A handful of very expensive houses extend the right tail. The log of SalePrice is much closer to a normal distribution.

There is an interesting side effect of building a model on the log-transformed target variable. When you use the log of SalePrice as the response variable, the interpretation of the coefficients changes.

Now, a one-unit increase in a predictor variable corresponds to a percentage change in SalePrice, rather than an absolute change.

So, in the log-transformed model, if the coefficient of a predictor variable is 0.01, then a one-unit increase in that predictor is associated with an approximately 1% increase in SalePrice.

This means a coefficient can work just as well for a $60,000 house as a $600,000 house.

Machine Learning Pipelines: A Key Tool in Model Building

In the realm of machine learning, Sklearn’s pipeline is an indispensable tool that simplifies the process of building and evaluating models. It neatly chains together data transformation steps and the machine learning model in a sequence.

When you fit the pipeline, it seamlessly performs the data transformations before fitting the model with the transformed data.

To demonstrate the usage of pipelines, let’s consider a task. We have our features stored in a DataFrame X and target values in a variable y. Our goal is to create a pipeline that:

Imputes null values for numerical data with the median
Imputes null values for text data with the most common value
Scales the numeric data with StandardScaler
Uses One Hot Encoding on the text data

Here’s how you can implement it:

from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder


# Identify numeric columns
numeric_columns = X.select_dtypes(include=['number']).columns


# Identify categorical columns
categorical_columns = X.select_dtypes(include=['object']).columns


# Create transformers
numeric_transformer = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler()
)


categorical_transformer = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    OneHotEncoder(handle_unknown='ignore')
)


# Combine transformers into a preprocessor step
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)
    ]
)


# The preprocessor can now be used in a pipeline with a final estimator
# model = make_pipeline(preprocessor, YourModel())

This code has three essential parts:

Identifying the types of columns: Numeric columns are handled differently from non-numeric ones. We fill null values for numeric columns with the median and for text data with the most common value.
Creating transformers: We use the make_pipeline function to create a data transformer for each type of column. The numeric transformer imputes values then scales them, and the categorical transformer fills missing data with the most frequent value, then applies One Hot Encoding to the result.
Combining transformers: We apply different transformers to different columns using the ColumnTransformer.

Next, let’s package this process into a function, train_and_test, which accepts a machine learning model and a data manipulation function as parameters. This allows us to easily test different models and feature engineering approaches.

def train_and_test(model, X, y, data_func=None):
    X_copy = X.copy()
    y_copy = y.copy()


    if data_func:
        data_func(X_copy, y_copy)


    pipe = make_pipeline(
        impute_and_encode(X_copy),
        model
    )


    pipe.fit(X_copy, y_copy)
    evaluate_model(pipe, X, y)

Evaluating the Model with RMSE

💡 RMSE stands for Root Mean Squared Error. Here’s how it works: for each data point, the model’s predicted value is subtracted from the actual value to give the prediction error. Each of these errors is then squared and the results are averaged across all data points. Finally, the square root of this average is taken to give the RMSE.

Because the errors are squared before averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE is most useful when large errors are particularly undesirable. Here is the code to evaluate model performance:

def evaluate_model(model, X, y):
    model.fit(X, y)


    rmse_scores = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5))
    rmse_mean = rmse_scores.mean()


    # Calculate R-squared score using cross validation
    r2_scores = cross_val_score(model, X, y, scoring="r2", cv=5)
    r2_mean = r2_scores.mean()
    print(f'mean RMSE with 5 folds: {rmse_mean}')
    print(f'mean R2: {r2_mean}')
    return rmse_mean, r2_mean

The basic idea behind cross-validation is to divide the data into a number of subsets, or ‘folds’.

The model is then trained on all but one of these folds and tested on the remaining fold. This process is repeated with each fold serving as the test set once.

This is often referred to as K-fold cross-validation, where K is the number of folds. Cross-validation gives a better measure of how well your model will perform on unseen data than using a single train-test split.

Unleashing Exploratory Data Analysis for Feature Engineering

Feature engineering is a crucial phase in the model-building process where you transform existing features and create new ones with the aim of enhancing model performance. A great starting point for feature engineering is to get acquainted with the existing features through Exploratory Data Analysis (EDA). Let’s see how this process can lead us to discover some intriguing insights.

A widely used EDA visualization tool is the heatmap, which provides an overview of feature correlations. Let’s take a closer look at how our features correlate with ‘SalePrice‘ – the target feature.

plt.figure(figsize=(4,10))
sns.heatmap(train.corr()[['SalePrice']], annot=True)
plt.title('Correlations with SalePrice')
plt.show()

💡 Recommended: Creating Beautiful Heatmaps with Seaborn

A notable anomaly in this heatmap is the feature ‘OverallCond‘, which denotes the overall condition of the house on a scale of 1 to 10 (10 being the best).

Intuitively, we’d expect houses in better condition to fetch higher prices, translating to a strong positive correlation. But surprisingly, ‘OverallCond‘ demonstrates a meager correlation of -0.037 with ‘SalePrice‘.

This presents an exciting puzzle – can we improve the model’s performance by modifying ‘OverallCond‘, crafting a new feature, or simply discarding it? With our pipeline and train_and_test function set up, testing these alternatives is a breeze.

Before we proceed, let’s visualize ‘OverallCond‘ vs ‘SalePrice‘ on a scatter plot:

The plot seems to suggest a positive correlation, contradicting the correlation matrix. A peek at the histogram of ‘OverallCond‘ unveils that the majority of houses have a value of 5.

💡 Let’s posit a hypothesis – Could the age of the house influence how ‘OverallCond‘ affects ‘SalePrice‘?

Let’s divide our data into older and newer houses (built before and after 1980, respectively) and plot them against ‘SalePrice‘.

older_house = X.YearBuilt < 1980
plot = sns.scatterplot(x=X.OverallCond, y=train.SalePrice, hue=older_house)
legend = plot.legend_
legend.set_title("Built before 1980")
plt.show()

Interesting! It appears that for newer houses, ‘OverallCond‘ generally receives a default value of 5. For older houses, however, the ‘OverallCond‘ rating seems to matter more.

To capitalize on this observation, we’ll create a new feature, ‘HouseAge‘, to represent the age of the house, and another, ‘AgeCond‘, to capture the interaction between ‘HouseAge‘ and ‘OverallCond‘.

def house_age(X, y):
    X['HouseAge'] = X.YrSold - X.YearBuilt
    X['AgeCond'] = X.HouseAge * X.OverallCond


train_and_test(LinearRegression(), house_age)

Incorporating these changes leads to a reduction in the RMSE from .1566 to .1562. While most experiments might not bear fruit and successful ones may bring minor improvements, persisting with this iterative process will gradually lead you to a well-performing model.

Error Residuals for Feature Creation

Error residuals, simply referred to as residuals, depict the gap between the actual and predicted values of a data point. In essence, it’s the enigmatic segment of your model’s prediction. In the realm of linear regression, it’s calculated as e = y – ŷ, where ‘y’ denotes the observed value, and ‘ŷ’ represents the predicted value from your model.

A healthy model ideally has normally distributed and random residuals. By uncovering patterns within these errors, we can pinpoint the model’s blind spots, fueling us with novel feature creation ideas.

To illuminate this, let’s first establish a function to predict:

def generate_predictions(model, data_func=None):
    X_copy = X.copy()
    y_copy = y.copy()


    if data_func:
        data_func(X_copy)


    pipe = make_pipeline(
      data_preprocessor(X_copy),
      model
    )


    pipe.fit(X_copy, y_copy)
    predictions = pipe.predict(X_copy)
    return predictions

With predictions in hand, we calculate and visualize the residuals:

predicted_prices = generate_predictions(LinearRegression(), house_age)
residuals = y.SalePrice - predicted_prices


plt.plot(range(len(y)), residuals, 'bo', alpha=.5)
plt.title('Error Residuals')
plt.xlabel('House Index')
plt.ylabel('Residual Value')
plt.show()

The larger negative residuals represent cases where the model way over predicted SalePrice. We can look at these houses and see if we can find some new information that will help the model predict lower prices. We are looking for something negative about these houses that the model didn’t see.

A quick scan reveals that these unpredictable homes often have an OverallQual rating below 5, and a SaleCondition that is not “Normal”.

train.loc[np.abs(residuals.SalePrice) > 0.4, ['SaleCondition', 'OverallQual', 'SalePrice']]

Utilizing the groupby function of Pandas, we compare median prices for true versus false conditions, ideally spotting substantial price differences with a reasonable record count for each condition:

train.groupby((train.OverallQual < 5)).agg(dict(SalePrice=['median', 'count']))

We can easily modify the code to test similar conditions

fltr = (train.SaleCondition=='Abnorml') & (train.OverallQual < 5)
train.groupby(fltr).agg(dict(SalePrice=['median', 'count']))

Now we can create a new feature and see if it helps the model.

def create_new_features(X, y):
    X['HouseAge'] = X.YrSold - X.YearBuilt
    X['AgeCond'] = X.HouseAge * X.OverallCond
    X['QuirkyCondition'] = (X.SaleCondition=='Abnorml') & (X.OverallQual < 5)


train_and_test(LinearRegression(), create_new_features)

The results? A tad better RMSE: mean RMSE with 5 folds: 0.1559. Another small victory. After every model modification, the residuals change, granting you another opportunity to analyze and iterate.

Leveraging Integer Encoding for Categorical Features

One Hot encoding is a popular technique for transforming categorical variables into binary features, especially when there’s no inherent order in the categories and their count is relatively small.

However, for ordinal features like OverallQual, where the categories follow a natural progression from “Poor” to “Excellent”, Integer (or Ordinal) Encoding would be more appropriate.

Here’s how to perform Integer Encoding on a feature:

def find_category_mappings(df, variable, target):  
  # first  we generate an ordered list with the labels
  ordered_labels = df.groupby([variable])[target].median().sort_values().index


  # return the dictionary with mappings
  return {k: i for i, k in enumerate(ordered_labels, 0)}
 
def integer_encode(df, feature):
    mapping = find_category_mappings(train, feature, 'SalePrice')
    df[feature] = df[feature].map(mapping)

The above functions rank feature values based on the median SalePrice, replacing them with their respective ranks. Consequently, unordered categorical features morph into meaningful ordinal features.

def ordinal_encode_features(X):
    integer_encode(X, 'BsmtQual')
    integer_encode(X, 'BsmtCond')
    # ... lots of others omitted for brevity ...
    integer_encode(X, 'GarageQual')
    integer_encode(X, 'GarageCond')

Ordinal Encoding is particularly useful when a categorical feature has many unique values or when creating interaction terms with that feature.

The ‘Neighborhood‘ feature is an excellent case in point. A more affluent neighborhood might have distinctive preferences for various features, which we can capture by creating interaction terms, multiplying the integer-encoded ‘Neighborhood‘ field with those features.

def neighborhood_features(X):
    X['Hood2'] = X['Neighborhood'].values
    integer_encode(X, 'Neighborhood')
   
    # neighborhood interactions
    X['HoodQual'] = X.Neighborhood * X.OverallQual
    X['HoodQual3'] = X.Neighborhood * X.BsmtQual
    # ... [add the other interaction terms here] ...
    X['HoodRooms'] = X.Neighborhood * X.TotRmsAbvGrd
    X['HoodRooms2'] = X.GrLivArea * X.BedroomAbvGr


def data_prep(X, y):
    X['HouseAge'] = X.YrSold - X.YearBuilt
    X['AgeCond'] = X.HouseAge * X.OverallCond
    X['QuirkyCondition'] = (X.SaleCondition=='Abnorml') & (X.OverallQual < 5)
    ordinal_encode_features(X)
    neighborhood_features(X)


train_and_test(RidgeCV(), data_prep)

And the result? A significant improvement in the RMSE score: mean RMSE with 5 folds: 0.1380.

Note that we used the RidgeCV model this time. Ridge regression is suitable when your data exhibits multicollinearity (high correlations among predictor variables), and it can help mitigate overfitting.

Attempting the same with LinearRegression resulted in an unsatisfactory outcome, indicating it’s time to explore more sophisticated models.

Exploring Advanced Models and Transformers Using TPOT

Tree-based Pipeline Optimization Tool (TPOT) is a Python library designed to automate the construction and optimization of machine learning pipelines. It uses genetic programming to ease the process of building complex models, especially beneficial for practitioners with limited machine learning expertise.

TPOT treats the pipeline creation as a search problem, exploring through various data pre-processing steps, feature selection techniques, model selections, and hyperparameter choices, aiming to find the optimal pipeline that maximizes the performance on your dataset.

It’s worth noting that running TPOT might take some time, but the insights obtained from its suggestions can be valuable. Particularly, it provides initial values for model hyperparameters, which can offer a significant advantage during the hyperparameter tuning process.

First step is to create a TPOTRegressor object:

from tpot import TPOTRegressor
tpot = TPOTRegressor(generations=5, population_size=20, verbosity=2)

The TPOTRegressor is designed specifically for regression tasks.

The generations parameter indicates the number of rounds the algorithm should run to find the best pipeline; a higher number typically implies a slower but potentially more accurate outcome.

population_size informs the algorithm on the number of pipelines to explore, and verbosity sets the level of output information.

Keep in mind that running TPOT can be time-consuming, especially as it’s applied across five cross-validation folds in the train_and_test function.

For instance, here is a TPOT recommendation:

Best pipeline: ExtraTreesRegressor(LassoLarsCV(input_matrix, normalize=False), bootstrap=False, max_features=0.8, min_samples_leaf=1, min_samples_split=3, n_estimators=100)

Interpreting this, you start from the center and work outward. Thus, TPOT suggests a pipeline comprising two steps:

LassoLarsCV(input_matrix, normalize=False)
ExtraTreesRegressor(bootstrap=False, max_features=0.8, min_samples_leaf=1, min_samples_split=3, n_estimators=100)

However, there’s a caveat.

A pipeline can only end with a machine learning model, and all previous steps must be transformers. Hence, not all suggestions directly fit the standard Scikit-Learn pipeline structure.

What if tpot recommends two machine learning models in its recommended pipeline?

You can stack them. 👇

Stacking Machine Learning Models

Stacking is a technique where predictions of individual models are used as input for a final model (also known as meta-learner) to make a final prediction. Scikit-Learn offers a StackingRegressor for this purpose.

To use the StackingRegressor, we first need to initialize the base models and the final model.

Here’s an example:

# Initialize the base models
base_models = [
    ('lassolarscv', LassoLarsCV(normalize=False)),
    ('extratrees', ExtraTreesRegressor(bootstrap=False, max_features=0.8, min_samples_leaf=1, min_samples_split=3, n_estimators=100))
]


# Initialize the final model
final_model = LinearRegression()


# Create the stacking regressor
stack2 = StackingRegressor(
    estimators=base_models,
    final_estimator=final_model
)


train_and_test(stack2, data_prep)

Using this model, the RMSE has dropped to .1294, a pretty significant improvement.

Adding Scalers and Feature Selectors to the Pipeline

Machine learning pipelines can incorporate scalers and feature selectors for improved results.

Scalers transform the data to fit within a certain scale like standard deviation or minimum and maximum values, improving the performance of some machine learning models.
Feature selectors, on the other hand, can be used to reduce the dimensionality of the data by selecting the most important features.

Here is a recommendation from tpot that includes a scaler:

Best Pipeline: XGBRegressor(ElasticNetCV(RobustScaler(input_matrix), l1_ratio=0.1, tol=0.001), learning_rate=0.1, max_depth=9, min_child_weight=6, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=0.35000000000000003, verbosity=0)

And here’s one that recommends a feature selector:

Best pipeline: RandomForestRegressor(VarianceThreshold(LassoLarsCV(input_matrix, normalize=False), 0.028), bootstrap=False, max_features=0.4, min_samples_leaf=9, min_samples_split=19, n_estimators=100)

Let’s try out these ideas.

from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import SelectFwe


def train_and_test(model, data_func=None):
    # use copies so original data isn't changed
    X_copy = X.copy()
    y_copy = y.copy()


    if data_func:
        data_func(X_copy, y_copy)


    pipe = make_pipeline(
    get_preprocessor(X_copy),
    RobustScaler(),
    VarianceThreshold(.028),
    model
  )
    pipe.fit(X_copy, y_copy)
    evaluate_model(pipe, X_copy, y_copy)


train_and_test(stack2, data_prep)

Another improvement!

Conclusion

This blog post has delved into several powerful tools and strategies that I leveraged to improve my ranking in the Kaggle House Prices competition. Here, we revisited:

The use of Pipelines and a robust “train_and_test” function to streamline the model training and evaluation process, fostering cleaner, more manageable code.
The exploration of Pandas and Seaborn libraries for brainstorming and creating new features. Data visualization, summary statistics, and feature engineering are crucial in building a comprehensive understanding of your dataset and in finding innovative ways to extract more predictive power from it.
The deployment of TPOT, a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. It’s a great resource to generate ideas for models, transformers, and pipeline configurations.

The key is to foster a productive cycle of idea generation and rapid testing. Ensuring a clean and organized codebase can significantly ease this process. It might be a bit challenging initially, as it was for me, especially when dealing with bloated notebooks that seem impossible to debug or optimize.

However, with perseverance and the right approach, you can turn this into an enjoyable and highly rewarding journey.

Over time, you will find yourself becoming more adept at navigating through these challenges and devising effective solutions, leading to better results and a deeper understanding of machine learning concepts.

Also check out my other article you’ll probably enjoy: