There are three components to the project:
- Estimate the sale price of properties based on their “fixed” characteristics, such as neighborhood, lot size, number of stories, etc.
- Estimate the impact of possible renovations to properties from the variation in sale price not explained by the fixed characteristics.
- Determine the features in the housing data that best predict “abnormal” sales (foreclosures, etc.).
This project uses the Ames housing data recently made available on kaggle.
Below are the goals for section 1:
- Perform any cleaning, feature engineering, and EDA deemed necessary.
- Be sure to remove any houses that are not residential from the dataset.
- Identify fixed features that can predict price.
- Train a model on pre-2010 data and evaluate its performance on the 2010 houses.
- Characterize model. How well does it perform? What are the best estimates of price?
What are the costs/benefits of renovate-able features such as quality, condition, and renovations? To isolate the effect of the renovate-able features on the sale price one way of doing this is to use the residuals from the first model as your target variable in the second model. The residuals from the first model (training and testing) represent the variance in price unexplained by the fixed characteristics. We will use the renovate-able features as the predictors for this second model.
Below are the goals for section 2:
- Use the features in the data that are renovate-able as predictors for our second model. This second model will predict the variance/residuals from the first model.
- Understand how well has our chosen model done.
- Identify which renovate-able features are the most important in our second model . Articulate our findings and make sense of the results.
SaleCondition feature indicates the circumstances of the house sale. From the data file, we can see that the possibilities are:
Normal Normal Sale Abnorml Abnormal Sale - trade, foreclosure, short sale AdjLand Adjoining Land Purchase Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit Family Sale between family members Partial Home was not completed when last assessed (associated with New Homes)
We want to find out if we can reliably indicate what features, if any, predict “abnormal” sales (foreclosures, short sales, etc.).
The goal for section 3 is to determine which features predict the
Abnorml category in the
What are the things we know about our dataset? Our dataset is 1450x 81. Loading our dataset in pandas, we can easily identify that there are multiple nulls in our dataframe. There are columns that are in int/float type but should be treated as categorical columns. Based on the size of our dataframe, we can set the ratio for training and testing data to around 1280:170 (approx 12% of dataset will be used for training).
The above box plot shows that the dataset is messy with multiple columns with outliers. As part of this study, I want to take a look at how the models perform with data with outliers so I will keep these outliers for now. Next, we’ll also look at the correlation of our variables through a heatmap.
The above heatmap shows that many of our variables are correlated. Multicollinearity could be an issue if we were to use Simple Linear Regression as it violates the LINE rule. We’ll take a look later by comparing a simple linear regression with regularized regression models using Lasso and Ridge. The regularised models will help address the multicollinearity issue.
Moving on to some data cleaning and data preparation, for most of the variables, the NA input corresponds to a valid data translating to the lack of that particular feature in the house(i.e. NA in Alley means ‘No Alley Access’). However, below are the 3 rows where NA is not a valid input. Since these are only 3 rows, I decided to drop these rows as it will not affect the dataset. We’ll also change the int/float variables that should be treated as categorical into object/string so that we can dummy the variables. The rest of the missing values are imputed.
Taking a look at our target, there are a few things that stand out.
- Our target has a number of outliers. Cleaning outliers is not part of our scope so we will keep the outliers. We will also see how this impacts the performance of our models.
- Our target variable is skewed so we will take the log of our target to be able to normalize the distribution.
Below are some of the key statistics of our target:
SECTION 1 MODELS
The predictors have a somewhat linear relationship with the target. Our target is a continuous figure. Hence, we will look at regression models. We will be looking at and comparing the following models.¶
- Simple Linear Regression
- LR with Ridge
- LR with Lasso
- Stochastic Gradient Descent
- Support Vector Regression
Below is a summary of performance of our models. Comparing the performance of the models, we can see that the Ridge, Lasso, ans SGDR models performed better than the other two. We can also see that the three abovementioned models’ performance are relatively close. I am choosing Lasso as the final model because ot performs at par with the other two, has the lowest RMSE (though difference from Ridge is not that significant), and gives the lowest residuals mean for both train and test data. On top of that, Lasso will be able to reduce the number of variables used in the model, simplifying the interpretation.
Also interesting to note is the extremely negative score we are getting from LR model for cross validated score while we are getting really high score if we only score our test data. This is a clear indication of our model not able to perform well most likely due to the issue of multicollinearity in our predictors.
Our chosen mode, LR with Lasso, was able to half the number of coefficients used in the model down to 72. As we are using the log of SalePrice, the coefficients indicate the percentage increase in SalePrice for every standard deviation increase in the predictor.
Interpreting the result, we can see from coefficient graph on the left that the above ground living area (GrLivArea) has the most impact on our target. The next most impactful variable’s effect is only half of the 1st. The number of cars you can park in the garage is another important factor that determines the price of property in Ames. A few neighbourhoods showed to have impact on pricing with North Ridge Heights having the most positive effect- being in the area gives properties a 3.6 percent increase in SalePrice.
SECTION 2 MODELS
Similar to section 1, we will also be looking at regression models. We will try out the following models.
- Simple Linear Regression
- Stochastic Gradient Descent Regression
- Support Vector Regression
Below is a summary of performance of our models. Comparing the performance of the models, again Ridge, Lasso and SGDR performed better than the other 2. of the three that performed better, SGDR model performed gave the best results. It had the highest cross validated score with the lowest variance on those scores. Its RMSE on both the train and test data are comparable to both Ridge and Lasso.
SGDR was able to reduce the number of coefficients used in the model down to a third.
The coefficients indicate the impact of renovate-able features on our model in q1. Positive coefficient indicate underpricing and negative indicate overpricing.
Interpreting the result, we can see that the overall quality and overall condition of the property could explain some of the errors in our model for q1. Our q1 model is overpricing the SalePrice for houses with worse conditions and quality and underpricing house with better to excellent conditions and quality.
Given the low scores from our model, I am not confident that this is a model that we can trust. However, we can explore bringing in our q2 variables with highest absolute value of coefficients in our q1 model in future study on this dataset.
SECTION 3 MODELS
Different from sections 1 and 2 as we will be implementing Classification models for this.
I used the following models:
- KNN Classification
- Simple Logistic Regression
- Logistic Regression with L1 Penalty
- Logistic Regression with L2 Penalty
- Linear SVC
Based on the above, the logistic regression with lasso performed best among our models. It gave us good score as well as giving us the highest true positive rating. It also gave the lowest false negative and false positive among our models.
Based on the above, we can see that houses in North Ridge are more likely to result in abnormal sale than others. Older houses (MSSubclass 30) and Split or multi-level properties are also more likely to result in abnormal sale.
On the other hand, newly constructed houses are least likely to end up in abnormal sale. Sale with warranty deed (conventional are also less likely to be an abnormal sale.
The code for this project is available in my github repo: link.