Project:

Machine Learning III

A deeper look at housing prices in Ames, Iowa.  Our journey from data cleansing to model fitting to final submission.

View on GitHub

The

Harlem GlobePlotters

A team of data scientists aimed at conquering the worlds of model fitting, multiple linear regression, gradient boosting & putting an end to large residuals.

The

Harlem GlobePlotters

A team of data scientists aimed at conquering the worlds of model fitting, multiple linear regression, gradient boosting & putting an end to large residuals.

Our Process

We began our project by brainstorming our objective, importing our data & exploring its intricacies. We ran through countless iterations, had many late nights & even a small mutiny.

Through the process, we worked hard, became good friends & more importantly, created an awesome Machine Learning Model.

Normalization of SalePrice

Review of the sale price presents as left-skewed. The log of sale price allows for a more normal distribution.

Impact of GrLivArea and Lot size on Sale Price

A quick Scatter plot gives us clues to possible outliers & where they exist in our data.

Impact of GarageCars and YearRemodel

Missingness

We then turned to dealing with missing values in our data, where we turned our attention to KNN, mean Values & Data importance.

Through this process, we increased our model’s accuracy by  giving it a more normalized dataset without missing values.

Ridge Regression

Through this process, we learned our model’s performed relatively well using the ridge approach.

XG Boost

~ No regularization, which reduces overfitting
~ Builtin routine for handling missing values
~ Feature Importance
~ Parameter tuning is a must
Root Mean Square Error: 0.04703920131579207

Stacking

Stacking models creates a more robust solution, adjusting for overfitting & bias when employed properly.

Thank You

For your time.

       

Aaron Festinger                              Xingwang Chen

 

       

Yan Mu                                              Alex Guyton

I. Exploratory Data Analysis

Data Type Analysis
Missingness Handling
Correlation Analysis
Feature Engineering
Variable Removal

 

II. Baseline Modeling

 Preprocessing
 Linear Regression
 Cross Validation

 

IV. Secondary Modeling

Linear Boosting
XG Boost
Random Forest
GDM

 

V. Conclustion

Correlation between numerical values

Correlation allows us to have a quick look at all the numerical values and their relationship with each other.

Lasso Approach

The Lasso regression approach showed us a relatively normal linear regression with a few abnormalities.

Through this process, we learned our model’s performed relatively well using the lasso approach.

ElasticNet

We then turned to ElasticNet to inspect if we could get a result that made use of both Lasso & Ridge Regression.

Surprisingly, Lasso gave us a better result than in terms of residuals, as seen below.

—————————————————————

Aaron Festinger <– aaron.festinger@gmail.com

Xinwang Chen <– xingw.chen@gmail.com

Alex Guyton <– thealexguyton@gmail.com

Yan Mu: <– dannamu2017@gmail.com

—————————————————————