Visualize To Realize: Leveraging EDA For Effective ML

Ready for a deep-dive into the exciting realm of data interpretation ? 📊 Today, Multi-Variate Regression is on the menu. Sounds complex ? Well, by the time we’re done, it’ll feel like a walk in the park 🌳🚶‍♂️

We’ll be looking at how to train a Multi-Variate Linear Regression model, demystifying those complex mathematical equations, and transforming them into simple, understandable concepts. And the best part ? We’ll be visualizing our results, turning abstract numbers into vibrant, understandable graphics 📈

Buckle-up because we’re about to roll up our sleeves and dive into implementing Multi-Variate Linear Regression from scratch using Pytorch 💪

If you haven’t read the previous article of this series, feel free to explore it !

The Dataset: CarDekho

In this study, we will be using the CarDekho dataset by trying to predict car prices. The dataset contains the following features 🔬:

the name of the car
the year it was released
the selling price
the present price
the number of kilometers driven
the type of fuel used (petrol or diesel)
the transmission type (manual or automatic)
the number of times the car has changed hands

Each of these factors significantly contributes to a car’s price, and we will use this data to build a predictive model 🤖

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a critical preliminary step that involves summarizing the main characteristics of a dataset through visual methods or statistical summaries 📊

It is essential for Machine Learning as it allows us to understand the underlying structure of the data, identify outliers, discover patterns, and test assumptions using powerful visual or quantitative methods, thus providing valuable insights for building efficient and accurate predictive models 🧠💡

First let’s import the necessary packages:

import os
import sys
import torch
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Now let’s read the data into a Pandas dataframe. You can download the dataset with this Kaggle link.

df = pd.read_csv(os.path.join('data', 'car_data.csv'))
df.head()

	Car_Name	Year	Selling_Price	Present_Price	Kms_Driven	Fuel_Type	Seller_Type	Transmission
0	ritz	2014	3.35	5.59	27000	Petrol	Dealer	Manual
1	sx4	2013	4.75	9.54	43000	Diesel	Dealer	Manual
2	ciaz	2017	7.25	9.85	6900	Petrol	Dealer	Manual
3	wagon r	2011	2.85	4.15	5200	Petrol	Dealer	Manual
4	swift	2014	4.60	6.87	42450	Diesel	Dealer	Manual

We need to convert the categorical columns as integer ids:

f_continuous = df[['Year', 'Selling_Price', 'Present_Price', 'Kms_Driven', 'Owner']]
f_categorical = pd.get_dummies(df[['Fuel_Type', 'Seller_Type', 'Transmission']])
df = pd.concat([f_continuous, f_categorical], axis=1)

# Drop refundant features
df.drop(['Transmission_Automatic', 'Seller_Type_Dealer', 'Fuel_Type_CNG'], axis=1, inplace=True)
df.head()

	Year	Selling_Price	Present_Price	Kms_Driven	Fuel_Type_Diesel	Fuel_Type_Petrol	Transmission_Manual
0	2014	3.35	5.59	27000	0	1	1
1	2013	4.75	9.54	43000	1	0	1
2	2017	7.25	9.85	6900	0	1	1
3	2011	2.85	4.15	5200	0	1	1
4	2014	4.60	6.87	42450	1	0	1

Individual Feature Distribution

Let’s visualize the individual feature distributions to understand the range, central tendencies, and spread of our data, which in turn helps us identify any skewness, outliers, or anomalies that could impact our model’s performance 📈🔎

df.hist(bins=14, color='steelblue', edgecolor='black', linewidth=1.0, xlabelsize=8, ylabelsize=8, grid=False)
plt.tight_layout(rect=(0, 0, 1.2, 1.2))

For these bar plots, we can see that most cars on sales are:

consuming petrol instead of diesel
had only one owner
are from the 2012-present time range
are manual
sold between 1 and 10, 100 000 Indian Rupees

Heatmap correlation

Let’s introduce our most colorful friend: the Heatmap Correlation ! 🌈 It is a graphical representation that showcases the correlation or relationship between different variables or features in a dataset using color-coded cells.

When two variables are highly correlated, they exhibit a value close to 1 (or -1 for inverse correlation), while a value near 0 indicates a lack of correlation between the variables.

A Heatmap Correlation, in machine learning, help us understand the relationships and dependencies between different features, thus providing a way to identify patterns and make informed decisions during the model-building process 🔧

plt.figure(figsize=(16, 8))
sns.heatmap(df.corr(), square= True, annot=True, fmt='.2f')

<matplotlib.axes._subplots.AxesSubplot at 0x1f5bb762b08>

The correlation between the variables and the selling price is the most useful information for us as it is what we want to predict 🚗💵

Most variables have a strong correlation, except for kilometers driven and number of owners, which are less connected. We could have deleted them, but since they matter in car-buying decisions, we’ll keep them in our analysis 🔍

Pairwise Plots

A pairwise plot, displays a grid of scatter plots, where each plot shows the relationship between the distribution of two variables.

Pairwise plots are useful because they allow us to quickly assess the patterns between variables.

cols_viz = ['Kms_Driven', 'Year', 'Selling_Price', 'Present_Price']
pp = sns.pairplot(df[cols_viz], height=1.8, aspect=1.8,
                  plot_kws=dict(edgecolor="k", linewidth=0.5),
                  diag_kind="kde", diag_kws=dict(shade=True))

fig = pp.fig 
fig.subplots_adjust(top=0.93, wspace=0.3)
t = fig.suptitle('Wine Attributes Pairwise Plots', fontsize=14)

Here, we can see that there’s one of two outliers. An outlier is a data point deviates or differs from the majority of other data points in a dataset.

The year feature has a polynomial correlation with the selling price so a polynomial regression will most likely outperform a standard linear regression.

Instead, the present price has a linear relationship with the present price.

Model Training and Testing

Create training data partition

The train-validation-test split is a process of dividing a dataset into three distinct subsets: the training set, the validation set, and the test set.

The training set is used to train the model, the validation set is used to tune model parameters and evaluate performance during development, and the test set is used to assess the final model’s performance on unseen data.


# Separate the target from the dataFrame
Y = df['Selling_Price']
X = df.drop('Selling_Price', axis=1)

# Convert data to Pytorch tensor
X_t = torch.from_numpy(X.to_numpy()).float()
Y_t = torch.from_numpy(Y.to_numpy()).float().unsqueeze(1)
X_train, X_test, Y_train, Y_test = train_test_split(X_t, Y_t, test_size=0.33, random_state=42)

Multi-Variate Linear Regression

Alright, it’s time to dive into some math now ! 📐

Training a linear model using least square regression is equivalent to minimize the mean squared error:

\[\begin{align} \text{Mse}(\boldsymbol{\hat{y}}, \boldsymbol{y}) &= \frac{1}{n}\sum_{i=1}^{n}{||\hat{y}_i - y_i ||_{2}^{2}} \\ &= \frac{1}{n}||\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y} ||_2^2 \end{align}\]

where \(n\) is the number of samples, \(\hat{y}\) is the predicted value of the model and \(y\) is the true target. The prediction \(\hat{y}\) is obtained by matrix multiplication between the input \(\boldsymbol{X}\) and the weights of the model \(\boldsymbol{w}\).

Minimizing the \(\text{Mse}\) can be achieved by solving the gradient of this equation equals to zero in regards to the weights \(\boldsymbol{w}\):

\[\begin{align} \nabla_{\boldsymbol{w}}\text{Mse} &= 0 \\ (\boldsymbol{X}^\top \boldsymbol{X})^{-1}\boldsymbol{X}^\top \boldsymbol{y} &= \boldsymbol{w} \end{align}\]

For more information on how to find \(\boldsymbol{w}\) please visit the following link.

Building the Model

Let’s build the model using Pytorch:

def add_ones_col(X):
    """Add a column a one to the input torch tensor"""
    x_0 = torch.ones((X.shape[0],), dtype=torch.float32).unsqueeze(1)
    X = torch.cat([x_0, X], dim=1)
    return X

def multi_linear_reg(X, y):
    """Multivariate linear regression function
    
    Args:
        X: A torch tensor for the data.
        y: A torch tensor for the labels.
    """
    X = add_ones_col(X)  # Add a column of ones to X to agregate the bias to the input matrices
    Xt_X = X.T.mm(X)
    Xt_y = X.T.mm(y)

    Xt_X_inv = Xt_X.inverse()
    w = Xt_X_inv.mm(Xt_y)
    return w

def prediction(X, w):
    """Predicts a selling price for each input
    
    Args:
        X: A torch tensor for the data.
        w: A torch tensor for the weights of the linear regression mode.
    """
    X = add_ones_col(X)
    return X.mm(w)

Time to train the model 🎉

# Fit the training set into the model to get the weights
w = multi_linear_reg(X_train, Y_train)

# Predict using matrix multiplication with the weights
Y_pred_train = prediction(X_train, w)
Y_pred_test = prediction(X_test, w)

Computing the Prediction Errors

Now let’s crunch some numbers and calculate the metric to assess our model’s performance.

def mse(Y_true, Y_pred):
    error = Y_pred - Y_true
    return (error.T.mm(error) / Y_pred.shape[0]).item()

def mae(Y_true, Y_pred):
    error = Y_pred - Y_true
    return error.abs().mean().item()

mse_train = mse(Y_train, Y_pred_train)
mae_train = mae(Y_train, Y_pred_train)
print('MSE Train:\t', mse_train)
print('MAE Train:\t', mae_train, end='\n\n')

mse_test = mse(Y_test, Y_pred_test)
mae_test = mae(Y_test, Y_pred_test)
print('MSE Test:\t', mse_test)
print('MAE Test:\t', mae_test, end='\n\n')

MSE Train:   2.808985471725464
MAE Train:   1.1321566104888916

MSE Test:    3.7205495834350586
MAE Test:    1.2941011190414429

The model has an \(\text{Mse}\) of 3.72 on average on the test set. It means that, on average, the squared difference between the predicted values and the actual values is 3.72.

The selling price unit is in 100 000 Indian Rupee, which convert to 1 USD = 82,20 INR at the time of writing. It means that on average, the model has an approximate error of 408 USD on the test set.

Principal Component Analysis (PCA) and 3D Visualization

In this section, we will use PCA to reduce the number of feature to two, in order to visualize the plane of the linear regressor.

Suppose a collection of \(m\) points \(\{\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(m)}\} \in \mathbb{R}^n\). The principal components analysis aims to reduce the dimensionality of the points while losing the least precision as possible. For each point \(\boldsymbol{x}^{(i)} \in \mathbb{R}^n\) we will find a corresponding code vector \(\boldsymbol{c}^{(i)} \in \mathbb{R}^l\) where \(l\) is smaller than \(n\). Let \(f\) be the encoding function and \(g\) be the decoding function and \(\boldsymbol{D} \in \mathbb{R}^{n,l}\) is the decoding matrix whose columns are orthonormal: \[\begin{align} f(\boldsymbol{x}) &= \boldsymbol{D}^\top \boldsymbol{x} \\ g(f(\boldsymbol{x})) &= \boldsymbol{D}\boldsymbol{D}^\top \boldsymbol{x} \end{align}\]

def cov(X):
    """Computes the covariance of the input
    
        The covariance  matrix gives some sense of how much two values are
        linearly related to each other, as well as the scale of these variables.
        It is computed by (1 / (N - 1)) * (X - E[X]).T (X - E[X]).
    
    Args:
        X: A torch tensor as input.
    """
    X -= X.mean(dim=0, keepdim=True)
    fact = 1.0 / (X.shape[0] - 1)
    cov = fact * X.T.mm(X)
    return cov

def pca(X, target_dim=2):
    """Computes the n^th first principal components of the input
    
    PCA can be implemented using the n^th principal components of the covariance matrix.
    We could have been using an eigen decomposition because the covariance matrix is always squared
    but singular value decomposition does also the trick if we take the right singular vectors
    and perform a matrix multiplication to the right.
    
    Args:
        X: A torch tensor as the input.
        target_dim: An integer for selecting the n^th first components.
    """
    cov_x = cov(X)
    
    U, S, V = torch.svd(cov_x)
    transform_mat = V[:, :target_dim]
    X_reduced = X.mm(transform_mat)
    return X_reduced, transform_mat

Let’s project our data in 2D with PCA:

X_test_pca, _ = pca(X_test, target_dim=2)
X_train_pca, _ = pca(X_train, target_dim=2)

points = torch.cat([X_test_pca[:3], Y_pred_test[:3]], axis=1)
v1 = points[2, :] - points[0, :]
v2 = points[1, :] - points[0, :]
cp = torch.cross(v1, v2)
a, b, c = cp
d = cp.dot(points[2, :])

min_mesh_x = min(X_test_pca[:, 0].min(), X_train_pca[:, 0].min())
max_mesh_x = max(X_test_pca[:, 0].max(), X_train_pca[:, 0].max())
min_mesh_y = min(X_test_pca[:, 1].min(), X_train_pca[:, 1].min())
max_mesh_y = max(X_test_pca[:, 1].max(), X_train_pca[:, 1].max())

mesh_x = np.linspace(min_mesh_x, max_mesh_x, 25)
mesh_y = np.linspace(min_mesh_y, max_mesh_y, 25)
mesh_xx, mesh_yy = np.meshgrid(mesh_x, mesh_y)

mesh_zz = (d - a * mesh_xx - b * mesh_yy) / c

Then, we recreate the prediction plane using three random points of the prediction. More information here.

fig = plt.figure(figsize=(25,7))
ax1 = fig.add_subplot(131, projection='3d')
ax2 = fig.add_subplot(132, projection='3d')
ax3 = fig.add_subplot(133, projection='3d')
axes = [ax1, ax2, ax3]

for ax in axes:
    ax.scatter(X_test_pca[:, 0], X_test_pca[:, 1], Y_test, color='red', edgecolor='black')
    ax.scatter(X_train_pca[:, 0], X_train_pca[:, 1], Y_train, color='green', edgecolor='black')
    ax.scatter(mesh_xx.flatten(), mesh_yy.flatten(), mesh_zz.flatten(), facecolor=(0, 0, 0, 0), s=20, edgecolor='#70b3f0')
    
    ax.set_xlabel('1st component', fontsize=12)
    ax.set_ylabel('2nd component', fontsize=12)
    ax.set_zlabel('Selling Price', fontsize=12)
    ax.ticklabel_format(axis="x", style="sci", scilimits=(0,0))
    ax.ticklabel_format(axis="y", style="sci", scilimits=(0,0))
    ax.ticklabel_format(axis="z", style="sci", scilimits=(0,0))
ax1.view_init(elev=60, azim=50)
ax2.view_init(elev=10, azim=0)
ax3.view_init(elev=-15, azim=140)

Good news ! The prediction plane is fitting pretty well the data 😎

Conclusion

So, here’s the bottom line: EDA helped us understand some trends in our data and we managed to build a pretty decent linear regression model !

It turns out that linear regression can work like magic if your data has a lot of linear correlation.

Remember to add a linear model to your baseline for relatively regression tasks in order to compare it to more sophisticated models. Happy data exploring! 🚀📊

Still eager to learn ? You can jump right in the next article of this series on Logistic Regression.