Estimating Fuel Efficiency using Linear Regression Model

Linear regression is a statistical technique employed to represent the relationship between a dependent variable (the target) and one or more independent variables (the predictors) by fitting a straight line, referred to as the regression line, to the data points.

Equation:

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 +...+ \beta_nx_n + \in\]

where,

\(y \implies \) Dependent variable

\(x_i \implies \) Independent variables

\(\beta_i \implies \) coefficients/weights for the features

\(\in \implies \) Error term (residual)

The primary objective is to reduce the discrepancy, or error, between the predicted values and the actual observations through methods such as least squares. This approach is extensively utilized in forecasting, trend analysis, and examining the impact of various factors.

Example: Estimating Fuel Consumption Utilizing the Built-in 'mpg' Dataset from Seaborn through a Linear Regression Model

Link of Dataset: mpg_dataset

Data Dictionary:

The 'mpg' dataset available in Seaborn provides insights into the fuel efficiency and attributes of different automobiles from the 1970s and 1980s. The following data dictionary outlines the significance of each variable within this dataset:

S.N.	Variables	Data Type	Description
1	mpg	Float	Miles per gallon (fuel efficiency)
2	cylinders	Integer	Number of cylinders in the car’s engine
3	displacement	Float	Engine displacement (in cubic inches)
4	horsepower	Float	Engine horsepower
5	weight	Integer	Vehicle weight (in pounds)
6	acceleration	Float	Time taken to accelerate from 0 to 60 mph (in seconds)
7	model_year	Integer	Year of the car model (e.g., 70 for 1970)
8	origin	Categorical	Region of origin for the car (usa, europe, japan)
9	name	String	Full name of the car (make and model)

The dataset comprises four float variables, three integer variables, one categorical variable, and one string variable. The total number of records in the dataset is 398.
Dependent Variable: mpg
Independent Variables: cylinders, displacement, horsepower, weight, acceleration, model_year, and origin.

Data Preprocessing:

To estimate 'mpg', the 'name' variable can be excluded as it does not influence the 'mpg' value.
There are six missing values in the 'horsepower' variable; consequently, these six rows have been removed from the dataset for further analysis. Thus, the final dataset contains 392 records

Univariate, Bivariate, Multivariate Analysis

Univariate Analysis:

cylinders: The count plot indicates that vehicles equipped with 4 cylinders exhibit the highest frequency of observations. Approximately 98% of the vehicles are categorized as having either 4, 6, or 8 cylinders, while those with 3 or 5 cylinders show significantly lower observation values.

model_year: The year 1973 records the highest number of vehicles, followed closely by 1978 and then 1976.

origin: The predominant origin of the vehicles is the USA.

horsepower: The distribution is right-skewed, with the majority of data concentrated on the left. Most vehicles are equipped with engines producing between 70 and 100 horsepower, with a notable spike at the 150 horsepower mark. A declining trend is observed beyond the 200 horsepower threshold.

displacement: Similar to horsepower, the displacement variable is also right-skewed, with most data points located on the left. The volume of displacement is highest between the 100 and 110 range, followed by a decreasing trend after the 200 displacement value.

weight: The weight variable is characterized by a right-skewed distribution, with the majority of observations on the left side.

acceleration: The acceleration variable follows a normal distribution.

Bivariate Analysis:

In this section, we will explore the correlation between the dependent variable 'mpg' and various other factors.

The association between 'mpg' and 'cylinders' indicates that vehicles equipped with 4 cylinders exhibit the highest mpg values. As the number of cylinders increases beyond 4, the mpg values tend to decline, with vehicles having 8 cylinders recording the lowest mpg figures.

The connection between 'mpg' and 'model_year' reveals that mpg values tend to rise as the model year advances.

Multivariate Analysis:

In this section, we will explore the relationship between the dependent variable 'mpg' and multiple independent variables.

The correlation between 'mpg' and 'horsepower' across different countries reveals a declining trend in mpg values as horsepower increases. Notably, the most gradual decrease is observed in the USA.

The association between 'mpg' and 'acceleration' across the three countries indicates an upward trend in mpg values with rising acceleration.

The relationship between 'mpg' and 'weight' across the three countries demonstrates a downward trend in mpg values as weight increases.

The connection between 'mpg' and 'displacement' across the three countries also shows a decreasing trend in mpg values with higher displacement.

Correlation Matrix - Heatmap:

Let’s focus on the target variable 'mpg'.

The variables 'cylinders', 'displacement', 'horsepower', and 'weight' exhibit an inverse relationship with 'mpg', while 'acceleration' and 'model_year' show a direct relationship with 'mpg'.

The correlation between 'mpg' and 'acceleration' is the weakest, suggesting that we can exclude 'acceleration' from the dataset for our analysis.

The variable 'weight' demonstrates the strongest correlation with 'mpg', quantified at -0.83. Therefore, this variable will be crucial in our estimation of 'mpg'.

Now, we will examine whether there are other variables that exhibit high correlation with one another, aside from 'mpg'. In such cases, we will eliminate one of the correlated variables from the dataset, as they possess similar predictive capabilities for 'mpg'. Focusing on 'displacement', we find that it has over 90% correlation with three variables: 'cylinders', 'horsepower', and 'weight'. Consequently, we can also exclude 'displacement' from the dataset.

A pertinent question arises: could the removal of these variables adversely affect the model? While there may be minimal negative impacts, our primary objective is to estimate 'mpg' using the fewest variables possible to mitigate the risk of overfitting.

Modifications to the Variable 'model_year':

Our objective is to prepare the dataset for modeling purposes, with a particular emphasis on the variable 'model_year'. This variable represents the year of the vehicle's manufacture. To incorporate this variable directly into the model, we can leverage the concept of age in years. Therefore, the next step involves feature engineering to create a new variable derived from 'model_year'. We will establish a new variable called 'age' by calculating the difference between the current date and the 'model_year', thereby determining the 'Age of the Vehicle'.

Modifications to the Variable 'cylinders':

The bivariate analysis and correlation matrix indicate an inverse relationship between miles per gallon (mpg) and the number of cylinders in a car's engine. The data reveals that vehicles with 4 cylinders exhibit the highest observation value, approximately 50%, while those with three or five cylinders show minimal observation values. Consequently, we will categorize the 'cylinders' variable and convert its values into a binary format of 0 and 1 for the purpose of analysis.

Modifications to the Variable ‘origin’:

The variable 'origin' is a categorical variable comprising three distinct values: usa, japan, and europe. It is necessary to transform these values into a format that the model can interpret. We will utilize the get-dummies() method on the variable 'origin' to convert the values into a binary representation of 0 and 1. Below is the final data structure prepared for modeling:

Modeling

Split the dataset into a training set comprising 80% and a testing set comprising 20%.

Modeling Results - Intercept and Coefficients:

Intercept	69.041759

Variable	Coefficient
horsepower	-0.031005
weight	-0.004994
age	-0.700768
cylinders_4	7.312174
cylinders_5	7.545131
cylinders_6	5.047242
cylinders_8	7.831837
origin_japan	0.970655
origin_usa	-1.640203

The intercept term indicates the value of the dependent variable when all independent variables (X's) are set to zero. The size of the coefficients reflects the variation in the dependent variable (y) resulting from a one-unit change in the corresponding independent variable, with positive and negative values indicating the direction of this change. A positive sign denotes a direct proportionality, while a negative sign indicates an inverse proportionality.

Assessing the Model's Accuracy – Evaluation Metrics:

Evaluation Metrics	Values
R2_Score	0.825936
MAE	2.333349
MSE	8.868320
RMSE	2.977972

The R2 score of 0.826 indicates that around 83% of the variation in the dependent variable y can be attributed to the independent variables Xs, which is a favorable outcome. The Mean Absolute Error (MAE) stands at 2.33, reflecting a low error rate. The Root Mean Square Error (RMSE), regarded as a more robust evaluation metric, is recorded at 2.98, which is marginally higher than the MAE, as it imposes a greater penalty on the model for larger discrepancies.

Link of Actual and Predicted y-value: Actual and Predicted y-value

Visualization - Predicted Model

The Train and Test R-squared scores are quite similar, recorded at 0.844 and 0.826, respectively. Therefore, there is no indication of overfitting in the data.

Go to Index page

Disclaimer

The content or analysis presented in the Blog is exclusively intended for educational purposes. It is important to note that this should not be considered as a suggestion for investing in stocks or as legal or medical advice. It is highly recommended to seek guidance from an expert before making any decision.