Linear regression is a statistical technique employed to represent the relationship between a dependent variable (the target) and one or more independent variables (the predictors) by fitting a straight line, referred to as the regression line, to the data points.
Equation:
\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 +...+ \beta_nx_n + \in\]where,
\(y \implies \) Dependent variable
\(x_i \implies \) Independent variables
\(\beta_i \implies \) coefficients/weights for the features
\(\in \implies \) Error term (residual)
The primary objective is to reduce the discrepancy, or error, between the predicted values and the actual observations through methods such as least squares. This approach is extensively utilized in forecasting, trend analysis, and examining the impact of various factors.
Example: Estimating Fuel Consumption Utilizing the Built-in 'mpg' Dataset from Seaborn through a Linear Regression Model
Link of Dataset: mpg_dataset
Data Dictionary:
The 'mpg' dataset available in Seaborn provides insights into the fuel efficiency and attributes of different automobiles from the 1970s and 1980s. The following data dictionary outlines the significance of each variable within this dataset:
S.N. | Variables | Data Type | Description |
---|---|---|---|
1 | mpg | Float | Miles per gallon (fuel efficiency) |
2 | cylinders | Integer | Number of cylinders in the car’s engine |
3 | displacement | Float | Engine displacement (in cubic inches) |
4 | horsepower | Float | Engine horsepower |
5 | weight | Integer | Vehicle weight (in pounds) |
6 | acceleration | Float | Time taken to accelerate from 0 to 60 mph (in seconds) |
7 | model_year | Integer | Year of the car model (e.g., 70 for 1970) |
8 | origin | Categorical | Region of origin for the car (usa, europe, japan) |
9 | name | String | Full name of the car (make and model) |
- The dataset comprises four float variables, three integer variables, one categorical variable, and one string variable. The total number of records in the dataset is 398.
- Dependent Variable: mpg
- Independent Variables: cylinders, displacement, horsepower, weight, acceleration, model_year, and origin.
Data Preprocessing:
- To estimate 'mpg', the 'name' variable can be excluded as it does not influence the 'mpg' value.
- There are six missing values in the 'horsepower' variable; consequently, these six rows have been removed from the dataset for further analysis. Thus, the final dataset contains 392 records
Univariate Analysis:
- cylinders: The count plot indicates that vehicles equipped with 4 cylinders exhibit the highest frequency of observations. Approximately 98% of the vehicles are categorized as having either 4, 6, or 8 cylinders, while those with 3 or 5 cylinders show significantly lower observation values.
- model_year: The year 1973 records the highest number of vehicles, followed closely by 1978 and then 1976.
- origin: The predominant origin of the vehicles is the USA.
- horsepower: The distribution is right-skewed, with the majority of data concentrated on the left. Most vehicles are equipped with engines producing between 70 and 100 horsepower, with a notable spike at the 150 horsepower mark. A declining trend is observed beyond the 200 horsepower threshold.
- displacement: Similar to horsepower, the displacement variable is also right-skewed, with most data points located on the left. The volume of displacement is highest between the 100 and 110 range, followed by a decreasing trend after the 200 displacement value.
- weight: The weight variable is characterized by a right-skewed distribution, with the majority of observations on the left side.
- acceleration: The acceleration variable follows a normal distribution.
Bivariate Analysis:
In this section, we will explore the correlation between the dependent variable 'mpg' and various other factors.
- The association between 'mpg' and 'cylinders' indicates that vehicles equipped with 4 cylinders exhibit the highest mpg values. As the number of cylinders increases beyond 4, the mpg values tend to decline, with vehicles having 8 cylinders recording the lowest mpg figures.
- The connection between 'mpg' and 'model_year' reveals that mpg values tend to rise as the model year advances.
Multivariate Analysis:
In this section, we will explore the relationship between the dependent variable 'mpg' and multiple independent variables.
- The correlation between 'mpg' and 'horsepower' across different countries reveals a declining trend in mpg values as horsepower increases. Notably, the most gradual decrease is observed in the USA.
- The association between 'mpg' and 'acceleration' across the three countries indicates an upward trend in mpg values with rising acceleration.
- The relationship between 'mpg' and 'weight' across the three countries demonstrates a downward trend in mpg values as weight increases.
- The connection between 'mpg' and 'displacement' across the three countries also shows a decreasing trend in mpg values with higher displacement.
Correlation Matrix - Heatmap:
Let’s focus on the target variable 'mpg'.
- The variables 'cylinders', 'displacement', 'horsepower', and 'weight' exhibit an inverse relationship with 'mpg', while 'acceleration' and 'model_year' show a direct relationship with 'mpg'.
- The correlation between 'mpg' and 'acceleration' is the weakest, suggesting that we can exclude 'acceleration' from the dataset for our analysis.
- The variable 'weight' demonstrates the strongest correlation with 'mpg', quantified at -0.83. Therefore, this variable will be crucial in our estimation of 'mpg'.
Now, we will examine whether there are other variables that exhibit high correlation with one another, aside from 'mpg'. In such cases, we will eliminate one of the correlated variables from the dataset, as they possess similar predictive capabilities for 'mpg'. Focusing on 'displacement', we find that it has over 90% correlation with three variables: 'cylinders', 'horsepower', and 'weight'. Consequently, we can also exclude 'displacement' from the dataset.
A pertinent question arises: could the removal of these variables adversely affect the model? While there may be minimal negative impacts, our primary objective is to estimate 'mpg' using the fewest variables possible to mitigate the risk of overfitting.
Modifications to the Variable 'model_year':
Our objective is to prepare the dataset for modeling purposes, with a particular emphasis on the variable 'model_year'. This variable represents the year of the vehicle's manufacture. To incorporate this variable directly into the model, we can leverage the concept of age in years. Therefore, the next step involves feature engineering to create a new variable derived from 'model_year'. We will establish a new variable called 'age' by calculating the difference between the current date and the 'model_year', thereby determining the 'Age of the Vehicle'.
Modifications to the Variable 'cylinders':
The bivariate analysis and correlation matrix indicate an inverse relationship between miles per gallon (mpg) and the number of cylinders in a car's engine. The data reveals that vehicles with 4 cylinders exhibit the highest observation value, approximately 50%, while those with three or five cylinders show minimal observation values. Consequently, we will categorize the 'cylinders' variable and convert its values into a binary format of 0 and 1 for the purpose of analysis.
Modifications to the Variable ‘origin’:
The variable 'origin' is a categorical variable comprising three distinct values: usa, japan, and europe. It is necessary to transform these values into a format that the model can interpret. We will utilize the get-dummies() method on the variable 'origin' to convert the values into a binary representation of 0 and 1. Below is the final data structure prepared for modeling:
Modeling
Split the dataset into a training set comprising 80% and a testing set comprising 20%.
Modeling Results - Intercept and Coefficients:
Intercept | 69.041759 |
---|
Variable | Coefficient |
---|---|
horsepower | -0.031005 |
weight | -0.004994 |
age | -0.700768 |
cylinders_4 | 7.312174 |
cylinders_5 | 7.545131 |
cylinders_6 | 5.047242 |
cylinders_8 | 7.831837 |
origin_japan | 0.970655 |
origin_usa | -1.640203 |
The intercept term indicates the value of the dependent variable when all independent variables (X's) are set to zero. The size of the coefficients reflects the variation in the dependent variable (y) resulting from a one-unit change in the corresponding independent variable, with positive and negative values indicating the direction of this change. A positive sign denotes a direct proportionality, while a negative sign indicates an inverse proportionality.
Assessing the Model's Accuracy – Evaluation Metrics:
Evaluation Metrics | Values |
---|---|
R2_Score | 0.825936 |
MAE | 2.333349 |
MSE | 8.868320 |
RMSE | 2.977972 |
The R2 score of 0.826 indicates that around 83% of the variation in the dependent variable y can be attributed to the independent variables Xs, which is a favorable outcome. The Mean Absolute Error (MAE) stands at 2.33, reflecting a low error rate. The Root Mean Square Error (RMSE), regarded as a more robust evaluation metric, is recorded at 2.98, which is marginally higher than the MAE, as it imposes a greater penalty on the model for larger discrepancies.
Link of Actual and Predicted y-value: Actual and Predicted y-value
Visualization - Predicted Model
The Train and Test R-squared scores are quite similar, recorded at 0.844 and 0.826, respectively. Therefore, there is no indication of overfitting in the data.
Go to Index page
Disclaimer
The content or analysis presented in the Blog is exclusively intended for educational purposes. It is important to note that this should not be considered as a suggestion for investing in stocks or as legal or medical advice. It is highly recommended to seek guidance from an expert before making any decision.
You would also like to read:
- Analysis of 'titanic' dataset to predict survivor traits using Logistic Regression Model
- How Do Regression and Classification Differ in Supervised Machine Learning?
- Supervised Machine Learning: How Machines Learn with Labeled Data
- Unsupervised Machine Learning: How Machines Discover Insights Without Labels
- Selecting the Best Free Python Data Science Environments
- TensorFlow: The Go-To Tool for AI and Machine Learning
- What is a Decision Tree?
- Neural Network: the core framework for Deep Learning Models
- Reinforcement Learning: Focus on Learning through interactions with environment
- How do machine learning models choose appealing ads for user segments?
- Safeguarding Human Intelligence: Essential Improvements for Thriving in an AI-Driven World
- The Transformative Power of Artificial Intelligence (AI) and Machine Learning (ML)
- General AI: How Close Are We to Achieving Human-Like Intelligence?
- Narrow AI - the Specialized Artificial Intelligence: Key Aspects and Recent Advancements