Logistic Regression is a statistical method for binary classification tasks, like yes/no or spam/non-spam or survived/not-survived. It estimates the probability of an outcome belonging to one of two classes based on input features. Unlike Linear Regression, which predicts continuous values, Logistic Regression uses a sigmoid function to generate probabilities between 0 and 1. Although called regression, it addresses a classification problem with a categorical dependent variable and employs the Maximum Likelihood Method to keep predictions within the 0 to 1 range.
Equation:
\[P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 +...+ \beta_nx_n)}}\]where,
\(P(y=1|x) \implies \) Probability of class 1 (e.g., survived)
Decision threshold (e.g., 0.5) determines class
The threshold value is set at 0.5. When the prediction exceeds or equal to 0.5, the logistic regression model will classify the outcome as 1; conversely, if the prediction falls below 0.5, the model will classify it as 0.
Example: Analysis of Seaborn's 'titanic' dataset to predict survivor traits using Logistic Regression Model
The Titanic met its tragic fate on April 15, 1912, while on its inaugural journey, colliding with an iceberg in the North Atlantic. This engineering marvel was carrying 2,224 individuals, yet only 32% managed to survive, rendering it one of the most catastrophic maritime disasters in history. The incident incited widespread outrage and prompted critical discussions regarding maritime safety, resulting in substantial regulatory reforms. These reforms included mandates for an adequate number of lifeboats for all passengers, enhanced crew training, and the establishment of the International Ice Patrol. A significant contributor to the high casualty rate was the inadequate provision of lifeboats—only 20 were available, with a capacity for approximately 1,178 individuals. Many passengers were reluctant to abandon the ship, mistakenly believing it to be secure, and the ensuing chaos during the sinking exacerbated the dire circumstances.
To gain a deeper understanding of this calamity, we will examine the built-in dataset ‘titanic’ found in the Seaborn library using Logistic Regression Model. This analysis will focus on the demographics of the survivors, investigating variables such as age, gender, class, and nationality to reveal insights into the survival rates.
Link of Dataset: titanic_dataset
Data Dictionary:
The Titanic dataset available in Seaborn represents a portion of the passenger data collected by the British Board of Trade. It encompasses information regarding the individuals who were on board the RMS Titanic, which tragically sank during its inaugural journey in 1912. This dataset is commonly utilized for illustrating classification algorithms and conducting exploratory data analysis (EDA). The Titanic dataset comprises the following columns:
SN | Variables | Data Type | Description |
---|---|---|---|
1 | survived | Integer | Survival status (0 = No, 1 = Yes) |
2 | pclass | Integer | Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd) |
3 | sex | category | Gender of the passenger (male, female) |
4 | age | Float | Age of the passenger in years. Some values may be NaN. |
5 | sibsp | Integer | Number of siblings or spouses aboard the Titanic |
6 | parch | Integer | Number of parents or children aboard the Titanic |
7 | fare | Float | Passenger fare in British pounds |
8 | embarked | Categorical | Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) |
9 | class | Categorical | Class of travel (First, Second, Third). This is a categorical representation of pclass. |
10 | who | Categorical | Simplified demographic (man, woman, child) |
11 | adult_male | Boolean | Whether the passenger was adult male (True, False) |
12 | deck | Categorical | Deck of the ship where the passenger was located (A through G, NaN for unknown) |
13 | embark_town | Categorical | Town of embarkation (Cherbourg, Queenstown, Southampton) |
14 | alive | Categorical | Survival status (yes, no). This is a categorical representation of survived. |
15 | alone | Boolean | Whether the passenger was traveling alone (True, False) |
The 'titanic' dataset comprises a total of 891 records and contains 15 variables, which include 4 integer variables, 2 float variables, 2 boolean variables, and 7 categorical variables. For the purpose of analysis, the boolean variables have been transformed into a 0/1 format.
Additionally, we have eliminated redundant variables such as 'embarked', 'class', and 'alive', as we will utilize their corresponding variables 'embark_town', 'pclass', and 'survived' in our analysis.
Exploratory Data Analysis:
Missing Value Treatment:
- The variables 'Age', 'deck', and 'embark_town' contain missing data.
- The 'deck' variable exhibits a significant amount of missing information, with 688 out of a total of 891 entries being absent. Consequently, this variable has been excluded from the dataset for subsequent analysis.
- The 'Age' variable has 177 missing entries, which we retained due to its importance. We estimated the missing values by calculating the average age for each ticket class ('pclass'). This approach resulted in an age distribution by class, as shown in the boxplot: pclass=1 has an average age of 38, pclass=2 has 30, and pclass=3 has 25.
- The 'embark_town' variable has 2 missing entries. We have removed these two rows from the dataset for further analysis. As a result, the dataset now comprises 889 rows.
Analysis of Dependent Variable (survived):
The overwhelming majority of those on board did not survive. According to the final sample dataset, out of the 889 passengers, 549 (62%) perished, while 340 (38%) managed to survive. Our goal is to determine whether the survival of these passengers was merely a matter of luck or if specific advantages or opportunities during the voyage contributed to their survival.
People on Board | 889 |
---|---|
Not-survived | 549 (62%) |
Survived | 340 (38%) |
Analysis of Independent Variables:
alone: This variable is derived from 'sibsp' and 'parch'. If both values are '0', it indicates that the passenger is alone; otherwise, they are not. A significant number of those who perished were alone on the ship, while a higher percentage of survivors were not alone.
alone | Total (889) | Not-survived (549) | Survived (340) |
---|---|---|---|
yes | 535 (60%) | 374 (68%) | 161 (47%) |
no | 354 (40%) | 175 (32%) | 179 (53%) |
adult_male: The vast majority of adult males appear to have perished. The survival rate for individuals who were not adult males was higher, suggesting that children and possibly women had a greater chance of survival. A more detailed examination of the 'gender' variable is necessary for clarity.
adult_male | Total (889) | Not-survived (549) | Survived (340) |
---|---|---|---|
yes | 537 (60%) | 449 (82%) | 88 (26%) |
no | 352 (40%) | 100 (18%) | 252 (74%) |
who and sex: The majority of fatalities were male, with the male death rate being five times that of females. The percentage of females among the survivors was more than double that of males, indicating that women and children were prioritized for rescue.
who | Total (889) | Not-survived (549) | Survived (340) |
---|---|---|---|
man | 537 (60%) | 449 (82%) | 88 (26%) |
woman | 269 (30%) | 66 (12%) | 203 (60%) |
child | 83 (9%) | 34 (6%) | 49 (14%) |
sex | Total (889) | Not-survived (549) | Survived (340) |
---|---|---|---|
male | 577 (65%) | 468 (85%) | 109 (32%) |
female | 312 (35%) | 81 (15%) | 231 (68%) |
embark_town: A higher percentage of passengers who boarded from Southampton were represented in both the deceased and survivors categories. This suggests that most passengers embarked from Southampton, with 644 boarding there, compared to 168 from Cherbourg and only 77 from Queenstown.
embark_town | Total (889) | Not-survived (549) | Survived (340) |
---|---|---|---|
Southampton | 644 (72%) | 427 (78%) | 217 (64%) |
Cherbourg | 168 (19%) | 75 (14%) | 93 (27%) |
Queenstown | 77 (9%) | 47 (9%) | 30 (9%) |
pclass: The majority of those holding third-class tickets did not survive. Conversely, the survival rate was higher among those with first-class tickets.
pclass | Total (889) | Not-survived (549) | Survived (340) |
---|---|---|---|
1 | 214 (24%) | 80 (15%) | 134 (39%) |
2 | 184 (21%) | 97 (18%) | 87 (26%) |
3 | 491 (55%) | 372 (68%) | 119 (35%) |
age: Most passengers were aged between 25 and 35, with a small number of children aged 0 to 4. The number of elderly passengers was generally low, and there is a noticeable decline in the number of passengers after the age of 35.
fare: Ticket prices ranged from 0 to 50, with a decreasing trend observed beyond 50. The data exhibits a rightward skew, indicating a right-skewed distribution.
Correlation Matrix – Heatmap
The variables 'age', 'sibsp', and 'parch' exhibit a minimal correlation with the dependent variable 'survived'. Consequently, we have excluded these three variables from the dataset for subsequent analysis.
Regarding the variables 'sex', 'who', and 'embark_town': These are categorical variables. We applied the get-dummies() method to these variables to transform their values into a binary format, represented by 0 and 1.
Modeling:
Dependent Variable: survived
Independent Variable: pclass, fare, adult_male, alone, sex, who, embark_town
The dataset has been partitioned into a training set that constitutes 75% and a testing set that accounts for 25%. The model was developed using the training dataset through the implementation of a Logistic Regression Model. Predictions were subsequently made utilizing the testing dataset.
Link of Actual and Predicted y-value: Actual and Predicted y-value
Classification Report and Confusion Matrix of the Model:
The f1-Score of the model stands at 0.79, suggesting that the model performs quite well. The total number of correct predictions made by the model is 111 + 65 = 176, representing an accuracy of 79%. The Type 1 error is recorded at 30, indicating that the model incorrectly predicted 30 passengers as having survived when they actually did not. Conversely, the Type 2 error is 17, meaning that the model misclassified 17 passengers as not having survived when they actually did. The Type 2 error is of greater concern, as a lower rate is preferable.
Go to Index page
Disclaimer
The content or analysis presented in the Blog is exclusively intended for educational purposes. It is important to note that this should not be considered as a suggestion for investing in stocks or as legal or medical advice. It is highly recommended to seek guidance from an expert before making any decision.
You would also like to read:
- Estimating Fuel Efficiency using Linear Regression Model
- How Do Regression and Classification Differ in Supervised Machine Learning?
- Supervised Machine Learning: How Machines Learn with Labeled Data
- Unsupervised Machine Learning: How Machines Discover Insights Without Labels
- Selecting the Best Free Python Data Science Environments
- TensorFlow: The Go-To Tool for AI and Machine Learning
- What is a Decision Tree?
- Neural Network: the core framework for Deep Learning Models
- Reinforcement Learning: Focus on Learning through interactions with environment
- How do machine learning models choose appealing ads for user segments?
- Safeguarding Human Intelligence: Essential Improvements for Thriving in an AI-Driven World
- The Transformative Power of Artificial Intelligence (AI) and Machine Learning (ML)
- General AI: How Close Are We to Achieving Human-Like Intelligence?
- Narrow AI - the Specialized Artificial Intelligence: Key Aspects and Recent Advancements