Analysis of 'titanic' dataset to predict survivor traits using Logistic Regression Model

Logistic Regression is a statistical method for binary classification tasks, like yes/no or spam/non-spam or survived/not-survived. It estimates the probability of an outcome belonging to one of two classes based on input features. Unlike Linear Regression, which predicts continuous values, Logistic Regression uses a sigmoid function to generate probabilities between 0 and 1. Although called regression, it addresses a classification problem with a categorical dependent variable and employs the Maximum Likelihood Method to keep predictions within the 0 to 1 range.

Equation:

\[P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 +...+ \beta_nx_n)}}\]

where,

\(P(y=1|x) \implies \) Probability of class 1 (e.g., survived)

Decision threshold (e.g., 0.5) determines class

The threshold value is set at 0.5. When the prediction exceeds or equal to 0.5, the logistic regression model will classify the outcome as 1; conversely, if the prediction falls below 0.5, the model will classify it as 0.

Example: Analysis of Seaborn's 'titanic' dataset to predict survivor traits using Logistic Regression Model

The Titanic met its tragic fate on April 15, 1912, while on its inaugural journey, colliding with an iceberg in the North Atlantic. This engineering marvel was carrying 2,224 individuals, yet only 32% managed to survive, rendering it one of the most catastrophic maritime disasters in history. The incident incited widespread outrage and prompted critical discussions regarding maritime safety, resulting in substantial regulatory reforms. These reforms included mandates for an adequate number of lifeboats for all passengers, enhanced crew training, and the establishment of the International Ice Patrol. A significant contributor to the high casualty rate was the inadequate provision of lifeboats—only 20 were available, with a capacity for approximately 1,178 individuals. Many passengers were reluctant to abandon the ship, mistakenly believing it to be secure, and the ensuing chaos during the sinking exacerbated the dire circumstances.


The Titanic met its tragic fate on April 15, 1912, while on its inaugural journey, colliding with an iceberg in the North Atlantic.

To gain a deeper understanding of this calamity, we will examine the built-in dataset ‘titanic’ found in the Seaborn library using Logistic Regression Model. This analysis will focus on the demographics of the survivors, investigating variables such as age, gender, class, and nationality to reveal insights into the survival rates.

Link of Dataset: titanic_dataset

Data Dictionary:

The Titanic dataset available in Seaborn represents a portion of the passenger data collected by the British Board of Trade. It encompasses information regarding the individuals who were on board the RMS Titanic, which tragically sank during its inaugural journey in 1912. This dataset is commonly utilized for illustrating classification algorithms and conducting exploratory data analysis (EDA). The Titanic dataset comprises the following columns:

SN Variables Data Type Description
1 survived Integer Survival status (0 = No, 1 = Yes)
2 pclass Integer Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)
3 sex category Gender of the passenger (male, female)
4 age Float Age of the passenger in years. Some values may be NaN.
5 sibsp Integer Number of siblings or spouses aboard the Titanic
6 parch Integer Number of parents or children aboard the Titanic
7 fare Float Passenger fare in British pounds
8 embarked Categorical Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
9 class Categorical Class of travel (First, Second, Third). This is a categorical representation of pclass.
10 who Categorical Simplified demographic (man, woman, child)
11 adult_male Boolean Whether the passenger was adult male (True, False)
12 deck Categorical Deck of the ship where the passenger was located (A through G, NaN for unknown)
13 embark_town Categorical Town of embarkation (Cherbourg, Queenstown, Southampton)
14 alive Categorical Survival status (yes, no). This is a categorical representation of survived.
15 alone Boolean Whether the passenger was traveling alone (True, False)

The 'titanic' dataset comprises a total of 891 records and contains 15 variables, which include 4 integer variables, 2 float variables, 2 boolean variables, and 7 categorical variables. For the purpose of analysis, the boolean variables have been transformed into a 0/1 format.

Additionally, we have eliminated redundant variables such as 'embarked', 'class', and 'alive', as we will utilize their corresponding variables 'embark_town', 'pclass', and 'survived' in our analysis.

Exploratory Data Analysis:

Missing Value Treatment:
  • The variables 'Age', 'deck', and 'embark_town' contain missing data.

  • The 'deck' variable exhibits a significant amount of missing information, with 688 out of a total of 891 entries being absent. Consequently, this variable has been excluded from the dataset for subsequent analysis.

  • The 'Age' variable has 177 missing entries, which we retained due to its importance. We estimated the missing values by calculating the average age for each ticket class ('pclass'). This approach resulted in an age distribution by class, as shown in the boxplot: pclass=1 has an average age of 38, pclass=2 has 30, and pclass=3 has 25.

  • The 'embark_town' variable has 2 missing entries. We have removed these two rows from the dataset for further analysis. As a result, the dataset now comprises 889 rows.

Exploratory Data Analysis (Titanic)

Analysis of Dependent Variable (survived):

The overwhelming majority of those on board did not survive. According to the final sample dataset, out of the 889 passengers, 549 (62%) perished, while 340 (38%) managed to survive. Our goal is to determine whether the survival of these passengers was merely a matter of luck or if specific advantages or opportunities during the voyage contributed to their survival.

People on Board 889
Not-survived 549 (62%)
Survived 340 (38%)
Analysis of Independent Variables:

alone: This variable is derived from 'sibsp' and 'parch'. If both values are '0', it indicates that the passenger is alone; otherwise, they are not. A significant number of those who perished were alone on the ship, while a higher percentage of survivors were not alone.

alone Total (889) Not-survived (549) Survived (340)
yes 535 (60%) 374 (68%) 161 (47%)
no 354 (40%) 175 (32%) 179 (53%)

adult_male: The vast majority of adult males appear to have perished. The survival rate for individuals who were not adult males was higher, suggesting that children and possibly women had a greater chance of survival. A more detailed examination of the 'gender' variable is necessary for clarity.

adult_male Total (889) Not-survived (549) Survived (340)
yes 537 (60%) 449 (82%) 88 (26%)
no 352 (40%) 100 (18%) 252 (74%)

who and sex: The majority of fatalities were male, with the male death rate being five times that of females. The percentage of females among the survivors was more than double that of males, indicating that women and children were prioritized for rescue.

who Total (889) Not-survived (549) Survived (340)
man 537 (60%) 449 (82%) 88 (26%)
woman 269 (30%) 66 (12%) 203 (60%)
child 83 (9%) 34 (6%) 49 (14%)

sex Total (889) Not-survived (549) Survived (340)
male 577 (65%) 468 (85%) 109 (32%)
female 312 (35%) 81 (15%) 231 (68%)

embark_town: A higher percentage of passengers who boarded from Southampton were represented in both the deceased and survivors categories. This suggests that most passengers embarked from Southampton, with 644 boarding there, compared to 168 from Cherbourg and only 77 from Queenstown.

embark_town Total (889) Not-survived (549) Survived (340)
Southampton 644 (72%) 427 (78%) 217 (64%)
Cherbourg 168 (19%) 75 (14%) 93 (27%)
Queenstown 77 (9%) 47 (9%) 30 (9%)

pclass: The majority of those holding third-class tickets did not survive. Conversely, the survival rate was higher among those with first-class tickets.

pclass Total (889) Not-survived (549) Survived (340)
1 214 (24%) 80 (15%) 134 (39%)
2 184 (21%) 97 (18%) 87 (26%)
3 491 (55%) 372 (68%) 119 (35%)

age: Most passengers were aged between 25 and 35, with a small number of children aged 0 to 4. The number of elderly passengers was generally low, and there is a noticeable decline in the number of passengers after the age of 35.

fare: Ticket prices ranged from 0 to 50, with a decreasing trend observed beyond 50. The data exhibits a rightward skew, indicating a right-skewed distribution.

Correlation Matrix – Heatmap

Correlation Matrix – Heatmap

The variables 'age', 'sibsp', and 'parch' exhibit a minimal correlation with the dependent variable 'survived'. Consequently, we have excluded these three variables from the dataset for subsequent analysis.

Regarding the variables 'sex', 'who', and 'embark_town': These are categorical variables. We applied the get-dummies() method to these variables to transform their values into a binary format, represented by 0 and 1.

Modeling:

Dependent Variable: survived

Independent Variable: pclass, fare, adult_male, alone, sex, who, embark_town

The dataset has been partitioned into a training set that constitutes 75% and a testing set that accounts for 25%. The model was developed using the training dataset through the implementation of a Logistic Regression Model. Predictions were subsequently made utilizing the testing dataset.

Link of Actual and Predicted y-value: Actual and Predicted y-value

Classification Report and Confusion Matrix of the Model:

Classification Report

Confusion Matrix

The f1-Score of the model stands at 0.79, suggesting that the model performs quite well. The total number of correct predictions made by the model is 111 + 65 = 176, representing an accuracy of 79%. The Type 1 error is recorded at 30, indicating that the model incorrectly predicted 30 passengers as having survived when they actually did not. Conversely, the Type 2 error is 17, meaning that the model misclassified 17 passengers as not having survived when they actually did. The Type 2 error is of greater concern, as a lower rate is preferable.


Go to Index page


Disclaimer

The content or analysis presented in the Blog is exclusively intended for educational purposes. It is important to note that this should not be considered as a suggestion for investing in stocks or as legal or medical advice. It is highly recommended to seek guidance from an expert before making any decision.


You would also like to read: