Analysis of 'titanic' dataset to predict survivor traits using Logistic Regression Model

Logistic Regression is a statistical method for binary classification tasks, like yes/no or spam/non-spam or survived/not-survived. It estimates the probability of an outcome belonging to one of two classes based on input features. Unlike Linear Regression, which predicts continuous values, Logistic Regression uses a sigmoid function to generate probabilities between 0 and 1. Although called regression, it addresses a classification problem with a categorical dependent variable and employs the Maximum Likelihood Method to keep predictions within the 0 to 1 range.

Equation:

\[P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 +...+ \beta_nx_n)}}\]

where,

\(P(y=1|x) \implies \) Probability of class 1 (e.g., survived)

Decision threshold (e.g., 0.5) determines class

The threshold value is set at 0.5. When the prediction exceeds or equal to 0.5, the logistic regression model will classify the outcome as 1; conversely, if the prediction falls below 0.5, the model will classify it as 0.

Example: Analysis of Seaborn's 'titanic' dataset to predict survivor traits using Logistic Regression Model

The Titanic met its tragic fate on April 15, 1912, while on its inaugural journey, colliding with an iceberg in the North Atlantic. This engineering marvel was carrying 2,224 individuals, yet only 32% managed to survive, rendering it one of the most catastrophic maritime disasters in history. The incident incited widespread outrage and prompted critical discussions regarding maritime safety, resulting in substantial regulatory reforms. These reforms included mandates for an adequate number of lifeboats for all passengers, enhanced crew training, and the establishment of the International Ice Patrol. A significant contributor to the high casualty rate was the inadequate provision of lifeboats—only 20 were available, with a capacity for approximately 1,178 individuals. Many passengers were reluctant to abandon the ship, mistakenly believing it to be secure, and the ensuing chaos during the sinking exacerbated the dire circumstances.

To gain a deeper understanding of this calamity, we will examine the built-in dataset ‘titanic’ found in the Seaborn library using Logistic Regression Model. This analysis will focus on the demographics of the survivors, investigating variables such as age, gender, class, and nationality to reveal insights into the survival rates.

Link of Dataset: titanic_dataset

Data Dictionary:

The Titanic dataset available in Seaborn represents a portion of the passenger data collected by the British Board of Trade. It encompasses information regarding the individuals who were on board the RMS Titanic, which tragically sank during its inaugural journey in 1912. This dataset is commonly utilized for illustrating classification algorithms and conducting exploratory data analysis (EDA). The Titanic dataset comprises the following columns:

SN	Variables	Data Type	Description
1	survived	Integer	Survival status (0 = No, 1 = Yes)
2	pclass	Integer	Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)
3	sex	category	Gender of the passenger (male, female)
4	age	Float	Age of the passenger in years. Some values may be NaN.
5	sibsp	Integer	Number of siblings or spouses aboard the Titanic
6	parch	Integer	Number of parents or children aboard the Titanic
7	fare	Float	Passenger fare in British pounds
8	embarked	Categorical	Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
9	class	Categorical	Class of travel (First, Second, Third). This is a categorical representation of pclass.
10	who	Categorical	Simplified demographic (man, woman, child)
11	adult_male	Boolean	Whether the passenger was adult male (True, False)
12	deck	Categorical	Deck of the ship where the passenger was located (A through G, NaN for unknown)
13	embark_town	Categorical	Town of embarkation (Cherbourg, Queenstown, Southampton)
14	alive	Categorical	Survival status (yes, no). This is a categorical representation of survived.
15	alone	Boolean	Whether the passenger was traveling alone (True, False)

The 'titanic' dataset comprises a total of 891 records and contains 15 variables, which include 4 integer variables, 2 float variables, 2 boolean variables, and 7 categorical variables. For the purpose of analysis, the boolean variables have been transformed into a 0/1 format.

Additionally, we have eliminated redundant variables such as 'embarked', 'class', and 'alive', as we will utilize their corresponding variables 'embark_town', 'pclass', and 'survived' in our analysis.

Exploratory Data Analysis:

Missing Value Treatment:

The variables 'Age', 'deck', and 'embark_town' contain missing data.

The 'deck' variable exhibits a significant amount of missing information, with 688 out of a total of 891 entries being absent. Consequently, this variable has been excluded from the dataset for subsequent analysis.

The 'Age' variable has 177 missing entries, which we retained due to its importance. We estimated the missing values by calculating the average age for each ticket class ('pclass'). This approach resulted in an age distribution by class, as shown in the boxplot: pclass=1 has an average age of 38, pclass=2 has 30, and pclass=3 has 25.

The 'embark_town' variable has 2 missing entries. We have removed these two rows from the dataset for further analysis. As a result, the dataset now comprises 889 rows.

Analysis of Dependent Variable (survived):

The overwhelming majority of those on board did not survive. According to the final sample dataset, out of the 889 passengers, 549 (62%) perished, while 340 (38%) managed to survive. Our goal is to determine whether the survival of these passengers was merely a matter of luck or if specific advantages or opportunities during the voyage contributed to their survival.

People on Board	889
Not-survived	549 (62%)
Survived	340 (38%)

Analysis of Independent Variables:

alone: This variable is derived from 'sibsp' and 'parch'. If both values are '0', it indicates that the passenger is alone; otherwise, they are not. A significant number of those who perished were alone on the ship, while a higher percentage of survivors were not alone.

alone	Total (889)	Not-survived (549)	Survived (340)
yes	535 (60%)	374 (68%)	161 (47%)
no	354 (40%)	175 (32%)	179 (53%)

adult_male: The vast majority of adult males appear to have perished. The survival rate for individuals who were not adult males was higher, suggesting that children and possibly women had a greater chance of survival. A more detailed examination of the 'gender' variable is necessary for clarity.

adult_male	Total (889)	Not-survived (549)	Survived (340)
yes	537 (60%)	449 (82%)	88 (26%)
no	352 (40%)	100 (18%)	252 (74%)

who and sex: The majority of fatalities were male, with the male death rate being five times that of females. The percentage of females among the survivors was more than double that of males, indicating that women and children were prioritized for rescue.

who	Total (889)	Not-survived (549)	Survived (340)
man	537 (60%)	449 (82%)	88 (26%)
woman	269 (30%)	66 (12%)	203 (60%)
child	83 (9%)	34 (6%)	49 (14%)

sex	Total (889)	Not-survived (549)	Survived (340)
male	577 (65%)	468 (85%)	109 (32%)
female	312 (35%)	81 (15%)	231 (68%)

embark_town: A higher percentage of passengers who boarded from Southampton were represented in both the deceased and survivors categories. This suggests that most passengers embarked from Southampton, with 644 boarding there, compared to 168 from Cherbourg and only 77 from Queenstown.

embark_town	Total (889)	Not-survived (549)	Survived (340)
Southampton	644 (72%)	427 (78%)	217 (64%)
Cherbourg	168 (19%)	75 (14%)	93 (27%)
Queenstown	77 (9%)	47 (9%)	30 (9%)

pclass: The majority of those holding third-class tickets did not survive. Conversely, the survival rate was higher among those with first-class tickets.

pclass	Total (889)	Not-survived (549)	Survived (340)
1	214 (24%)	80 (15%)	134 (39%)
2	184 (21%)	97 (18%)	87 (26%)
3	491 (55%)	372 (68%)	119 (35%)

age: Most passengers were aged between 25 and 35, with a small number of children aged 0 to 4. The number of elderly passengers was generally low, and there is a noticeable decline in the number of passengers after the age of 35.

fare: Ticket prices ranged from 0 to 50, with a decreasing trend observed beyond 50. The data exhibits a rightward skew, indicating a right-skewed distribution.

Correlation Matrix – Heatmap

The variables 'age', 'sibsp', and 'parch' exhibit a minimal correlation with the dependent variable 'survived'. Consequently, we have excluded these three variables from the dataset for subsequent analysis.

Regarding the variables 'sex', 'who', and 'embark_town': These are categorical variables. We applied the get-dummies() method to these variables to transform their values into a binary format, represented by 0 and 1.

Modeling:

Dependent Variable: survived

Independent Variable: pclass, fare, adult_male, alone, sex, who, embark_town

The dataset has been partitioned into a training set that constitutes 75% and a testing set that accounts for 25%. The model was developed using the training dataset through the implementation of a Logistic Regression Model. Predictions were subsequently made utilizing the testing dataset.

Link of Actual and Predicted y-value: Actual and Predicted y-value

Classification Report and Confusion Matrix of the Model:

The f1-Score of the model stands at 0.79, suggesting that the model performs quite well. The total number of correct predictions made by the model is 111 + 65 = 176, representing an accuracy of 79%. The Type 1 error is recorded at 30, indicating that the model incorrectly predicted 30 passengers as having survived when they actually did not. Conversely, the Type 2 error is 17, meaning that the model misclassified 17 passengers as not having survived when they actually did. The Type 2 error is of greater concern, as a lower rate is preferable.

Go to Index page

Disclaimer

The content or analysis presented in the Blog is exclusively intended for educational purposes. It is important to note that this should not be considered as a suggestion for investing in stocks or as legal or medical advice. It is highly recommended to seek guidance from an expert before making any decision.