Titanic disaster (SQL)

Titanic disaster SQL

Of course you can analyze data with Phyton or R, but how about the good old SQL?

In this example I show you my process to analyse the Titanic disaster dataset purely in SQL.

It's a bit extensive but exciting ... please bear with me to the final conclusion.

Introduction

Also in 2022 SQL is the most popular database management system in the world: Most popular database management systems 2022 | Statista. The Titanic dataset is a relatively simple csv file containing the data of 1306 passengers and their fate in the night of April 15 1912. Let's start!

Preperation

I am using SQL Server 2019 installation and SQL Server Management Studio 18. The dataset is a csv file downloaded from https://data.world/nrippner/titanic-disaster-dataset.

Step 1: Create the database and the table

I created the database and the table here, to enter the data I used the BULK INSERT statement.

use titanic;

drop table if exists dbo.passengers;

create table passengers

(

passengerID int primary key,

pclass smallint not null,

survived smallint not null,

last_name varchar(50),

first_name varchar(100),

sex varchar(10),

age decimal(12,2),

sibsp smallint,

parch smallint,

ticket varchar(50),

fare decimal(12,2),

cabin varchar(30),

embarked varchar(30),

boat varchar(30),

body int,

homedest varchar(50)

)

BULK INSERT dbo.passengers

FROM 'D:\COMPUTER\data\nrippner-titanic-disaster-dataset\nrippner-titanic-disaster-dataset\original\titanic_disaster_original2.csv'

WITH

(

FORMAT='CSV',

FIRSTROW=2

);

Step 2: Check for missing values

Missing values have a significant impact on the data quality thus on the results. For every data analysis task, the phrase "Shit in - shit out" is still valid. First let's find out, how many empty rows we have for each column and return a percentage for it.

with CTE1 as

(

select

cast(100.00 * sum(case when pclass IS NULL then 1 else 0 end)/ max(all_passengers) as decimal(6,1)) as pclass,

cast(100.00 * sum(case when survived IS NULL then 1 else 0 end) / max(all_passengers) as decimal(6,1)) as survived,

cast(100.00 * sum(case when last_name IS NULL then 1 else 0 end) / max(all_passengers) as decimal(6,1)) as lastname,

cast(100.00 * sum(case when first_name IS NULL then 1 else 0 end) / max(all_passengers) as decimal(6,1)) as firstname,

cast(100.00 * sum(case when sex IS NULL then 1 else 0 end) / max(all_passengers) as decimal(6,1)) as sex,

cast(100.00 * sum(case when age IS NULL then 1 else 0 end) / max(all_passengers) as decimal(6,1)) as age,

cast(100.00 * sum(case when sibsp IS NULL then 1 else 0 end) / max(all_passengers) as decimal(6,1)) as sib,

cast(100.00 * sum(case when parch IS NULL then 1 else 0 end) / max(all_passengers) as decimal(6,1)) as parch,

cast(100.00 * sum(case when ticket IS NULL then 1 else 0 end) / max(all_passengers) as decimal(6,1)) as ticket,

cast(100.00 * sum(case when fare IS NULL then 1 else 0 end) / max(all_passengers) as decimal(6,1)) as fare,

cast(100.00 * sum(case when cabin IS NULL then 1 else 0 end) / max(all_passengers) as decimal(6,1)) as cabin,

cast(100.00 * sum(case when embarked IS NULL then 1 else 0 end) / max(all_passengers) as decimal(6,1)) as embarked,

cast(100.00 * sum(case when boat IS NULL then 1 else 0 end) / max(all_passengers) as decimal(6,1)) as boat,

cast(100.00 * sum(case when body IS NULL then 1 else 0 end) / max(all_passengers) as decimal(6,1)) as body,

cast(100.00 * sum(case when homedest IS NULL then 1 else 0 end) / max(all_passengers) as decimal(6,1)) as home

from

(

select count(*) over() as all_passengers, *

from passengers

) sub1

)

select 'pclass' as attribute, pclass as null_percent from CTE1

union all

select 'survived', survived from CTE1

union all

select 'lastname', lastname from CTE1

union all

select 'firstname', firstname from CTE1

union all

select 'sex', sex from CTE1

union all

select 'age', age from CTE1

union all

select 'sibsp', sib from CTE1

union all

select 'parch', parch from CTE1

union all

select 'ticket', ticket from CTE1

union all

select 'fare', survived from CTE1

union all

select 'cabin', cabin from CTE1

union all

select 'embarked', embarked from CTE1

union all

select 'boat', boat from CTE1

union all

select 'body', body from CTE1

union all

select 'homedest', home from CTE1

;

It shows that [age] contains 20% nulls, [cabin] 77.5%, [boat] 62,9%, [body] 90,8% and [homedest](homedestination) 43,1%. While I am not going to use [boat], [body] and [homedest] for my analysis, I left with the critical columns [age] and [cabin]. We will deal with this problem later.

Range of values

Now I have a closer look at [age] and [fare], I calculated the average and range of the values. Both columns are right skewed, means that the max values are more far away from the average than the mins, thus there are outliers on the upper range of the values, which has an impact on the average calculation.

select 'age' as attribute, count (distinct coalesce(age,0)) as dist_ct, min(age) as min_val, avg(age) as avg_val, max(age) as max_val from passengers
union all
select 'fare', count (distinct fare), min(fare), avg(fare), max(fare) from passengers;

Replace or leave missing values?

There are two ways to deal with missing (null) data. Either replace it with something more reasonable or ignore it, e.g. using the WHERE [age] IS NOT NULL statement. I decided to checked what's the impact by replacing the missing values by the average age based on the title, which is the part of the [first_name] column. That seems a good way to get a reasonable detailed number for replacement. To do that, I need to extract the title from the [first_name] column.

With that I created a view to collect the related data:

-- drop view age_replaced;

-- average age by title

-- create view age_replaced as

with CTE1 as
(
select
passengerID,
trim(substring(first_name,1,charindex('.',first_name,1))) as title,
first_name,
last_name,
age
from passengers p
)
,
CTE2 as
(
select
title,
avg(age) as avg_age
from
(
SELECT
[first_name],
trim(substring(first_name,1,charindex('.',first_name,1))) as title,
[sex],
[age]
FROM [titanic].[dbo].[passengers]
) sub1
group by title
)
select
t1.passengerID,
t1.title,
t1.first_name,
t1.last_name,
t1.age,
cast(

-- here the values are replaced
case
when t1.age <1 then 1
when t1.age is null then t2.avg_age
else t1.age
end
as decimal(6,2)
) as replacedage,
t2.avg_age
from CTE1 t1
inner join CTE2 t2 on t1.title = t2.title;

select
p.sex,
AVG(p.age) as avg_original,
avg(replacedage) as avg_replaced,
AVG(p.age) - avg(replacedage) as avg_difference
from passengers p
join [dbo].[age_replaced] r on r.passengerid = p.passengerID
group by p.sex;

Obviously replacing the missing values with the average age according to the passenger’s title does not make a significant impact, so I decided to continue with the age grouping ignoring the missing values (263 null rows out of 1306).

I am satisfied with the result and the data quality so far, now let's dive into the analysis ... this is exciting!

Now I can run the comparison on this view

Analysis of survival

The overall survival rate

The general approach to calculate survival rate base on a subquery using a WINDOW functions in order to get the divisor grouped by the desired dimension. This repeats for the following analysis in a similar way.

/* overall survival */

select
t1.survived,
count(*) as num_of_pass,
max(t1.total_passengers) as total_pass,
format(round(1.00 * count(*) / max(t1.total_passengers),4),'P') as survival_rate,

replicate('|',floor(100 * count(*) / max(t1.total_passengers))) as visual
from

-- subquery for the overall number of passengers
(
select
count(*) over () as total_passengers, passengers.*
from passengers
) t1
group by t1.survived; -- group by the code survived (1 or 0)

Tragically nearly two third of the passengers didn't survive, only 500 of 1306 made it. Now lets dig a bit deeper to understand, how the survivors are distributed by class, sex, age ect.

Survival by sex and class

As we all know from the books, women had a much higher chance of survival than men. The figures confirm that, female passengers in all classes had a survival rate of 72.7% while 4 of 5 men died. Using GROUP BY ROLLUP shows subtotals by sex.

/* survial rate by sex and class */

with CTE1 as
(

select
t1.pclass as pclass,
case
when t1.pclass = 1 then 'First'
when t1.pclass = 2 then 'Second'
when t1.pclass = 3 then 'Third'
else 'False'
end as class,
t1.sex as sex,
-- count(*) as num_of_pass_by_pclass,
max(total_passengers) as all_passengers,
sum(survived) as survived
from
(
select
passengerid as pass_id,
pclass as pclass,
sex as sex,
survived as survived,
count(*) over (partition by pclass, sex) as total_passengers
from passengers
) t1
where survived = 1
group by sex, pclass
)

select
sex,
class,
sum(all_passengers) as all_passengers,
sum(survived) as survived,
cast((100.00 * sum(survived)) / sum(all_passengers) as decimal(5,1)) as 'survived %',
100-cast((100.00 * sum(survived)) / sum(all_passengers) as decimal(5,1)) as 'died %'
from CTE1
group by rollup (sex, class)
;

Survival by age

Average passenger by age

Before we start let's have a look on the average age by class and sex; just for curiosity and to practice the PIVOT statement ;-).

select * from
(
select
pclass,
sex,
age
from passengers
) t
PIVOT(
avg(age)
for sex IN ([female], [male])
) as pvt
order by pclass
;

Passengers in the third class were significantly younger than in the first class (15 years for female and 16 years from male). Did the younger and fitter ones had a better chance of survival?

Creating age bins

The column [age] has 99 distinct values, so it does not make a lot of sense to calculate the survival rate on each of them. More expedient is the approach to group [age] into bins. I did that creating the following helper table:

-- drop table age_ranges;

/* create a table age_ranges */

IF NOT EXISTS (SELECT * FROM sysobjects WHERE name='age_ranges' and xtype='U')
create table age_ranges (agegrp int primary key, agegroup varchar(20), age_from int, age_to int) ;

insert into age_ranges
values
(1,'under 10',0,10),
(2,'between 10 and 20',10,20),
(3,'between 20 and 30',20,30),
(4,'between 30 and 40',30,40),
(5,'between 40 and 50',40,50),
(6,'over 50',50,99),
(7,'missing value',null,null);

select * from age_ranges;
-- truncate table age_grouping;

This basic query matches the helper range table with the fact table

This way an age group is assigned to each passenger

select
passengerid,
age,
r.agegroup
from passengers p
inner join age_ranges r on p.age >= r.age_from and p.age < r.age_to
order by p.age;

Survival by age bin

Now it's pretty straight forward to calculate the survival rate per age bin.

Note, that the total number of passengers in age bins is only 1046, the count of survivors only 427 instead of 500 as null values in the [age] column are not considered (263 nulls).

select
agegrp,
agegroup,
count(agegrp) as count_by_age_group,
format((1.00 * count(agegrp) / max(all_passengers)),'P') as bin_percent,
sum(survived) as count_survived,
format((1.00 * sum(survived) / count(agegrp)),'P') as percent_survived
from (
select
passengerID,
age,
survived,
r.[agegrp],
r.[agegroup],
count(passengerid) over() as all_passengers
from passengers p
/* join with the age ranges table */
inner join age_ranges r on p.age >= r.age_from and p.age < r.age_to
) sub1
group by agegrp,agegroup
order by agegrp;

Infants and children under 10 made up the smallest group of passengers (7.8%), however their survival chance was nearly twice (61%) comparing to adults, which had a survival rate between 36 and 42%.

Bring all together (class, sex and age bins)

Now bring everything together to see the results by age bin, sex and class.

with CTE1 as
(
select
passengerID,
pclass,
sex,
age,
survived,
case when age <= 10 then 1
when age > 10 and age <= 20 then 2
when age > 20 and age <= 30 then 3
when age > 30 and age <= 40 then 4
when age > 40 and age <= 50 then 5
when age > 50 then 6
else 7
end as agesegm
from passengers
where age is not null
)
select
c.pclass,
c.sex,
t.agegroup,
count(passengerid) as no_of_passenger,
sum(c.survived) as survived,
format(1.00 * sum(c.survived) / count(passengerid),'P') as survicedpercent
-- replicate('|',floor(100 * sum(v.survived) / count(passengerid))) as visual
from CTE1 c
inner join age_grouping t on t.agegrp = c.agesegm
group by c.pclass,c.sex, c.agesegm,t.agegroup
order by c.pclass,c.sex, c.agesegm
;

The result list shows 36 rows (3 classes, 2 gender, 6 age groups), which I show the first 10 rows here. But there is a very interesting fact noticeable, when we visualize the data (sorry, I used Power BI for that as SQL really doesn't have the greatest visualization capacities).

The male passengers in the second class had an even lower survival rate than the ones in the third class (except the infants): only 5 of 147 men survived travelling second class (3.4%), the survival rate in the third class for these passenger groups were 15.1% and 33.8% in the first class.

Top 10 survival groups

Ok, that's a lot of numbers. How about we look at the top 10 passenger groups on class, sex and age to see, who had the best chances.

select top 10
a.pclass,
a.sex,
ag.agegroup,
count(pclass) as no_of_passenger,
sum(survived) as survived,
format(1.00 * sum(survived) / count(passengerid),'P') as survicedpercent
from v_agegrouping a (I used a view here to make my life easier)
inner join age_grouping ag on ag.agegrp = a.agesegm
group by a.pclass,ag.agegroup,a.sex
order by (1.00 * sum(survived) / count(passengerid) ) desc, pclass
;

With no surprise the top 10 survivor groups are mainly women, actually 8 out of 10, while male infants made the two remaining groups of survivors in the top 10. Only 14 male passengers are in the top 10 survivor groups but 159 female.

What else can we dig out of the data?

Solo or family travelers: who's survival chances were better?

To answer this question, we need to look at the information about siblings or spouses [sibsp] and parents or children [parch] as well [ticket] and [last_name] in order to figure out, who is a solo traveler or a family member. But this has come challenges:

In first class for example families often travelled with their staff/servants on the same ticket (example A), but these passengers are not part of the family, the [sibsp] and [parch] shows a zero. Shall these "staff" passengers count as solo travelers? On the other hand, in third class several passengers travelling on the same ticket are men travelling together, but probably I would consider them as solo travelers (example B). There also families and couples travelling on different tickets for some reason but not marked as sibsp as shown in example C.

Example A

Example B

Example C

Create the passenger segmentation

Finally I had to make a decision how to determine family vs solo travellers. The most obvious attribute is to consider everyone, who has a value either in the column [sibsp] or [parch] as a familiy member, call it "familiy size".

Passengers having a family size of 0 and a lastname_ticket count (partition by last_name and ticket) is 1 and the count of ticketnumber is 1also, they are considered as solo travelers. Thr remaining passengers are classified as "unclear status", which are by the way 127 passengers. I want to call this segmentation [fam_group].

/* calculate the familysize3 */

with CTE1 as
(
select
passengerID,
survived,
pclass,
last_name,
first_name,
ticket,
cabin,
sex,
age,
sibsp,
parch,
familysize,
lastnamecount,
-- lastnamerow,
lastnameticketcount,
ticketcount,
ticketrow
from
(
select
*,
sibsp + parch as familysize, -- simply add sibsp and parch to determine, whether there is a value > 0
count(last_name) over(partition by last_name) as lastnamecount, -- count the occurance of name by last_name
-- ROW_NUMBER() over(partition by last_name order by last_name) as lastnamerow, -- the last name in order of their occurance
count(last_name) over(partition by last_name, ticket) as lastnameticketcount, -- count the name by last_name and ticketnumber grouping
count(ticket) over(partition by ticket) as ticketcount, -- number of tickets count by the ticket number
ROW_NUMBER() over(partition by ticket order by ticket) as ticketrow -- count of ticket number in order of their occurance
from passengers
) t1
)
select * from
(
Select *,
case
when familysize > 0 then 'Rel' -- here they are or having relatives of some kind -> means family
when familysize = 0 and lastnameticketcount = 1 and ticketcount = 1 then 'Solo' -- the familysize is 0 and there is one ticket per name -> solo
else 'unclear'
end fam_group
from CTE1
) t2
order by last_name,ticket,ticketrow;

Now lets see, if this answers the question. Put the segmentation into a CTE and run the analysis on it:

It shows that having any relatives (spouse or children) slightly increased the chances of survival for male (16.4% ↗ 25.9%) and female (66.9% ↗ 72.4%) passengers comparing to passengers travelling solo. I assume, that travellers with family had a better chance to get into a lifeboat than solo travelers.

One last thing: the survival rate by embarked harbour

The survival rate by embarked harbor is a perfect example, that you sometimes have to look twice to find the story behind your numbers.

select
t1.embarked,
max(pass_embarked) as allpassengers,
sum(survived) as survived,
100.00 * sum(survived) / max(pass_embarked) as survival_rate
from
(
select
passengerid as pass_id,
embarked,
survived,
count(*) over (partition by embarked) as pass_embarked
from passengers
where embarked is not null -- two passengers have no embarked code
) t1
where survived = 1
group by t1.embarked;

It shows, that by boarding the Titanic in Cherbourg, it seems the chance of survival is much higher than travelling from Queenstown or Southampton. But that does not make sense, hence there must be an underlaying information, which cause this result. A reasonable explanation must lay in a different distribution of passengers, for example their passenger class or the cabin, they travelled. As we don't have consistent information about the cabin, let's have a look on the travel-class.

-- the number of passengers embarked
with CTE1 as
(
select
embarked,
pclass,
count(passengerid) as passengercount,
100.00 * count(passengerid) / max(no_pass_emb) as perc_emb_class,
100.00 * count(passengerid) / max(pass_all) as perc_pass_all
from
(
select
passengerid,
pclass,
embarked,
count(passengerid) over(partition by embarked) as no_pass_emb,
count(*) over () as pass_all
from passengers
where embarked is not null
) sub1
group by embarked,pclass
)
,

-- survival rate by embarked
CTE2 as
(
select
embarked,
max(pass_embarked) as embarkedpassengers,
sum(survived) as survived,
100.00 * sum(survived) / max(pass_embarked) as survival_rate
from
(
select
passengerid as pass_id,
embarked,
survived,
count(*) over (partition by embarked) as pass_embarked
from passengers
where embarked is not null
) sub2
where survived = 1
group by embarked
)
-- select * from CTE2;

-- return information
select
case
when c1.embarked = 'C' then 'Cherbourg (FRA)'
when c1.embarked = 'Q' then 'Queenstown (IRL)'
when c1.embarked = 'S' then 'Southampton (ENG)'
else 'missing'
end as embarked_harbour,
c1.pclass,
c1.passengercount,
convert(decimal(6,1),c1.perc_pass_all) as all_pass_perc,
convert(decimal(6,1),c1.perc_emb_class) as embarked_pass_perc,
convert(decimal(6,1),c2.survival_rate) as embarked_survival_perc
from CTE1 c1
inner join CTE2 c2 on c2.embarked=c1.embarked
;

The answer lies in the passenger distribution by passenger class. It reveals that the higher survival rate for Cherbourg is related to the proportional higher number of passengers travelling first class rather in class 2 or 3 (52.2%) and as we know, the survival rate in the first class was much higher than in second or third class. In contrast, boarding from Queenstown, 92% of the passengers travelled in class 3 with a much lower chance of survival as well from Southampton, where 52% of passengers travelled third class.

/* calculate the survival rate by family group */
select
t1.fam_group,
sex,
-- count(*) as num_of_pass_by_famgroup,
max(total_passengers) as allpassengers,
sum(survived) as survived,
format(round(sum(survived) / cast(max(total_passengers) as float),4),'P') as survival_rate
from
(
select
passengerid as pass_id,
fam_group,
survived,
sex,
count(*) over (partition by fam_group, sex) as total_passengers
from CTE2
) t1
where survived = 1
group by t1.fam_group, t1.sex;

Final conclusion and kudos

With this post I not only want to show my abilities using SQL for data analysis, I also feel it exciting and satisfying to find insights about the human fate in one of the biggest passenger ship disasters in history.

And of course I am not the first and only one analyzing the Titanic disaster with SQL, this article was inspired by Do Lee and his article here Kaggle Titanic Competition in SQL | by Do Lee | Towards Data Science.