Loading The Titanic Passenger Data Into PostgreSQL With DBeaver Part 1

The Sinking of the RMS Titanic in 1912 has shocked and entertained for over a century.  I will admit to being shocked to find a GitHub Repo with a CSV formatted file with the passenger list. This file is an interesting glimpse into the lives of the folks who sailed on the ill fated ship so long ago. And interesting datasets are always good for examples. I had been looking for a project to show off what I call 'Lifting and Shifting' but others call ETL. 

At the PG NYC 2025 Conference, Ryan Booz had a session title Transforming Data with the Power of PostgreSQL and SQL. For many years, ETL was considered as the way to move data from one source to another  Ryan argued that it should be ELT, and rather convincingly. 

ETL?

ETL is a three-step data integration process (Extract, Transform, Load) used to collect raw data from various sources, process it into a consistent, usable format, and then load it into a central data warehouse or database for analytics and business intelligence.

Spreadsheet conversions are a fairly common task and DBeaver is a slick way to make the change. 

ELT?

The 'T' is often the tricky part. Databases have a lot of tools to manipulate data. So why put off using the power of the database until the end? What if we loaded the data into a database as an intermediate step and used the power of the database to transform the data into a data final form? This seems like an obviously better practice. So how do we do it?

Peek At The Data

Using the above link, I down loaded the CSV file of the passenger list. Using the DBeaver Community Edition, I used File -> Open to access the titanic.csv file. This was done to investigate the file only and does not actually do anything to import the data.












The data can be viewed at this point. A quick review shows nothing out of the ordinary. Foreshadowing - there was something wrong.  I like to perform this step just to make sure there is nothing obviously wrong with the data - bad encoding, weird line feeds, sudden changes in character sets, and the like. There maybe formatting issues (JSON, XML, SGML) or other anomalies. I would rather find them at this step than later discovering the data was wrong from the start.







The E

The next step is to import the data to a new PostgreSQL 18 instance. I connected to the instance. Next,  Right click on the tables and select Import Data.

And then we need to specify the Import source as a CSV file. DBeaver can use a database table, XML, XLXS file. And yes you can move from Database X to Database Y too, 














Now select Input File.

We browse for the titanic.csv file. At this point we can change the importer settings, but the default values were perfectly fine in this case. Often, you have have to change delimiters from single to double quotes, or end of line characters. Life is simpler if you select the proper character set so you do not lose data.



Select Next and then Preview Data.


We can customize the input settings to change the data types or change the column names. again, the defaults are always a good first choice.

 The preview looks good, at least at this point.


The settings for loading that data into the database can now be set. I went with the defaults but you may want to bulk load the data, log queries, or even truncate a table from a previous attempt.



And we 'Proceed'. At least until we hit an error.


Okay, so it is time to realize we have an issue. This is fairly common in importing data that one (or more than one) records does not comply with what we thought was the layout. We have a name that can not be used with our varchar(64) specification. At this point, I want ALL the data. So I backed out of the conversion and restarted.

This time I changed the length of the Name column. The heuristics thought that a varchar(64) would suffice but I changed it to a text.  PostgreSQL has an amazingly rich set of data types, but your database of choice may not - and you need to check. This is a simple case with one row that does not comply.  But often the case is that you have a CSV with many rows that are non-complaint. And the non-complaint rows need to be corrected or the target rows need to be adjusted.


 Is using TEXT wasteful? Maybe if I had counted the maximum length of the name column used that number for a varchar things would have been better. The trade off is trying to be perfect on the first conversion versus adjust as needed. Remember this is an intermediate stage and we can trim later, if needed.


Rerun the dialogs and now we have all of our data in a table.



This is the end of part one. We now have the data from a CSV file loaded into a temporary database. Next comes a serious look at the data to see what we can do to save space, add value to the, or improve the usability. 

Do we want to keep 'female' and 'male' or using a binary representation. Would it help to have separate first and last name columns?  Do we expand the Embarked data. Do we know what all the columns represent? All that will be in Part 2. 



Comments

Popular posts from this blog

Incremental Backups in PostgreSQL 17

How PostgreSQL's Aggregate FILTER Will Spoil You

Can Artificial Intelligence Created Better Tables Than You?