Loading The Titanic Passenger Data Into PostgreSQL With DBeaver Part 2
In the last edition of this blog, the passenger list data from the HMS Titanic was loaded into a preliminary database. Now it is time to refine.
I am using DBeaver Enterprise 25.2.0. PostgreSQL 18, and Github Copilot with Gpt-4.
Prompt: Any recommendations on improving this table for storage efficiency? This prompt was entered into the DBeaver AI Assistant.
To improve the storage efficiency of the "titanic" table, we can consider the following recommendations:
### 1. **Optimize Data Types**
- **"PassengerId"**: If the number of passengers is not expected to exceed 2,147,483,647, the `int4` type is fine. However, if the range is smaller (e.g., less than 32,767), you could use `int2` (smallint) to save space.
- **"Survived"**: Since this is likely a binary value (0 or 1), you can use `boolean` instead of `int4`.
- **"Pclass"**: If the class is limited to 1, 2, or 3, you can use `int2` (smallint) instead of `int4`.
- **"Sex"**: If this column only contains "male" and "female", consider using an `enum` type or a `char(1)` ('M' or 'F') to save space.
- **"Age"**: If precision beyond one decimal place is not required, consider reducing the precision or using `numeric` with a defined scale.
- **"SibSp" and "Parch"**: If these values are small integers (e.g., number of siblings/spouses or parents/children), you can use `int2` (smallint).
- **"Ticket"**: If ticket numbers are of fixed length, consider using `char(n)` instead of `varchar`.
- **"Fare"**: If precision beyond two decimal places is not required, consider using `numeric` with a defined scale (e.g., `numeric(10,2)`).
- **"Cabin"**: If cabin identifiers are of fixed length, consider using `char(n)` instead of `varchar`.
- **"Embarked"**: If this column only contains a small set of values (e.g., 'C', 'Q', 'S'), consider using an `enum` type or `char(1)`.
Comment: This is what I was looking for. Saving space is always good and these are solid recommendations.
### 2. **Normalize the Table**
- If columns like "Sex", "Embarked", or "Pclass" have repetitive values, consider normalizing them into separate lookup tables. For example:
- Create a "sex" table with "id" and "value" columns.
- Create an "embarked" table with "id" and "value" columns.
- Replace the original columns with foreign keys referencing these tables.
Comment: Interesting idea, but will the overhead of an extra lookup to the other tables offset any gains? This suggestion is not getting a 'warm fuzzy' from me.
### 3. **Remove Redundant Columns**
- If any columns are not used frequently or can be derived from other columns, consider removing them. For example, if "Pclass" can be inferred from another column, it may not need to be stored.
Comment: This is intriguing. Dropping redundant columns would help, but how do I make the inference to determine if and how Pclass status?
### 4. **Indexing**
- Add indexes to columns that are frequently queried, such as "PassengerId", "Survived", or "Embarked". However, keep in mind that indexes increase storage usage, so only index columns that are necessary for performance.
Comment: This is a pretty generic suggestion.
### 5. **Partitioning**
- If the table grows significantly, consider partitioning it by a column like "Pclass" or "Embarked" to improve query performance and storage management.
Comment: Hopefully, this dataset is closed. :-) If this was a growing set, or larger, or needed faster lookups, this would be a good consideration.
### 6. **Compression**
- Use PostgreSQL's `TOAST` mechanism for compressing large text fields like "Name", "Ticket", and "Cabin". This is automatically handled by PostgreSQL for large text/varchar fields, but you can ensure it is enabled.
Comment: This is a reach. Nothing in a row of this data will trigger TOAST.
### Example of an Optimized Table Definition
Here’s an example of how the table could look after applying some of these recommendations:
CREATE TABLE titanic (
"PassengerId" int4 PRIMARY KEY,
"Survived" boolean,
"Pclass" int2,
"Name" text,
"Sex" char(1),
"Age" numeric(4,1),
"SibSp" int2,
"Parch" int2,
"Ticket" varchar(20),
"Fare" numeric(10,2),
"Cabin" varchar(10),
"Embarked" char(1)
);
Comment: At first glance, this looks good. But do I want Age as a Numeric
field instead of an Int?
### Additional Notes
- If you normalize the table, you will need to create additional tables for "Sex" and "Embarked" and replace their columns with foreign keys.
- Always test the impact of these changes on your queries and storage before applying them in production.
Conclusion
Github Copilot did what a DBA would do when looking to optimize this data. It looked for columns where data could be saved. It considered partitioning and compression.
AI is a pattern-matching algorithm, and it is unsettling to realize your career was a pattern.
Next time, we will look at making the changes and see how it looks after the iteration.
Comments
Post a Comment