Transactions and Vacuums or How PostgreSQL Roombas Your Data

February 01, 2025

In the last blog, you saw how PostgreSQL keeps old versions of rows around within a transaction just in case there is a need to unwind or roll back the transaction. So, what happens to those rows that have been replaced after the transaction is complete?

The answer is nothing.

They sit there, taking up disk space.

How Do You Get That Disk Space Back?

PostgreSQL's design keeps the old rows around. In the early days of the PostgreSQL project, the idea was that those old rows could be rolled off to a WORM drive (write-once read many), or you could 'time travel' through the data. Those were grand goals, but sadly, practicality and the cost of disk space set them aside.

Our first example shows the creation of a table and the addition of one row of data.

demo=# create table v (a int);
CREATE TABLE
demo=# insert into v (a) values (1);
INSERT 0 1
demo=#

The first row of data is stored at CTID(0,1)

demo=# select a, ctid from v;
a | ctid
---+-------
1 | (0,1)
(1 row)
demo=#

Now, we will update the row.

demo=# update v set a = a * 2;

UPDATE 1

demo=# select a, ctid from v;

a | ctid

---+-------

2 | (0,2)

(1 row)

demo=#

We can not 'see' (0,1) because we have made an update. But it is still there!

And we can see the number of 'dead' tuples by examining pg_stat_all_tables for the number of 'dead' tuples.

demo=# select relname, n_dead_tup

from pg_stat_all_tables where relname = 'v';

relname | n_dead_tup

---------+------------

v | 1

(1 row)

demo=#

To illustrate this better, we will add a few more rows.

demo=# update v set a = a * 2;

UPDATE 1

demo=# update v set a = a * 2;

UPDATE 1

demo=# update v set a = a * 2;

UPDATE 1

demo=# update v set a = a * 2;

UPDATE 1

demo=# select a, ctid from v;

a | ctid

----+-------

32 | (0,6)

(1 row)

demo=#

The single row in table v has been inserted once and updated 5 times.

demo=# select relname, n_dead_tup, n_live_tup, last_vacuum

from pg_stat_all_tables where relname = 'v';

relname | n_dead_tup | n_live_tup | last_vacuum

---------+------------+------------+-------------

v | 5 | 1 |

(1 row)

demo=#

We can only see the most current row. That is the 'live tuple' above. The older versions of the row are the 'dead' tuples.

Vacuum to the Rescue

To reclaim the space, we use the vacuum command. Note that an automatic vacuum runs automatically on most modern versions of PostgreSQL that will do the cleaning, eventually. Or we can call Vacuum explicitly to clean things up.

demo=# vacuum;

VACUUM

demo=# select relname, n_dead_tup, n_live_tup, last_vacuum

from pg_stat_all_tables where relname = 'v';

relname | n_dead_tup | n_live_tup | last_vacuum

---------+------------+------------+-------------------------------

v | 0 | 1 | 2025-02-01 10:21:01.859963-06

(1 row)

The 'dead' tuple count is set to zero. The disk space taken by those tuples is reclaimed for later use.

Why vacuum?

PostgreSQL’s VACUUM command has to process each table regularly for several reasons:

To recover or reuse disk space occupied by updated or deleted rows.
To update data statistics used by the PostgreSQL query planner.
To update the visibility map, which speeds up index-only scans.
To protect against loss of ancient data due to transaction ID wraparound.

Vacuum maintains a visibility map for each table to track which pages contain only tuples known to be visible to all active transactions. This allows the vacuum to skip these tuples since they are known to be in a known ‘good’ state, speeding up vacuuming. It will enable PostgreSQL to answer some queries using only the index (the corresponding indexes get vacuumed, too) without reference to the underlying table. This provides index-only scans that do not need to dive into the data to answer the query.

Wraparound transaction IDs are something to be worried about. Postgres has a 32-bit transaction ID number that wraps around to zero and starts again. Compare this to MySQL’s running out of AUTO_INCREMENT numbers and refusing to add more rows until corrected. When the wrap happens, old transactions that were in the past now appear to be in the future, and their output becomes invisible. This is a catastrophic data loss! Worse, your data is safe, but you have ZERO access.

To avoid this, it is necessary to vacuum every table in every database at least once every two billion transactions.

Autovacuum

PostgreSQL has an optional but highly recommended feature called autovacuum, which aims to automate the execution of VACUUM and ANALYZE commands. Most recent versions have this turned on, but double-check as detailed below. Autovacuum is highly configurable and keeps you from remembering to run it.

To determine if you have autovacuum and to find out if it is running, ask the server.

demo=# show autovacuum;

autovacuum

------------

(1 row)

demo=#

Questions, comments, and general inquiries

I hope you are enjoying this blog, and I will continue it. But if you have a topic you want to see covered or have a different approach to something I did cover, please leave it in the blog comments.

Comments

igniteFebruary 9, 2025 at 2:05 AM
Hi,
This is really nice content so people understand who all happens in RDBMS world, it's not magic.

I am refreshing your blog weekly to see if there is any news :D (even with the notifications activated)
ReplyDelete
Replies

Add comment

Search This Blog

Dave's PostgreSQL Stuff