Kill Indexes of Nulls

Evgeniy Demin
3 min readJan 2, 2023

--

“Adding an index is a simple solution,” — they say. It is simple, but there is something you should do first. Knowing this could help you maximize the benefit of an index.

The image from Pixabay

With the series of database-related articles, we understand when we can apply a cheaper Hash index instead of a gold-standard B-Tree, getting better performance with a smaller size. You can also use an SQL query to find suitable candidates for Hash indexes in your databases. And last but not least, we know that analyzing the table before populating an index is a must.

In this topic, I want your attention to the data we index. More specifically, on its presence.

Let’s say we have a simple table that stores users and their referrers in our database. Note: we omit database architecture best practices for simplicity. Although, I saw the following design many times in my practice.

CREATE TABLE users (
id bigserial primary key,
full_name varchar not null,
referrer_id bigint
);

Your next task is to show a particular user’s number of referred users. Your query could be something like this:

SELECT COUNT(*) FROM users
WHERE referrer_id = 1111111 -- any referrer user's ID

Sooner or later, you see that the index would improve the performance of the query above, so you decide to add it.

CREATE INDEX users_on_referrer_id_index ON users(referrer_id);

Now the performance is excellent, so the job is done. But let me ask you a question:

Is there anything we should consider and potentially improve?

Yes, there is!

Let’s bring attention to the fact that the referrer_id column in the table can be null according to the schema definition.

Is this a problem? No, but we should know that:

An index allocates the space on the disk for null values as for any other.

Assuming that querying for null values is unnecessary and some tables may have millions or even billions of rows with only a tiny percentage of them having values, the index may take a lot of space on disk size for no reason.

Gladly, in modern databases, such as PostgreSQL, we can improve it by adding a condition to the index, as shown below.

CREATE INDEX users_on_referrer_id_index ON users(referrer_id)
WHERE referrer_id IS NOT NULL;

I prepared an SQL query written in PL/pgSQL that may help you find indexes that store too many null values. I tested it in one of the projects I am working on, and it found 1.1 GB to free from a total of 60 GB.

DO $$
DECLARE
total_count bigint;
column_count bigint;
row record;
BEGIN
DROP TABLE IF EXISTS _null_indexes;

CREATE TEMPORARY TABLE _null_indexes (
table_name text,
index_name text,
column_name text,
space_ratio decimal,
index_size bigint,
potential_index_size bigint,
total_count bigint,
column_count bigint
);

FOR row IN (
SELECT tab.relname as table_name,
cls.relname as index_name,
pa.attname as column_name,
pg_relation_size(cls.relname::text) as size
FROM pg_stat_user_indexes idx
JOIN pg_index pi ON pi.indexrelid = idx.indexrelid
JOIN pg_class cls ON cls.oid = idx.indexrelid
JOIN pg_class tab ON tab.oid = idx.relid
JOIN pg_am am ON am.oid = cls.relam
JOIN pg_attribute pa on cls.oid = pa.attrelid
JOIN pg_type pt on pa.atttypid = pt.oid
WHERE amname = 'btree'
AND indnatts = 1
AND indislive = true
AND indisvalid = true
AND indpred IS NULL
AND typname = 'varchar'
AND indisprimary = false)
LOOP
EXECUTE format('SELECT COUNT(%I) FROM %I', row.column_name, row.table_name) INTO column_count;
EXECUTE format('SELECT COUNT(*) FROM %I', row.table_name) INTO total_count;

IF column_count < total_count THEN
INSERT INTO _null_indexes VALUES (
row.table_name,
row.index_name,
row.column_name,
round(column_count::decimal / (total_count::decimal + 1) * 100, 2),
row.size,
row.size * column_count::decimal / (total_count::decimal + 1),
total_count,
column_count
);
END IF;
END LOOP;
END
$$ LANGUAGE plpgsql;

SELECT index_name, space_ratio, pg_size_pretty(index_size), pg_size_pretty(potential_index_size)
FROM _null_indexes
WHERE space_ratio < 10 OR (space_ratio < 99 AND index_size > 100 * 1024 * 1024)
ORDER BY space_ratio;

I hope you enjoyed the article. Please share your opinion on the topic. I would be happy if we could connect:

--

--

Evgeniy Demin

Ruby & Golang practitioner. Remote expert. Open-source contributor. Beginner blogger. Join my network to grow your expertise.