Kill Indexes of Nulls

3 min readJan 2, 2023

“Adding an index is a simple solution,” — they say. It is simple, but there is something you should do first. Knowing this could help you maximize the benefit of an index.

With the series of database-related articles, we understand when we can apply a cheaper Hash index instead of a gold-standard B-Tree, getting better performance with a smaller size. You can also use an SQL query to find suitable candidates for Hash indexes in your databases. And last but not least, we know that analyzing the table before populating an index is a must.

In this topic, I want your attention to the data we index. More specifically, on its presence.

Let’s say we have a simple table that stores users and their referrers in our database. Note: we omit database architecture best practices for simplicity. Although, I saw the following design many times in my practice.

CREATE TABLE users (
    id bigserial primary key,
    full_name varchar not null,
    referrer_id bigint
);

Your next task is to show a particular user’s number of referred users. Your query could be something like this:

SELECT COUNT(*) FROM users
WHERE referrer_id = 1111111 -- any referrer user's ID

Sooner or later, you see that the index would improve the performance of the query above, so you decide to add it.

CREATE INDEX users_on_referrer_id_index ON users(referrer_id);

Now the performance is excellent, so the job is done. But let me ask you a question:

Is there anything we should consider and potentially improve?

Yes, there is!

Let’s bring attention to the fact that the referrer_id column in the table can be null according to the schema definition.

Is this a problem? No, but we should know that:

An index allocates the space on the disk for null values as for any other.

Assuming that querying for null values is unnecessary and some tables may have millions or even billions of rows with only a tiny percentage of them having values, the index may take a lot of space on disk size for no reason.

Gladly, in modern databases, such as PostgreSQL, we can improve it by adding a condition to the index, as shown below.

CREATE INDEX users_on_referrer_id_index ON users(referrer_id)
WHERE referrer_id IS NOT NULL;

I prepared an SQL query written in PL/pgSQL that may help you find indexes that store too many null values. I tested it in one of the projects I am working on, and it found 1.1 GB to free from a total of 60 GB.

DO $$
    DECLARE
        total_count bigint;
        column_count bigint;
        row record;
    BEGIN
        DROP TABLE IF EXISTS _null_indexes;

        CREATE TEMPORARY TABLE _null_indexes (
            table_name text,
            index_name text,
            column_name text,
            space_ratio decimal,
            index_size bigint,
            potential_index_size bigint,
            total_count bigint,
            column_count bigint
        );

        FOR row IN (
            SELECT tab.relname                         as table_name,
                   cls.relname                         as index_name,
                   pa.attname                          as column_name,
                   pg_relation_size(cls.relname::text) as size
            FROM pg_stat_user_indexes idx
                JOIN pg_index pi ON pi.indexrelid = idx.indexrelid
                JOIN pg_class cls ON cls.oid = idx.indexrelid
                JOIN pg_class tab ON tab.oid = idx.relid
                JOIN pg_am am ON am.oid = cls.relam
                JOIN pg_attribute pa on cls.oid = pa.attrelid
                JOIN pg_type pt on pa.atttypid = pt.oid
            WHERE amname = 'btree'
                AND indnatts = 1
                AND indislive = true
                AND indisvalid = true
                AND indpred IS NULL
                AND typname = 'varchar'
                AND indisprimary = false)
            LOOP
                EXECUTE format('SELECT COUNT(%I) FROM %I', row.column_name, row.table_name) INTO column_count;
                EXECUTE format('SELECT COUNT(*) FROM %I', row.table_name) INTO total_count;

                IF column_count < total_count THEN
                    INSERT INTO _null_indexes VALUES (
                        row.table_name,
                        row.index_name,
                        row.column_name,
                        round(column_count::decimal / (total_count::decimal + 1) * 100, 2),
                        row.size,
                        row.size * column_count::decimal / (total_count::decimal + 1),
                        total_count,
                        column_count
                    );
                END IF;
            END LOOP;
    END
$$ LANGUAGE plpgsql;

SELECT index_name, space_ratio, pg_size_pretty(index_size), pg_size_pretty(potential_index_size)
FROM _null_indexes
WHERE space_ratio < 10 OR (space_ratio < 99 AND index_size > 100 * 1024 * 1024)
ORDER BY space_ratio;

I hope you enjoyed the article. Please share your opinion on the topic. I would be happy if we could connect:

Kill Indexes of Nulls

Written by Evgeniy Demin

No responses yet