With the series of database-related articles, we understand when we can apply a cheaper Hash index instead of a gold-standard B-Tree, getting better performance with a smaller size. You can also use an SQL query to find suitable candidates for Hash indexes in your databases. And last but not least, we know that analyzing the table before populating an index is a must.
In this topic, I want your attention to the data we index. More specifically, on its presence.
Let’s say we have a simple table that stores users and their referrers in our database. Note: we omit database architecture best practices for simplicity. Although, I saw the following design many times in my practice.
CREATE TABLE users (
id bigserial primary key,
full_name varchar not null,
Your next task is to show a particular user’s number of referred users. Your query could be something like this:
SELECT COUNT(*) FROM users
WHERE referrer_id = 1111111 -- any referrer user's ID
Sooner or later, you see that the index would improve the performance of the query above, so you decide to add it.
CREATE INDEX users_on_referrer_id_index ON users(referrer_id);
Now the performance is excellent, so the job is done. But let me ask you a question:
Is there anything we should consider and potentially improve?
Yes, there is!
Let’s bring attention to the fact that the
referrer_id column in the table can be
null according to the schema definition.
Is this a problem? No, but we should know that:
An index allocates the space on the disk for null values as for any other.
Assuming that querying for
null values is unnecessary and some tables may have millions or even billions of rows with only a tiny percentage of them having values, the index may take a lot of space on disk size for no reason.
Gladly, in modern databases, such as PostgreSQL, we can improve it by adding a condition to the index, as shown below.
CREATE INDEX users_on_referrer_id_index ON users(referrer_id)
WHERE referrer_id IS NOT NULL;
I prepared an SQL query written in PL/pgSQL that may help you find indexes that store too many null values. I tested it in one of the projects I am working on, and it found 1.1 GB to free from a total of 60 GB.
DROP TABLE IF EXISTS _null_indexes;
CREATE TEMPORARY TABLE _null_indexes (
FOR row IN (
SELECT tab.relname as table_name,
cls.relname as index_name,
pa.attname as column_name,
pg_relation_size(cls.relname::text) as size
FROM pg_stat_user_indexes idx
JOIN pg_index pi ON pi.indexrelid = idx.indexrelid
JOIN pg_class cls ON cls.oid = idx.indexrelid
JOIN pg_class tab ON tab.oid = idx.relid
JOIN pg_am am ON am.oid = cls.relam
JOIN pg_attribute pa on cls.oid = pa.attrelid
JOIN pg_type pt on pa.atttypid = pt.oid
WHERE amname = 'btree'
AND indnatts = 1
AND indislive = true
AND indisvalid = true
AND indpred IS NULL
AND typname = 'varchar'
AND indisprimary = false)
EXECUTE format('SELECT COUNT(%I) FROM %I', row.column_name, row.table_name) INTO column_count;
EXECUTE format('SELECT COUNT(*) FROM %I', row.table_name) INTO total_count;
IF column_count < total_count THEN
INSERT INTO _null_indexes VALUES (
round(column_count::decimal / (total_count::decimal + 1) * 100, 2),
row.size * column_count::decimal / (total_count::decimal + 1),
$$ LANGUAGE plpgsql;
SELECT index_name, space_ratio, pg_size_pretty(index_size), pg_size_pretty(potential_index_size)
WHERE space_ratio < 10 OR (space_ratio < 99 AND index_size > 100 * 1024 * 1024)
ORDER BY space_ratio;