Hash index: to be, or not to be?

Evgeniy Demin
3 min readDec 24, 2022

This single SQL query can help you find B-Tree indexes that may become Hash indexes saving you money on storage bills!

Image from Pixabay

We recently compared the Hash and B-Tree indexes and learned a lesson about better benchmarking. In this article, I want to help you find suitable candidates for Hash indexes in your projects.

I have shared my story of why this opportunity could lead to a promotion or a pay rise, so give it a try.

Without further ado, let’s look at the query. I put some comments to describe better what it does in particular.

DO $$
DECLARE
count bigint;
unique_count bigint;
row record;
BEGIN
CREATE TEMPORARY TABLE hash_index_candidates (
table_name text,
index_name text, -- b-tree index to migrate
column_name text,
unique_ratio decimal,
btree_index_size text,
potential_hash_index_size text, -- assumption based value
count bigint
); -- stores temporary results for the current session
FOR row IN (
SELECT
t.table_name AS table_name,
t.index_name AS index_name,
t.column_name AS column_name,
pg_size_pretty(t.size) AS index_size
FROM
(
SELECT
tab.relname AS table_name,
cls.relname AS index_name,
pa.attname AS column_name,
pg_relation_size(cls.relname :: text) AS size
FROM pg_stat_user_indexes idx
JOIN pg_index pi ON pi.indexrelid = idx.indexrelid
JOIN pg_class cls ON cls.oid = idx.indexrelid
JOIN pg_class tab ON tab.oid = idx.relid
JOIN pg_am am ON am.oid = cls.relam
JOIN pg_attribute pa ON cls.oid = pa.attrelid
JOIN pg_type pt ON pa.atttypid = pt.oid
WHERE amname = 'btree' -- b-tree type
AND indnatts = 1 -- index covers single column
AND indislive = true -- check index "health"
AND indisvalid = true -- check index "health"
AND indpred IS NULL -- index has no conditions
AND typname IN ('varchar', 'text', 'char')
AND indisprimary = false -- index is not primary
AND typlen = -1 -- type has unlimited bytes length
AND indisunique = false -- index is not unique
) t
WHERE t.size > 104857600 -- index size is more than 100MB
) LOOP
EXECUTE format('SELECT COUNT(%I) FROM %I', row.column_name, row.table_name ) INTO count;
EXECUTE format('SELECT COUNT(DISTINCT %I) FROM %I', row.column_name, row.table_name) INTO unique_count;
INSERT INTO hash_index_candidates
VALUES
(
row.table_name,
row.index_name,
row.column_name,
round(unique_count :: decimal / (count :: decimal + 1) * 100, 2),
row.index_size,
pg_size_pretty(count / 300 * 8192), -- assumes 300 tuples per hash bucket
count
);
END LOOP;
END
$$ LANGUAGE plpgsql;

SELECT * FROM hash_index_candidates WHERE unique_ratio > 95; -- fetch results

In one of my projects, I found several indexes that would free about 1.2GB if I migrated them to Hash indexes.

But wait a second before doing so. First, we should check if the candidate satisfies the rule of thumb we came up with here:

  • Values have high cardinality. Ideally, their hash codes have high cardinality as well;
  • Values are queried only with equality operators;
  • Values lengths should be at least 25 characters long.

The most important part is the usage of the index. If you’re confident that your field under the index is queried by equality operators only, then go ahead and migrate.

We may have a better vision of stored statistics from PostgreSQL to automate it, but this could require a specific configuration of your database, which you might have yet to have. So I provided something that may help most of us.

As a side note: we may skip rule 3 if the resulting size of the index is lower than with B-Tree. The SQL query covers rule 1.

That’s it!

Thank you for your time, and please let me know if it helped you.

--

--

Evgeniy Demin

Ruby & Golang practitioner. Remote expert. Open-source contributor. Beginner blogger. Join my network to grow your expertise.