PostgreSQL indexes: Hash vs B-tree

Evgeniy Demin
6 min readDec 14, 2022

--

Do you always know when to use the hash index over the b-tree index? How significant would the benefit be from the choice? I don’t.

So I did a little research to find out a rule of thumb. And in this article, I will share the results.

UPD: The article was updated.

Unfortunately, in the first post version, I made a small mistake in the benchmark that was hard to catch. You can read more about it in this article.

Spoiler: hash index is even sweeter now.

I skip the part about b-tree and hash indexes as there are a lot of resources you can read. However, I stop for a second on the hash description from PostgreSQL’s official documentation.

Hash indexes store a 32-bit hash code derived from the value of the indexed column. Hence, such indexes can only handle simple equality comparisons. The query planner will consider using a hash index whenever an indexed column is involved in a comparison using the equal operator.

Despite the b-tree index, which can store many values without reducing the expected performance, the hash index has a limit of 2³²-1 of unique hash codes (different values may have the same hash codes). Therefore, increasing the number of duplicates (in terms of hash codes) negatively affects the index performance.

Values have high cardinality. Ideally, their hash codes have high cardinality as well.

One of the reasons why the b-tree index is so standard is its flexibility because it supports all comparison operators. Hash index, on the other hand, supports only equality operators.

Values are queried only with equality operators.

Now, let’s follow the benchmarks and compare memory consumption and performance.

Before we look into the results, I would like to share the PL/pgSQL procedures I used to research the metrics.

The random_string procedure below generates random strings of a given length.

-- returns a randomized string of given length
CREATE OR REPLACE FUNCTION random_string(length integer)
RETURNS text AS
$$
DECLARE
chars text[] := '{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}';
result text := '';
i integer := 0;
BEGIN
FOR i IN 1..length LOOP
result := result || chars[ceil(61 * random()) + 1];
END LOOP;
RETURN result;
END
$$ LANGUAGE plpgsql;

Let’s define the benchmark procedure we will use to gather values for the performance comparison of indexes.

-- executes a given query for every string in strings
-- returns an average time for a single execution in milliseconds
CREATE OR REPLACE FUNCTION benchmark(query text, strings varchar[])
RETURNS numeric AS
$$
DECLARE
_start_ts timestamptz;
_end_ts timestamptz;
string varchar;
BEGIN
_start_ts := clock_timestamp();
FOREACH string IN ARRAY strings LOOP
EXECUTE format(query) using string;
END LOOP;
_end_ts := clock_timestamp();

RETURN 1000 * (extract(epoch FROM _end_ts - _start_ts)) /
array_length(strings, 1);
END
$$ LANGUAGE plpgsql;

And last but not minor, test procedure retrieves all the exciting metrics.

CREATE OR REPLACE FUNCTION test(length integer, count numeric)
RETURNS TABLE (
sample_length integer,
unique_ratio decimal, -- in percentage
hash_index_size bigint, -- in kilobytes
btree_index_size bigint, -- in kilobytes
column_size bigint, -- in kilobytes
hash_select_query decimal, -- in milliseconds
btree_select_query decimal, -- in milliseconds
hash_insert_query decimal, -- in milliseconds
btree_insert_query decimal -- in milliseconds
) AS
$$
DECLARE
strings varchar[];
BEGIN
CREATE TABLE IF NOT EXISTS hash_table(example varchar);
CREATE TABLE IF NOT EXISTS btree_table(example varchar);

INSERT INTO hash_table (SELECT random_string(length) FROM generate_series(1, count));
INSERT INTO btree_table (SELECT example FROM hash_table);

ANALYSE hash_table; -- this is critical for hash index
ANALYSE btree_table;

CREATE INDEX IF NOT EXISTS hash_index ON hash_table USING hash(example);
CREATE INDEX IF NOT EXISTS btree_index ON btree_table USING btree(example);

ANALYSE hash_table;
ANALYSE btree_table;

strings := array_agg(random_string(length)) FROM generate_series(1, 100);

RETURN QUERY
SELECT (SELECT length(example) FROM hash_table LIMIT 1),
round(count(DISTINCT example)::decimal / count(*) * 100, 2) AS unique_ratio,
pg_relation_size('hash_index') / 1024 AS hash_index_size,
pg_relation_size('btree_index') / 1024 AS btree_index_size,
pg_table_size('hash_table') / 1024 AS column_size,
benchmark('SELECT example FROM hash_table WHERE example = $1', strings) AS hash_select_query,
benchmark('SELECT example FROM btree_table WHERE example = $1', strings) AS btree_select_query,
benchmark('INSERT INTO hash_table VALUES($1)', strings) AS hash_insert_query,
benchmark('INSERT INTO btree_table VALUES($1)', strings) AS btree_insert_query
FROM hash_table;

DROP TABLE IF EXISTS hash_table;
DROP TABLE IF EXISTS btree_table;
END
$$ LANGUAGE plpgsql;

Now, when we have all of them, we can run the following query to retrieve results for tables with 1000 rows of the given length.

SELECT (test(length, 1000)).*
FROM (VALUES (3), (5), (7), (10), (25), (100), (255), (355), (512), (755), (835), (1024)) s(length);

Note: we wrap test(length, number) with parenthesis to nicely unpack the resulting output.

The results are gathered using PostgreSQL 15.1.

Let’s dive into memory consumption comparison. Below, I will post a series of charts for 1.000, 10.000, 100.000, and 1.000.000 rows.

1000 rows
10.000 rows
100.000 rows
1.000.000 rows

Let’s highlight a few things we see on the charts:

  • Hash indexes are data agnostic, meaning their size depends only on the number of indexed data;
  • The more rows you have, the less string length you need to benefit from the hash index;
  • In most cases, the hash index consumes much less memory than the stored field, while when the b-tree requires more than the stored field;
  • The starting length where we see the benefit could be between 15–30; however, to simplify the rule of thumb, we would say at least 25.

A stored value length should be at least 25 characters long.

Now let’s jump to the performance comparison.

1000 rows
10.000 rows
100.000 rows
1.000.000 rows

Based on the charts for the performance comparison, we may say:

  • in all cases, the performance is better with the hash index;
  • select queries have a 20–60% boost, increasing with the number of rows;
  • insert queries have a 10–80% boost, increasing with the number of rows;
  • both indexes perform fast, and some could neglect a 0.01 millisecond gain.

For a million-rows table with 1024 long-length strings, the switch to the hash index would:

  • save you about 1.5GB on the disk (50 times smaller than the b-tree index);
  • give you about a 60–80% performance boost (0.02–0.04ms per query).

The size reduction is enormous! This can save a lot of money on infrastructure costs!

Let’s conclude the rule of thumb:

  • Values have high cardinality. Ideally, their hash codes have high cardinality as well;
  • Values are queried only with equality operators;
  • Values lengths should be at least 25 characters long.

Follow me and be notified about upcoming articles.

--

--

Evgeniy Demin
Evgeniy Demin

Written by Evgeniy Demin

Ruby & Golang practitioner. Remote expert. Open-source contributor. Beginner blogger. Join my network to grow your expertise.

Responses (3)