Hey! I recorded a video course!

My first video course is out and it's called: Build an MVP with Elixir
If you like this article, you will also like the course! Check it out here!

Hey folks, before we begin, a public service announcement:

Respect people who wear glasses.

They paid money to see you.

Okay, that was enough fun for today. Let’s dive into today’s topic: How can you search through millions of usernames using Ecto and Postgres? Let’s find out!

Quick note: This article does not cover full-text search (FTS) with Postgres. It focuses only on searching for short strings like names, movie titles, article names, etc. I will cover FTS in another article in the future.

Imagine that you have a database with a million users. You need to allow admins to search through the users by name. You need to get this out the door quickly, so instead of building a cool, custom search engine, you want to use what’s already at hand. Luckily, Postgres got your back.

In short, there are two good candidates of query methods you can use: (i)like and similarity. (i)like is better if you want to have exact matches with your query. For example, the search term Ber would return Bert or Berry, but not Bart. similarity is a more fuzzy approach and would also return Bart, since it’s not the same, but very similar to Ber.

We will test our name search against an example set of 18.000 first names. The non-trivial size of the set allows us to test the speedup of using indexes. Let’s have a look at how to implement each of them.

🔗 (i)LIKE

Probably the most common search method is the LIKE operator. It checks whether a string contains a search term. So, Bert matches Ber, but Bart does not. You can specify where in the string the search term may exist using the wildcard % sign. For example, a search term of Ber% means that a string must begin with Ber whereas %Ber means the string must end with Ber. If your search term may exist anywhere in the string, you can use two % signs like this: %Ber%. This would match Bert, but also DagoBert.

In our queries, we usually don’t use LIKE, but its sibling ILIKE, which ignores cases in our strings and search terms. So, whereas Ber% LIKE 'bert' doesn’t match, Ber% ILIKE 'bert' does. Now, let’s have a look at how to implement an efficient ILIKE search in Ecto.

🔗 The Migration

First, we build a migration that creates a simple persons table with only one field: a name. We also create an index for that field, which we’ll discuss later. This is how it looks:

defmodule Demo.Repo.Migrations.CreatePersons do
  use Ecto.Migration

  def up do
    create table(:persons) do
      add :name, :string
    end

    execute "CREATE EXTENSION IF NOT EXISTS pg_trgm;"
    
    execute """
      CREATE INDEX persons_name_gin_trgm_idx 
        ON persons 
        USING gin (name gin_trgm_ops);
    """
  end

  def down do
    drop table(:persons)
  end
end

First off, since we use the execute/1 function with custom SQL code, we need to split the migration into up/0 and down/0 since otherwise, Ecto doesn’t know how to roll back our migration.

The two execute/1 statements are the reason our ILIKE search is efficient. The first statement activates the pg_trgm extension, which allows us to search through a GIN index of our names instead of having to scan through every row 1-by-1. The second execute/1 statement creates the GIN index for the name column using the gin_trgm_ops.

Watch out: Don’t create the index with the Ecto variant of it, which is:

# Don't use this
create(index(:persons, [:name], using: "GIN"))

# Because it translates to this:
CREATE INDEX persons_name_index ON persons USING GIN (name varchar_ops);

The Ecto variant will create a GIN index, but by using the varchar_ops and not the gin_trgm_ops. This won’t speed up our text search. So, better use the custom SQL code in the second execute/1 statement to create the GIN index.

Let’s have a look at how to use the index next.

🔗 The Query

Thankfully, Ecto already offers the ilike function, which we can use for our query. This is what it looks like:

defp search(search_term) do
  search_term = "%#{search_term}%"

  query =
    from(
      p in Person,
      where: ilike(p.name, ^search_term)
    )

  Repo.all(query)
end

This will return all names that contain the search term anywhere in their string. If you analyze this query in a database explorer, you will see that it uses our GIN index. We ran this query without the index and it was around 30 times slower (~700ms vs 21s). So, before you push this query to production, please check in your database explorer whether your index is used. You can do that by running the following command:

EXPLAIN ANALYZE SELECT name FROM persons WHERE name ILIKE '%bert%';

It should return an analysis that starts with Bitmap Heap Scan. That means that your index is used. If it returns an analysis beginning with Seq Scan, it doesn’t use your index and scans every row 1-by-1, which you don’t want.

Now, the ILIKE query returns all names that match a certain search term, but it won’t return names that don’t match the search term but are very similar to it. For example, if you search for Ana you might also want to see Anna and Hannah in your results. So, names that don’t match exactly, but are very similar. We can use Postgres’ SIMILARITY operator for that. Let’s check out how.

🔗 Similarity

similarity is one of these functions that appear to be very easy to use, but it’s very easy to get it wrong. So, before you deploy your query to production, please analyze the query using EXPLAIN ANALYZE first. We will get into the details of what can go wrong later on. Let’s first have a look at how to use this operator.

You can re-use the migration for ILIKE also for similiarity. It uses the same GIN index. Run your your migration and then head over to your context to add the following query:

def similarity(name) do
  query =
    from(
      p in Person,
      where: fragment("? % ?", ^name, p.name),
      order_by: {:desc, fragment("? % ?", ^name, p.name)}
    )

  Repo.all(query)
end

This query uses the short-hand version of similarity, the % operator, to compare the search term with the names in the database. It orders the results by their similarity to the search term so that the most similar results are on top. There are a few small, but important details about this query:

  1. You must use the short-hand % operator because the query won’t use the GIN index otherwise. Don’t use similarity(name, ?) in your query, but the short-hand % operator.

  2. If you use the <% or <<% operators, you must put the search term first. So, WHERE bert <% name. If you turn them around, Postgres won’t use your GIN index.

  3. When you use the short-hand % operator, you can’t set the threshold for how similar names must be to be included in the results. If you use the similarity function, you can set it with: similarity(name, ?) > 0.3. If you use the short-hand operator %, the global pg_trgm.similarity_threshold (defaults to 0.3) is used instead. You can configure the global threshold for one session or your entire database using:

SET pg_trgm.similarity_threshold = 0.3;

You can configure the similarity threshold of your Ecto.Repo connection using a Connection prefix (Thanks to Michał for the tip ❤️). You can set it in your configuration like this:

# config.exs

query_args = ["SET pg_trgm.similarity_threshold = 0.4", []]

config :my_app, MyApp.Repo,
  after_connect: {Postgrex, :query!, query_args}

This configuration will set your custom similarity_threshold after connecting to the database.

If the global threshold of 0.3 is a problem for you and you can sacrifice the GIN index, the following variation will help. Be warned that it doesn’t use the GIN index and scans your entire table. Sadly, there’s no free lunch.

def similarity(name, limit \\ 25) do
  query =
    from(
      p in Person,
      order_by: {:desc, fragment("? % ?", ^name, p.name)},
      limit: ^limit
    )

  Repo.all(query)
end

This query has no hard threshold for similarity, but orders the names by their similarity and only returns the top - in this case - 25 names. It isn’t perfect, but it works without a hard threshold.

🔗 Variations of similarity

If you implement a name search, it is important to know the two siblings of the similarity operator, the word_similarity and strict_word_similarity.

One problem with the similarity operator is that it compares a search term against the entire string of a name, ignoring any spaces and therefore, word boundaries. For example, let’s say you want to see the profile of a user with a typical Portuguese name, like João Miguel Mello Fernandes Pereira dos Reis. You probably don’t remember the full name of the person, but only their first name João.

Now, if you search for João using the similarity operator, it will rank the user lower since João shares very few letters with João Miguel Mello Fernandes Pereira dos Reis. A user called João Sausa will have a much higher ranking simply because his name is shorter. The problem is that similarity depends on the length of a name. The Postgres team recognized this problem and created the word_similarity and strict_word_similarity alternatives.

Both alternatives put more importance on whether a search term is similar to any word in the name. Since the search term, João, matches perfectly with the first word in the long name João ... Reis, it will have a similarity of 1. The same holds for the shorter name João Sausa. So, the similarity doesn’t depend on the length of a name anymore.

The difference between word_similarity and strict_word_similarity is subtle but important. word_similarity compares your search term with sub-parts of a word whereas strict_word_similiarity only compares against whole words. So, use word_similarity if you want to return names even when your search term only matches a part of the name, e.g. Bert matches Dagobert. If you want to prioritize full matches, use strict_word_similiarity, e.g. only Dagobert matches Dagobert, but Bert doesn’t.

In order to make use of your GIN index, you must also use the short-hand operators <% for word_similarity and <<% for strict_word_similarity when writing your query.

🔗 Combine ILIKE and Similarity

Now that you understand both methods, ILIKE and similarity, there’s one combination that you might like best:

def ilike_with_similarity(search_term) do
  ilike_search_term = "%#{search_term}%"

  query =
    from(
      p in Person,
      where: ilike(p.name, ^ilike_search_term),
      order_by: {:desc, fragment("? % ?", ^search_term, p.name)}
    )

  Repo.all(query)
end

This query uses ILIKE to filter out any non-matching names and then orders the remaining names by their similarity to the search term. It uses the GIN index for all operations and returns the most relevant results on top. Keep in mind that it has the same limitation as ILIKE which is that it doesn’t return non-matching, but very similar search results like Ana for the search term Anna. It is independent of the global similarity threshold since it only sorts the results of the ILIKE query. If you use the ILIKE method, you should seriously consider this combination for production use.

🔗 Intermediate Wrap-up

Now you know how to use the ILIKE and similarity operators efficiently to search for names and other short texts in your database. If that is all you want to know, feel free to stop reading at this point. Just make sure to EXPLAIN ANALYZE your queries and check that they use the index before going to production.

The remainder of this blog post takes a deep dive into the nitty-gritty details of how Postgres calculates the similarity score. During our research, we were unable to find a comprehensive explanation of the inner workings of the similarity, word_similarity, and strict_word_similarity operators. So, here it is 😊

🔗 A deep-dive into Similarity

To understand the similarity operator, we first need to understand what Tigrams are. In short: A trigram is a group of three consecutive characters taken from a string. So, the Trigrams of the word Bert would be:

> SELECT show_trgm('Bert');
{"  b"," be","ber","ert","rt "}

Trigrams break down each string into a Set of character groups with 3 letters each. A character group may be padded with spaces if they are at the beginning or end of the string (like " b" above). The similarity operator uses these sets to compare two strings and returns how much they overlap on a scale from 0 to 1. The two sets are more similar if they have many Trigrams in common and less similar if they share fewer Trigrams.

🔗 How Postgres calculates similarity

Let’s break down the three names Bert, Bart, and Berry into their Trigrams and calculate the similarity of Bert and Bart and Bert and Berry. Note that Trigrams ignore the case of letters. Everything is lowercase.

The following chart shows which Trigrams occur in both names (X) and which don’t (-). For example in Bert and Bart, the first Trigram " b" occurs in both names, whereas the second Trigram " ba" doesn’t.

Bert:  {"  b"," be","ber","ert","rt "}
Bart:  {"  b"," ba","art","bar","rt "}
Match:    X     -     -     -     X

Bert:  {"  b"," be","ber","ert","rt "}
Berry: {"  b"," be","ber","err","rry","ry "}
Match:    X     X     X     -     -     - 

Postgres calculates the similarity of these sets using the following calculation:

similarity = number of shared trigrams / number of distinct trigrams

In our comparison of Bert vs Bart, both Bert and Bart have 5 Trigrams and 2 Trigrams in common. We can calculate their similarity with:

similarity = 2 / (5 + 5 - 2) = 0.25

In our comparison of Bert vs Berry, Bert has 5 Trigrams whereas Berry has 6. They have 3 Trigrams in common, so their similarity is:

similarity = 3 / (5 + 6 - 3) = 0.375

Based on this calculation, Bert is more similar to Berry than it is to Bart. Let’s quickly double-check these results in Postgres:

> SELECT similarity('Bert', 'Bart'), similarity('Bert', 'Berry');

similarity | similarity
---------- * ----------
0.25       | 0.375

Cool! Our calculations were correct! Now you know exactly how the similarity operator works (Yey? 😄).

🔗 How Postgres calculates word similarities

The word_similarity and strict_word_similarity scores are calculated in a different way than similarity. Postgres uses only parts of the names to calculate the similarity with the search term. So, if we search for João and our user’s name is João Miguel Mello Fernandes Pereira dos Reis, Postgres will only use the first name João to calculate the similarity to the search term since it is the most similar word in the name.

Let’s have a look the similarity scores when calculating the word and strict_word similarity of Bert and Dagobert Duck:

SELECT strict_word_similarity('Bert', 'Dagobert Duck'), 
       word_similarity('Bert', 'Dagobert Duck');

strict_word_similarity | word_similarity
---------------------- * ----------
0.28 (rounded)         | 0.6

The similarities are quite different! The high-level explanation is that Postgres compares the Trigram sets of Bert and the sub-part bert of Dagobert for word_similarity whereas it compares the sets of Bert and Dagobert for strict_word_similarity. Let’s have a look at the Trigram sets and how they overlap:

Bert:     {"  b"," be","ber","ert","rt "}	
bert:                 {"ber","ert","rt "}
Match:      -      -     X     X     X

Bert:                             {"  b"," be","ber","ert","rt "}	
Dabogert: {"  d"," da","dag","ago","gob","obe","ber","ert","rt "}
Match:       -     -     -     -     -     -     X     X     X

So, if we calculate the similarities of these two comparisons, we get the same result:

word_similarity        = 3 / (5 + 3 - 3) = 0.6
strict_word_similarity = 3 / (5 + 9 - 3) ~ 0.28

These calculations match Postgres’s results exactly! I hope this little deep-dive clears up how Postgres calculates the different similarity values.

🔗 Conclusion

And that’s it! I hope you enjoyed this article! If you have questions or comments, let’s discuss them on Twitter. Follow me on Twitter or subscribe to my newsletter below if you want to get notified when I publish the next blog post. If you found this topic interesting, here are some references for further reading:

Until next time. Thanks for reading!

Liked this post?

Get notified about new posts