Hey! I recorded a video course!

My first video course is out and it's called: Build an MVP with Elixir
If you like this article, you will also like the course! Check it out here!

Before we begin, my apologies for the delay in publishing this article. This year’s flu season hit me pretty hard. But being sick reminded me of an odd experience I made recently:

Someone called me on my phone, sneezed, and then hung up.

To be honest, I'm getting sick and tired of these cold calls!

Anyways. Let’s dive into today’s topic: Efficient full-text search with Postgres and Ecto!

🔗 The Problem

My last article explained how to search efficiently through short strings in your database, like names or emails. But what if you want to search through longer texts like book summaries, news articles, or your notes? Unfortunately, the methods from the last article - ILIKE and SIMILARITY - aren’t efficient enough for longer texts. They scan the entire content from start to end, which is too slow.

That’s exactly where Postgres’ full-text search comes in handy. It allows you to break down long texts into a list of keywords, called tsvector, and search through them efficiently using tsquery. So, instead of scanning the entire text of every book, Postgres only compares normalized lists of keywords with each other. This sounds pretty abstract, so let’s have a look at how these two components work.

🔗 The (short) Solution

This article is quite long, so here’s a sneak peek of the final result, for texts in one language only. Multi-language support is much more complex. So, if you need that, please keep on reading. Let’s say we have a model called Book that contains a title and a summary of a book. Here’s the migration to create the books table:

def up do
  create table(:books) do
    add :title, :text
    add :summary, :text
  end

  execute """
    ALTER TABLE books
      ADD COLUMN searchable tsvector
      GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(summary, '')), 'B')
      ) STORED;
  """

  execute """
    CREATE INDEX books_searchable_idx ON books USING gin(searchable);
  """
end

def down do
  drop table(:books)
end 

Here’s how to search for text in the books table:

def search(search_term) do
  query =
    from(b in Book,
      where:
        fragment(
          "searchable @@ websearch_to_tsquery(?)",
          ^search_term
        ),
      order_by: {
        :desc,
        fragment(
          "ts_rank_cd(searchable, websearch_to_tsquery(?), 4)", 
          ^search_term
        )
      }
    )

  Repo.all(query)
end

Okay, let’s walk through the code, starting with the migration.

🔗 Understanding the Migration

Have a look at the first execute statement. On a high level, it extracts a list of keywords from the title and the summary fields, gives them different weights, and stores them in a new column called searchable.

Now, in detail: The first step of implementing full-text search is to convert all texts into tsvectors. In short, tsvectors are sorted lists of keywords that were extracted and normalized from long texts. They help to reduce long texts into a much more searchable data structure. More on them later.

We generate these tsvectors by calling to_tsvector('english', coalesce(title, '')). The coalesce function replaces any empty fields with an empty string since to_tsvector can’t handle NULL values. tsvectors depend on the language of a text for applying their normalization. In our case, we defined the language of the text as English. Postgres doesn’t support all languages by default, but you can add your own language too. You can check which languages your Postgres instance supports by running:

SELECT cfgname FROM pg_catalog.pg_ts_config;

Next, we want to prioritize books that have the search term in their title. For example, if we search for books about Elixir, we want to see Elixir in Action first and Seven Languages in Seven Weeks second since the first book contains our search term in its title whereas the second one only mentions the search term somewhere deep in its content.

We can give different tsvectors different weights by using the setweight(tsvector, weight) function. Postgres offers four different weights: A -> B -> C -> D where A is the highest. By default, these weights are set to: 1.0, 0.4, 0.2, 0.1. Postgres multiplies these with the rank of a search result to adjust their position in the overall ranking (more on ranking search results later). In our case, we want the title field to have the highest weight. So, we use setweight(to_tsvector(...), 'A') to give it the A weight.

After creating the tsvectors and weighting them, we store them in an auto-generated column called searchable. Postgres will automatically generate a value for this field whenever you insert a book or update its title or summary.

This field isn’t strictly necessary. If you worry about storing too much data, you can omit the execute statement for this field and replace searchable in your index and query with the value behind the AS operator, so the setweight(...) || setweight(...) operation.

Now, the second execute statement in our migration creates a GIN index for the searchable field. This is what makes our full-text search efficient because it allows Postgres to find matching books in a highly-optimized data structure instead of having to scan through every row. You can read more about indexes here.

Now that we understand our migration and how it makes the full-text search efficient, let’s have a look at how to query for books.

🔗 Understanding the Query

Our query has two steps.

# Step 1: Filter for books that match our query
where: fragment("searchable @@ websearch_to_tsquery(?)", ^search_term)

We use the fragment function here to execute raw SQL code. The websearch_to_tsquery function sanitizes and normalizes our user input and makes it safe to expose the full-text search to our end users. It returns a tsquery which is also a sorted list of keywords like tsvector but allows us to add search modifiers like AND, OR, NOT, and FOLLOWED BY (more on those later). Using the @@ operator, we check whether the tsvector in the searchable field matches our tsquery. Postgres returns only those books that match our query.

# Step 2: Sort the matching books by their rank/relevance.
order_by: { 
  :desc, 
  fragment(
    "ts_rank_cd(searchable, websearch_to_tsquery(?), 4)", 
    ^search_term
  )
}

Next, we sort the matching books based on their rank so that the most relevant books appear at the top. The ts_rank_cd function calculates:

  1. how often search terms appear in the document (frequency)
  2. how close together the search terms are in the document (proximity)
  3. how important is the section of the document where the search terms appear (priority)

So, a book that contains a search term multiple times, close together, and in its title has a higher rank than a book that contains the search term only a few times, far apart, and in an unimportant section, for example, the appendix.

One note about the number 4 here: When ranking documents, it is important to take their length into account. A long book will include a search term more often than a short essay just by chance. That doesn’t make it more relevant though. You can adjust the rank for the length of a document by providing a normalization factor. In this example, we used 4, which divides the rank by the average distance between the matching search terms. So, an essay with only a few matches that occur close together now receives a higher rank than a long book with more matches, but far apart.

Relevant ranking depends heavily on your type of data, so please play around with the normalization factor and the weights until the ranking becomes useful to you.

Be aware that ranking will slow down your query because it needs to visit the tsvector of every matching document to calculate the rank of the document. Unfortunately, there’s no good way around it. You can improve this downside by limiting how many documents you return from the query though.

And that’s our query! It returns all books that match our search terms ranked by their “relevance”.

🔗 Testing your setup

Before you push this code to production, make sure that your GIN index is used. Otherwise, your full-text search will be slow. You can check this by running your query in your favorite Postgres explorer and prefixing it with EXPLAIN ANALYZE. For example, you can setup and run the following query:

-- If you have only a few books in our database, force Postgres 
-- to use the index. Otherwise, it falls back to a sequence scan 
-- because that's faster than using an index for only a few rows.
SET enable_seqscan = OFF;

-- The example query
EXPLAIN ANALYZE SELECT * FROM books
  WHERE searchable @@ websearch_to_tsquery('prince');

The output of this query should start with Bitmap Heap Scan. If it doesn’t, please double-check that you followed this article correctly.

🔗 Highlighting matches

Postgres offers the ts_headline function to highlight the matches in your results. By default, it wraps the matches in <b>...</b> HTML tags, but you can customize those. Let’s see how to use this option with Ecto.

def search(search_term) do
  query =
    from(b in Book,
      where: 
        fragment("searchable @@ websearch_to_tsquery(?)", ^search_term),
      select: %{
        headline:
          fragment(
            """
            ts_headline(
              'english', 
              CONCAT(title, ' ', summary), 
              websearch_to_tsquery(?),
              'StartSel=<h3>,StopSel=</h3>,MinWords=25,MaxWords=75'
            )
            """,
            ^search_term
          )
      }
    )

  Repo.all(query)
end

This query will return a headline for every match, which is at least 25 words (see MinWords) and at most 75 words long (see MaxWords). Matches are wrapped in h3 HTML tags (see StartSel and StopSel). Thanks to Barry Jones for writing about this in 2016!

🔗 Before you continue

This concludes the “short” solution of how to implement efficient full-text search with Ecto and Postgres. If you copy&paste the code from above, you should have a working solution. The rest of the article takes a deep dive into the following topics:

  • How can you search through documents in different languages?
  • How does Postgres create tsvectors?
  • How can you use search modifiers in tsqueries?
  • What is the difference between plainto_tsquery, phraseto_tsquery, and websearch_to_tsquery?

If you are interested in any of these topics, please keep on reading. Otherwise, thank you very much for your attention and we’ll meet again in my next article! 🤗

Still here? Nice! So, then let’s continue.

🔗 Adding Multi-language Support

Our short solution above only covered the use-case where your texts are in a single language. Full-text search with texts in multiple languages is much more complex. Below, you will find three solutions which cover the main use-cases. They all support index-based full-text search in texts with multiple languages, but they differ in their ability to support two main use-cases: Dynamically adding texts in new languages and reducing the overall storage costs.

When you first create your database, you might have texts in only a few languages. Over time though, you might want to add texts in new languages. Some of the proposed solutions support adding new languages without you having to do anything. But some need the complete rebuilding of generated fields and indexes, which can be expensive and lock your database temporarily.

The second use-case is pretty practical: How much extra data do you need to store to support the full-text search? Some solutions need only indexes to make the text searchable. Some need large generated fields which multiply the overall storage costs.

The solutions differ on one more dimension: Do you know the language of a text and the language of your search term? If yes, then you probably want to go with solution two or three. But if not, only solution one will be useful to you.

Each solution supports the use-cases differently. You must decide yourself which solution supports your individual use-case best. Let’s have a look at them.

🔗 Concatenating tsvectors

The first solution is the most basic one: We create tsvectors of the text for every language that we want to support and store the result in one large field. It isn’t a pretty or cheap solution, but it supports the use-case where you don’t know the language of a text or the language of the search term. It just stores tsvectors for all supported languages and searches through them.

Let’s have a look at the migration:

def up do
# create table(:books) ...

  execute """
    ALTER TABLE books
      ADD COLUMN searchable tsvector
      GENERATED ALWAYS AS (
        to_tsvector('english', coalesce(title, '') || ' ' || coalesce(summary, '')) || 
        to_tsvector('spanish', coalesce(title, '') || ' ' || coalesce(summary, '')) || 
        to_tsvector('italian', coalesce(title, '') || ' ' || coalesce(summary, ''))
      ) STORED;
  """

  execute """
    CREATE INDEX books_searchable_idx ON books USING gin(searchable);
  """
end

Just as in our “short” solution, we generate a new field called searchable that stores one large tsvector that contains tsvectors for every support language. It will bloat your database with a lot of redundant data given that only the tsvector in the language of the text is relevant, but we still store tsvectors for all other languages too. However, it allows us to search in every possible language, like so:

def search(search_term) do
  query = 
    from(
      b in Book,
      where: fragment("searchable @@ websearch_to_tsquery('english', ?)", ^search_term),
      or_where: fragment("searchable @@ websearch_to_tsquery('spanish', ?)", ^search_term),
      or_where: fragment("searchable @@ websearch_to_tsquery('italian', ?)", ^search_term)
    )
  
  Repo.all(query)
end

This solution isn’t cheap or pretty, but it works if you support only a small set of languages and don’t know the language of the text or the search terms. You will have more false positives since a text might match your search term, but not in the intended language. For example, if you search for Zeitgeist - a German word but commonly used in English too - you might see German texts although you wanted English texts instead.

If you want to add new languages, you would need to change and regenerate the searchable field and its index, which can lock your database temporarily. So, this solution doesn’t support adding new languages very well or minimises your storage costs. Let’s have a look at the next solution, which is significantly better, but not without its own downsides.

🔗 Partial Indexes

The second solution works without an autogenerated field but requires knowledge of the language of a text. Using that info it creates partial indexes, one per language group. Have a look at the migration:

def up do
  create table(:books) do
    add :title, :string
    add :summary, :text
    # Note the new field here
    add :language, :string
  end

  for lang <- ["english", "spanish", "italian"] do
    execute """
      CREATE INDEX books_searchable_partial_idx_#{lang} ON books 
        USING GIN(
          to_tsvector('#{lang}', coalesce(title, '') || ' ' || coalesce(summary, ''))
        )
        WHERE language = '#{lang}';
    """
  end
end

The migration creates an index for every supported language that tracks texts only in the language of the index. The size of the partial indexes taken together is never larger than the size of a single index for all languages. This solution doesn’t require an autogenerated field and therefore minimizes the storage costs significantly. You can add new languages by creating a new migration that adds another partial index for the new language. So, adding new languages cannot happen at runtime, but require a new deploy that executes the migration. But partial indexes without any texts have a negligible size. So, you could create indexes for all languages that Postgres supports when setting up the database. This would allow you to add texts in new languages dynamically and at runtime.

Querying for texts using partial indexes is pretty straight forward:

def search(language, search_term) do
  query =
    from(b in Book,
      where: b.language == ^language,
      where:
        fragment(
          """
            to_tsvector(
              language::regconfig, 
              coalesce(title, '') || ' ' || coalesce(summary, '')
            ) @@ websearch_to_tsquery(?::text::regconfig, ?)
          """,
          ^language,
          ^search_term
        )
    )

  Repo.all(query)
end

This search function searches for books in a particular language that contain our search term. It uses the partial index of that language to make the search efficient. This solution is probably the most optimal solution in terms of storage costs and adding new languages. It doesn’t require auto-generated fields and you can set up partial indexes for all languages without much extra costs. The migration and query are pretty minimal and not too verbose.

You must know the language of a text, but you don’t necessarily need to know the language of the search term. You could cycle through a list of “most searched in” languages and invoke the search/2 function until you receive a result. It’s less efficient, but might work well enough in your use-case.

The last solution presents a similar solution, but with only a single index instead of many partial ones. But before we dive into that one, a small caveat about working with regconfig in Ecto.

🔗 Casting Strings to Regconfig

Note that we use ?::text::regconfig in the websearch_to_tsquery function. Ecto doesn’t support casting strings to regconfig directly, so we must cast the parameter first to text and let Postgres handle the casting to regconfig. If you do the following, you will receive an error from Ecto:

where: fragment(
  """
    to_tsvector(
      language::regconfig, 
      coalesce(title, '') || ' ' || coalesce(summary, '')
    # Note that we don't cast to text first:
    ) @@ websearch_to_tsquery(?, ?)
  """,
  # Ecto can't cast this parameter to `regconfig` 
  # directly and throws an error.
  ^language,
  ^search_term
)

🔗 One index for all languages

The last solution was an attempt to consolidate all partial indexes into a single index that uses the language of a book as a dynamic value to create the index’s value. It was inspired by a few different resources. Let’s have a look at the migration:

def up do
  # Create table books ...
  
  execute """
    CREATE FUNCTION lang_to_regconfig(text) RETURNS regconfig
      LANGUAGE sql IMMUTABLE STRICT AS $$
      SELECT $1::regconfig; $$
  """

  execute """
    CREATE INDEX books_language_agnostic_idx ON books 
      USING gin(
        to_tsvector(
            lang_to_regconfig(language), 
            coalesce(title, '') || ' ' || coalesce(summary, '')
        )
      );
  """
end

The migration first creates an immutable function that casts a text to a regconfig language. Without this function, we cannot create the index because the unwrapped language::regconfig cast function is not immutable and Postgres doesn’t allow mutable functions when creating indexes.

The second step is to create the language agnostic index. It uses the lang_to_regconfig function to cast the language of a book into a regconfig language. The rest is equal to the mutation of the partial indexes.

And that’s it! We now have a single index for all books that contains a tsvector only for their language. We can use the index for searching like this:

def search(language, search_term) do
  query =
    from(b in Book,
      where:
        fragment(
          "to_tsvector(
            lang_to_regconfig(language), 
            coalesce(title, '') || ' ' || coalesce(summary, '')
          ) @@ websearch_to_tsquery(?::text::regconfig, ?)",
          ^language,
          ^search_term
        )
    )

  Repo.all()
end

Note that the left side of our where fragment must equal how we constructed the index. So, if we wouldn’t use the lang_to_regconfig(language) function in our where fragment, Postgres would not use the index. As always, please check your implementation locally using a Postgres explorer and the EXPLAIN ANALYZE prefix before you deploy it into production.

This final solution combines the best approaches of the previous solutions. It doesn’t generate a new field to keep the storage costs minimal. You can add languages dynamically without the need for a new migration and don’t have to create partial indexes for every supported language. You don’t have to filter your books based on their language, but can search through all of them at once.

According to Postgres’ analysis, the single index is significantly less costly 16.52 total cost than the partial index solution 3212.54 total cost. Sadly, we weren’t able to test these two solutions on a real-world large dataset, so please let us know what works best for you!

🔗 Performance Comparison

We tested the three solutions with ~23k abstracts in 4 languages (German, Italian, Spanish, Dutch) taken from the Gutenberg Dialog project. We used Benchee to benchmark each query. We only searched in a single language (German) with a single search term (a short sentence) that matched only a single abstract. Here are the results:

Name                        ips        average  deviation         median         99th %
concatenated             242.57        4.12 ms    ±39.02%        3.76 ms        8.96 ms
language agnostic        232.71        4.30 ms    ±60.96%        3.54 ms       12.12 ms
partial indexes            2.73      366.62 ms     ±5.86%      367.23 ms      420.01 ms

Comparison: 
concatenated             242.57
language agnostic        232.71 - 1.04x slower +0.175 ms
partial indexes            2.73 - 88.93x slower +362.50 ms

Surprisingly, the “dumb” solution of concatenating all languages was the fastest! Our language agnostic solution with a single index made a decent second place, but the shocking outlier was our partial index search. We made sure that the German partial index was used by analyzing the query and got the following output:

Gather  (cost=1000.00..6737.78 rows=1 width=739) (actual time=1.645..137.773 rows=1 loops=1)
  Workers Planned: 1
  Workers Launched: 1
  ->  Parallel Seq Scan on books  (cost=0.00..5737.68 rows=1 width=739) (actual time=64.755..132.177 rows=0 loops=2)
        Filter: (((language)::text = 'german'::text) AND (to_tsvector((language)::regconfig, (((COALESCE(title, ''::character varying))::text || ' '::text) || COALESCE(summary, ''::text))) @@ '''glut'' & ''inn'' & ''zorn'' & ''gesicht'' & ''trat'''::tsquery))
        Rows Removed by Filter: 11218
Planning Time: 3.470 ms
Execution Time: 137.803 ms

Apparently, Postgres spins up a worker to conduct the partial index search, which slows down the search significantly. At this point, we finished the benchmark, but any tips on how to improve this solution is most welcome!

🔗 A short Interlude

This concludes the section about implementations for multiple languages. We discussed three solutions to implement full-text search with multiple languages: concatenation, partial indexes, and a single language agnostic index. The first and last solution both work well, so it’s up to you to decide which one works best.

The rest of the article dives into smaller Postgres-related questions like:

  1. What are tsvectors and how does Postgres create them?
  2. What are tsqueries and how can you use search modifiers in them?

If you have read so far, congratulations! This certainly isn’t a short or easy blog post, so give yourself a pat on the shoulder and get another coffee. It becomes a bit nit-picky from here on.

🔗 Creating tsvectors

A tsvector is a sorted list of lexemes, which are words normalized to their most basic linguistic form. For example, the words program, programs, and programming all have the same lexeme program. Let’s see how Postgres creates a tsvector from the sentence Coding with Elixir makes for cleaner code.

SELECT to_tsvector('english', 'Coding with Elixir makes for cleaner code');
----------------------------------------------
'cleaner':6 'code':1,7 'elixir':3 'make':4

When creating a tsvector, Postgres applies a whole list of normalizations. They might remove, replace, or group words, but they all serve one purpose: to reduce long texts to their most relevant keywords.

In this case, Postgres applied these normalizations that we can see:

  1. It removed stop-words like is, a, for, with that have little relevance since they occur in almost every text.
  2. It reduced every word to its lexeme, e.g. coding -> code.
  3. It grouped words that reduce to the same lexeme, e.g. [coding, code] -> code.
  4. It sorted the lexemes alphabetically. This allows for more efficient searching for keywords because their position in the list is approximately known.
  5. It kept the position of every lexeme in the text, e.g. code:1,7 occurred as the first and seventh word. Keeping the position allows us to calculate the proximity of search words in the text. More about this later.

Caveat: Postgres returns the best result if you can provide the language of the input text. In our case, we used English, which is the default. If you don’t know the language of your text, you can use the simple mode instead. This won’t reduce the words into their lexemes and only remove a predefined list of stop-words. So, take this as a caveat: If you don’t know the language of your text, your search results and search performance will suffer and you will have to store a lot more data.

Now that we know how Postgres creates tsvectors, let’s have a look at their counterpart: the tsquery.

🔗 Search Modifiers in tsquery

When you build your tsquery, you can use search modifiers like AND, OR, NOT, and FOLLOWED BY. Let’s see how they work:

  • To search for documents that include both words elixir AND postgres, your tsquery would be to_tsquery('elixir & postgres')
  • To match documents that contain either elixir OR postgres, you can search with to_tsquery('elixir | postgres')
  • To find documents that contain certain an entire phrase, like efficient search, you can search with: to_tsquery('efficient <-> search')
  • To find documents that don’t include certain words, you can use to_tsquery('!java').
  • To group words, you can use parentheses. For example, to match documents that contain either elixir AND postgres or elixir AND ecto use: ts_toquery('(elixir & postgres) | (elixir & ecto').

When you write these searches, you probably don’t want to write the search modifiers by hand. That’s why Postgres offers three convenience functions: plainto_tsquery, phraseto_tsquery, and the most important one: websearch_to_tsquery.

  • plainto_tsquery ignores any search modifiers and simply adds AND between each word. So, plainto_tsquery('elixir code') equals to_tsquery('elixir & code') and matches any document that contains both words.
  • phraseto_tsquery also ignores any modifiers and adds <-> between each word. This matches documents that contain an exact phrase like elixir is great. It won’t match documents where these words exist, but in the wrong order, like in Is elixir great?.
  • websearch_to_tsquery is the safest way to allow user input in your full-text queries. It allows all search modifiers but expects them in a human-readable form. OR is simply OR, NOT becomes -, AND stays AND but can be omitted, and phrases have to be put in quotes, like "efficient search". For example:
-- These queries are equal
websearch_to_tsquery('"efficient search" or postgres and -java');
to_tsquery('(efficient <-> search) | (postgres & !java)');

🔗 Conclusion

And that’s it! I hope you enjoyed this article! If you have questions or comments, let’s discuss them on Twitter. Follow me on Twitter or subscribe to my newsletter below if you want to get notified when I publish the next blog post. If you found this topic interesting, here are some references for further reading:

Liked this post?

Get notified about new posts