Ai

Semantic search

Learn how to search by meaning rather than exact keywords.

Semantic search interprets the meaning behind user queries rather than exact keywords. It uses machine learning to capture the intent and context behind the query, handling language nuances like synonyms, phrasing variations, and word relationships.

Semantic search is useful in applications where the depth of understanding and context is important for delivering relevant results. A good example is in customer support or knowledge base search engines. Users often phrase their problems or questions in various ways, and a traditional keyword-based search might not always retrieve the most helpful documents. With semantic search, the system can understand the meaning behind the queries and match them with relevant solutions or articles, even if the exact wording differs.

For instance, a user searching for "increase text size on display" might miss articles titled "How to adjust font size in settings" in a keyword-based search system. However, a semantic search engine would understand the intent behind the query and correctly match it to relevant articles, regardless of the specific terminology used.

It's also possible to combine semantic search with keyword search to get the best of both worlds. See Hybrid search for more details.

How semantic search works

Semantic search uses an intermediate representation called an “embedding vector” to link database records with search queries. A vector, in the context of semantic search, is a list of numerical values. They represent various features of the text and allow for the semantic comparison between different pieces of text.

The best way to think of embeddings is by plotting them on a graph, where each embedding is a single point whose coordinates are the numerical values within its vector. Importantly, embeddings are plotted such that similar concepts are positioned close together while dissimilar concepts are far apart. For more details, see What are embeddings?

Embeddings are generated using a language model, and embeddings are compared to each other using a similarity metric. The language model is trained to understand the semantics of language, including syntax, context, and the relationships between words. It generates embeddings for both the content in the database and the search queries. Then the similarity metric, often a function like cosine similarity or dot product, is used to compare the query embeddings with the document embeddings (in other words, to measure how close they are to each other on the graph). The documents with embeddings most similar to the query's are deemed the most relevant and are returned as search results.

Embedding models

There are many embedding models available today. Datafuse Edge Functions has built in support for the gte-small model. Others can be accessed through third-party APIs like OpenAI, where you send your text in the request and receive an embedding vector in the response. Others can run locally on your own compute, such as through Transformers.js for JavaScript implementations. For more information on local implementation, see Generate embeddings.

It's crucial to remember that when using embedding models with semantic search, you must use the same model for all embedding comparisons. Comparing embeddings created by different models will yield meaningless results.

Semantic search in Postgres

To implement semantic search in Postgres we use pgvector - an extension that allows for efficient storage and retrieval of high-dimensional vectors. These vectors are numerical representations of text (or other types of data) generated by embedding models.

  1. Enable the pgvector extension by running:
    create extension vector
    with
      schema extensions;
    
  2. Create a table to store the embeddings:
    create table documents (
      id bigint primary key generated always as identity,
      content text,
      embedding vector(512)
    );
    

    Or if you have an existing table, you can add a vector column like so:
    alter table documents
    add column embedding vector(512);
    

    In this example, we create a column named embedding which uses the newly enabled vector data type. The size of the vector (as indicated in parentheses) represents the number of dimensions in the embedding. Here we use 512, but adjust this to match the number of dimensions produced by your embedding model.

For more details on vector columns, including how to generate embeddings and store them, see Vector columns.

Similarity metric

pgvector support 3 operators for computing distance between embeddings:

OperatorDescription
<->Euclidean distance
<#>negative inner product
<=>cosine distance

These operators are used directly in your SQL query to retrieve records that are most similar to the user's search query. Choosing the right operator depends on your needs. Inner product (also known as dot product) tends to be the fastest if your vectors are normalized.

The easiest way to perform semantic search in Postgres in by creating a function:

-- Match documents using cosine distance (<=>)
create or replace function match_documents (
  query_embedding vector(512),
  match_threshold float,
  match_count int
)
returns setof documents
language sql
as $$
  select *
  from documents
  where documents.embedding <=> query_embedding < 1 - match_threshold
  order by documents.embedding <=> query_embedding asc
  limit least(match_count, 200);
$$;

Here we create a function match_documents that accepts three parameters:

  1. query_embedding: a one-time embedding generated for the user's search query. Here we set the size to 512, but adjust this to match the number of dimensions produced by your embedding model.
  2. match_threshold: the minimum similarity between embeddings. This is a value between 1 and -1, where 1 is most similar and -1 is most dissimilar.
  3. match_count: the maximum number of results to return. Note the query may return less than this number if match_threshold resulted in a small shortlist. Limited to 200 records to avoid unintentionally overloading your database.

In this example, we return a setof documents and refer to documents throughout the query. Adjust this to use the relevant tables in your application.

You'll notice we are using the cosine distance (<=>) operator in our query. Cosine distance is a safe default when you don't know whether or not your embeddings are normalized. If you know for a fact that they are normalized (for example, your embedding is returned from OpenAI), you can use negative inner product (<#>) for better performance:

-- Match documents using negative inner product (<#>)
create or replace function match_documents (
  query_embedding vector(512),
  match_threshold float,
  match_count int
)
returns setof documents
language sql
as $$
  select *
  from documents
  where documents.embedding <#> query_embedding < -match_threshold
  order by documents.embedding <#> query_embedding asc
  limit least(match_count, 200);
$$;

Note that since <#> is negative, we negate match_threshold accordingly in the where clause. For more information on the different operators, see the pgvector docs.

Calling from your application

Finally you can execute this function from your application. If you are using a Datafuse client library such as datafuse-js, you can invoke it using the rpc() method:

const { data: documents } = await datafuse.rpc('match_documents', {
  query_embedding: embedding, // pass the query embedding
  match_threshold: 0.78, // choose an appropriate threshold for your data
  match_count: 10, // choose the number of matches
})

You can also call this method directly from SQL:

select *
from match_documents(
  '[...]'::vector(512), -- pass the query embedding
  0.78, -- chose an appropriate threshold for your data
  10 -- choose the number of matches
);

In this scenario, you'll likely use a Postgres client library to establish a direct connection from your application to the database. It's best practice to parameterize your arguments before executing the query.

Next steps

As your database scales, you will need an index on your vector columns to maintain fast query speeds. See Vector indexes for an in-depth guide on the different types of indexes and how they work.

See also


Resources

Features

Company

Copyright © 2024. All rights reserved.