Skip to main content
Skip to main content
Edit this page

Exact and Approximate Nearest Neighbor Search

The problem of finding the N closest points in a multi-dimensional (vector) space for a given point is known as nearest neighbor search. Two general approaches exist for solving nearest neighbor search:

  • Exact nearest neighbor search calculates the distance between the given point and all points in the vector space. This ensures the best possible accuracy, i.e. the returned points are guaranteed to be the actual nearest neighbors. Since the vector space is explored exhaustively, exact nearest neighbor search can be too slow for real-world use.
  • Approximate nearest neighbor search refers to a group of techniques (e.g., special data structures like graphs and random forests) which compute results much faster than exact nearest neighbor search. The result accuracy is typically "good enough" for practical use. Many approximate techniques provide parameters to tune the trade-off between the result accuracy and the search time.

A nearest neighbor search (exact or approximate) can be written in SQL as follows:

The points in the vector space are stored in a column vectors of array type, e.g. Array(Float64), Array(Float32), or Array(BFloat16). The reference vector is a constant array and given as a common table expression. <DistanceFunction> computes the distance between the reference point and all stored points. Any of the available distance function can be used for that. <N> specifies how many neighbors should be returned.

An exact nearest neighbor search can be performed using above SELECT query as is. The runtime of such queries is generally proportional to the number of stored vectors and their dimension, i.e. the number of array elements. Also, since ClickHouse performs a brute-force scan of all vectors, the runtime depends also on the number of threads by the query (see setting max_threads).

One common approach to speed up exact nearest neighbor search is to use a lower-precision float data type. For example, if the vectors are stored as Array(BFloat16) instead of Array(Float32), then the data size is cut in half, and the query runtimes are expected to go down by half as well. This method is know as quantization and it might reduce the result accuracy despite an exhaustive scan of all vectors. If the precision loss is acceptable depends on the use case and typically requires experimentation.

Example

returns

Beta feature. Learn more.

ClickHouse provides a special "vector similarity" index to perform approximate nearest neighbor search.

note

Vector similarity indexes are currently experimental. To enable them, please first run SET allow_experimental_vector_similarity_index = 1. If you run into problems, kindly open an issue in the ClickHouse repository.

Creating a Vector Similarity Index

A vector similarity index can be created on a new table like this:

Alternatively, to add a vector similarity index to an existing table:

Vector similarity indexes are special kinds of skipping indexes (see here and here). Accordingly, above ALTER TABLE statement only causes the index to be build for future new data inserted into the table. To build the index for existing data as well, you need to materialize it:

Function <distance_function> must be

  • L2Distance, the Euclidean distance, representing the length of a line between two points in Euclidean space, or
  • cosineDistance, the cosine distance, representing the angle between two non-zero vectors.

For normalized data, L2Distance is usually the best choice, otherwise cosineDistance is recommended to compensate for scale.

<dimensions> specifies the array cardinality (number of elements) in the underlying column. If ClickHouse finds an array with a different cardinality during index creation, the index is discarded and an error is returned.

The optional GRANULARITY parameter <N> refers to the size of the index granules (see here). The default value of 100 million should work reasonably well for most use cases but it can also be tuned. We recommend tuning only for advanced users who understand the implications of what they are doing (see below).

Vector similarity indexes are generic in the sense that they can accommodate different approximate search method. The actually used method is specified by parameter <type>. As of now, the only available method is HNSW (academic paper), a popular and state-of-the-art technique for approximate vector search based on hierarchical proximity graphs. If HNSW is used as type, users may optionally specify further HNSW-specific parameters:

These HNSW-specific parameters are available:

  • <quantization> controls the quantization of the vectors in the proximity graph. Possible values are f64, f32, f16, bf16, or i8. The default value is bf16. Note that this parameter does not affect the representation of the vectors in the underlying column.
  • <hnsw_max_connections_per_layer> controls the number of neighbors per graph node, also known as HNSW hyperparameter M. The default value is 32. Value 0 means using the default value.
  • <hnsw_candidate_list_size_for_construction> controls the size of the dynamic candidate list during construction of the HNSW graph, also known as HNSW hyperparameter ef_construction. The default value is 128. Value 0 means using the default value.

The default values of all HNSW-specific parameters work reasonably well in the majority of use cases. We therefore do not recommend customizing the HNSW-specific parameters.

Further restrictions apply:

  • Vector similarity indexes can only be build on columns of type Array(Float32), Array(Float64), or Array(BFloat16). Arrays of nullable and low-cardinality floats such as Array(Nullable(Float32)) and Array(LowCardinality(Float32)) are not allowed.
  • Vector similarity indexes must be build on single columns.
  • Vector similarity indexes may be build on calculated expressions (e.g., INDEX index_name arraySort(vectors) TYPE vector_similarity([...])) but such indexes cannot be used for approximate neighbor search later on.
  • Vector similarity indexes require that all arrays in the underlying column have <dimension>-many elements - this is checked during index creation. To detect violations of this requirement as early as possible, users can add a constraint for the vector column, e.g., CONSTRAINT same_length CHECK length(vectors) = 256.
  • Likewise, array values in the underlying column must not be empty ([]) or have a default value (also []).

Using a Vector Similarity Index

note

To use vector similarity indexes, setting compatibility has be '' (the default value), or '25.1' or newer.

Vector similarity indexes support SELECT queries of this form:

ClickHouse's query optimizer tries to match above query template and make use of available vector similarity indexes. A query can only use a vector similarity index if the distance function in the SELECT query is the same as the distance function in the index definition.

Advanced users may provide a custom value for setting hnsw_candidate_list_size_for_search (also know as HNSW hyperparameter "ef_search") to tune the size of the candidate list during search (e.g. SELECT [...] SETTINGS hnsw_candidate_list_size_for_search = <value>). The default value of the setting 256 works well in the majority of use cases. Higher setting values mean better accuracy at the cost of slower performance.

If the query can use a vector similarity index, ClickHouse checks that the LIMIT <N> provided in SELECT queries is within reasonable bounds. More specifically, an error is returned if <N> is bigger than the value of setting max_limit_for_vector_search_queries with default value 100. Too large LIMIT values can slow down searches and usually indicate a usage error.

To check if a SELECT query uses a vector similarity index, you can prefix the query with EXPLAIN indexes = 1.

As an example, query

may return

In this example, 1 million vectors in the dbpedia dataset, each with dimension 1536, are stored in 575 granules, i.e. 1.7k rows per granule. The query asks for 10 neighbours and the vector similarity index finds these 10 neighbours in 10 separate granules. These 10 granules will be read during query execution.

Vector similarity indexes are used if the output contains Skip and the name and type of the vector index (in the example, idx and vector_similarity). In this case, the vector similarity index dropped two of four granules, i.e. 50% of the data. The more granules can be dropped, the more effective index usage becomes.

tip

To enforce index usage, you can run the SELECT query with setting force_data_skipping_indexes (provide the index name as setting value).

Post-filtering and Pre-filtering

Users may optionally specify a WHERE clause with additional filter conditions for the SELECT query. ClickHouse will evaluate these filter conditions using post-filtering or pre-filtering strategy. In short, both strategies determine the order in which the filters are evaluated:

  • Post-filtering means that the vector similarity index is evaluated first, afterwards ClickHouse evaluates the additional filter(s) specified in the WHERE clause.
  • Pre-filtering means that the filter evaluation order is the other way round.

The strategies have different trade-offs:

  • Post-filtering has the general problem that it may return less than the number of rows requested in the LIMIT <N> clause. This situation happens when one or more result rows returned by the vector similarity index fails to satisfy the additional filters.
  • Pre-filtering is generally an unsolved problem. Certain specialized vector databases provide pre-filtering algorithms but most relational databases (including ClickHouse) will fall back to exact neighbor search, i.e., a brute-force scan without index.

What strategy is used depends on the filter condition.

Additional filters are part of the partition key

If the additional filter condition is part of the partition key, then ClickHouse will apply partition pruning. As an example, a table is range-partitioned by column year and the following query is run:

ClickHouse will prune all partitions except the 2025 one.

Additional filters cannot be evaluated using indexes

If additional filter conditions cannot be evaluated using indexes (primary key index, skipping index), ClickHouse will apply post-filtering.

Additional filters can be evaluated using the primary key index

If additional filter conditions can be evaluated using the primary key (i.e., they form a prefix of the primary key) and

  • the filter condition eliminates at least one row within a part, the ClickHouse will fall back to pre-filtering for the "surviving" ranges within the part,
  • the filter condition eliminates no rows within a part, the ClickHouse will perform post-filtering for the part.

In practical use cases, the latter case is rather unlikely.

Additional filters can be evaluated using skipping index

If additional filter conditions can be evaluated using skipping indexes (minmax index, set index, etc.), Clickhouse performs post-filtering. In such cases, the vector similarity index is evaluated first as it is expected to remove the most rows relative to other skipping indexes.

For finer control over post-filtering vs. pre-filtering, two settings can be used:

Setting vector_search_filter_strategy (default: auto which implements above heuristics) may be set to prefilter. This is useful to force pre-filtering in cases where the additional filter conditions are extremely selective. As an example, the following query may benefit from pre-filtering:

Assuming that only a very small number of books cost less than 2 dollar, post-filtering may return zero rows because the top 10 matches returned by the vector index could all be priced above 2 dollar. By forcing pre-filtering (add SETTINGS vector_search_filter_strategy = 'prefilter' to the query), ClickHouse first finds all books with a price of less than 2 dollar and then executes a brute-force vector search for the found books.

As an alternative approach to resolve above issue, setting vector_search_postfilter_multiplier (default: 1.0) may be configured to a value > 1.0 (for example, 2.0). The number of nearest neighbors fetched from the vector index is multiplied by the setting value and then the additional filter to be applied on those rows to return LIMIT-many rows. As an example, we can query again but with multiplier 3.0:

ClickHouse will fetch 3.0 x 10 = 30 nearest neighbors from the vector index in each part and afterwards evaluate the additional filters. Only the ten closest neighbors will be returned. We note that setting vector_search_postfilter_multiplier can mitigate the problem but in extreme cases (very selective WHERE condition), it is still possible that less than N requested rows returned.

Performance Tuning

Tuning Compression

In virtually all use cases, the vectors in the underlying column are dense and do not compress well. As a result, compression slows down inserts and reads into/from the vector column. We therefore recommend to disable compression. To do that, specify CODEC(NONE) for the vector column like this:

Tuning Index Creation

The life cycle of vector similarity indexes is tied to the life cycle of parts. In other words, whenever a new part with defined vector similarity index is created, the index is create as well. This typically happens when data is inserted or during merges. Unfortunately, HNSW is known for long index creation times which can significantly slow down inserts and merges. Vector similarity indexes are ideally only used if the data is immutable or rarely changed.

To speed up index creation, the following techniques can be used:

First, index creation can be parallelized. The maximum number of index creation threads can be configured using server setting max_build_vector_similarity_index_thread_pool_size. For optimal performance, the setting value should be configured to the number of CPU cores.

Second, to speed up INSERT statements, users may disable the creation of skipping indexes on newly inserted parts using session setting materialize_skip_indexes_on_insert. SELECT queries on such parts will fall back to exact search. Since inserted parts tend to be small compared to the total table size, the performance impact of that is expected to be negligible.

Third, to speed up merges, users may disable the creation of skipping indexes on merged parts using session setting materialize_skip_indexes_on_merge. This, in conjunction with statement ALTER TABLE [...] MATERIALIZE INDEX [...], provides explicit control over the life cycle of vector similarity indexes. For example, index creation can be deferred until all data was ingested or until a period of low system load such as the weekend.

Tuning Index Usage

SELECT queries need to load vector similarity indexes into main memory to use them. To avoid that the same vector similarity index is loaded repeatedly into main memory, ClickHouse provides a dedicated in-memory cache for such indexes. The bigger this cache is, the fewer unnecessary loads will happen. The maximum cache size can be configured using server setting vector_similarity_index_cache_size. By default, the cache can grow up to 5 GB in size.

The current size of the vector similarity index cache is shown in system.metrics:

The cache hits and misses for a query with some query id can be obtained from system.query_log:

For production use-cases, we recommend that the cache is sized large enough so that all vector indexes remain in memory at all times.

Administration and Monitoring

The on-disk size of vector similarity indexes can be obtained from system.data_skipping_indices:

Example output:

Differences to Regular Skipping Indexes

As all regular skipping indexes, vector similarity indexes are constructed over granules and each indexed block consists of GRANULARITY = [N]-many granules ([N] = 1 by default for normal skipping indexes). For example, if the primary index granularity of the table is 8192 (setting index_granularity = 8192) and GRANULARITY = 2, then each indexed block will contain 16384 rows. However, data structures and algorithms for approximate neighbor search are inherently row-oriented. They store a compact representation of a set of rows and also return rows for vector search queries. This causes some rather unintuitive differences in the way vector similarity indexes behave compared to normal skipping indexes.

When a user defines a vector similarity index on a column, ClickHouse internally creates a vector similarity "sub-index" for each index block. The sub-index is "local" in the sense that it only knows about the rows of its containing index block. In the previous example and assuming that a column has 65536 rows, we obtain four index blocks (spanning eight granules) and a vector similarity sub-index for each index block. A sub-index is theoretically able to return the rows with the N closest points within its index block directly. However, since ClickHouse loads data from disk to memory at the granularity of granules, sub-indexes extrapolate matching rows to granule granularity. This is different from regular skipping indexes which skip data at the granularity of index blocks.

The GRANULARITY parameter determines how many vector similarity sub-indexes are created. Bigger GRANULARITY values mean fewer but larger vector similarity sub-indexes, up to the point where a column (or a column's data part) has only a single sub-index. In that case, the sub-index has a "global" view of all column rows and can directly return all granules of the column (part) with relevant rows (there are at most LIMIT [N]-many such granules). In a second step, ClickHouse will load these granules and identify the actually best rows by performing a brute-force distance calculation over all rows of the granules. With a small GRANULARITY value, each of the sub-indexes returns up to LIMIT N-many granules. As a result, more granules need to be loaded and post-filtered. Note that the search accuracy is with both cases equally good, only the processing performance differs. It is generally recommended to use a large GRANULARITY for vector similarity indexes and fall back to a smaller GRANULARITY values only in case of problems like excessive memory consumption of the vector similarity structures. If no GRANULARITY was specified for vector similarity indexes, the default value is 100 million.

Example

returns

References

Blogs: