Choosing the best model for semantic search

Semantic search is transforming search technology by providing more accurate and relevant results. However, with many embedding models available, choosing the right one can be challenging. This guide will help you understand the key factors to consider when selecting a model to build semantic search.

Overview

In this guide, we will use the open-source search engine Meilisearch to perform the semantic searches. For the purpose of the tests, we're using the entry tier of Meilisearch Cloud (i.e., the Build plan.)

This guide will cover the following models:

Model/Service	Dimensions	Context Length
Cohere embed-english-v3.0	1024	512
Cohere embed-english-light-v3.0	384	512
Cohere embed-multilingual-v3.0	1024	512
Cohere embed-multilingual-light-v3.0	384	512
OpenAI text-embedding-3-small	1536	8192
OpenAI text-embedding-3-large	3072	8192
Mistral	1024	8192
VoyageAI voyage-2	1024	4000
VoyageAI voyage-large-2	1536	16000
VoyageAI voyage-multilingual-2	1024	32000
Jina Colbert v2	128, 96, or 64	8192
OSS all-MiniLM-L6-v2	384	512
OSS bge-small-en-v1.5	1024	512
OSS bge-large-en-v1.5	1536	512

Factors to consider

1. Results relevancy

Relevancy is crucial for effective search, as it ensures that users find the most pertinent results quickly. In the realm of semantic search, achieving a balance between relevancy and speed is essential to provide a seamless user experience. It's important to consider the tradeoffs of vector search vs full-text search.

When selecting a model, consider your specific use case, such as the need for multilingual support, handling multi-modal data, or addressing domain-specific requirements. If you have a highly specialized use case or need to support a particular language, it may be beneficial to explore models that can be trained on your data or opt for multilingual models.

The performance difference between a very small model and a large model is not always substantial. Smaller models are generally less expensive and faster, making them a practical choice in many scenarios. Therefore, it is often worth considering smaller models for their cost-effectiveness and speed.

Additionally, you should always consider the context you're providing to the model. In Meilisearch, this comes in the form of a document template. The more accurately the template describes the data, the better the search results will be, leading to a more satisfying user experience.

2. Search performance

You read everywhere now. Time is money. And the web is no different. Nowadays, search-as-you-type is the baseline for customer-facing applications. Saving users time greatly enhances their satisfaction and keeps them engaged with your platform.

To achieve lightning-fast search performance, consider using a local model to minimize latency by eliminating the need for round trips to the embedding service. If you need to use a remote model, then hosting your search service (e.g., your Meilisearch database) in close proximity to the embedding service can significantly reduce latency.

The table below showcases latency benchmarks for various local embedding models and embedding APIs. All requests are originated from a Meilisearch instance hosted on AWS (London datacenter.)

Model/Service	Latency
Cloudflare bge-small-en-v1.5	±800ms
Cloudflare bge-large-en-v1.5	±500ms
Cohere embed-english-v3.0	±170ms
Cohere embed-english-light-v3.0	±160ms
Local gte-small	±20ms
Local all-MiniLM-L6-v2	±10ms
Local bge-small-en-v1.5	±20ms
Local bge-large-en-v1.5	±60ms
Mistral	±200ms
Jina colbert	±400ms
OpenAI text-embedding-3-small	±460ms
OpenAI text-embedding-3-large	±750ms
VoyageAI voyage-2	±350ms
VoyageAI voyage-large-2	±400ms

Here you can see that there are some clear winners in terms of latency. Unfortunately, latency is not the same as throughput, so we also need to take a close look at the indexing time.

3. Indexing performance

Indexing performance is another critical aspect when comparing search solutions. The embedding model performance will directly impact the indexing speed of your search solution. And the speed at which your data can be indexed directly impacts the overall efficiency and scalability of your search solution.

Local models without GPUs may have slower indexing due to limited processing power. In contrast, third-party services offer varying speeds and limitations based on their infrastructure and service agreements. It is essential to evaluate these factors to ensure that your chosen model and service can meet your requirements effectively.

Several factors come into play when optimizing indexing. Again, the latency plays a big role: reducing the time taken for data to travel between your application and the model is always going to improve your experience. Additionally, the maximum size of API calls the API accepts, the provider's rate limiting, and the model's supported number of dimensions can all influence the efficiency and scalability of the indexing process.

The benchmark below compares the indexing of a 10k e-commerce documents (with automatic embedding generation):

Model/Service	Indexation Time
Cohere embed-english-v3.0	43s
Cohere embed-english-light-v3.0	16s
OpenAI text-embedding-3-small	95s
OpenAI text-embedding-3-large	151s
Cloudflare bge-small-en-v1.5	152s
Cloudflare bge-large-en-v1.5	159s
Jina Colbert V2	375s
VoyageAI voyage-large-2	409s
Mistral	409s
Local all-MiniLM-L6-v2	880s
Local bge-small-en-v1.5	3379s
Local bge-large-en-v1.5	9132s

4. Pricing

While local embedders are free, most services charge per million of tokens. Here's a breakdown of the pricing for each platform:

Cohere:
- $0.10 per million tokens
OpenAI:
- $0.13 per million tokens for text-embedding-3-large
- $0.02 per million tokens for text-embedding-3-small
Cloudflare:
- $0.011/1,000 Neurons
Jina:
- $0.18 per million tokens
Mistral:
- $0.10 per million tokens
VoyageAI:
- $0.10 per million tokens for voyage-2
- $0.12 per million tokens for voyage-large-2
- $0.12 per million tokens for voyage-multilingual-2
Local model: Free

As your search needs grow and scale, it may become more cost-effective to invest in your own GPU machine. By having your own hardware, you can have greater control over the performance and scalability of your search solution and potentially reduce costs in the long run.

It is often best to start with a well-known model from the list provided. They are generally easy to setup and you will easily find community resources to help you.. As the need arises, you can consider migrating the model to a cloud provider like AWS. Many services offer this option, allowing you to leverage their infrastructure for improved performance and scalability.

Alternatively, you can choose an equivalent open-source model to self-host, giving you even more flexibility and control over your search solution in the long term. Please note that optimizing local models for performance or high volume may require to scale your infrastructure accordingly.

Ready to elevate your search experience?

Talk to a search expert

Going further

While this article provides a comprehensive overview, we did not delve deeply into optimization techniques. There are several additional optimizations that can be explored to further enhance the performance of semantic search.

Here is a list of additional areas to investigate when choosing a model for your search experience:

Experiment with different presets (query vs. document) for models that offer this option to potentially improve relevancy
Evaluate specialized models for specific applications to assess their performance and suitability for your use case
Explore models that provide a reranking function to further refine search results
Test higher-tier accounts on each platform to check for improved performance and reduced rate limiting
Investigate parameters for receiving quantized data directly from the API to optimize data transfer and processing

Conclusion

Model/Service	Dimensions	Context Length	Latency	Indexation Time	Pricing (per million tokens)
Cohere embed-english-v3.0	1024	512	±170ms	43s	$0.10
Cohere embed-english-light-v3.0	384	512	±160ms	16s	$0.10
OpenAI text-embedding-3-small	1536	8192	±460ms	95s	$0.02
OpenAI text-embedding-3-large	3072	8192	±750ms	151s	$0.13
Mistral	1024	8192	±200ms	409s	$0.10
VoyageAI voyage-2	1024	4000	±350ms	330s	$0.10
VoyageAI voyage-large-2	1536	16000	±400ms	409s	$0.12
Jina Colbert v2	128, 96, or 64	8192	±400ms	375s	$0.18
OSS all-MiniLM-L6-v2	384	512	±10ms	880s	Free
OSS bge-small-en-v1.5	1024	512	±20ms	3379s	Free
OSS bge-large-en-v1.5	1536	512	±60ms	9132s	Free

Choosing the right model and service for semantic search involves carefully balancing several key factors: relevancy, search performance, indexation performance, and cost.

Each option presents its own set of trade-offs:

Cloud-based services like Cohere and OpenAI offer excellent relevancy and reasonable latency, with Cohere's embed-english-light-v3.0 standing out for its balance of speed and performance.
Local models provide the fastest search latency but may struggle with indexation speed on limited hardware.
Emerging services like Mistral and VoyageAI show promise with competitive pricing and performance.
Open-source models offer cost-effective solutions for those willing to manage their own infrastructure.

Ultimately, the best choice depends on your specific use case, budget, and performance requirements. For many applications, starting with a cloud-based service like Cohere or OpenAI provides a good balance of ease of use, performance, and cost. As your needs grow, consider exploring local or specialized models, or contact Meilisearch's sales team for tailored solutions.

Meilisearch is an open-source search engine enabling developers to build state-of-the-art experiences while enjoying simple, intuitive DX.

For more things Meilisearch, you can join the community on Discord or subscribe to the newsletter. You can learn more about the product by checking out its roadmap and participating in product discussions.