Choosing the best model for semantic search


Semantic search is transforming search technology by providing more accurate and relevant results. However, with many embedding models available, choosing the right one can be challenging. This guide will help you understand the key factors to consider when selecting a model to build semantic search.

Overview

In this guide, we will use the open-source search engine Meilisearch to perform the semantic searches. For the purpose of the tests, we're using the entry tier of Meilisearch Cloud (i.e., the Build plan.)

This guide will cover the following models:

Model/Service Dimensions Context Length
Cohere embed-english-v3.0 1024 512
Cohere embed-english-light-v3.0 384 512
Cohere embed-multilingual-v3.0 1024 512
Cohere embed-multilingual-light-v3.0 384 512
OpenAI text-embedding-3-small 1536 8192
OpenAI text-embedding-3-large 3072 8192
Mistral 1024 8192
VoyageAI voyage-2 1024 4000
VoyageAI voyage-large-2 1536 16000
VoyageAI voyage-multilingual-2 1024 32000
Jina Colbert v2 128, 96, or 64 8192
OSS all-MiniLM-L6-v2 384 512
OSS bge-small-en-v1.5 1024 512
OSS bge-large-en-v1.5 1536 512

Factors to consider

1. Results relevancy

Relevancy is crucial for effective search, as it ensures that users find the most pertinent results quickly. In the realm of semantic search, achieving a balance between relevancy and speed is essential to provide a seamless user experience. It's important to consider the tradeoffs of vector search vs full-text search.

When selecting a model, consider your specific use case, such as the need for multilingual support, handling multi-modal data, or addressing domain-specific requirements. If you have a highly specialized use case or need to support a particular language, it may be beneficial to explore models that can be trained on your data or opt for multilingual models.

The performance difference between a very small model and a large model is not always substantial. Smaller models are generally less expensive and faster, making them a practical choice in many scenarios. Therefore, it is often worth considering smaller models for their cost-effectiveness and speed.

Additionally, you should always consider the context you're providing to the model. In Meilisearch, this comes in the form of a document template. The more accurately the template describes the data, the better the search results will be, leading to a more satisfying user experience.

2. Search performance

You read everywhere now. Time is money. And the web is no different. Nowadays, search-as-you-type is the baseline for customer-facing applications. Saving users time greatly enhances their satisfaction and keeps them engaged with your platform.

To achieve lightning-fast search performance, consider using a local model to minimize latency by eliminating the need for round trips to the embedding service. If you need to use a remote model, then hosting your search service (e.g., your Meilisearch database) in close proximity to the embedding service can significantly reduce latency.

The table below showcases latency benchmarks for various local embedding models and embedding APIs. All requests are originated from a Meilisearch instance hosted on AWS (London datacenter.)

Model/Service Latency
Cloudflare bge-small-en-v1.5 ±800ms
Cloudflare bge-large-en-v1.5 ±500ms
Cohere embed-english-v3.0 ±170ms
Cohere embed-english-light-v3.0 ±160ms
Local gte-small ±20ms
Local all-MiniLM-L6-v2 ±10ms
Local bge-small-en-v1.5 ±20ms
Local bge-large-en-v1.5 ±60ms
Mistral ±200ms
Jina colbert ±400ms
OpenAI text-embedding-3-small ±460ms
OpenAI text-embedding-3-large ±750ms
VoyageAI voyage-2 ±350ms
VoyageAI voyage-large-2 ±400ms

Here you can see that there are some clear winners in terms of latency. Unfortunately, latency is not the same as throughput, so we also need to take a close look at the indexing time.

3. Indexing performance

Indexing performance is another critical aspect when comparing search solutions. The embedding model performance will directly impact the indexing speed of your search solution. And the speed at which your data can be indexed directly impacts the overall efficiency and scalability of your search solution.

Local models without GPUs may have slower indexing due to limited processing power. In contrast, third-party services offer varying speeds and limitations based on their infrastructure and service agreements. It is essential to evaluate these factors to ensure that your chosen model and service can meet your requirements effectively.

Several factors come into play when optimizing indexing. Again, the latency plays a big role: reducing the time taken for data to travel between your application and the model is always going to improve your experience. Additionally, the maximum size of API calls the API accepts, the provider's rate limiting, and the model's supported number of dimensions can all influence the efficiency and scalability of the indexing process.

The benchmark below compares the indexing of a 10k e-commerce documents (with automatic embedding generation):

Model/Service Indexation Time
Cohere embed-english-v3.0 43s
Cohere embed-english-light-v3.0 16s
OpenAI text-embedding-3-small 95s
OpenAI text-embedding-3-large 151s
Cloudflare bge-small-en-v1.5 152s
Cloudflare bge-large-en-v1.5 159s
Jina Colbert V2 375s
VoyageAI voyage-large-2 409s
Mistral 409s
Local all-MiniLM-L6-v2 880s
Local bge-small-en-v1.5 3379s
Local bge-large-en-v1.5 9132s

4. Pricing

While local embedders are free, most services charge per million of tokens. Here's a breakdown of the pricing for each platform:

  • Cohere:
    • $0.10 per million tokens
  • OpenAI:
    • $0.13 per million tokens for text-embedding-3-large
    • $0.02 per million tokens for text-embedding-3-small
  • Cloudflare:
    • $0.011/1,000 Neurons
  • Jina:
    • $0.18 per million tokens
  • Mistral:
    • $0.10 per million tokens
  • VoyageAI:
    • $0.10 per million tokens for voyage-2
    • $0.12 per million tokens for voyage-large-2
    • $0.12 per million tokens for voyage-multilingual-2
  • Local model: Free

As your search needs grow and scale, it may become more cost-effective to invest in your own GPU machine. By having your own hardware, you can have greater control over the performance and scalability of your search solution and potentially reduce costs in the long run.

It is often best to start with a well-known model from the list provided. They are generally easy to setup and you will easily find community resources to help you.. As the need arises, you can consider migrating the model to a cloud provider like AWS. Many services offer this option, allowing you to leverage their infrastructure for improved performance and scalability.

Alternatively, you can choose an equivalent open-source model to self-host, giving you even more flexibility and control over your search solution in the long term. Please note that optimizing local models for performance or high volume may require to scale your infrastructure accordingly.


Ready to elevate your search experience?


Going further

While this article provides a comprehensive overview, we did not delve deeply into optimization techniques. There are several additional optimizations that can be explored to further enhance the performance of semantic search.

Here is a list of additional areas to investigate when choosing a model for your search experience:

  • Experiment with different presets (query vs. document) for models that offer this option to potentially improve relevancy
  • Evaluate specialized models for specific applications to assess their performance and suitability for your use case
  • Explore models that provide a reranking function to further refine search results
  • Test higher-tier accounts on each platform to check for improved performance and reduced rate limiting
  • Investigate parameters for receiving quantized data directly from the API to optimize data transfer and processing

Conclusion

Model/Service Dimensions Context Length Latency Indexation Time Pricing (per million tokens)
Cohere embed-english-v3.0 1024 512 ±170ms 43s $0.10
Cohere embed-english-light-v3.0 384 512 ±160ms 16s $0.10
OpenAI text-embedding-3-small 1536 8192 ±460ms 95s $0.02
OpenAI text-embedding-3-large 3072 8192 ±750ms 151s $0.13
Mistral 1024 8192 ±200ms 409s $0.10
VoyageAI voyage-2 1024 4000 ±350ms 330s $0.10
VoyageAI voyage-large-2 1536 16000 ±400ms 409s $0.12
Jina Colbert v2 128, 96, or 64 8192 ±400ms 375s $0.18
OSS all-MiniLM-L6-v2 384 512 ±10ms 880s Free
OSS bge-small-en-v1.5 1024 512 ±20ms 3379s Free
OSS bge-large-en-v1.5 1536 512 ±60ms 9132s Free

Choosing the right model and service for semantic search involves carefully balancing several key factors: relevancy, search performance, indexation performance, and cost.

Each option presents its own set of trade-offs:

  • Cloud-based services like Cohere and OpenAI offer excellent relevancy and reasonable latency, with Cohere's embed-english-light-v3.0 standing out for its balance of speed and performance.
  • Local models provide the fastest search latency but may struggle with indexation speed on limited hardware.
  • Emerging services like Mistral and VoyageAI show promise with competitive pricing and performance.
  • Open-source models offer cost-effective solutions for those willing to manage their own infrastructure.

Ultimately, the best choice depends on your specific use case, budget, and performance requirements. For many applications, starting with a cloud-based service like Cohere or OpenAI provides a good balance of ease of use, performance, and cost. As your needs grow, consider exploring local or specialized models, or contact Meilisearch's sales team for tailored solutions.


Meilisearch is an open-source search engine enabling developers to build state-of-the-art experiences while enjoying simple, intuitive DX.

For more things Meilisearch, you can join the community on Discord or subscribe to the newsletter. You can learn more about the product by checking out its roadmap and participating in product discussions.