In our continued effort to equip developers and organizations with advanced search tools, we are thrilled to announce the launch of several new features in the latest Preview API for Azure AI Search. These enhancements are designed to optimize vector index size and provide more granular control and understanding of your search index to build Retrieval-Augmented Generation (RAG) applications.

MRL Support for Quantization

Matryoshka Representation Learning (MRL) is a new technique that introduces a different form of vector compression, which complements and works independently of existing quantization methods. MRL enables the flexibility to truncate embeddings without significant semantic loss, offering a balance between vector size and information retention.

This technique works by training embedding models so that information density increases towards the beginning of the vector. As a result, even when using only a prefix of the original vector, much of the key information is preserved, allowing for shorter vector representations without a substantial drop in performance.

OpenAI has integrated MRL into their ‘text-embedding-3-small’ and ‘text-embedding-3-large’ models, making them adaptable for use in scenarios where compressed embeddings are needed while maintaining high retrieval accuracy. You can read more about the underlying research in the official paper [1] or learn about the latest OpenAI embedding models in their blog.

Storage Compression Comparison

Table 1.1 below highlights the different configurations for vector compression, comparing standard uncompressed vectors, Scalar Quantization (SQ), and Binary Quantization (BQ) with and without MRL. The compression ratio demonstrates how efficiently the vector index size can be optimized, yielding significant cost savings. You can find more about our Vector Index Size Limits here: Service limits for tiers and skus – Azure AI Search | Microsoft Learn.

Table 1.1: Vector Index Size Compression Comparison

Configuration

*Compression Ratio

Uncompressed

–

28x

**MRL + SQ (1/2 and 1/3 truncation dimension respectively)

8x-12x

**MRL + BQ (1/2 and 1/3 truncation dimension respectively)

64x – 96x

Note: Compression ratios depend on embedding dimensions and truncation. For instance, using “text-embedding-3-large” with 3072 dimensions truncated to 1024 dimensions can result in 96x compression with Binary Quantization.

*All compression methods listed above, may experience slightly lower compression ratios due to overhead introduced by the index data structures. See “Memory overhead from selected algorithm” for more details.

**The compression impact when using MRL depends on the value of the truncation dimension. We recommend either using ½ or 1/3 of the original dimensions to preserve embedding quality (see below)

Quality Retainment Table:

Table 1.2 provides a detailed view of the quality retainment when using MRL with quantization across different models and configurations. The results indicate the impact on Mean NDCG@10 across a subset of MTEB datasets, showing that high levels of compression can still preserve up to 99% of search quality, particularly with BQ and MRL.

Table 1.2: Impact of MRL on Mean NDCG@10 Across MTEB Subset

Model Name

Original Dimension

MRL Dimension

Quantization Algorithm

No Rerank (% Δ)

Rerank 2x Oversampling (% Δ)

OpenAI text-embedding-3-small

1536

512

-2.00% (Δ = 1.155)

-0.0004% (Δ = 0.0002)

OpenAI text-embedding-3-small

1536

512

-15.00% (Δ = 7.5092)

-0.11% (Δ = 0.0554)

OpenAI text-embedding-3-small

1536

768

-2.00% (Δ = 0.8128)

-1.60% (Δ = 0.8128)

OpenAI text-embedding-3-small

1536

768

-10.00% (Δ = 5.0104)

-0.01% (Δ = 0.0044)

OpenAI text-embedding-3-large

3072

1024

-1.00% (Δ = 0.616)

-0.02% (Δ = 0.0118)

OpenAI text-embedding-3-large

3072

1024

-7.00% (Δ = 3.9478)

-0.58% (Δ = 0.3184)

OpenAI text-embedding-3-large

3072

1536

-1.00% (Δ = 0.3184)

-0.08% (Δ = 0.0426)

OpenAI text-embedding-3-large

3072

1536

-5.00% (Δ = 2.8062)

-0.06% (Δ = 0.0356)

Table 1.2 compares the relative point differences of Mean NDCG@10 when using different MRL dimensions (1/3 and 1/2 from the original dimensions) from an uncompressed index across OpenAI text-embedding models.

Key Takeaways:

99% Search Quality with BQ + MRL + Oversampling: Combining Binary Quantization (BQ) with Oversampling and Matryoshka Representation Learning (MRL) retains 99% of the original search quality in the datasets and embeddings combinations we tested, even with up to 96x compression, making it ideal for reducing storage while maintaining high retrieval performance.
Flexible Embedding Truncation: MRL enables dynamic embedding truncation with minimal accuracy loss, providing a balance between storage efficiency and search quality.
No Latency Impact Observed: Our experiments also indicated that using MRL had no noticeable latency impact, supporting efficient performance even at high compression rates.

For more details on how MRL works and how to implement it, visit the MRL Support Documentation.

Targeted Vector Filtering

Targeted Vector Filtering allows you to apply filters specifically to the vector component of hybrid search queries. This fine-grained control ensures that your filters enhance the relevance of vector search results without inadvertently affecting keyword-based searches.

Sub-Scores

Sub-Scores provide granular scoring information for each recall set contributing to the final search results. In hybrid search scenarios, where multiple factors like vector similarity and text relevance play a role, Sub-Scores offer transparency into how each component influences the overall ranking.

Text Split Skill by Tokens

The Text Split Skill by Tokens feature enhances your ability to process and manage large text data by splitting text based on token countsThis gives you more precise control over passage (chunk) length, leading to more targeted indexing and retrieval, particularly for documents with extensive content.

For any questions or to share your feedback, feel free to reach out through our Azure Search · Community

Getting started with Azure AI Search

Learn more about Azure AI Search and about all the latest features. 
Want to chat with your data? Check out VoiceRAG!
Start creating a search service in the Azure Portal, Azure CLI, the Management REST API, ARM template, or a Bicep file. 
Learn about Retrieval Augmented Generation in Azure AI Search.
Explore our preview client libraries in Python, .NET, Java, and JavaScript, offering diverse integration methods to cater to varying user needs. 
Explore how to create end-to-end RAG applications with Azure AI Studio.

References:
[1] Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., & Farhadi, A. (2024). Matryoshka Representation Learning. arXiv preprint arXiv:2205.13147. Retrieved from https://arxiv.org/abs/2205.13147{2205.13147}

Microsoft Tech Community – Latest Blogs –Read More

Cart

Cart

Azure AI Search October Updates: Nearly 100x Compression with Minimal Quality Loss

MRL Support for Quantization

Storage Compression Comparison

Quality Retainment Table:

Targeted Vector Filtering

Sub-Scores

Text Split Skill by Tokens

References:
[1] Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., & Farhadi, A. (2024). Matryoshka Representation Learning. arXiv preprint arXiv:2205.13147. Retrieved from https://arxiv.org/abs/2205.13147{2205.13147}

Related posts

Ignite 2024: Why nearly 70% of the Fortune 500 now use Microsoft 365 Copilot

8080 Books, an imprint of Microsoft, launches, offering thought leadership titles spanning technology, business and society

From questions to discoveries: NASA’s new Earth Copilot brings Microsoft AI capabilities to democratize access to complex data

Leave a Reply Cancel reply

Information

Contact Us

All Categories

Search

Cart

All Categories

Search

Cart

Azure AI Search October Updates: Nearly 100x Compression with Minimal Quality Loss

MRL Support for Quantization

Storage Compression Comparison

Quality Retainment Table:

Targeted Vector Filtering

Sub-Scores

Text Split Skill by Tokens

References: [1] Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., & Farhadi, A. (2024). Matryoshka Representation Learning. arXiv preprint arXiv:2205.13147. Retrieved from https://arxiv.org/abs/2205.13147{2205.13147}

Share this!

Related posts

Ignite 2024: Why nearly 70% of the Fortune 500 now use Microsoft 365 Copilot

8080 Books, an imprint of Microsoft, launches, offering thought leadership titles spanning technology, business and society

From questions to discoveries: NASA’s new Earth Copilot brings Microsoft AI capabilities to democratize access to complex data

Leave a Reply Cancel reply

Sign Up For Newsletters

Information

Contact Us

References:
[1] Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., & Farhadi, A. (2024). Matryoshka Representation Learning. arXiv preprint arXiv:2205.13147. Retrieved from https://arxiv.org/abs/2205.13147{2205.13147}