Vector DB and RAG

 

Vector Stores: Powering AI with Semantic Search

Vector stores (a.k.a. vector databases) are specialized data storage systems designed to store and retrieve high-dimensional vector representations of data like text, images, or audio. They're a key component in retrieval-augmented generation (RAG), semantic search, AI chatbots, and recommendation systems.


๐Ÿ”น 1. What Are Vectors in AI?

In AI, we convert unstructured inputs (like sentences or images) into dense numeric vectors (embeddings) using models like BERT, OpenAI’s text-embedding, or CLIP.
These vectors capture the semantic meaning of the input.

๐Ÿ“Œ For example:

  • "What is AI?"[0.12, -0.64, 0.88, ..., 0.34]
  • "Explain artificial intelligence" → Close in vector space 

๐Ÿ”น 2. What Is a Vector Store?

A vector store indexes these high-dimensional vectors and supports efficient nearest neighbor search for similarity-based retrieval.

✔ Supports k-Nearest Neighbor (kNN) and Approximate Nearest Neighbor (ANN) search
✔ Returns similar documents/images when queried with a vector
✔ Often integrated with LLMs to provide contextual memory and knowledge retrieval


๐Ÿ”น 3. Popular Vector Databases

Vector Store Key Features
Pinecone Fully managed, scalable, cloud-native; ideal for RAG pipelines
FAISS (by Meta) Open-source, lightning-fast, supports GPU indexing
Weaviate Schema-aware, includes hybrid (symbolic + vector) search
Milvus Open-source, built for billion-scale vector search
Chroma Simple and tightly integrated with LangChain workflows

๐Ÿ”น 4. When Are Vector Stores Used?

Retrieval-Augmented Generation (RAG)
→ Combines search with LLMs to ground answers in external knowledge

Semantic Search
→ Finds documents based on meaning, not just keywords

Image & Video Similarity Search
→ Compare visual embeddings for tasks like face recognition

Personalized Recommendations
→ Suggests content with similar vector profiles


๐Ÿ”น 5. Sample RAG Pipeline Using FAISS + LangChain

from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

# Prepare document and embedding
docs = ["Generative AI is a subfield of machine learning", "SQL is used for database querying"]
embedding_model = OpenAIEmbeddings()
vectordb = FAISS.from_texts(docs, embedding_model)

# Ask a semantic question
query = "What is AI used for?"
retrieved_docs = vectordb.similarity_search(query, k=2)
print(retrieved_docs)

✔ This retrieves contextually similar documents, which can be used to augment LLM responses.




๐Ÿ”น What Is ANN (Approximate Nearest Neighbor)?

In vector search, ANN algorithms help find items in a database whose vector embeddings are closest (most similar) to a query vector—but faster than exact methods like brute-force search.

๐Ÿ“Œ Why “Approximate”?
Finding exact neighbors in high-dimensional space is expensive (computationally). ANN trades a bit of accuracy for a massive speed boost—perfect for real-time semantic search.


๐Ÿ”น Where It’s Used

  • Search engines (e.g., vectorized text search)
  • Recommendation systems
  • RAG (Retrieval-Augmented Generation)
  • Image or video similarity
  • Multimodal embedding search (text-to-image, etc.)

๐Ÿ”น Popular ANN Techniques

1. Brute-Force Search (Exact kNN)

✔ Compares query vector to every database vector
High accuracy, slow for large datasets
Used mostly for evaluation or small data


2. Tree-Based Methods

Algorithm Description
KD-Tree Great for low dimensions (<20D); partitions space using axis-aligned splits.
Ball Tree Similar to KD-Tree but uses hyperspheres; better for clustered data.
VP-Tree Uses distances between points to build partitions—used in some metric spaces.

๐Ÿ“Œ Fast for structured, small to mid-sized vector sets


3. Hashing-Based Methods

Algorithm Description
LSH (Locality-Sensitive Hashing) Hashes similar vectors into the same bucket with high probability.
MinHash / SimHash Specialized for Jaccard or cosine similarity.

✅ Best for extremely fast, but approximate search
Used in early versions of semantic search engines


4. Graph-Based Approaches

Algorithm Description
HNSW (Hierarchical Navigable Small World) Builds layered graphs for logarithmic traversal—extremely fast & accurate.
NSW Non-hierarchical version; still efficient.

Most widely used in modern vector stores
๐Ÿ”ฅ FAISS, Weaviate, and Pinecone support HNSW


5. Quantization-Based Methods

Algorithm Description
PQ (Product Quantization) Compresses vectors into smaller subspaces; compares compressed codes.
IVF (Inverted File Index) Clusters database vectors, narrows search to relevant partitions.
IVF+PQ (IVFPQ) Combines clustering + compression.

✅ Offers a good trade-off between speed and memory efficiency
Common in FAISS deployments at scale


๐Ÿ”น Choosing the Right ANN Method

Dataset Size Best ANN Type
Small (<10k) Brute-Force, KD-Tree
Medium (10k–1M) HNSW, LSH, IVF
Large (>1M) HNSW, IVFPQ, PQ


Types of Vector stores:

๐Ÿ”น By Architecture and Deployment Type

  1. In-Memory Vector Stores

    • Fastest, but limited by RAM.
    • Ideal for prototyping and small-scale tasks.
    • Example: Chroma (used in LangChain), FAISS (with IndexFlatL2).
  2. Disk-Based / Persistent Vector Stores

    • Scales beyond RAM limits.
    • Useful for production workloads.
    • Examples: Weaviate, Milvus, Qdrant, Vespa.
  3. Cloud-Native Managed Vector Databases

    • Fully hosted with autoscaling, security, replication.
    • Minimal infra setup.
    • Examples: Pinecone, Azure Cognitive Search (vector mode), Google Vertex AI Matching Engine.

๐Ÿ”น By Search Algorithm Used

  1. Flat Index (Brute Force)

    • Exact, slow.
    • Good for small datasets.
    • Used in FAISS IndexFlatL2.
  2. Quantized Indexes (IVF, PQ)

    • Combines clustering + compression.
    • Balances speed and accuracy.
    • FAISS supports IVF, IVFPQ.
  3. Graph-Based Indexes (HNSW, NSW)

    • Great recall and speed on large sets.
    • Used by Pinecone, Weaviate, FAISS (HNSW), Vespa.
  4. Hashing-Based Stores

    • Based on LSH (Locality Sensitive Hashing).
    • Less common now but useful for certain similarity types.

๐Ÿ”น By Feature Set

Vector Store Highlights
Pinecone Serverless, fully managed, fast HNSW, metadata filtering
FAISS Facebook’s open-source, versatile, GPU-compatible
Weaviate Schema + hybrid search + modular storage
Chroma Lightweight, great for LangChain prototypes
Qdrant Rust-based, fast, filters, re-ranking
Milvus High throughput, GPU/CPU support, billion-scale indexing
ElasticSearch / OpenSearch (vector mode) Traditional inverted index + vector hybrid
Zilliz Managed version of Milvus with cloud features


๐Ÿ”น What Are Vector Libraries?

Vector libraries are in-memory software tools that help compute, index, and search embeddings (high-dimensional vectors) efficiently — usually used during experimentation or local model development.

๐Ÿงฐ Examples:

Library Description
FAISS (by Meta) Fast similarity search and clustering; supports IVF, PQ, and HNSW indexing; GPU acceleration available.
Annoy (by Spotify) Optimized for disk-based and memory-efficient approximate nearest neighbor (ANN) search using trees.
ScaNN (by Google) Deep learning-friendly ANN search with Scalable Nearest Neighbors; integrates well with TensorFlow.
NMSLIB Non-Metric Space Library supporting HNSW; great for Python-based pipelines.
Hnswlib Lightweight, high-performance C++/Python library for HNSW ANN indexing.

✅ Use Case: Ideal for prototyping vector search, local RAG, or batch similarity scoring.


๐Ÿ”ธ What Are Vector Databases?

Vector databases are production-ready services (self-hosted or managed) built to store and search embeddings across billions of vectors, often with metadata filtering, scalability, and indexing baked in.

๐Ÿงฉ Examples:

Vector DB Key Features
Pinecone Fully managed, real-time vector search with metadata filtering, hybrid search, and serverless scaling.
Weaviate Open-source + hybrid semantic search (vector + keyword), RESTful APIs, built-in modules (e.g. OpenAI, Cohere).
Qdrant Fast, Rust-based, filtering and re-ranking with payload-aware HNSW support.
Milvus Scalable GPU/CPU support, good for billion-scale search, supports IVF, HNSW, and hybrid indexes.
Chroma Lightweight vector store used in LangChain; great for small-scale local pipelines.

✅ Use Case: Perfect for production-scale RAG, AI chat memory, personalization systems, and semantic enterprise search.


⚖️ Key Differences

Feature Vector Libraries Vector Databases
Scale Local, up to millions of vectors Cloud-scale, billions of vectors
Persistence Typically non-persistent Persistent (disk/cloud)
Filtering & Metadata Minimal Advanced filtering, tagging, ranking
Deployment Python or C++ codebase Hosted, Docker, or managed APIs
Use Case Prototyping, local dev Real-time, scalable production use


                            


๐Ÿง  1. Retrieval-Augmented Generation (RAG)

Used in LLM-powered applications to fetch relevant documents or facts from a knowledge base before answering.

  • Example: Chatbots with long-term memory, like a customer support bot that recalls manuals or product specs.

๐Ÿ” 2. Semantic Search

Vector DBs retrieve content based on meaning, not exact wording.

  • Example: Searching “startup capital help” returns “small business loans” due to semantic closeness.

๐Ÿค 3. Recommendation Systems

Finds items (products, songs, users) similar in meaning or behavior.

  • Example: “You may also like” suggestions based on vector proximity to your preferences.

๐Ÿ“„ 4. Document Similarity & Clustering

Used to group and compare content such as emails, contracts, or academic papers.

  • Example: Deduplicating similar FAQs or clustering legal documents by topic.

๐Ÿ“ท 5. Image & Video Retrieval

Embedding-based search for visual similarity—crucial in media, fashion, and surveillance.

  • Example: “Show me all images similar to this dress.”

๐Ÿ›ก️ 6. Cybersecurity & Anomaly Detection

Vectors represent user behavior or network traffic patterns.

  • Example: Spotting fraud by comparing a transaction to a vector profile of normal behavior.

๐ŸŒ 7. Multilingual Applications

Since embeddings from different languages can share the same vector space, a vector DB can do cross-lingual retrieval.

  • Example: Search English documents using a German query.

๐ŸŽฏ 8. Personalized Search & Chat Memory

Vector DBs can store user histories, preferences, and chat memory for context-aware AI.

  • Example: A sales AI that “remembers” what features a client liked last week.


๐Ÿ”น Chroma DB

  • Type: Lightweight, open-source vector store
  • Best for: Prototyping, LangChain experiments, local development
  • Storage: Local (in-memory or persistent file-based)
  • Filtering: Limited metadata filtering compared to Pinecone
  • Indexing: Typically brute-force or simple ANN (less optimized for scale)
  • Integration: Designed with LangChain in mind — super plug-and-play
  • Deployment: Runs easily on your machine or container; no cloud infra needed

Use Case: Fast setup for building RAG chatbots, notebooks, or embedding playgrounds.


๐Ÿ”น Pinecone

  • Type: Fully managed, cloud-native vector database
  • Best for: Scalable, production-grade RAG pipelines
  • Storage: Distributed, persistent cloud storage
  • Filtering: Advanced — supports metadata filters, namespaces, versioning
  • Indexing: Uses HNSW and optimized sparse-dense hybrid indexes
  • Integration: Works seamlessly with LangChain, OpenAI, Cohere, etc.
  • Deployment: No infrastructure needed — just use their API

Use Case: Recommended when you need millisecond latency, high availability, and scalable search across millions of documents.


๐Ÿง  When to Use Which?

Scenario Pick This
Fast prototyping or hobby project Chroma
Full-scale production (chatbots, search apps) Pinecone
You want cloud scaling & team collaboration Pinecone
Lightweight local dev with minimal setup Chroma


๐Ÿ”น 1. Chroma DB Sample Code

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# Load and split document
loader = TextLoader("example.txt")
docs = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(docs)

# Embeddings and Vector Store
embeddings = OpenAIEmbeddings()
chroma_store = Chroma.from_documents(texts, embeddings)

# Retrieve relevant docs
query = "What is Generative AI?"
results = chroma_store.similarity_search(query, k=3)
for doc in results:
    print(doc.page_content)

Ideal for: Local dev, quick experiments, LangChain notebooks.


๐Ÿ”น 2. Pinecone Sample Code

import pinecone
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings

# Init Pinecone
pinecone.init(api_key="YOUR_API_KEY", environment="us-east1-gcp")
index_name = "langchain-demo"

# Create Index (run once)
if index_name not in pinecone.list_indexes():
    pinecone.create_index(index_name, dimension=1536)

# Prepare Embeddings + Store
embeddings = OpenAIEmbeddings()
docs = ["Generative AI is the ability for machines to create content.",
        "Large language models can perform reasoning tasks.",
        "Vector databases store and retrieve high-dimensional embeddings."]

vector_store = Pinecone.from_texts(docs, embeddings, index_name=index_name)

# Semantic search
query = "What can large language models do?"
results = vector_store.similarity_search(query, k=2)
for r in results:
    print(r.page_content)

Ideal for: Cloud-scale applications, persistent vector storage, enterprise RAG.






๐Ÿ”✨ What Is Generative Search?

Generative Search is the fusion of two powerful AI capabilities:

  1. Semantic Retrieval — finding relevant documents using vector similarity (meaning-based search).
  2. Generative AI — using large language models (LLMs) like GPT to synthesize natural language answers from those documents.

This approach is often implemented using a Retrieval-Augmented Generation (RAG) pipeline.


๐Ÿ”ง How Generative Search Works (Step-by-Step)

  1. User Query
    "What are the benefits of using vector databases?"

  2. Embedding Generation
    → The query is converted into a vector using a model like OpenAIEmbeddings, SentenceTransformers, or Cohere.

  3. Vector Search (Semantic Retrieval)
    → The vector is used to search a vector database (e.g., Pinecone, FAISS, Weaviate) to retrieve the most relevant documents.

  4. Context Injection
    → Retrieved documents are injected into the prompt for the LLM.

  5. LLM Response Generation
    → The LLM (e.g., GPT-4) generates a natural language answer grounded in the retrieved context.


๐Ÿง  Why It’s Powerful

Feature Benefit
Grounded Responses Reduces hallucinations by anchoring answers in real data
Domain Adaptability Works with custom corpora (legal, medical, enterprise docs)
Explainability You can trace the answer back to source documents
Real-Time Knowledge Keeps LLMs up-to-date without retraining

๐Ÿ› ️ Tools for Building Generative Search

Tool Role
LangChain Orchestrates retrieval + generation
Pinecone / Weaviate / FAISS Vector database for semantic search
OpenAI / Cohere / Hugging Face Embedding + generation models
Chroma Lightweight vector store for local dev

๐Ÿงช Sample LangChain RAG Pipeline (Generative Search)

from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI

# Load vector store
embedding = OpenAIEmbeddings()
vectorstore = Chroma(persist_directory="db", embedding_function=embedding)

# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

# Ask a question
query = "What is a vector database and why is it useful?"
result = qa_chain(query)

print("Answer:", result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])


๐Ÿ”๐Ÿง  What Is RAG (Retrieval-Augmented Generation)?

RAG (Retrieval-Augmented Generation) is a powerful architecture that combines information retrieval with generative AI to produce more accurate, grounded, and context-aware responses. It’s the backbone of many modern AI systems like chatbots, enterprise search assistants, and AI copilots.


๐Ÿ”ง How RAG Works (Step-by-Step)

1. User Query

The user asks a question:

"What are the benefits of using vector databases?"


2. Embedding the Query

The query is converted into a dense vector using an embedding model like:

  • OpenAIEmbeddings
  • SentenceTransformers
  • CohereEmbeddings

3. Semantic Retrieval

The vector is used to search a vector database (e.g., Pinecone, FAISS, Weaviate) to find top-k relevant documents based on semantic similarity.


4. Context Injection

The retrieved documents are injected into the prompt for the LLM (e.g., GPT-4, Claude, LLaMA) as context.


5. Response Generation

The LLM generates a natural language answer grounded in the retrieved documents.


๐Ÿง  Why Use RAG?

Feature Benefit
Grounded Answers Reduces hallucinations by anchoring responses in real data
Dynamic Knowledge No need to retrain the LLM when data changes
Domain Adaptability Works with custom corpora (legal, medical, enterprise)
Explainability You can trace answers back to source documents

๐Ÿ› ️ Tools for Building RAG Pipelines

Component Tools
Embeddings OpenAI, Cohere, Hugging Face, Azure
Vector Store Pinecone, FAISS, Weaviate, Qdrant, Chroma
LLM GPT-4, Claude, LLaMA, Mistral
Frameworks LangChain, LlamaIndex, Haystack

๐Ÿงช Sample RAG Pipeline (LangChain + Chroma)

from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI

# Load vector store
embedding = OpenAIEmbeddings()
vectorstore = Chroma(persist_directory="db", embedding_function=embedding)

# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

# Ask a question
query = "What is a vector database and why is it useful?"
result = qa_chain(query)

print("Answer:", result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])


๐Ÿ”ง๐Ÿง  RAG Pipeline: Retrieval-Augmented Generation Explained

A RAG pipeline is a hybrid architecture that combines a retriever (to fetch relevant documents) with a generator (to synthesize answers). It’s the foundation of intelligent systems like AI assistants, enterprise search tools, and domain-specific chatbots.


๐Ÿ” RAG Pipeline Architecture

User Query
   ↓
[Embed Query]
   ↓
[Vector Search in Vector DB (Retriever)]
   ↓
[Top-k Relevant Documents]
   ↓
[Inject into Prompt for LLM (Generator)]
   ↓
[LLM Generates Final Answer]

๐Ÿงฑ Core Components of a RAG Pipeline

Component Role Tools
Embedding Model Converts text into dense vectors OpenAI, Cohere, Hugging Face
Vector Store Stores and retrieves embeddings Pinecone, FAISS, Weaviate, Chroma
Retriever Finds top-k relevant documents LangChain, LlamaIndex
LLM (Generator) Synthesizes answers from context GPT-4, Claude, LLaMA, Mistral
Orchestrator Manages the pipeline flow LangChain, LlamaIndex, Haystack

๐Ÿงช Sample RAG Pipeline Using LangChain + Chroma + OpenAI

from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# Load and split documents
loader = TextLoader("docs/your_knowledge.txt")
docs = loader.load()
splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# Create vector store
embedding = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embedding)

# Build RAG chain
rag_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

# Ask a question
query = "What are the benefits of using vector databases?"
result = rag_chain(query)

print("Answer:", result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])

๐ŸŽฏ When to Use a RAG Pipeline

✅ When your LLM needs access to:

  • Private or domain-specific knowledge
  • Frequently updated content
  • Long-term memory or document context

✅ When you want:

  • Explainable answers (with sources)
  • Reduced hallucination
  • No need to fine-tune the LLM

๐Ÿง  Bonus: Enhancements for Production RAG

Feature Description
Metadata Filtering Retrieve docs by tags (e.g., date, author, topic)
Hybrid Search Combine keyword + vector search
Re-ranking Use a cross-encoder to re-rank retrieved docs
Multi-turn Memory Add chat history to the prompt
Streaming Output Stream LLM responses for better UX




Now, let’s discuss each of these layers in detail.

 

Embedding Layer

You are already familiar with the embedding layer, as it was covered in the previous sessions on semantic search. The embedding layer is typically the first layer of a RAG model, and it typically contains an embedding model that is trained on a massive data set of text and code. This data set is used to learn the relationships between words and phrases and to create embeddings that represent these relationships. The embedding layer is an important part of RAG models because it allows your system to understand the meaning of the text that it is processing and understand its semantic relationship to the query. The embedding layer generates embeddings for your text corpus and allows the RAG model to understand the meaning of the query and to generate a relevant and informative response. This is essential for a variety of tasks, such as question answering, summarisation and machine translation.

 

Search and Rank Layer

The next layer is the search and rank or the re-rank layer. The search and re-rank layer is a crucial component that is responsible for retrieving the relevant information from an external knowledge base, ranking it based on its relevance to the input query and presenting it to the generation layer for further processing. The search and re-rank layer is an essential component of RAG, as it ensures that the retrieved text is accurate, relevant and contextually appropriate. The search and re-rank layer typically consists of two components:

  • A search component that uses various techniques to retrieve relevant documents from the knowledge base

  • A re-rank component that uses a variety of techniques to re-rank the retrieved documents to produce the most relevant results

 

The search component typically uses a technique called semantic similarity. As discussed in the previous session, semantic similarity is a measure of how similar two pieces of text are in terms of their meaning. The search component uses semantic similarity to retrieve documents from a knowledge base that are relevant to the user's query. 

 

The re-rank component of the search typically uses a variety of techniques to re-rank the retrieved documents. These techniques can include the following:

  • Ranking by relevance: The re-rank component can rank the retrieved documents based on how relevant they are to the user's query.

  • Ranking by popularity: The re-rank component can rank the retrieved documents based on how popular they are, such as by measuring the number of times they have been viewed or shared.

  • Ranking by freshness: The re-rank component can rank the retrieved documents based on how recent they are, such as by measuring the date on which they were published.

 

The search and re-rank layer is an important part of RAG models because it allows the model to retrieve and re-rank relevant documents from a knowledge base. This is essential for numerous tasks, such as question answering, summarisation and machine translation. The search and re-rank layer is a powerful tool that can be used to improve the performance of a variety of AI tasks. It is an essential part of RAG models, and it plays a key role in helping these models retrieve and re-rank relevant information. The retrieval-based model is used to find relevant information from existing information sources. The re-rank layer is used to rank the retrieved information based on its relevance to the input query. 

 

Generation Layer

The generation layer is typically the last layer of a RAG model which consists of a foundation large language model that is trained on a massive data set of text and code. As the name suggests, the generation layer allows the model to generate new text in response to a user's query. The generative model takes the retrieved information, synthesises all the data and shapes it into a coherent and contextually appropriate response. This is essential for many tasks, such as question answering, summarisation machine translation and also generative search specifically RAG. In the context of search, this layer excels in providing context and natural language capabilities for generative search.





The first step in the pipeline is to build a vector store that can store documents along with metadata. The typical process involves ingesting the documents, converting the raw text in the documents and then splitting them into chunks based on various chunking strategies. Each chunk is then represented as a vector using an appropriate text embedding model, which is then stored in the vector database.

 

The next step is to embed the user query into the same vector space as the documents in the vector store with the embedding model. Once the query is embedded, a semantic search is performed to find the closest embedding from the vector store. The top K entries (chunks or documents) that have the highest semantic overlap with the query are retrieved using various search and indexing strategies that are available in vector databases. 

 

In addition to the semantic search layers for retrieving the top K relevant documents, we also discussed two major strategies to improve the overall performance and responsiveness of the semantic search system:

  • Cache mechanism
  • Re-ranking layer

 

Once the top entries for the query have been retrieved and re-ranked, the next stage is to pass the results to the generative search step. In this final step, the prompt, along with the query, and the relevant documents are passed to the LLM to generate a unique response to the user’s query. The retrieved documents provide context to the LLM, which helps it generate a more accurate response.

 

Overall, retrieval augmented generation combines the strengths of semantic search and large language models to generate more accurate responses to user queries.












Comments

Popular posts from this blog

Resume Work and Project Details

Time Series and MMM basics

LINEAR REGRESSION