Vector DB and RAG
Vector Stores: Powering AI with Semantic Search
Vector stores (a.k.a. vector databases) are specialized data storage systems designed to store and retrieve high-dimensional vector representations of data like text, images, or audio. They're a key component in retrieval-augmented generation (RAG), semantic search, AI chatbots, and recommendation systems.
๐น 1. What Are Vectors in AI?
In AI, we convert unstructured inputs (like sentences or images) into dense numeric vectors (embeddings) using models like BERT, OpenAI’s text-embedding, or CLIP.
These vectors capture the semantic meaning of the input.
๐ For example:
"What is AI?"→[0.12, -0.64, 0.88, ..., 0.34]"Explain artificial intelligence"→ Close in vector space
๐น 2. What Is a Vector Store?
A vector store indexes these high-dimensional vectors and supports efficient nearest neighbor search for similarity-based retrieval.
✔ Supports k-Nearest Neighbor (kNN) and Approximate Nearest Neighbor (ANN) search
✔ Returns similar documents/images when queried with a vector
✔ Often integrated with LLMs to provide contextual memory and knowledge retrieval
๐น 3. Popular Vector Databases
| Vector Store | Key Features |
|---|---|
| Pinecone | Fully managed, scalable, cloud-native; ideal for RAG pipelines |
| FAISS (by Meta) | Open-source, lightning-fast, supports GPU indexing |
| Weaviate | Schema-aware, includes hybrid (symbolic + vector) search |
| Milvus | Open-source, built for billion-scale vector search |
| Chroma | Simple and tightly integrated with LangChain workflows |
๐น 4. When Are Vector Stores Used?
✅ Retrieval-Augmented Generation (RAG)
→ Combines search with LLMs to ground answers in external knowledge
✅ Semantic Search
→ Finds documents based on meaning, not just keywords
✅ Image & Video Similarity Search
→ Compare visual embeddings for tasks like face recognition
✅ Personalized Recommendations
→ Suggests content with similar vector profiles
๐น 5. Sample RAG Pipeline Using FAISS + LangChain
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
# Prepare document and embedding
docs = ["Generative AI is a subfield of machine learning", "SQL is used for database querying"]
embedding_model = OpenAIEmbeddings()
vectordb = FAISS.from_texts(docs, embedding_model)
# Ask a semantic question
query = "What is AI used for?"
retrieved_docs = vectordb.similarity_search(query, k=2)
print(retrieved_docs)
✔ This retrieves contextually similar documents, which can be used to augment LLM responses.
๐น What Is ANN (Approximate Nearest Neighbor)?
In vector search, ANN algorithms help find items in a database whose vector embeddings are closest (most similar) to a query vector—but faster than exact methods like brute-force search.
๐ Why “Approximate”?
Finding exact neighbors in high-dimensional space is expensive (computationally). ANN trades a bit of accuracy for a massive speed boost—perfect for real-time semantic search.
๐น Where It’s Used
- Search engines (e.g., vectorized text search)
- Recommendation systems
- RAG (Retrieval-Augmented Generation)
- Image or video similarity
- Multimodal embedding search (text-to-image, etc.)
๐น Popular ANN Techniques
1. Brute-Force Search (Exact kNN)
✔ Compares query vector to every database vector
✔ High accuracy, slow for large datasets
Used mostly for evaluation or small data
2. Tree-Based Methods
| Algorithm | Description |
|---|---|
| KD-Tree | Great for low dimensions (<20D); partitions space using axis-aligned splits. |
| Ball Tree | Similar to KD-Tree but uses hyperspheres; better for clustered data. |
| VP-Tree | Uses distances between points to build partitions—used in some metric spaces. |
๐ Fast for structured, small to mid-sized vector sets
3. Hashing-Based Methods
| Algorithm | Description |
|---|---|
| LSH (Locality-Sensitive Hashing) | Hashes similar vectors into the same bucket with high probability. |
| MinHash / SimHash | Specialized for Jaccard or cosine similarity. |
✅ Best for extremely fast, but approximate search
Used in early versions of semantic search engines
4. Graph-Based Approaches
| Algorithm | Description |
|---|---|
| HNSW (Hierarchical Navigable Small World) | Builds layered graphs for logarithmic traversal—extremely fast & accurate. |
| NSW | Non-hierarchical version; still efficient. |
✅ Most widely used in modern vector stores
๐ฅ FAISS, Weaviate, and Pinecone support HNSW
5. Quantization-Based Methods
| Algorithm | Description |
|---|---|
| PQ (Product Quantization) | Compresses vectors into smaller subspaces; compares compressed codes. |
| IVF (Inverted File Index) | Clusters database vectors, narrows search to relevant partitions. |
| IVF+PQ (IVFPQ) | Combines clustering + compression. |
✅ Offers a good trade-off between speed and memory efficiency
Common in FAISS deployments at scale
๐น Choosing the Right ANN Method
| Dataset Size | Best ANN Type |
|---|---|
| Small (<10k) | Brute-Force, KD-Tree |
| Medium (10k–1M) | HNSW, LSH, IVF |
| Large (>1M) | HNSW, IVFPQ, PQ |
Types of Vector stores:
๐น By Architecture and Deployment Type
In-Memory Vector Stores
- Fastest, but limited by RAM.
- Ideal for prototyping and small-scale tasks.
- Example: Chroma (used in LangChain), FAISS (with
IndexFlatL2).
Disk-Based / Persistent Vector Stores
- Scales beyond RAM limits.
- Useful for production workloads.
- Examples: Weaviate, Milvus, Qdrant, Vespa.
Cloud-Native Managed Vector Databases
- Fully hosted with autoscaling, security, replication.
- Minimal infra setup.
- Examples: Pinecone, Azure Cognitive Search (vector mode), Google Vertex AI Matching Engine.
๐น By Search Algorithm Used
Flat Index (Brute Force)
- Exact, slow.
- Good for small datasets.
- Used in FAISS IndexFlatL2.
Quantized Indexes (IVF, PQ)
- Combines clustering + compression.
- Balances speed and accuracy.
- FAISS supports IVF, IVFPQ.
Graph-Based Indexes (HNSW, NSW)
- Great recall and speed on large sets.
- Used by Pinecone, Weaviate, FAISS (HNSW), Vespa.
Hashing-Based Stores
- Based on LSH (Locality Sensitive Hashing).
- Less common now but useful for certain similarity types.
๐น By Feature Set
| Vector Store | Highlights |
|---|---|
| Pinecone | Serverless, fully managed, fast HNSW, metadata filtering |
| FAISS | Facebook’s open-source, versatile, GPU-compatible |
| Weaviate | Schema + hybrid search + modular storage |
| Chroma | Lightweight, great for LangChain prototypes |
| Qdrant | Rust-based, fast, filters, re-ranking |
| Milvus | High throughput, GPU/CPU support, billion-scale indexing |
| ElasticSearch / OpenSearch (vector mode) | Traditional inverted index + vector hybrid |
| Zilliz | Managed version of Milvus with cloud features |
๐น What Are Vector Libraries?
Vector libraries are in-memory software tools that help compute, index, and search embeddings (high-dimensional vectors) efficiently — usually used during experimentation or local model development.
๐งฐ Examples:
| Library | Description |
|---|---|
| FAISS (by Meta) | Fast similarity search and clustering; supports IVF, PQ, and HNSW indexing; GPU acceleration available. |
| Annoy (by Spotify) | Optimized for disk-based and memory-efficient approximate nearest neighbor (ANN) search using trees. |
| ScaNN (by Google) | Deep learning-friendly ANN search with Scalable Nearest Neighbors; integrates well with TensorFlow. |
| NMSLIB | Non-Metric Space Library supporting HNSW; great for Python-based pipelines. |
| Hnswlib | Lightweight, high-performance C++/Python library for HNSW ANN indexing. |
✅ Use Case: Ideal for prototyping vector search, local RAG, or batch similarity scoring.
๐ธ What Are Vector Databases?
Vector databases are production-ready services (self-hosted or managed) built to store and search embeddings across billions of vectors, often with metadata filtering, scalability, and indexing baked in.
๐งฉ Examples:
| Vector DB | Key Features |
|---|---|
| Pinecone | Fully managed, real-time vector search with metadata filtering, hybrid search, and serverless scaling. |
| Weaviate | Open-source + hybrid semantic search (vector + keyword), RESTful APIs, built-in modules (e.g. OpenAI, Cohere). |
| Qdrant | Fast, Rust-based, filtering and re-ranking with payload-aware HNSW support. |
| Milvus | Scalable GPU/CPU support, good for billion-scale search, supports IVF, HNSW, and hybrid indexes. |
| Chroma | Lightweight vector store used in LangChain; great for small-scale local pipelines. |
✅ Use Case: Perfect for production-scale RAG, AI chat memory, personalization systems, and semantic enterprise search.
⚖️ Key Differences
| Feature | Vector Libraries | Vector Databases |
|---|---|---|
| Scale | Local, up to millions of vectors | Cloud-scale, billions of vectors |
| Persistence | Typically non-persistent | Persistent (disk/cloud) |
| Filtering & Metadata | Minimal | Advanced filtering, tagging, ranking |
| Deployment | Python or C++ codebase | Hosted, Docker, or managed APIs |
| Use Case | Prototyping, local dev | Real-time, scalable production use |
๐ง 1. Retrieval-Augmented Generation (RAG)
Used in LLM-powered applications to fetch relevant documents or facts from a knowledge base before answering.
- Example: Chatbots with long-term memory, like a customer support bot that recalls manuals or product specs.
๐ 2. Semantic Search
Vector DBs retrieve content based on meaning, not exact wording.
- Example: Searching “startup capital help” returns “small business loans” due to semantic closeness.
๐ค 3. Recommendation Systems
Finds items (products, songs, users) similar in meaning or behavior.
- Example: “You may also like” suggestions based on vector proximity to your preferences.
๐ 4. Document Similarity & Clustering
Used to group and compare content such as emails, contracts, or academic papers.
- Example: Deduplicating similar FAQs or clustering legal documents by topic.
๐ท 5. Image & Video Retrieval
Embedding-based search for visual similarity—crucial in media, fashion, and surveillance.
- Example: “Show me all images similar to this dress.”
๐ก️ 6. Cybersecurity & Anomaly Detection
Vectors represent user behavior or network traffic patterns.
- Example: Spotting fraud by comparing a transaction to a vector profile of normal behavior.
๐ 7. Multilingual Applications
Since embeddings from different languages can share the same vector space, a vector DB can do cross-lingual retrieval.
- Example: Search English documents using a German query.
๐ฏ 8. Personalized Search & Chat Memory
Vector DBs can store user histories, preferences, and chat memory for context-aware AI.
- Example: A sales AI that “remembers” what features a client liked last week.
๐น Chroma DB
- Type: Lightweight, open-source vector store
- Best for: Prototyping, LangChain experiments, local development
- Storage: Local (in-memory or persistent file-based)
- Filtering: Limited metadata filtering compared to Pinecone
- Indexing: Typically brute-force or simple ANN (less optimized for scale)
- Integration: Designed with LangChain in mind — super plug-and-play
- Deployment: Runs easily on your machine or container; no cloud infra needed
✅ Use Case: Fast setup for building RAG chatbots, notebooks, or embedding playgrounds.
๐น Pinecone
- Type: Fully managed, cloud-native vector database
- Best for: Scalable, production-grade RAG pipelines
- Storage: Distributed, persistent cloud storage
- Filtering: Advanced — supports metadata filters, namespaces, versioning
- Indexing: Uses HNSW and optimized sparse-dense hybrid indexes
- Integration: Works seamlessly with LangChain, OpenAI, Cohere, etc.
- Deployment: No infrastructure needed — just use their API
✅ Use Case: Recommended when you need millisecond latency, high availability, and scalable search across millions of documents.
๐ง When to Use Which?
| Scenario | Pick This |
|---|---|
| Fast prototyping or hobby project | Chroma |
| Full-scale production (chatbots, search apps) | Pinecone |
| You want cloud scaling & team collaboration | Pinecone |
| Lightweight local dev with minimal setup | Chroma |
๐น 1. Chroma DB Sample Code
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
# Load and split document
loader = TextLoader("example.txt")
docs = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(docs)
# Embeddings and Vector Store
embeddings = OpenAIEmbeddings()
chroma_store = Chroma.from_documents(texts, embeddings)
# Retrieve relevant docs
query = "What is Generative AI?"
results = chroma_store.similarity_search(query, k=3)
for doc in results:
print(doc.page_content)
✅ Ideal for: Local dev, quick experiments, LangChain notebooks.
๐น 2. Pinecone Sample Code
import pinecone
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
# Init Pinecone
pinecone.init(api_key="YOUR_API_KEY", environment="us-east1-gcp")
index_name = "langchain-demo"
# Create Index (run once)
if index_name not in pinecone.list_indexes():
pinecone.create_index(index_name, dimension=1536)
# Prepare Embeddings + Store
embeddings = OpenAIEmbeddings()
docs = ["Generative AI is the ability for machines to create content.",
"Large language models can perform reasoning tasks.",
"Vector databases store and retrieve high-dimensional embeddings."]
vector_store = Pinecone.from_texts(docs, embeddings, index_name=index_name)
# Semantic search
query = "What can large language models do?"
results = vector_store.similarity_search(query, k=2)
for r in results:
print(r.page_content)
✅ Ideal for: Cloud-scale applications, persistent vector storage, enterprise RAG.
๐✨ What Is Generative Search?
Generative Search is the fusion of two powerful AI capabilities:
- Semantic Retrieval — finding relevant documents using vector similarity (meaning-based search).
- Generative AI — using large language models (LLMs) like GPT to synthesize natural language answers from those documents.
This approach is often implemented using a Retrieval-Augmented Generation (RAG) pipeline.
๐ง How Generative Search Works (Step-by-Step)
User Query
→"What are the benefits of using vector databases?"Embedding Generation
→ The query is converted into a vector using a model likeOpenAIEmbeddings,SentenceTransformers, orCohere.Vector Search (Semantic Retrieval)
→ The vector is used to search a vector database (e.g., Pinecone, FAISS, Weaviate) to retrieve the most relevant documents.Context Injection
→ Retrieved documents are injected into the prompt for the LLM.LLM Response Generation
→ The LLM (e.g., GPT-4) generates a natural language answer grounded in the retrieved context.
๐ง Why It’s Powerful
| Feature | Benefit |
|---|---|
| Grounded Responses | Reduces hallucinations by anchoring answers in real data |
| Domain Adaptability | Works with custom corpora (legal, medical, enterprise docs) |
| Explainability | You can trace the answer back to source documents |
| Real-Time Knowledge | Keeps LLMs up-to-date without retraining |
๐ ️ Tools for Building Generative Search
| Tool | Role |
|---|---|
| LangChain | Orchestrates retrieval + generation |
| Pinecone / Weaviate / FAISS | Vector database for semantic search |
| OpenAI / Cohere / Hugging Face | Embedding + generation models |
| Chroma | Lightweight vector store for local dev |
๐งช Sample LangChain RAG Pipeline (Generative Search)
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
# Load vector store
embedding = OpenAIEmbeddings()
vectorstore = Chroma(persist_directory="db", embedding_function=embedding)
# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
retriever=vectorstore.as_retriever(),
return_source_documents=True
)
# Ask a question
query = "What is a vector database and why is it useful?"
result = qa_chain(query)
print("Answer:", result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])
๐๐ง What Is RAG (Retrieval-Augmented Generation)?
RAG (Retrieval-Augmented Generation) is a powerful architecture that combines information retrieval with generative AI to produce more accurate, grounded, and context-aware responses. It’s the backbone of many modern AI systems like chatbots, enterprise search assistants, and AI copilots.
๐ง How RAG Works (Step-by-Step)
1. User Query
The user asks a question:
"What are the benefits of using vector databases?"
2. Embedding the Query
The query is converted into a dense vector using an embedding model like:
OpenAIEmbeddingsSentenceTransformersCohereEmbeddings
3. Semantic Retrieval
The vector is used to search a vector database (e.g., Pinecone, FAISS, Weaviate) to find top-k relevant documents based on semantic similarity.
4. Context Injection
The retrieved documents are injected into the prompt for the LLM (e.g., GPT-4, Claude, LLaMA) as context.
5. Response Generation
The LLM generates a natural language answer grounded in the retrieved documents.
๐ง Why Use RAG?
| Feature | Benefit |
|---|---|
| Grounded Answers | Reduces hallucinations by anchoring responses in real data |
| Dynamic Knowledge | No need to retrain the LLM when data changes |
| Domain Adaptability | Works with custom corpora (legal, medical, enterprise) |
| Explainability | You can trace answers back to source documents |
๐ ️ Tools for Building RAG Pipelines
| Component | Tools |
|---|---|
| Embeddings | OpenAI, Cohere, Hugging Face, Azure |
| Vector Store | Pinecone, FAISS, Weaviate, Qdrant, Chroma |
| LLM | GPT-4, Claude, LLaMA, Mistral |
| Frameworks | LangChain, LlamaIndex, Haystack |
๐งช Sample RAG Pipeline (LangChain + Chroma)
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
# Load vector store
embedding = OpenAIEmbeddings()
vectorstore = Chroma(persist_directory="db", embedding_function=embedding)
# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
retriever=vectorstore.as_retriever(),
return_source_documents=True
)
# Ask a question
query = "What is a vector database and why is it useful?"
result = qa_chain(query)
print("Answer:", result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])
๐ง๐ง RAG Pipeline: Retrieval-Augmented Generation Explained
A RAG pipeline is a hybrid architecture that combines a retriever (to fetch relevant documents) with a generator (to synthesize answers). It’s the foundation of intelligent systems like AI assistants, enterprise search tools, and domain-specific chatbots.
๐ RAG Pipeline Architecture
User Query
↓
[Embed Query]
↓
[Vector Search in Vector DB (Retriever)]
↓
[Top-k Relevant Documents]
↓
[Inject into Prompt for LLM (Generator)]
↓
[LLM Generates Final Answer]
๐งฑ Core Components of a RAG Pipeline
| Component | Role | Tools |
|---|---|---|
| Embedding Model | Converts text into dense vectors | OpenAI, Cohere, Hugging Face |
| Vector Store | Stores and retrieves embeddings | Pinecone, FAISS, Weaviate, Chroma |
| Retriever | Finds top-k relevant documents | LangChain, LlamaIndex |
| LLM (Generator) | Synthesizes answers from context | GPT-4, Claude, LLaMA, Mistral |
| Orchestrator | Manages the pipeline flow | LangChain, LlamaIndex, Haystack |
๐งช Sample RAG Pipeline Using LangChain + Chroma + OpenAI
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
# Load and split documents
loader = TextLoader("docs/your_knowledge.txt")
docs = loader.load()
splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
# Create vector store
embedding = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embedding)
# Build RAG chain
rag_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
retriever=vectorstore.as_retriever(),
return_source_documents=True
)
# Ask a question
query = "What are the benefits of using vector databases?"
result = rag_chain(query)
print("Answer:", result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])
๐ฏ When to Use a RAG Pipeline
✅ When your LLM needs access to:
- Private or domain-specific knowledge
- Frequently updated content
- Long-term memory or document context
✅ When you want:
- Explainable answers (with sources)
- Reduced hallucination
- No need to fine-tune the LLM
๐ง Bonus: Enhancements for Production RAG
| Feature | Description |
|---|---|
| Metadata Filtering | Retrieve docs by tags (e.g., date, author, topic) |
| Hybrid Search | Combine keyword + vector search |
| Re-ranking | Use a cross-encoder to re-rank retrieved docs |
| Multi-turn Memory | Add chat history to the prompt |
| Streaming Output | Stream LLM responses for better UX |
Now, let’s discuss each of these layers in detail.
Embedding Layer
You are already familiar with the embedding layer, as it was covered in the previous sessions on semantic search. The embedding layer is typically the first layer of a RAG model, and it typically contains an embedding model that is trained on a massive data set of text and code. This data set is used to learn the relationships between words and phrases and to create embeddings that represent these relationships. The embedding layer is an important part of RAG models because it allows your system to understand the meaning of the text that it is processing and understand its semantic relationship to the query. The embedding layer generates embeddings for your text corpus and allows the RAG model to understand the meaning of the query and to generate a relevant and informative response. This is essential for a variety of tasks, such as question answering, summarisation and machine translation.
Search and Rank Layer
The next layer is the search and rank or the re-rank layer. The search and re-rank layer is a crucial component that is responsible for retrieving the relevant information from an external knowledge base, ranking it based on its relevance to the input query and presenting it to the generation layer for further processing. The search and re-rank layer is an essential component of RAG, as it ensures that the retrieved text is accurate, relevant and contextually appropriate. The search and re-rank layer typically consists of two components:
A search component that uses various techniques to retrieve relevant documents from the knowledge base
A re-rank component that uses a variety of techniques to re-rank the retrieved documents to produce the most relevant results
The search component typically uses a technique called semantic similarity. As discussed in the previous session, semantic similarity is a measure of how similar two pieces of text are in terms of their meaning. The search component uses semantic similarity to retrieve documents from a knowledge base that are relevant to the user's query.
The re-rank component of the search typically uses a variety of techniques to re-rank the retrieved documents. These techniques can include the following:
Ranking by relevance: The re-rank component can rank the retrieved documents based on how relevant they are to the user's query.
Ranking by popularity: The re-rank component can rank the retrieved documents based on how popular they are, such as by measuring the number of times they have been viewed or shared.
Ranking by freshness: The re-rank component can rank the retrieved documents based on how recent they are, such as by measuring the date on which they were published.
The search and re-rank layer is an important part of RAG models because it allows the model to retrieve and re-rank relevant documents from a knowledge base. This is essential for numerous tasks, such as question answering, summarisation and machine translation. The search and re-rank layer is a powerful tool that can be used to improve the performance of a variety of AI tasks. It is an essential part of RAG models, and it plays a key role in helping these models retrieve and re-rank relevant information. The retrieval-based model is used to find relevant information from existing information sources. The re-rank layer is used to rank the retrieved information based on its relevance to the input query.
Generation Layer
The generation layer is typically the last layer of a RAG model which consists of a foundation large language model that is trained on a massive data set of text and code. As the name suggests, the generation layer allows the model to generate new text in response to a user's query. The generative model takes the retrieved information, synthesises all the data and shapes it into a coherent and contextually appropriate response. This is essential for many tasks, such as question answering, summarisation machine translation and also generative search specifically RAG. In the context of search, this layer excels in providing context and natural language capabilities for generative search.
The first step in the pipeline is to build a vector store that can store documents along with metadata. The typical process involves ingesting the documents, converting the raw text in the documents and then splitting them into chunks based on various chunking strategies. Each chunk is then represented as a vector using an appropriate text embedding model, which is then stored in the vector database.
The next step is to embed the user query into the same vector space as the documents in the vector store with the embedding model. Once the query is embedded, a semantic search is performed to find the closest embedding from the vector store. The top K entries (chunks or documents) that have the highest semantic overlap with the query are retrieved using various search and indexing strategies that are available in vector databases.
In addition to the semantic search layers for retrieving the top K relevant documents, we also discussed two major strategies to improve the overall performance and responsiveness of the semantic search system:
- Cache mechanism
- Re-ranking layer
Once the top entries for the query have been retrieved and re-ranked, the next stage is to pass the results to the generative search step. In this final step, the prompt, along with the query, and the relevant documents are passed to the LLM to generate a unique response to the user’s query. The retrieved documents provide context to the LLM, which helps it generate a more accurate response.
Overall, retrieval augmented generation combines the strengths of semantic search and large language models to generate more accurate responses to user queries.
Comments
Post a Comment