Vector DB and RAG

Vector Stores: Powering AI with Semantic Search

Vector stores (a.k.a. vector databases) are specialized data storage systems designed to store and retrieve high-dimensional vector representations of data like text, images, or audio. They're a key component in retrieval-augmented generation (RAG), semantic search, AI chatbots, and recommendation systems.

🔹 1. What Are Vectors in AI?

In AI, we convert unstructured inputs (like sentences or images) into dense numeric vectors (embeddings) using models like BERT, OpenAI’s text-embedding, or CLIP.
These vectors capture the semantic meaning of the input.

📌 For example:

"What is AI?" → [0.12, -0.64, 0.88, ..., 0.34]
"Explain artificial intelligence" → Close in vector space

🔹 2. What Is a Vector Store?

A vector store indexes these high-dimensional vectors and supports efficient nearest neighbor search for similarity-based retrieval.

✔ Supports k-Nearest Neighbor (kNN) and Approximate Nearest Neighbor (ANN) search
✔ Returns similar documents/images when queried with a vector
✔ Often integrated with LLMs to provide contextual memory and knowledge retrieval

🔹 3. Popular Vector Databases

Vector Store	Key Features
Pinecone	Fully managed, scalable, cloud-native; ideal for RAG pipelines
FAISS (by Meta)	Open-source, lightning-fast, supports GPU indexing
Weaviate	Schema-aware, includes hybrid (symbolic + vector) search
Milvus	Open-source, built for billion-scale vector search
Chroma	Simple and tightly integrated with LangChain workflows

🔹 4. When Are Vector Stores Used?

✅ Retrieval-Augmented Generation (RAG)
→ Combines search with LLMs to ground answers in external knowledge

✅ Semantic Search
→ Finds documents based on meaning, not just keywords

✅ Image & Video Similarity Search
→ Compare visual embeddings for tasks like face recognition

✅ Personalized Recommendations
→ Suggests content with similar vector profiles

🔹 5. Sample RAG Pipeline Using FAISS + LangChain

from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

# Prepare document and embedding
docs = ["Generative AI is a subfield of machine learning", "SQL is used for database querying"]
embedding_model = OpenAIEmbeddings()
vectordb = FAISS.from_texts(docs, embedding_model)

# Ask a semantic question
query = "What is AI used for?"
retrieved_docs = vectordb.similarity_search(query, k=2)
print(retrieved_docs)

✔ This retrieves contextually similar documents, which can be used to augment LLM responses.

🔹 What Is ANN (Approximate Nearest Neighbor)?

In vector search, ANN algorithms help find items in a database whose vector embeddings are closest (most similar) to a query vector—but faster than exact methods like brute-force search.

📌 Why “Approximate”?
Finding exact neighbors in high-dimensional space is expensive (computationally). ANN trades a bit of accuracy for a massive speed boost—perfect for real-time semantic search.

🔹 Where It’s Used

Search engines (e.g., vectorized text search)
Recommendation systems
RAG (Retrieval-Augmented Generation)
Image or video similarity
Multimodal embedding search (text-to-image, etc.)

🔹 Popular ANN Techniques

1. Brute-Force Search (Exact kNN)

✔ Compares query vector to every database vector
✔ High accuracy, slow for large datasets
Used mostly for evaluation or small data

2. Tree-Based Methods

Algorithm	Description
KD-Tree	Great for low dimensions (<20D); partitions space using axis-aligned splits.
Ball Tree	Similar to KD-Tree but uses hyperspheres; better for clustered data.
VP-Tree	Uses distances between points to build partitions—used in some metric spaces.

📌 Fast for structured, small to mid-sized vector sets

3. Hashing-Based Methods

Algorithm	Description
LSH (Locality-Sensitive Hashing)	Hashes similar vectors into the same bucket with high probability.
MinHash / SimHash	Specialized for Jaccard or cosine similarity.

✅ Best for extremely fast, but approximate search
Used in early versions of semantic search engines

4. Graph-Based Approaches

Algorithm	Description
HNSW (Hierarchical Navigable Small World)	Builds layered graphs for logarithmic traversal—extremely fast & accurate.
NSW	Non-hierarchical version; still efficient.

✅ Most widely used in modern vector stores
🔥 FAISS, Weaviate, and Pinecone support HNSW

5. Quantization-Based Methods

Algorithm	Description
PQ (Product Quantization)	Compresses vectors into smaller subspaces; compares compressed codes.
IVF (Inverted File Index)	Clusters database vectors, narrows search to relevant partitions.
IVF+PQ (IVFPQ)	Combines clustering + compression.

✅ Offers a good trade-off between speed and memory efficiency
Common in FAISS deployments at scale

🔹 Choosing the Right ANN Method

Dataset Size	Best ANN Type
Small (<10k)	Brute-Force, KD-Tree
Medium (10k–1M)	HNSW, LSH, IVF
Large (>1M)	HNSW, IVFPQ, PQ

Types of Vector stores:

🔹 By Architecture and Deployment Type

In-Memory Vector Stores
- Fastest, but limited by RAM.
- Ideal for prototyping and small-scale tasks.
- Example: Chroma (used in LangChain), FAISS (with IndexFlatL2).
Disk-Based / Persistent Vector Stores
- Scales beyond RAM limits.
- Useful for production workloads.
- Examples: Weaviate, Milvus, Qdrant, Vespa.
Cloud-Native Managed Vector Databases
- Fully hosted with autoscaling, security, replication.
- Minimal infra setup.
- Examples: Pinecone, Azure Cognitive Search (vector mode), Google Vertex AI Matching Engine.

🔹 By Search Algorithm Used

Flat Index (Brute Force)
- Exact, slow.
- Good for small datasets.
- Used in FAISS IndexFlatL2.
Quantized Indexes (IVF, PQ)
- Combines clustering + compression.
- Balances speed and accuracy.
- FAISS supports IVF, IVFPQ.
Graph-Based Indexes (HNSW, NSW)
- Great recall and speed on large sets.
- Used by Pinecone, Weaviate, FAISS (HNSW), Vespa.
Hashing-Based Stores
- Based on LSH (Locality Sensitive Hashing).
- Less common now but useful for certain similarity types.

🔹 By Feature Set

Vector Store	Highlights
Pinecone	Serverless, fully managed, fast HNSW, metadata filtering
FAISS	Facebook’s open-source, versatile, GPU-compatible
Weaviate	Schema + hybrid search + modular storage
Chroma	Lightweight, great for LangChain prototypes
Qdrant	Rust-based, fast, filters, re-ranking
Milvus	High throughput, GPU/CPU support, billion-scale indexing
ElasticSearch / OpenSearch (vector mode)	Traditional inverted index + vector hybrid
Zilliz	Managed version of Milvus with cloud features

🔹 What Are Vector Libraries?

Vector libraries are in-memory software tools that help compute, index, and search embeddings (high-dimensional vectors) efficiently — usually used during experimentation or local model development.

🧰 Examples:

Library	Description
FAISS (by Meta)	Fast similarity search and clustering; supports IVF, PQ, and HNSW indexing; GPU acceleration available.
Annoy (by Spotify)	Optimized for disk-based and memory-efficient approximate nearest neighbor (ANN) search using trees.
ScaNN (by Google)	Deep learning-friendly ANN search with Scalable Nearest Neighbors; integrates well with TensorFlow.
NMSLIB	Non-Metric Space Library supporting HNSW; great for Python-based pipelines.
Hnswlib	Lightweight, high-performance C++/Python library for HNSW ANN indexing.

✅ Use Case: Ideal for prototyping vector search, local RAG, or batch similarity scoring.

🔸 What Are Vector Databases?

Vector databases are production-ready services (self-hosted or managed) built to store and search embeddings across billions of vectors, often with metadata filtering, scalability, and indexing baked in.

🧩 Examples:

Vector DB	Key Features
Pinecone	Fully managed, real-time vector search with metadata filtering, hybrid search, and serverless scaling.
Weaviate	Open-source + hybrid semantic search (vector + keyword), RESTful APIs, built-in modules (e.g. OpenAI, Cohere).
Qdrant	Fast, Rust-based, filtering and re-ranking with payload-aware HNSW support.
Milvus	Scalable GPU/CPU support, good for billion-scale search, supports IVF, HNSW, and hybrid indexes.
Chroma	Lightweight vector store used in LangChain; great for small-scale local pipelines.

✅ Use Case: Perfect for production-scale RAG, AI chat memory, personalization systems, and semantic enterprise search.

⚖️ Key Differences

Feature	Vector Libraries	Vector Databases
Scale	Local, up to millions of vectors	Cloud-scale, billions of vectors
Persistence	Typically non-persistent	Persistent (disk/cloud)
Filtering & Metadata	Minimal	Advanced filtering, tagging, ranking
Deployment	Python or C++ codebase	Hosted, Docker, or managed APIs
Use Case	Prototyping, local dev	Real-time, scalable production use

🧠 1. Retrieval-Augmented Generation (RAG)

Used in LLM-powered applications to fetch relevant documents or facts from a knowledge base before answering.

Example: Chatbots with long-term memory, like a customer support bot that recalls manuals or product specs.

🔍 2. Semantic Search

Vector DBs retrieve content based on meaning, not exact wording.

Example: Searching “startup capital help” returns “small business loans” due to semantic closeness.

🤝 3. Recommendation Systems

Finds items (products, songs, users) similar in meaning or behavior.

Example: “You may also like” suggestions based on vector proximity to your preferences.

📄 4. Document Similarity & Clustering

Used to group and compare content such as emails, contracts, or academic papers.

Example: Deduplicating similar FAQs or clustering legal documents by topic.

📷 5. Image & Video Retrieval

Embedding-based search for visual similarity—crucial in media, fashion, and surveillance.

Example: “Show me all images similar to this dress.”

🛡️ 6. Cybersecurity & Anomaly Detection

Vectors represent user behavior or network traffic patterns.

Example: Spotting fraud by comparing a transaction to a vector profile of normal behavior.

🌍 7. Multilingual Applications

Since embeddings from different languages can share the same vector space, a vector DB can do cross-lingual retrieval.

Example: Search English documents using a German query.

🎯 8. Personalized Search & Chat Memory

Vector DBs can store user histories, preferences, and chat memory for context-aware AI.

Example: A sales AI that “remembers” what features a client liked last week.

🔹 Chroma DB

Type: Lightweight, open-source vector store
Best for: Prototyping, LangChain experiments, local development
Storage: Local (in-memory or persistent file-based)
Filtering: Limited metadata filtering compared to Pinecone
Indexing: Typically brute-force or simple ANN (less optimized for scale)
Integration: Designed with LangChain in mind — super plug-and-play
Deployment: Runs easily on your machine or container; no cloud infra needed

✅ Use Case: Fast setup for building RAG chatbots, notebooks, or embedding playgrounds.

🔹 Pinecone

Type: Fully managed, cloud-native vector database
Best for: Scalable, production-grade RAG pipelines
Storage: Distributed, persistent cloud storage
Filtering: Advanced — supports metadata filters, namespaces, versioning
Indexing: Uses HNSW and optimized sparse-dense hybrid indexes
Integration: Works seamlessly with LangChain, OpenAI, Cohere, etc.
Deployment: No infrastructure needed — just use their API

✅ Use Case: Recommended when you need millisecond latency, high availability, and scalable search across millions of documents.

🧠 When to Use Which?

Scenario	Pick This
Fast prototyping or hobby project	Chroma
Full-scale production (chatbots, search apps)	Pinecone
You want cloud scaling & team collaboration	Pinecone
Lightweight local dev with minimal setup	Chroma

🔹 1. Chroma DB Sample Code

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# Load and split document
loader = TextLoader("example.txt")
docs = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(docs)

# Embeddings and Vector Store
embeddings = OpenAIEmbeddings()
chroma_store = Chroma.from_documents(texts, embeddings)

# Retrieve relevant docs
query = "What is Generative AI?"
results = chroma_store.similarity_search(query, k=3)
for doc in results:
    print(doc.page_content)

✅ Ideal for: Local dev, quick experiments, LangChain notebooks.

🔹 2. Pinecone Sample Code

import pinecone
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings

# Init Pinecone
pinecone.init(api_key="YOUR_API_KEY", environment="us-east1-gcp")
index_name = "langchain-demo"

# Create Index (run once)
if index_name not in pinecone.list_indexes():
    pinecone.create_index(index_name, dimension=1536)

# Prepare Embeddings + Store
embeddings = OpenAIEmbeddings()
docs = ["Generative AI is the ability for machines to create content.",
        "Large language models can perform reasoning tasks.",
        "Vector databases store and retrieve high-dimensional embeddings."]

vector_store = Pinecone.from_texts(docs, embeddings, index_name=index_name)

# Semantic search
query = "What can large language models do?"
results = vector_store.similarity_search(query, k=2)
for r in results:
    print(r.page_content)

✅ Ideal for: Cloud-scale applications, persistent vector storage, enterprise RAG.

🔍✨ What Is Generative Search?

Generative Search is the fusion of two powerful AI capabilities:

Semantic Retrieval — finding relevant documents using vector similarity (meaning-based search).
Generative AI — using large language models (LLMs) like GPT to synthesize natural language answers from those documents.

This approach is often implemented using a Retrieval-Augmented Generation (RAG) pipeline.

🔧 How Generative Search Works (Step-by-Step)

User Query
→ "What are the benefits of using vector databases?"
Embedding Generation
→ The query is converted into a vector using a model like OpenAIEmbeddings, SentenceTransformers, or Cohere.
Vector Search (Semantic Retrieval)
→ The vector is used to search a vector database (e.g., Pinecone, FAISS, Weaviate) to retrieve the most relevant documents.
Context Injection
→ Retrieved documents are injected into the prompt for the LLM.
LLM Response Generation
→ The LLM (e.g., GPT-4) generates a natural language answer grounded in the retrieved context.

🧠 Why It’s Powerful

Feature	Benefit
Grounded Responses	Reduces hallucinations by anchoring answers in real data
Domain Adaptability	Works with custom corpora (legal, medical, enterprise docs)
Explainability	You can trace the answer back to source documents
Real-Time Knowledge	Keeps LLMs up-to-date without retraining

🛠️ Tools for Building Generative Search

Tool	Role
LangChain	Orchestrates retrieval + generation
Pinecone / Weaviate / FAISS	Vector database for semantic search
OpenAI / Cohere / Hugging Face	Embedding + generation models
Chroma	Lightweight vector store for local dev

🧪 Sample LangChain RAG Pipeline (Generative Search)

from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI

# Load vector store
embedding = OpenAIEmbeddings()
vectorstore = Chroma(persist_directory="db", embedding_function=embedding)

# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

# Ask a question
query = "What is a vector database and why is it useful?"
result = qa_chain(query)

print("Answer:", result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])

🔍🧠 What Is RAG (Retrieval-Augmented Generation)?

RAG (Retrieval-Augmented Generation) is a powerful architecture that combines information retrieval with generative AI to produce more accurate, grounded, and context-aware responses. It’s the backbone of many modern AI systems like chatbots, enterprise search assistants, and AI copilots.

🔧 How RAG Works (Step-by-Step)

1. User Query

The user asks a question:

"What are the benefits of using vector databases?"

2. Embedding the Query

The query is converted into a dense vector using an embedding model like:

OpenAIEmbeddings
SentenceTransformers
CohereEmbeddings

3. Semantic Retrieval

The vector is used to search a vector database (e.g., Pinecone, FAISS, Weaviate) to find top-k relevant documents based on semantic similarity.

4. Context Injection

The retrieved documents are injected into the prompt for the LLM (e.g., GPT-4, Claude, LLaMA) as context.

5. Response Generation

The LLM generates a natural language answer grounded in the retrieved documents.

🧠 Why Use RAG?

Feature	Benefit
Grounded Answers	Reduces hallucinations by anchoring responses in real data
Dynamic Knowledge	No need to retrain the LLM when data changes
Domain Adaptability	Works with custom corpora (legal, medical, enterprise)
Explainability	You can trace answers back to source documents

🛠️ Tools for Building RAG Pipelines

Component	Tools
Embeddings	OpenAI, Cohere, Hugging Face, Azure
Vector Store	Pinecone, FAISS, Weaviate, Qdrant, Chroma
LLM	GPT-4, Claude, LLaMA, Mistral
Frameworks	LangChain, LlamaIndex, Haystack

🧪 Sample RAG Pipeline (LangChain + Chroma)

from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI

# Load vector store
embedding = OpenAIEmbeddings()
vectorstore = Chroma(persist_directory="db", embedding_function=embedding)

# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

# Ask a question
query = "What is a vector database and why is it useful?"
result = qa_chain(query)

print("Answer:", result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])

🔧🧠 RAG Pipeline: Retrieval-Augmented Generation Explained

A RAG pipeline is a hybrid architecture that combines a retriever (to fetch relevant documents) with a generator (to synthesize answers). It’s the foundation of intelligent systems like AI assistants, enterprise search tools, and domain-specific chatbots.

🔁 RAG Pipeline Architecture

User Query
   ↓
[Embed Query]
   ↓
[Vector Search in Vector DB (Retriever)]
   ↓
[Top-k Relevant Documents]
   ↓
[Inject into Prompt for LLM (Generator)]
   ↓
[LLM Generates Final Answer]

🧱 Core Components of a RAG Pipeline

Component	Role	Tools
Embedding Model	Converts text into dense vectors	OpenAI, Cohere, Hugging Face
Vector Store	Stores and retrieves embeddings	Pinecone, FAISS, Weaviate, Chroma
Retriever	Finds top-k relevant documents	LangChain, LlamaIndex
LLM (Generator)	Synthesizes answers from context	GPT-4, Claude, LLaMA, Mistral
Orchestrator	Manages the pipeline flow	LangChain, LlamaIndex, Haystack

🧪 Sample RAG Pipeline Using LangChain + Chroma + OpenAI

from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# Load and split documents
loader = TextLoader("docs/your_knowledge.txt")
docs = loader.load()
splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# Create vector store
embedding = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embedding)

# Build RAG chain
rag_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

# Ask a question
query = "What are the benefits of using vector databases?"
result = rag_chain(query)

print("Answer:", result["result"])
print("Sources:", [doc.metadata for doc in result["source_documents"]])

🎯 When to Use a RAG Pipeline

✅ When your LLM needs access to:

Private or domain-specific knowledge
Frequently updated content
Long-term memory or document context

✅ When you want:

Explainable answers (with sources)
Reduced hallucination
No need to fine-tune the LLM

🧠 Bonus: Enhancements for Production RAG

Feature	Description
Metadata Filtering	Retrieve docs by tags (e.g., date, author, topic)
Hybrid Search	Combine keyword + vector search
Re-ranking	Use a cross-encoder to re-rank retrieved docs
Multi-turn Memory	Add chat history to the prompt
Streaming Output	Stream LLM responses for better UX

Now, let’s discuss each of these layers in detail.

Embedding Layer

You are already familiar with the embedding layer, as it was covered in the previous sessions on semantic search. The embedding layer is typically the first layer of a RAG model, and it typically contains an embedding model that is trained on a massive data set of text and code. This data set is used to learn the relationships between words and phrases and to create embeddings that represent these relationships. The embedding layer is an important part of RAG models because it allows your system to understand the meaning of the text that it is processing and understand its semantic relationship to the query. The embedding layer generates embeddings for your text corpus and allows the RAG model to understand the meaning of the query and to generate a relevant and informative response. This is essential for a variety of tasks, such as question answering, summarisation and machine translation.

Search and Rank Layer

The next layer is the search and rank or the re-rank layer. The search and re-rank layer is a crucial component that is responsible for retrieving the relevant information from an external knowledge base, ranking it based on its relevance to the input query and presenting it to the generation layer for further processing. The search and re-rank layer is an essential component of RAG, as it ensures that the retrieved text is accurate, relevant and contextually appropriate. The search and re-rank layer typically consists of two components:

A search component that uses various techniques to retrieve relevant documents from the knowledge base
A re-rank component that uses a variety of techniques to re-rank the retrieved documents to produce the most relevant results

The search component typically uses a technique called semantic similarity. As discussed in the previous session, semantic similarity is a measure of how similar two pieces of text are in terms of their meaning. The search component uses semantic similarity to retrieve documents from a knowledge base that are relevant to the user's query.

The re-rank component of the search typically uses a variety of techniques to re-rank the retrieved documents. These techniques can include the following:

Ranking by relevance: The re-rank component can rank the retrieved documents based on how relevant they are to the user's query.
Ranking by popularity: The re-rank component can rank the retrieved documents based on how popular they are, such as by measuring the number of times they have been viewed or shared.
Ranking by freshness: The re-rank component can rank the retrieved documents based on how recent they are, such as by measuring the date on which they were published.

The search and re-rank layer is an important part of RAG models because it allows the model to retrieve and re-rank relevant documents from a knowledge base. This is essential for numerous tasks, such as question answering, summarisation and machine translation. The search and re-rank layer is a powerful tool that can be used to improve the performance of a variety of AI tasks. It is an essential part of RAG models, and it plays a key role in helping these models retrieve and re-rank relevant information. The retrieval-based model is used to find relevant information from existing information sources. The re-rank layer is used to rank the retrieved information based on its relevance to the input query.

Generation Layer

The generation layer is typically the last layer of a RAG model which consists of a foundation large language model that is trained on a massive data set of text and code. As the name suggests, the generation layer allows the model to generate new text in response to a user's query. The generative model takes the retrieved information, synthesises all the data and shapes it into a coherent and contextually appropriate response. This is essential for many tasks, such as question answering, summarisation machine translation and also generative search specifically RAG. In the context of search, this layer excels in providing context and natural language capabilities for generative search.

The first step in the pipeline is to build a vector store that can store documents along with metadata. The typical process involves ingesting the documents, converting the raw text in the documents and then splitting them into chunks based on various chunking strategies. Each chunk is then represented as a vector using an appropriate text embedding model, which is then stored in the vector database.

The next step is to embed the user query into the same vector space as the documents in the vector store with the embedding model. Once the query is embedded, a semantic search is performed to find the closest embedding from the vector store. The top K entries (chunks or documents) that have the highest semantic overlap with the query are retrieved using various search and indexing strategies that are available in vector databases.

In addition to the semantic search layers for retrieving the top K relevant documents, we also discussed two major strategies to improve the overall performance and responsiveness of the semantic search system:

Cache mechanism
Re-ranking layer

Once the top entries for the query have been retrieved and re-ranked, the next stage is to pass the results to the generative search step. In this final step, the prompt, along with the query, and the relevant documents are passed to the LLM to generate a unique response to the user’s query. The retrieved documents provide context to the LLM, which helps it generate a more accurate response.

Overall, retrieval augmented generation combines the strengths of semantic search and large language models to generate more accurate responses to user queries.