Vector DB, RAG and Adv Prompting Interview questions

Vector Databases & Embeddings

What is an embedding? (Very Frequent) 🧐
An embedding is a numerical representation of data, like text or an image. It's a list of numbers (a vector) that captures the meaning or semantic content of the data. Words with similar meanings will have similar embedding vectors.
Why do we need embeddings?
Machine learning models, including LLMs, can't understand text directly. They work with numbers. Embeddings translate the complex, high-level meaning of data into a mathematical format that models can process.
What is a vector database? (Very Frequent) 💾
A vector database is a special type of database designed to efficiently store and search through embeddings. Instead of searching for exact text matches, it finds items that are semantically similar.
How is a vector database different from a traditional (SQL) database?
- Traditional DB: Stores structured data (like text, numbers in rows/columns) and finds exact matches. For example, WHERE name = 'John'.
- Vector DB: Stores unstructured data as embeddings and finds the "closest" or most similar items. For example, "Find documents that are similar to the concept of 'machine learning efficiency'."
What does "similarity search" mean? (Frequent)
Similarity search, also called a vector search, is the process of finding the vectors in a database that are closest to a given query vector. "Closest" is measured using a distance metric.
What are some common distance metrics used in vector databases?
- Cosine Similarity: Measures the angle between two vectors. It's very popular for text embeddings because it cares about the orientation (meaning) and not the magnitude.
- Euclidean Distance: Measures the straight-line distance between the tips of two vectors. It's more intuitive but can be less effective for high-dimensional data like text embeddings.
Give me some examples of vector databases.
Some popular ones are Pinecone, Weaviate, Chroma DB, and Milvus. Some traditional databases like PostgreSQL also have extensions (like pgvector) to handle vector searches.
What is the process of adding data to a vector database?
It's a two-step process:
1. Embedding: You take your data (e.g., a paragraph of text) and use an embedding model (like one from OpenAI or Hugging Face) to convert it into a vector.
2. Indexing (or Upserting): You store this vector, along with any associated metadata (like the original text or a document ID), in the vector database.
What is an "index" in a vector database?
An index is a special data structure that organizes the vectors to make searching fast. Without an index, the database would have to compare your query vector to every single vector in the database, which is very slow.
Can you name a common indexing algorithm?
HNSW (Hierarchical Navigable Small World) is a very popular and powerful one. It creates a multi-layered graph of the vectors, allowing for very fast and approximate nearest neighbor (ANN) searches.
What is Approximate Nearest Neighbor (ANN) search? Why is it "approximate"?
ANN is a technique to find "close enough" neighbors without guaranteeing you find the absolute closest one. It's "approximate" because it trades a tiny bit of accuracy for a massive gain in search speed, which is essential for large datasets.
What is "chunking"? Why is it important before creating embeddings? (Frequent)
Chunking is the process of breaking down large documents into smaller, meaningful pieces before creating embeddings. This is crucial because:
- Embedding models have a limited input token length.
- Smaller, more focused chunks lead to better quality embeddings and more precise search results.

Retrieval-Augmented Generation (RAG)

What is RAG (Retrieval-Augmented Generation)? (Very Frequent) 🗣️
RAG is a technique that enhances an LLM's response by providing it with relevant information retrieved from an external knowledge source. The LLM uses this information as context to generate a more accurate and up-to-date answer.
Why do we need RAG? What problem does it solve? (Very Frequent)
RAG primarily solves two major problems with LLMs:
- Hallucination: It reduces the chance that the LLM will make up facts by grounding its answer in real data.
- Knowledge Cutoff: It allows the LLM to answer questions about information that is recent or private (e.g., your company's internal data), which it wasn't trained on.
Explain the basic workflow of a RAG system. (Very Frequent)
1. User Query: The user asks a question.
2. Retrieve: The system searches a vector database to find text chunks that are semantically relevant to the user's query.
3. Augment: The retrieved text chunks are combined with the original user query into a detailed prompt.
4. Generate: This augmented prompt is sent to an LLM, which then generates the final answer based on the provided context.
What are the key components of a RAG pipeline?
You need a Retriever (which queries the vector database) and a Generator (the LLM). You also need a data Indexing pipeline to populate your vector database.
What is the "Retriever" in RAG?
The Retriever is the component responsible for fetching the relevant information. In a basic RAG system, this is simply the vector database search.
What is the "Generator" in RAG?
The Generator is the LLM that takes the user's question and the retrieved context and synthesizes them into a final, human-readable answer.
What does it mean to "ground" an LLM?
Grounding means connecting the LLM's output to a verifiable source of information. RAG is a primary method for grounding an LLM, as the generated answer is based on the retrieved documents.
Can you build a RAG system without a vector database?
Yes, but it's less common. You could use a traditional full-text search engine (like Elasticsearch) as your retriever. However, vector databases are generally better at understanding the semantic meaning of a query rather than just keyword matches.
How do you evaluate a RAG system?
You measure two main things:
- Retrieval Quality: How relevant are the documents the retriever found? (Metrics: Hit Rate, MRR).
- Generation Quality: How good is the final answer generated by the LLM? (Metrics: Faithfulness - does it stick to the context?, Answer Relevancy).
What is a "Lost in the Middle" problem in RAG?
This refers to the tendency of some LLMs to pay more attention to information at the beginning and end of the context, potentially ignoring important details placed in the middle of the retrieved documents.
What is the difference between RAG and fine-tuning? (Frequent)
- Fine-tuning: Changes the model's internal knowledge by retraining its weights on new data. It's good for teaching the model a new style, tone, or skill. It is expensive and static.
- RAG: Changes the external knowledge the model has access to at inference time. It's good for providing factual, up-to-date information. It is cheaper and dynamic.
When would you choose RAG over fine-tuning?
Choose RAG when your primary goal is to reduce factual inaccuracies and incorporate new or rapidly changing information without constantly retraining the model.
Can you combine RAG and fine-tuning?
Yes! You can fine-tune a model to be better at using the context provided by a RAG system. For example, you could fine-tune it to always cite its sources from the context or to handle cases where the context doesn't contain the answer.

Advanced Prompting Techniques

What is the difference between Zero-Shot and Few-Shot Prompting? (Very Frequent)
- Zero-Shot: You ask the model to do something without giving it any examples. It relies entirely on its pre-trained knowledge.
- Few-Shot: You include a few examples of the task in the prompt to show the model the pattern and format you want. This is also called in-context learning.
What is Chain-of-Thought (CoT) Prompting? (Very Frequent) 🧠
Chain-of-Thought is a prompting technique where you instruct the model to "think step by step" and break down its reasoning process before giving the final answer. This dramatically improves performance on tasks that require logic, math, or multi-step reasoning.
Give an example of a Zero-Shot CoT prompt.
You simply add the phrase "Let's think step by step" to the end of your question.
- Example: "A juggler has 16 balls. 6 are red, and the rest are blue. He drops half of the blue balls. How many blue balls are left? Let's think step by step."
What is a "persona" in a prompt? Why is it useful?
A persona is when you tell the model to "act as" a specific character or expert. For example, "Act as a senior copywriter." This is useful because it primes the model to adopt the tone, style, and knowledge associated with that role, leading to a more tailored and high-quality response.
How can you specify the output format in a prompt?
You explicitly ask for it. Be very clear.
- Example: "Extract the key dates and events from the text below. Provide the output as a JSON object with keys 'date' and 'event'."
What are delimiters and why should you use them?
Delimiters are characters or symbols (like ###, ```, < >) used to separate different parts of your prompt, such as separating the main instruction from the text you want it to process. They make the prompt structure clear to the model and can help prevent prompt injection.
What is Self-Consistency in prompting?
It's an advanced technique that builds on Chain-of-Thought. You run the same CoT prompt multiple times (with a higher temperature to get diverse reasoning paths) and then take the majority vote on the final answer. This makes the result more robust and reliable.
What is the ReAct (Reason and Act) prompting framework? 🤖
ReAct is a framework that enables an LLM to interact with external tools (like a calculator or a search engine). The model generates a cycle of:
1. Thought: The model reasons about what it needs to do next.
2. Act: The model decides which tool to use and with what input.
3. Observation: The model receives the output from the tool and uses it for the next "Thought" step.
Why is ReAct powerful?
It combines the reasoning power of an LLM with the factual, real-time knowledge of external tools, overcoming many of the limitations of a standalone LLM.
What is "prompt chaining"?
This is the process of breaking a complex task into a series of simpler prompts, where the output of one prompt becomes the input for the next. For example, one prompt could extract keywords, and a second prompt could use those keywords to write an article.
What is a "System Prompt"?
A system prompt is a high-level instruction given at the very beginning of a conversation that defines the AI's persona, capabilities, and constraints for the entire session. It's the foundational instruction that guides all subsequent responses.
How do you handle a long document that doesn't fit in the prompt?
This is a common interview question. The main methods are:
- Summarization: Create a summary of the document first.
- Chunking & Iterating: Break the document into chunks and process them one by one (this is related to the "Map-Reduce" or "Refine" patterns).
- RAG: The best method. Embed the document chunks into a vector database and retrieve only the relevant parts needed to answer a specific question.
What are some tips for writing a good prompt? (Frequent)
- Be specific and clear.
- Provide examples (few-shot).
- Give the model a persona.
- Use delimiters.
- Tell it to think step-by-step.
- Specify the output format.
What is prompt injection?
Prompt injection is a security risk where a user crafts an input that manipulates the LLM to ignore its original instructions and follow the user's malicious instructions instead.
What is a "negative prompt"?
Mostly used in image generation, a negative prompt tells the model what you don't want to see in the output. For example, in a realistic photo, the negative prompt might be "cartoon, 3d render, blurry".

Putting It All Together (Scenario Questions)

How would you build a Q&A chatbot for your university's website?
I would use a RAG architecture.
1. Index: Scrape all the text from the university website, chunk it into logical sections (e.g., by page or paragraph), create embeddings, and store them in a vector database like Chroma DB.
2. Retrieve & Generate: When a student asks a question, I would embed their question, retrieve the most relevant chunks of text from the database, and then feed that context and the question into an LLM (like one from OpenAI or a free one from Hugging Face) to generate the answer.
An LLM is giving answers that are factually wrong about current events. What is the best way to fix this?
The best solution is to implement a RAG system. The retriever can be connected to a live news API or a search engine. This way, the LLM is provided with up-to-the-minute information as context before it generates an answer, ensuring its responses are current.
Your RAG system is slow. What are the first places you would look to optimize?
I would first check the retrieval step. Is the vector database index configured properly? Can the search be made faster? Then, I would look at the LLM inference speed. Can I use a smaller, faster model? Can I use a more optimized inference server?
The final answers from your RAG system are not following the retrieved context correctly. How could you improve this?
This is a prompt engineering problem. I would refine the prompt sent to the generator (the LLM). I would add a stronger instruction, such as: "You must answer the user's question using ONLY the provided context. If the answer is not in the context, you must say 'I do not have enough information to answer that.'"
What is the "context window" of an LLM? Why is it important for RAG?
The context window is the maximum number of tokens an LLM can take as input at one time. It's crucial for RAG because it limits how much retrieved information you can pass to the model. You need to make sure your retrieved chunks plus your query fit within this window.
What is a "reranker" in a RAG system?
A reranker is an optional second stage in the retrieval process. The initial vector search might retrieve, say, the top 20 documents. A reranker (which is often a more sophisticated but slower model) then re-evaluates and re-orders these top 20 documents to find the absolute most relevant ones, improving the final quality of the context.
What is LangChain or LlamaIndex?
They are popular open-source frameworks that make it much easier to build applications powered by LLMs. They provide pre-built components for creating complex systems like RAG pipelines, prompt chains, and agents that can use tools.
What are LLM Agents?
An LLM Agent is a system that uses an LLM as its "brain" to make decisions and use tools to accomplish a complex goal. The ReAct framework is a simple example of an agent. An agent can plan, execute actions, observe the results, and continue until the task is done.
How would you decide which embedding model to use?
It's a trade-off. I would look at the MTEB (Massive Text Embedding Benchmark) leaderboard. For high performance, I might choose a larger, state-of-the-art model. For a faster, cheaper application, I would choose a smaller, more efficient model like all-MiniLM-L6-v2. The key is to balance performance with computational cost.
If you have a very small, specific dataset, would RAG be effective?
It can be, but it might be a case where fine-tuning is also a good option. If the goal is to teach the model a very specific style or domain language from that dataset, fine-tuning might be better. If the goal is just to answer questions based on the facts in that dataset, RAG is the perfect tool.

Search This Blog

Stubborn_since_2k

Vector DB, RAG and Adv Prompting Interview questions

Vector Databases & Embeddings

Retrieval-Augmented Generation (RAG)

Advanced Prompting Techniques

Putting It All Together (Scenario Questions)

Comments

Post a Comment

Popular posts from this blog

Resume Work and Project Details

Time Series and MMM basics

LINEAR REGRESSION