Resume Work and Project Details

 Conceptual / MMM Basics


Q1. What is Media Mix Modeling (MMM)?

MMM quantifies how different marketing channels and external factors drive business outcomes such as sales or conversions, using regression-based or Bayesian models on historical time-series data.

Unlike attribution models, MMM doesn’t require user-level data and works even with aggregated spend data.


Q2. What problem does MMM solve?

It helps marketers measure ROI by channel and optimize future spend allocation while accounting for carryover, seasonality, and external effects.


Q3. What is the dependent variable and what are the independent variables in your models?

Dependent variable: Sales or revenue.
Independent variables: Marketing touchpoints (TV, Digital, Print, Radio) and control variables (Price, Competitor, Macro factors, Seasonality).


Q4. How is MMM different from digital attribution models?

Attribution models rely on user-level tracking (cookies, clicks). MMM uses aggregate, privacy-compliant data and captures offline media as well.


Q5. What’s the typical data frequency and duration for MMM?

Weekly or monthly data spanning 2–3 years for each market or brand to capture enough variation in spends.


πŸ”Ή 2️⃣ Data & Pre-Model Preparation

Q6. How do you prepare data for MMM?

Combine all channel and control variables into a single “stack,” align them on time, granularity, and dimension, handle missing values, define variable groups, and normalize metrics using log transforms.


Q7. How do you treat missing data in MMM?

Fill short gaps using interpolation or carry-forward; longer gaps are flagged or filled via business rules.


Q8. Why is outlier handling important before modeling?

Extreme spend spikes can bias coefficients; we cap or winsorize values at 1.5×IQR or use log transforms.


Q9. How do you test multicollinearity?

Use correlation matrix, VIF, or PCA to identify correlated channels. High VIF variables are merged or dropped.


Q10. What is the role of the Data Dictionary in your product?

It maintains metadata — variable names, groups (touchpoint/control), measure types, granularity, and spend relationships — ensuring consistent modeling inputs.


πŸ”Ή 3️⃣ Transformations & Feature Engineering

Q11. What is Adstock transformation?

Models delayed impact of media.


Q12. What is Saturation (Diminishing Returns)?

As spend increases, incremental response flattens. Modeled via log, Hill, or S-curve transformations.


Q13. Why apply both Adstock and Saturation?

Adstock handles time-decay; Saturation handles diminishing marginal response — together they mimic real consumer behavior.


Q14. How do you choose decay and saturation parameters?

Calibrated through optimization or grid search; validated via model fit and business logic (e.g., TV has higher carryover than Display).


Q15. What transformations did you commonly use?

log(x+1), log(adstock/c+1), Hill-function, and interaction terms like TV×Digital.


Q16. What is meant by scalar configuration?

A file defining variable-wise transformation metadata — variable name, VG, type, transformation type, decay rate, saturation, targeting factor — used during model runs.


πŸ”Ή 4️⃣ Modeling & Estimation

Q17. Which algorithms or model types did you use?

Linear Regression, Constrained Regression, and Hierarchical Bayesian (HB) models.


Q18. Why choose Hierarchical Bayesian (HB) modeling?

It pools information across brands or regions, improving stability when individual datasets are sparse.


Q19. What priors are used in Bayesian MMM?

Normal priors for coefficients, positive constraints for touchpoints, hierarchical priors for pooling, and shrinkage terms to control variance.


Q20. How do you impose sign constraints in models?

During optimization, we bound coefficients (e.g., Ξ² ≥ 0 for touchpoints; Ξ² ≤ 0 for price) to ensure economic interpretability.


Q21. What are direct and indirect models?

Direct: Sales ~ Media + Controls.
Indirect: Media → Awareness → Sales, where awareness acts as an intermediate KPI.


Q22. How do you handle non-stationary time series?

Detrend using differencing or include time dummies/seasonal variables.
In HB models, trend is handled via priors and seasonality terms.


Q23. What is the objective function in MMM regression?

Minimize error between predicted and actual KPI (e.g., MAPE or RMSE) subject to business constraints.


Q24. How do you validate model convergence?

Check diagnostics like MAPE, R², posterior trace plots (for Bayesian), and ensure coefficients are stable across re-runs.


Q25. How do you deal with overfitting?

Cross-validation on hold-out weeks, regularization (ridge/lasso), or hierarchical shrinkage.


πŸ”Ή 5️⃣ Model Diagnostics & Validation

Q26. What metrics do you use to evaluate MMM models?

In-sample and out-of-sample MAPE, R², and Decomposition alignment with historical performance.


Q27. What’s an acceptable MAPE range for MMM?

Typically 5–15% at aggregate level, depending on data noise and market size.


Q28. How do you validate channel contributions?

Compare modeled contribution with business knowledge (e.g., spend share, campaign uplift); ensure no negative ROI for active channels.


Q29. What is the role of the MOR (Modeling Output Report)?

Summarizes model fit metrics, channel contributions, ROAS, elasticities, and decomposition visuals for review.


Q30. How do you ensure coefficients make business sense?

Sign and magnitude checks, sanity bounds, and elasticity range validations.


Q31. How do you compare different model runs or phases?

Use Phase Compare reports — analyze coefficient drift, contribution shifts, and stability across builds.


πŸ”Ή 6️⃣ Forecasting & Optimization

Q32. How do you create future forecasts?

Apply COR (Carry-Over Rules) — variable-specific rules like copy-year-ago, decay, or moving average — to simulate future spend levels.


Q33. How is forecast accuracy validated?

Compare predicted vs actuals for subsequent periods; monitor MAPE drift.


Q34. What is ROI vs MROI vs Elasticity?

ROI = Return/Spend.
MROI = Incremental Revenue/Spend.
Elasticity = %Ξ”Sales / %Ξ”Spend.
These are derived from model coefficients and used in optimizers.


Q35. How do business teams use model outputs?

Via Optimizer UI — they simulate budget reallocations to see how changing spend across channels affects future KPIs.


πŸ”Ή 7️⃣ Collaboration & Workflow

Q36. How do you collaborate with Data Management team?

Align on data mappings, variable definitions, and calendar settings to ensure consistency between stack and model.


Q37. What is the role of the Fact Funnel?

Aggregates modeled outputs (elasticities, contributions) into the optimizer environment for scenario planning.


Q38. How do you handle model refresh cycles?

For each new data phase, re-generate stack, update transformations, rerun estimation, and validate stability versus previous phase.


Q39. How often are models rebuilt or refreshed?

Typically quarterly or semi-annually, depending on data refresh cadence.


πŸ”Ή 8️⃣ Business Understanding & Insights

Q40. How do you interpret elasticity results?

Elasticity < 1 → low responsiveness; >1 → high responsiveness.
Helps prioritize high-ROI channels for future investment.


Q41. What kind of insights does MMM provide to clients?

Channel ROI, optimal spend mix, diminishing return curves, cross-channel synergies, and forecasted revenue impact.


Q42. How do you explain MMM outputs to non-technical stakeholders?

Use decomposition charts showing how each channel contributes to total sales and show incremental revenue for spend changes.


Q43. What’s one challenge you faced building MMM models?

Managing multicollinearity among digital channels and aligning model outputs with business expectations — solved via feature grouping and constraints.


Q57. What’s your approach if two correlated channels both show high coefficients?

Investigate multicollinearity via VIF; either group them, constrain one, or apply shrinkage priors to stabilize.


Q58. How do you decide whether to include a control variable?

Based on significance, theoretical relevance, and impact on other coefficients; test via nested model comparison.


Q59. What are typical pitfalls in MMM?

Multicollinearity, missing data, overfitting, and over-interpretation of coefficients without business validation.


Q60. How would you explain MMM to a non-technical stakeholder in one line?

“MMM helps you understand how each marketing dollar contributes to sales and guides smarter budget allocation.”



Project Summary — MMM Product Support Agentic Bot (Interview Answer)

I built an Event-Driven, Durable, and Observable AI Platform that powers a Product Support Agentic Bot for an MMM (Marketing Mix Modeling) product.
The bot helps internal teams and customers by answering questions using company knowledge bases, past Jira tickets, and product documentation.


πŸ”Ή What the System Does

  • Automatically ingests incoming user queries from UI or Slack.

  • Fetches and ranks relevant Knowledge Base documents + historical Jira tickets.

  • Uses a self-hosted LLM to generate accurate support responses, explanations, debugging instructions, and step-by-step agentic actions.

  • Handles long-running tasks like:

    • Searching Jira deeply

    • Regenerating PDFs

    • Re-querying vector DB

    • Multi-step agent workflows

  • Provides complete observability into retrieval quality and reasoning.


πŸ”Ή Core Problem Solved

MMM product users frequently need help with:

  • Model interpretation

  • Understanding marketing lift results

  • Troubleshooting model configs

  • Debugging pipeline failures

  • Historical issue references

  • Feature explanations

Support teams spend significant time searching Jira/Confluence → answers get delayed.

My system makes support instant, accurate, and consistent.


πŸ”Ή Why the Architecture

The platform is built using an event-driven + durable workflow design so no user request is ever lost, and multi-step agent workflows are orchestrated safely.

Key components:

ComponentWhy
FastAPI API LayerAuth & request validation
Kafka/RedpandaBuffers all user requests, handles traffic spikes
Temporal.ioManages long-running agentic workflows with durable execution
Qdrant CloudStores embedded KB + Jira history for retrieval
vLLM (self-hosted)High-throughput inference for Llama models
Arize PhoenixObservability for RAG relevance, hallucination checks

πŸ”Ή High-Level Workflow

  1. User asks: “Why is my MMM report showing negative ROI?”

  2. FastAPI receives, authenticates, publishes the query to Kafka.

  3. Temporal workflow starts:

    • Retrieve relevant Jira tickets + KB docs from Qdrant

    • Call vLLM for reasoning

    • Validate / filter / trace all steps

  4. Arize Phoenix logs retrieval relevance & fiber-trace.

  5. Response is returned with citations and explanations.


πŸ”Ή Outcome / Impact

  • 60–80% faster support resolution time.

  • Reduces manual Jira searching by agents.

  • Ensures repeatable and verifiable support reasoning.

  • Allows adding new agent skills without changing infra.

  • Enables traceability, debugging, and model evaluation.


The Enterprise Data Flow Architecture

Flow: 

User → Nginx → FastAPI (Middleware/JWT) → Presidio (PII) → Kafka → Temporal Worker → Unstructured.io → Raptor/FastEmbed → Qdrant → vLLM → Guardrails → User Response


🧱 Phase 1: Ingress & Security (The Gatekeepers)

This layer handles the raw traffic, security, and cleaning before any heavy lifting happens.

  1. Nginx (Reverse Proxy):

    • Purpose: The first line of defense. It sits in front of FastAPI to handle SSL termination (HTTPS), compress responses (Gzip), and load balance traffic across your API containers.

  2. FastAPI (The Interface):

    • Purpose: The entry point for your application. It defines the REST endpoints (/chat/upload). It is asynchronous and high-performance.

  3. Middleware:

    • Purpose: Code that runs before every request. We use it for Rate Limiting (preventing spam) and CORS (allowing your frontend to talk to the backend).

  4. JWT Tokens (Authentication):

    • Purpose: Stateless security. Instead of checking the database for every request, the user sends a signed "JSON Web Token" that proves who they are (Employee vs. Admin).

  5. Presidio (Data Privacy):

    • Purpose: PII Redaction. Before we process any text, Microsoft Presidio scans it for emails, credit card numbers, or SSNs and replaces them with <REDACTED> to ensure we don't leak sensitive data to the AI.

⚡ Phase 2: Event Bus & Orchestration (The Nervous System)

This layer ensures the system never crashes under load and handles long-running tasks.

  1. Kafka / RabbitMQ / Redis (The Event Bus):

    • Purpose: Backpressure Management. If 5,000 users upload PDFs at once, FastAPI doesn't process them instantly. It pushes a "Job Ticket" to Kafka. This decouples the "Ingestion" from the "Processing," ensuring the API never freezes.

  2. Temporal.io (The Orchestrator):

    • Purpose: Durable Execution. It replaces standard Python loops. A Temporal "Worker" picks up the job from Kafka. If the worker crashes mid-task, Temporal remembers the state and restarts the workflow automatically on a healthy node.

⚙️ Phase 3: Data Processing & Ingestion (The ETL Pipeline)

This layer turns raw files into "Smart Data" for the AI.

  1. Unstructured.io (Document Loaders):

    • Purpose: Universal parsing. Instead of basic PyPDF, this tool uses computer vision to extract text cleanly from PDFs, ensuring tables and headers are preserved.

  2. Raptor (Retrieval Strategy):

    • Purpose: Recursive Abstractive Processing. A simplistic chunking strategy cuts text randomly. RAPTOR summarizes clusters of chunks recursively. It allows the AI to answer "High-Level" questions (summaries) as well as "Low-Level" questions (specific facts).

  3. Tokenization :
    1. We have used Recursivecharactersplitter here to split characters into tokens. we can also use Tiktoken if used open AI models, sentence transformer or autotokenizer if used HuggingFace models.
  4. FastEmbed (BAAI/bge-small):

    • Purpose: Embedding Generation. Converts the cleaned text chunks into Vector Embeddings (lists of numbers) locally on the CPU, saving latency before sending them to the DB.

🧠 Phase 4: Storage & Retrieval (The Memory)

This layer stores the knowledge base efficiently.

  1. Qdrant DB (Vector Database):

    • Purpose: Stores the embeddings. It is the search engine that finds the "Top 5 Jira tickets" relevant to the user's error.

  2. HNSW / IVF (Indexing Algorithms):

    • Purpose: Search Speed. Instead of scanning every single row (slow), Qdrant uses HNSW (Hierarchical Navigable Small World) graphs to find the nearest neighbors in milliseconds.

  3. Hybrid Search:

    • Purpose: Accuracy. Combines "Semantic Search" (Vectors) with "Keyword Search" (BM25). It ensures we match the specific Error Code ERR-503 (Keyword) while also understanding the concept "Server Overload" (Semantic).

  4. Payload Filtering:

    • Purpose: Multi-Tenancy. Using metadata tags to ensure User A searches only User A's documents, not User B's.

  5. CAP Theorem (Concept):

    • Purpose: Database Design Principle. For Qdrant in this cluster, we prioritize A (Availability) and P (Partition Tolerance) over C (Consistency), meaning it's better to show a slightly old document than to show an error page.

πŸ€– Phase 5: Intelligence & Inference (The Brain)

This layer generates the final answer.

  1. vLLM (Inference Engine):

    • Purpose: High-Throughput Serving. We don't use the raw transformers library. vLLM uses "PagedAttention" to serve the Llama-3 model 24x faster, handling multiple users simultaneously on the same GPU.

  2. Quantization:

    • Purpose: Cost/Speed Optimization. We compress the model weights from 16-bit to 4-bit (AWQ/GPTQ). This reduces VRAM usage by 70%, allowing us to run a 70B model on cheaper hardware.

  3. Guardrails (NVIDIA NeMo):

    • Purpose: Safety. It checks the input for "Jailbreaks" (users trying to hack the bot) and checks the output for "Hallucinations" (the bot making up facts) before sending the response.

πŸ”­ Phase 6: Observability & Deployment (The Dashboard)

This layer keeps the lights on.

  1. OpenTelemetry:

    • Purpose: Standardized Tracing. It adds a TraceID to every request so we can track it as it hops from Nginx → Kafka → Qdrant.

  2. Arize Phoenix:

    • Purpose: LLM Observability. A dashboard that visualizes the traces. It tells you exactly which chunk was retrieved and helps you debug why the AI gave a bad answer.

  3. Kubernetes (Load Balancing):

    • Purpose: Scaling. It manages the Docker containers. If CPU usage spikes, Kubernetes automatically spins up more FastAPI pods or Temporal Workers to handle the load.

 


IMP Questions and Answers

🧠 Section 3: LLM Engineering & vLLM (26–40)

Q26. Why is vLLM faster than standard HuggingFace Transformers? A: vLLM uses PagedAttention. Standard transformers allocate contiguous memory for the KV Cache (keys/values), leading to fragmentation (wasted memory). PagedAttention splits KV cache into non-contiguous blocks (like OS virtual memory), allowing near 100% GPU memory utilization.

Q27. What is "Continuous Batching" in vLLM? A: In standard batching, the whole batch waits for the slowest sequence to finish. In Continuous Batching (iteration-level scheduling), as soon as one sequence finishes, vLLM ejects it and inserts a new request immediately, keeping the GPU fully saturated.

Q28. What is the "KV Cache" and why does it consume so much memory? A: The KV Cache stores the Key and Value matrices for every token generated so far to avoid re-computing them. For a 70B model with a long context (e.g., PDF text), this cache can grow to gigabytes per user, becoming the main bottleneck for concurrency.

Q29. How do you host Llama 3 70B if you only have 24GB VRAM cards? A: I would use Tensor Parallelism. I can split the model across multiple GPUs (e.g., 4x A10G). vLLM supports this natively. Alternatively, I would use Quantization (AWQ/GPTQ) to compress weights to 4-bit, reducing memory usage by 4x.

Q30. Explain "Quantization" (4-bit vs 16-bit). What are the trade-offs? A: Quantization represents model weights with fewer bits.

  • Pros: Much lower VRAM usage, faster inference (memory bandwidth).

  • Cons: Slight loss in accuracy/reasoning capability.

Q31. How do you prevent the LLM from "Hallucinating" URLs or IDs? A: I use Guardrails (like NVIDIA NeMo or strict prompt instructions). I can also use "Logit Bias" to force the model to output only tokens present in the context, though this is complex. Best approach: Post-processing verification (Regex check).

Q32. What is "Speculative Decoding"? A: A small "Draft Model" generates tokens quickly. The large "Target Model" verifies them in parallel. If the draft is correct, we accept multiple tokens at once, speeding up inference.

Q33. Explain the difference between "Pre-training", "Fine-tuning", and "RAG". A:

  • Pre-training: Teaching the model language (expensive).

  • Fine-tuning: Teaching the model a specific task/style (moderate).

  • RAG: Giving the model temporary knowledge (cheap, real-time). For this Support Bot, RAG is best because Jira tickets change daily.

Q34. How do you handle "Context Window" limits with large PDFs? A: I use strategies like Map-Reduce (summarize chunks, then summarize summaries) or Refine (iteratively update answer). Or simply use a model with a large context window (128k) like Llama-3.1, managed efficiently by vLLM.

Q35. What is "Temperature" and "Top-P" sampling? A:

  • Temperature: Controls randomness. 0.1 makes the model deterministic (good for code/JSON).

  • Top-P (Nucleus): Restricts the token choice to the top subset of probabilities summing to P.

Q36. How do you evaluate if your vLLM deployment is successful? A: I monitor Time Per Output Token (TPOT) and Tokens Per Second (TPS). High TPS means high throughput; low TPOT means low latency for the user.

Q37. Can vLLM handle multiple LoRA adapters? A: Yes, vLLM supports Multi-LoRA serving. We can have one base Llama-3 model and dynamically load different lightweight adapters (e.g., one for "SQL Generation", one for "Chat") for different requests without reloading the base model.

Q38. What is a "System Prompt" vs "User Prompt"? A: System Prompt sets the behavior ("You are a helpful assistant"). User prompt is the specific input. In Llama-3, these are formatted with specific special tokens (<|begin_of_text|>, etc.) which vLLM handles automatically.

Q39. How do you protect the LLM from "Prompt Injection"? A: I sanitize inputs (strip special characters). I also use "Delimiters" in the prompt (e.g., "Analyze the text inside the XML tags") and instruct the model to ignore instructions found inside the user content.

Q40. What is "Semantic Caching" for LLMs? A: Before calling the LLM, we check if a similar query exists in our Vector DB (Redis/Qdrant). If "How to reset password?" was answered recently, we return the cached answer. This saves money and time.


πŸ” Section 4: RAG & Vector Databases (41–55)

Q41. Explain the "Retriever-Reader" architecture. A: The Retriever (Qdrant) finds relevant documents. The Reader (LLM) takes those documents and generates an answer. The quality of the answer depends 100% on the quality of the retrieval.

Q42. Why Qdrant over Pinecone or Milvus? A: Qdrant is written in Rust (fast, memory safe), open-source (can self-host), and has excellent support for HNSW indexing and Payload Filtering, which is critical for filtering Jira tickets by date/tag.

Q43. What is HNSW? A: Hierarchical Navigable Small World. It's an algorithm for Approximate Nearest Neighbor (ANN) search. It builds a multi-layer graph where upper layers are "highways" for fast traversal and lower layers provide fine-grained search.

Q44. What is "Hybrid Search"? A: Combining Sparse Vector (Keyword/BM25) and Dense Vector (Semantic) search.

  • Why: "Error 503" is a keyword. "Server overload" is a concept. Hybrid search catches both.

Q45. How does "Re-ranking" improve RAG? A: The vector DB returns the top 50 documents (fast but less accurate). A Cross-Encoder Model (Re-ranker) reads these 50 pairs carefully and scores them effectively, returning the top 5 truly relevant ones to the LLM.

Q46. What is the optimal "Chunk Size"? A: It depends. For Jira tickets, I chunk by "Issue + Resolution". For PDFs, I use 512 tokens with 50 overlap. Small chunks lose context; large chunks confuse the retrieval. I use RecursiveCharacterTextSplitter.

Q47. How do you handle "Metadata Filtering" efficiently? A: In Qdrant, I create a "Payload Index" on fields like ticket_status or product_version. This allows Qdrant to filter during the graph traversal (pre-filtering), which is much faster than filtering results after search.

Q48. Explain "Embeddings". A: A vector representation of text in N-dimensional space. Semantic similarity = distance between points. I use BAAI/bge-small-en-v1.5 (384 dimensions) for a balance of speed and accuracy.

Q49. What is "Lost in the Middle" phenomenon? A: LLMs pay more attention to the start and end of the context window. If the correct answer is buried in the middle of 10 retrieved documents, the LLM might miss it. Re-ranking helps put the best doc first.

Q50. How do you measure Retrieval Quality? A: I use metrics like Hit Rate (is the correct doc in top-k?) and MRR (Mean Reciprocal Rank). I use Arize Phoenix to track these metrics in production.

Q51. What is "Hypothetical Document Embeddings" (HyDE)? A: Instead of searching with the user's question, the LLM generates a fake answer, and we search for documents similar to that fake answer. This often yields better semantic matches.

Q52. How do you handle PDF tables in RAG? A: Standard text extraction breaks tables. I use LlamaParse or Unstructured.io, which detects table layouts and converts them to Markdown/HTML tables, preserving the row/column structure for the LLM.

Q53. What is "Binary Quantization" in Vector DBs? A: Compressing float32 vectors into 1-bit binaries. It reduces memory usage by 32x and speeds up search significantly, with minimal accuracy loss for high-dimensional vectors.

Q54. How do you handle "Duplicate Documents" in RAG? A: I implement a deduplication step during ingestion using content hashing (MD5/SHA256). If the hash exists, I skip ingestion.

Q55. What is "Parent Document Retriever"? A: We index small chunks (for better search matching) but return the parent (larger) chunk to the LLM. This gives the LLM full context while keeping search precise.

Q151. Why Qdrant?

Supports:

  • Distributed clusters

  • Payload filtering

  • Fast ANN search

  • Good for RAG pipelines at scale

Q152. What embedding models do you use?

BGE, MPNet, or domain-specific fine-tuned models.

Q153. How do you store metadata?

As payloads: Jira ID, document type, KB tags, timestamps.

Q154. Why shard Qdrant?

To scale with document volume and query throughput.

Q155. How do you handle updates?

Soft delete → upsert → background reindex.

Q156. How do you evaluate retrieval quality?

Arize Phoenix metrics: Recall@K, precision, rank.

Q157. How do you prevent irrelevant chunks?

Hybrid search: vector + keyword filters + metadata.

Q158. How do you ensure fast search?

HNSW index + quantization + locality filtering.

Q159. How do you ensure multi-tenancy?

Namespaces or payload fields + filters.

Q160. How do you monitor Qdrant?

Latency, query timeouts, shard unavailability alerts.

Q161. Why vLLM?

Best throughput for Llama models with Paged Attention + continuous batching.

Q162. How do you load models?

FP16 or 4-bit quantized Llama 3.1/3.2 for cost/performance.

Q163. How do you handle batching?

vLLM auto-batching groups similar-length prompts for max GPU utilization.

Q164. How do you ensure low latency?

  • Warm model

  • Continuous batching

  • GPU affinity

  • Prompt caching

Q165. Why self-host vs API (OpenAI)?

  • Data privacy

  • Cost reduction

  • Predictable latency

  • No rate limits

Q166. How do you monitor GPU usage?

NVIDIA DCGM exporter → Prometheus → Grafana.

Q167. How do you handle inference failures?

Retries inside Temporal, fallback small model.

Q168. How do you avoid hallucination?

RAG grounding + Arize hallucination detectors.

Q169. How do you handle long contexts?

Chunking + retrieval-augmented attention windows.

Q170. How do you add new skills?

Add new workflows and prompts; no infra changes.

Q171. How do you handle multi-model routing?

Router chooses model based on intent classification.

Q172. What is PagedAttention?

Memory-efficient KV caching enabling high throughput.

Q173. Why not TGI or Triton?

vLLM has better throughput for Llama-family models.

Q174. How do you secure your models?

Private VPC, IAM policies, token-based auth.

Q175. How do you test your prompts?

A/B testing + RAG eval metrics.

Q193. How does your RAG pipeline work?

Query → embedding → Qdrant search → reranker → LLM → final answer.

Q194. How do you use Jira tickets?

Embed all ticket text + tags → store in Qdrant → retrieve relevant historical issues.

Q195. How do you make it agentic?

Each agent step = Temporal Activity (search Jira, regenerate report, etc.).

Q196. How do you avoid incorrect agent actions?

Validation layers + guardrails + temporal checks.

Q197. Why is this ideal for MMM product support?

It rapidly answers questions using domain knowledge + historical issues.

Q198. How do you scale to new domains?

Change embeddings + knowledge base, workflows remain same.

Q199. What makes this production-grade?

Durability (Temporal), resilience (Kafka), observability (Arize), scalable inference (vLLM).



Additional Questions and answers :

πŸ›️ Section 1: System Design & Architecture (1–15)

Q1. Design a system for MMM Support Bot to handle 10,000 concurrent requests. What are the key bottlenecks? A: I would use an Event-Driven Architecture.

  • Ingress: Nginx Load Balancer distributing traffic to stateless FastAPI pods on Kubernetes.

  • Bottleneck 1 (Compute): PDF parsing and RAG are CPU-intensive. I’d decouple this using a Message Queue (Kafka) so the API doesn't block.

  • Bottleneck 2 (DB): A single Qdrant instance will choke. I’d use a Sharded Qdrant Cluster.

  • Bottleneck 3 (LLM): Public APIs rate-limit. I’d use self-hosted vLLM on autoscaling GPU nodes.

Q2. Explain the difference between Monolithic and Microservices architecture in the context of this bot. A:

  • Monolith: The API, PDF parser, and RAG logic run in one process. If the PDF parser crashes (OOM error), the API goes down.

  • Microservices: We split them into Ingestion Service (FastAPI), Worker Service (Temporal/Python), and Inference Service (vLLM). If the Worker crashes, the API can still accept requests.

Q3. Why did you choose Event-Driven Architecture over REST for internal communication? A: REST is synchronous; the caller waits. For long-running tasks like "Generate 50-page PDF", HTTP connections would timeout. Event-Driven (Kafka) allows "Fire and Forget"—the API pushes a job and immediately returns a job_id to the user, ensuring high responsiveness.

Q4. What is the "Backpressure" problem and how does your architecture solve it? A: Backpressure happens when users send requests faster than workers can process them. Without a queue, the server crashes. By using Kafka/RabbitMQ, we buffer the surge. The workers process at their own pace, effectively "flattening the curve" of traffic.

Q5. How do you ensure "High Availability" (HA) for the Vector Database? A: I deploy Qdrant in a Distributed Mode with a replication factor of 2 or 3. If Node A goes down, Node B serves the traffic. I also use a load balancer in front of the Qdrant cluster.

Q6. What is the CAP theorem and which two did you pick for your Support Bot? A: CAP stands for Consistency, Availability, Partition Tolerance. For a Support Bot, I chose AP (Availability & Partition Tolerance) via Eventual Consistency. It's better for a user to see a slightly stale ticket search result than a 500 Error.

Q7. How do you handle "Idempotency" in the API? A: If a user clicks "Submit" twice, we shouldn't charge the LLM cost twice. I implement an Idempotency Key (usually a hash of the file content + user ID) in Redis. If a second request comes with the same key, we return the cached result.

Q8. Explain "Statelessness" and why it matters for Kubernetes. A: Stateless means the app doesn't save client data (like sessions or uploaded files) in its local memory/disk. This allows Kubernetes to kill/restart pods at will without losing data. I offload state to Redis (sessions) and S3 (files).

Q9. What is a "Circuit Breaker" pattern? A: It prevents cascading failures. If the vLLM service fails 5 times in a row, the Circuit Breaker "trips" and stops sending requests for 30 seconds, returning a fallback error immediately. This gives the vLLM service time to recover.

Q10. How do you secure the internal communication between microservices? A: I use mTLS (Mutual TLS) via a Service Mesh (like Istio or Linkerd) or simple JWT tokens passed between services to ensure only authorized services can talk to the LLM backend.

Q11. Why use Nginx as a Reverse Proxy? A: It handles SSL termination, Gzip compression, and basic load balancing, freeing up the FastAPI application to focus solely on application logic.

Q12. What is "Database Sharding"? A: Splitting a large database into smaller, faster pieces (shards) across multiple servers. For Qdrant, we shard by TenantID (Client ID) so that searching Client A's data doesn't scan Client B's vectors.

Q13. How do you handle "Thundering Herd" problem? A: This occurs when many services retry a failed connection simultaneously. I implement Exponential Backoff with Jitter (randomized delay) in my retry logic to spread out the reconnection attempts.

Q14. Explain "Blue-Green Deployment". A: We run two identical environments (Blue = Production, Green = New Version). We deploy to Green, test it, and then switch the Load Balancer to point to Green. If issues arise, we instantly switch back to Blue.

Q15. How would you design a "Multi-Tenant" architecture for this bot? A: I would add a tenant_id column to every PostgreSQL table and Qdrant payload. The API Middleware extracts the tenant_id from the API Key and enforces a filter on every DB query to ensure data isolation.


⏱️ Section 2: Orchestration & Temporal.io (16–25)

Q16. Why use Temporal instead of just Python asyncio or Celery? A: Celery is a task queue, not an orchestrator. If a worker crashes mid-task, Celery might retry, but it loses local state. Temporal provides Durable Execution—it persists the state of the workflow (history) to a DB. If a process crashes, it resumes exactly where it left off.

Q17. Explain the difference between "Workflows" and "Activities" in Temporal. A:

  • Workflow: The deterministic code that defines the sequence of steps (e.g., "Step 1, then Step 2"). It cannot communicate with the outside world directly.

  • Activity: The code that does the actual work (API calls, DB queries). It can fail and be retried.

Q18. What does "Deterministic" mean in Temporal Workflows? A: It means the code must produce the exact same command sequence if replayed. You cannot use random.randint(), datetime.now(), or thread spawning inside a Workflow, because replay would break.

Q19. How do you handle a "Human-in-the-loop" step (e.g., waiting for approval) in Temporal? A: I use workflow.wait_condition(). The workflow suspends execution (sleeping effectively for free) until it receives a Signal from the API (e.g., user clicks "Approve"). It can wait for days without consuming CPU.

Q20. How do you test a Temporal Workflow? A: Temporal provides a TestWorkflowEnvironment. I can mock the Activities (e.g., mock the PDF generator) and assert that the Workflow executes the correct sequence of steps.

Q21. What is a "Saga Pattern" and how does Temporal implement it? A: Sagas handle distributed transactions. If Step 3 fails, we must "undo" Steps 1 and 2. In Temporal, I use a try/finally block where the finally triggers compensating activities (e.g., "Delete S3 File" if "Send Email" fails).

Q22. What happens if the Temporal Service itself goes down? A: Temporal relies on a persistence layer (Cassandra/Postgres). If the service nodes die, the data is safe in the DB. When nodes restart, they read the DB and resume all workflows.

Q23. Explain "Signals" and "Queries" in Temporal. A:

  • Signal: Pushing data into a running workflow (e.g., "Here is the user's updated prompt").

  • Query: Pulling data out of a running workflow (e.g., "What is the current status?").

Q24. How do you handle versioning in Temporal Workflows? A: If I change the workflow logic, old running workflows might break (non-determinism). I use workflow.patched() to insert branching logic: "If this is an old workflow, take Path A; if new, take Path B."

Q25. Can Temporal utilize the vLLM efficiently? A: Yes. Temporal manages the request flow. It can implement a "Rate Limiter" activity to ensure we don't overwhelm the vLLM server, queuing workflows until capacity is available.


πŸ“¨ Section 6: Event-Driven & Async (Kafka) (71–80)

Q71. Why Kafka over Redis Pub/Sub? A: Redis Pub/Sub is "Fire and Forget" (messages are lost if no one listens). Kafka provides Persistence. If my worker crashes, the message stays in Kafka until it is acknowledged (committed).

Q72. What is a "Consumer Group"? A: A group of workers reading from a topic. Kafka ensures each message is delivered to only one consumer in the group, allowing me to scale processing by adding more workers.

Q73. What is "Dead Letter Queue" (DLQ)? A: If a message fails processing 5 times (e.g., malformed PDF), we move it to a DLQ topic instead of blocking the main queue. Engineers can inspect the DLQ later.

Q74. Explain "At-least-once" vs "Exactly-once" delivery. A:

  • At-least-once: Message might be delivered twice (e.g., if worker crashes before ack). My system must be Idempotent to handle this.

  • Exactly-once: Hard to achieve, uses transactions.

Q75. How do you monitor Kafka lag? A: "Lag" is the difference between the latest message produced and the last message processed. High lag means I need to scale up my workers.

Q76. What is "AsyncIO" event loop? A: A single-threaded loop that pauses tasks waiting for I/O (like DB calls) and runs other tasks. This allows Python to handle thousands of connections on one thread.

Q77. What is the difference between await and yield? A: await pauses execution until a Promise/Coroutine resolves. yield produces a value and pauses, effectively creating a generator (used for streaming).

Q78. How do you handle "Race Conditions" in async code? A: I use asyncio.Lock or DB-level row locking (SELECT ... FOR UPDATE) to ensure two tasks don't modify the same ticket status simultaneously.

Q79. What is "Graceful Shutdown"? A: When the server receives a SIGTERM signal, it stops accepting new requests but finishes processing current ones before killing the process.

Q80. How does Redis act as a Broker? A: In Celery, Redis stores the list of tasks. Workers "pop" tasks from the list. It's simpler than Kafka but holds data in memory (less durable).


πŸ“Š Section 7: Observability & Evals (81–90)

Q81. What is Arize Phoenix used for? A: It is an observability platform specifically for LLM apps. It traces the execution path (LangChain/LangGraph) and evaluates retrieval quality.

Q82. What is "RAGAS"? A: Retrieval Augmented Generation Assessment. It’s a framework (often used with Arize) to calculate metrics like Faithfulness (did the LLM lie?) and Answer Relevance.

Q83. How do you detect "Hallucinations" in production? A: Arize Phoenix uses an "LLM-as-a-Judge" approach. It uses a strong model (like GPT-4) to grade the Llama-3 response against the retrieved chunks. If the response contains facts not in the chunks, it flags it as hallucination.

Q84. What is "Distributed Tracing" (OpenTelemetry)? A: It assigns a TraceID to a request. As the request moves from Nginx -> FastAPI -> Kafka -> Worker -> Qdrant, every log shares that ID. I can visualize the full "waterfall" to find latency spikes.

Q85. What is the difference between Metrics, Logs, and Traces? A:

  • Metrics: Numbers (CPU usage, Requests/sec).

  • Logs: Text events ("Error: File not found").

  • Traces: The path of a request across services.

Q86. How do you visualize Embeddings? A: Arize Phoenix provides a 3D UMAP visualization. I can see clusters of user queries. If I see a cluster of queries far away from my knowledge base documents, I know I have a "Data Gap".

Q87. What is "Drift" in LLMs? A: When the input distribution changes over time (e.g., users start asking about a new product feature we haven't documented). Monitoring embedding clusters helps detect this.

Q88. How do you handle PII redaction in logs? A: I use libraries like Presidio or simple Regex to mask emails and credit card numbers before sending logs to Arize or Datadog.

Q89. Explain "Golden Dataset". A: A manually curated set of Question-Answer pairs used for regression testing. Before deploying a new prompt, I run the bot against the Golden Dataset to ensure accuracy hasn't dropped.

Q90. What is "Latency P99"? A: The time within which 99% of requests finish. It measures the "worst-case" performance. Optimizing P99 is crucial for enterprise SLAs.


🚒 Section 8: Behavioural

Q95. How do you handle Secrets in production? A: Never in code. I use AWS Secrets Manager or HashiCorp Vault. K8s injects them as Environment Variables at runtime.

Q96. Behavioral: "Tell me about a time you prioritized Trade-offs." A: "In the POC, I used LlamaParse Cloud (API). It was accurate but slow/costly. For production, I traded off a bit of accuracy for speed/privacy by switching to a local Unstructured library or a fine-tuned vision model hosted on vLLM."

Q97. Behavioral: "How do you handle disagreement on tech stack?" A: "My team wanted to use MongoDB. I argued for PostgreSQL because our data (Users, Tickets) is relational/structured. I created a prototype showing how complex joins would be in Mongo, and the team agreed to use Postgres for metadata and S3 for blobs."

Q98. Scenario: "The RAG system is retrieving irrelevant documents. How do you fix it?" A:

  1. Check Chunking (too small?).

  2. Check Embedding Model (is it domain-specific?).

  3. Implement Hybrid Search (Keywords).

  4. Implement Re-ranking (Cross-encoder).

Q99. Scenario: "The Bot is responding slowly (10s+). Debug it." A:

  1. Check Traces in Arize. Is it Retrieval or Inference?

  2. If Inference: Is vLLM batching correctly? Do we need more GPUs?

  3. If Retrieval: Is Qdrant index optimized?

  4. If Network: Is the payload too large?


Q100. Explain your architecture in one line.

We built an event-driven, durable AI platform using Kafka + Temporal + vLLM + Qdrant + Arize to power an MMM support agentic bot using KB & historical Jira tickets.

Q101. Why event-driven?

Event-driven ensures decoupling, natural backpressure handling, async processing, and no request drops during spikes.

Q102. Why durable?

Because support workflows involve long-running tasks (Jira search, multi-step retrieval) — Temporal guarantees we never lose state, even if servers crash.

Q103. Why observable?

RAG systems fail silently. We needed traceability of retrieval quality, hallucination, latency. Arize Phoenix provides that.

Q104. What is the main user flow?

User → FastAPI → Kafka → Temporal Workflow → Qdrant retrieval → vLLM inference → Arize trace → Response.

Q105. Is this microservices-based?

Yes — API, ingestion workers, orchestrators, vector DB, inference engine are independent and communicate via Kafka/Temporal signals.

Q106. Why not synchronous APIs?

Support workflows are long-running; sync APIs time out. Asynchronous event-driven guarantees reliability and scalability.

Q107. What problem does this architecture solve?

Instantly answers MMM product questions using KB + Jira history with resilient, scalable, traceable agentic workflows.

Q108. What design patterns are used?

  • Event Sourcing

  • Saga Pattern

  • CQRS

  • Workflow Orchestration

  • FAN-IN / FAN-OUT pipelines

Q109. How do you ensure scalability?

Scale Kafka partitions, Temporal workers, and vLLM replicas independently.

Q110. Why Kafka and not SQS/SNS?

Kafka supports high throughput, ordering, replays, retention, and consumer groups.


πŸ“Œ SECTION 3 — Event Bus & Kafka (Q119–Q130)

Q119. Why use Kafka/Redpanda?

High-throughput buffer for requests; prevents overload of downstream systems.

Q120. How do you partition data?

Partition key = user_id or tenant_id for ordering; or random for max throughput.

Q122. How do you achieve durability?

Kafka replication factor 3 (prod), fsync enabled, Acks=all.

Q124. Why not use RabbitMQ?

RabbitMQ is great for tasks but lacks replay, retention, scale, streaming, stateful consumers.


πŸ“Œ SECTION 4 — Temporal (Durable Execution) (Q131–Q150)

Q131. Why Temporal.io?

Guarantees long-running workflows never lose state, even if services crash.

Q132. What are durable workflows?

Workflows whose execution state is persisted; can recover from failure exactly at the last step.

Q133. What types of tasks run in Temporal?

  • RAG workflow

  • Query → Qdrant → vLLM

  • Jira search

  • PDF generation

  • Multi-step agentic flows

Q134. What happens on worker crash?

Workflow pauses → Temporal replays history → Worker resumes seamlessly.

Q135. How do you handle long-running tasks?

Use Temporal Activities with heartbeats + retries + timeouts.

Q136. How do you guarantee idempotency?

Workflow ID = request_id ensures the same event never creates duplicate workflows.

Q137. How do you handle versioning?

Temporal Change Version API for backward-compatible workflow updates.

Q138. How do you schedule retries?

Exponential backoff + max attempts + jitter.

Q139. How do you orchestrate agentic steps?

Each step = Activity; RAG pipeline is a Saga with compensating actions.

Q140. How do you call external services safely?

Timeouts, retries, circuit breakers, isolation per activity.

Q141. How do you perform fan-out operations?

Temporal child workflows or parallel activities.

Q142. How do you aggregate results?

FAN-IN: join child workflows via promises.

Q143. How long can your workflows run?

Days → Weeks. Temporal stores complete state.

Q144. How do you signal workflows?

Signals + queries to modify workflow state or send new instructions.

Q145. How do you handle schema changes?

Workflow versioning + payload converters.

Q146. How do you ensure workflow observability?

Temporal UI → execution history → step-by-step replay.

Q147. Why Temporal over Airflow?

Airflow is DAG-oriented batch, Temporal is event-driven, real-time, durable.

Q148. Why not Step Functions?

Temporal provides better dev UX, local testing, deterministic replay, no vendor lock-in.

Q149. How do you scale workflows?

More workers → more concurrency → Kafka partitions scale too.

Q150. What consistency model does Temporal provide?

Strong consistency for workflow state.


πŸ“Œ SECTION 7 — Arize Phoenix (Observability) (Q176–Q185)

Q176. Why Arize Phoenix?

Purpose-built RAG observability for trust, traceability, and quality monitoring.

Q177. What do you track?

  • Retrieval relevance

  • Latency

  • Hallucination

  • Semantic drift

  • Chunk ranking

Q178. How do you integrate Phoenix?

Send input, retrieved docs, LLM output, timestamps, trace_id.

Q179. What metrics matter?

Recall@5, Precision@K, P50/P95 latency, hallucination rate.

Q180. What is RAG attribution?

Shows which chunk influenced the LLM’s final answer.

Q181. How do you debug bad answers?

Trace pipeline → check retrieval → rewriting → final answer.

Q182. How do you detect hallucination?

Embed answer → compare with retrieved context.

Q183. How do you monitor cost?

Tokens per request + GPU time.

Q184. How do you detect pipeline drift?

Difference in embedding clusters over time.

Q185. Why observability is critical?

RAG fails silently → need visibility into each step.




EXAMPLE workflow : 

Here is the step-by-step execution flow for your specific example using the Enterprise Architecture.

We will break this into two separate journeys:

  1. The Write Path: Uploading the Jira Documents (Ingestion).

  2. The Read Path: The User asking "What is LLM?" (Retrieval).


πŸ”„ Scenario 1: The Ingestion Flow (Uploading Jira Docs)

Goal: The user uploads a PDF export of Jira tickets to add to the bot's knowledge base.

1. The Entry (Ingress)

  • User Action: Uploads jira_export.pdf.

  • Tool: Nginx (Reverse Proxy).

  • What happens: Nginx terminates the SSL (decrypts HTTPS) and routes the request to the Ingestion Service.

  • Architecture Note: Nginx protects the internal servers from direct internet exposure.

2. Authentication & Validation

  • Tool: FastAPI + JWT + Pydantic.

  • What happens:

    • FastAPI Middleware checks the Authorization: Bearer <token> header.

    • Pydantic validates that the file type is .pdf and size is under 10MB.

  • Architecture Note: "Stateless Auth". We don't check a session DB; we verify the cryptographic signature of the token.

3. The Handoff (Async Buffer)

  • Tool: Kafka (or Redpanda).

  • What happens: FastAPI does not parse the file. It saves the file to MinIO (S3) and pushes a small message {"job_id": "123", "file_path": "s3://bucket/jira.pdf"} to the ingestion_topic in Kafka.

  • Response: FastAPI immediately returns 202 Accepted: Processing started to the user.

  • Architecture Note: Event-Driven. If 10,000 users upload files simultaneously, the API doesn't crash. It just queues 10,000 messages.

4. The Orchestration (Reliable Execution)

  • Tool: Temporal.io.

  • What happens: A Temporal Worker (Python) listening to the queue picks up the job. It starts a workflow IngestDocumentWorkflow.

  • Architecture Note: Durable Execution. If the server power cuts out now, Temporal remembers "I started Job 123 but didn't finish." When power returns, it restarts the job automatically.

5. The Cleaning (ETL)

  • Tool: Unstructured.io + Microsoft Presidio.

  • What happens:

    • Unstructured.io extracts text from the PDF, handling tables and headers intelligently.

    • Presidio scans the extracted text for PII (names, emails) and redacts them (e.g., replacing john@company.com with <EMAIL_REDACTED>).

6. The "Learning" (Embedding)

  • Tool: FastEmbed (BAAI/bge-small-en).

  • What happens: The Temporal worker chunks the text (e.g., 500 tokens) and converts them into vectors (arrays of numbers like [0.1, -0.5, 0.8...]).

  • Architecture Note: We run this locally on the worker CPU to save cost/latency before hitting the DB.

7. Storage (Memory)

  • Tool: Qdrant.

  • What happens: The vectors are "Upserted" (Upload + Insert) into the Qdrant cluster.

  • Architecture Note: HNSW Indexing. Qdrant immediately builds a graph index so this new data is searchable in milliseconds.


❓ Scenario 2: The Query Flow (User asks "What is LLM?")

Goal: The user asks a question, and the bot answers using the Jira docs we just uploaded.

1. The Request

  • User Action: Types "What is LLM?" in the Chat UI.

  • Tool: FastAPI.

  • What happens: Request hits the /chat endpoint.

2. Observability Start

  • Tool: OpenTelemetry.

  • What happens: A TraceID is generated. We start a timer to measure latency.

3. Vectorizing the Query

  • Tool: FastEmbed.

  • What happens: The API converts the question "What is LLM?" into a vector [0.05, -0.3...].

4. The Search (Retrieval)

  • Tool: Qdrant (Hybrid Search).

  • What happens:

    • Dense Search: Finds vectors mathematically close to the question vector.

    • Keyword Search: Looks for exact matches of "LLM".

    • Payload Filter: Filters results to ensure tenant_id == "user_123" (Security).

  • Result: Qdrant returns the top 3 chunks from jira_export.pdf.

5. The Brain (Inference)

  • Tool: vLLM (hosting Llama-3).

  • What happens: FastAPI constructs a prompt:

    Plaintext
    System: Answer using only the context below.
    Context: <Content from Jira PDF retrieved from Qdrant>
    User: What is LLM?
    

    It sends this to the vLLM service.

  • Architecture Note: Continuous Batching. vLLM slots this request into the GPU alongside 50 other requests, processing them in parallel without wait times.

6. Safety Check

  • Tool: NVIDIA NeMo Guardrails.

  • What happens: Before sending the answer back, Guardrails checks: "Did the model hallucinate?" or "Is it revealing secrets?".

  • Result: Approved.

7. The Logging

  • Tool: Arize Phoenix.

  • What happens: The system logs the TraceID, the retrieved documents, and the final answer to the Arize dashboard.

  • Why: So you can debug later if the answer was wrong.

8. Final Response

  • Tool: FastAPI.

  • What happens: Returns the JSON response to the user.

    "Based on the Jira documents, an LLM (Large Language Model) is mentioned in ticket JIRA-402 as the engine used for the new support bot feature..."


πŸ—️ Architecture Diagram Description (Mental Model)

When describing this, visualize a V-Shape Pipeline:

  1. Left Side (Write/Ingestion): Heavy, Asynchronous, Backend-focused.

    • Tools: Kafka -> Temporal -> Unstructured -> Qdrant.

    • Vibe: "Fire and Forget."

  2. Right Side (Read/Query): Fast, Synchronous, Latency-focused.

    • Tools: FastAPI -> Qdrant -> vLLM.

    • Vibe: "Real-time Speed."

  3. Bottom Layer (Infrastructure):

    • Tools: Kubernetes (hosting containers), Arize Phoenix (watching everything), AWS S3 (holding files).



🎯 Sales Prediction Platform — Interview Q&A

(Comprehensive Technical + Behavioral coverage)

These questions are crafted based on your exact project code — how it works, the tech stack, and the reasoning behind your design choices.
I’ll group them by topic with expected answers, keywords to mention, and bonus tips for extra points.


🧱 1️⃣ Project Overview & Architecture

Q1. Can you explain your Sales Prediction Platform project end-to-end?

Answer:
“My Sales Prediction Platform is a Streamlit-based web app that enables users to upload sales data, clean it automatically, perform feature engineering, and generate forecasts using multiple models like Prophet, ARIMA, SARIMA, and XGBoost.
It also includes authentication via Firebase, dashboard visualizations, and auto-generated PDF reports summarizing metrics such as RMSE, MAE, and MAPE.”

Keywords: Streamlit, Forecasting, Prophet, ARIMA, XGBoost, Firebase Auth, ReportLab PDF, End-to-End Pipeline.

πŸ’‘ Bonus: “The platform is built entirely using a free tech stack — Streamlit Cloud, Firebase Auth, and open-source Python libraries.”


Q2. What is the architecture of your platform?

Answer:
“It’s modular — each feature is separated into its own Python module:

  • cleaning.py → handles missing values and outliers

  • feature_engineering.py → adds lag, date, and rolling features

  • forecasting.py → trains and compares Prophet, ARIMA, SARIMA, XGBoost

  • metrics.py → computes performance metrics (RMSE, MAE, MAPE)

  • pdf_gen.py → creates downloadable reports

  • auth.py + firebase_auth.py → manage secure user login/signup

  • app.py → orchestrates the full Streamlit interface.”

Keywords: Modular design, Separation of concerns, Reusable modules, Scalable architecture.


🧼 2️⃣ Data Preprocessing & Cleaning

Q3. How do you handle missing values in your data?

Answer:
“For numeric columns, I fill missing values using the median; for categorical columns, I use the label ‘Unknown’; and for datetime columns, I use the mode (most frequent date). This ensures continuity and avoids losing records.”

πŸ’‘ Bonus: “Median is more robust than mean against outliers.”


Q4. How do you detect and handle outliers?

Answer:
“I use the IQR (Interquartile Range) method to cap outliers at 1.5×IQR beyond Q1 and Q3.
This prevents sudden spikes from distorting trend models like Prophet.”

Keywords: IQR, Capping, Robust statistics, Prevent distortion.


Q5. Why do you normalize text columns during cleaning?

Answer:
“To avoid duplicate category entries caused by inconsistent casing or spaces — for instance, ‘New york’, ‘new York’, and ‘New York’ would be treated the same after normalization.”


🧩 3️⃣ Feature Engineering

Q6. What new features did you create to improve forecasts?

Answer:
“I extracted time-based features such as year, month, quarter, day of week, and weekend indicators.
For some datasets, I also added lag features (Sales_lag1, Sales_lag7) and rolling averages (Sales_roll7) to capture temporal dependencies.”

πŸ’‘ Bonus: “These features improve non-seasonal models like XGBoost.”


Q7. Why is feature engineering critical in forecasting?

Answer:
“Because models like XGBoost and ARIMA don’t inherently understand seasonality — they rely on engineered features to capture trends, cycles, and periodic behaviors.”


Q8. How does Prophet handle seasonality differently?

Answer:
“Prophet automatically models seasonality and trend using additive or multiplicative components, so it doesn’t require manual lag features.”

Keywords: Additive model, changepoints, trend flexibility, holiday effects.


πŸ“ˆ 4️⃣ Forecasting Models

Q9. What forecasting models did you implement?

Answer:
“I implemented Prophet, ARIMA, SARIMA, and XGBoost.
Prophet handles trend + seasonality, ARIMA and SARIMA are traditional time-series models, and XGBoost is a machine-learning regressor that can capture non-linear patterns.”


Q10. How do you choose the best model automatically?

Answer:
“I train all models and compare their RMSE and MAPE scores on a holdout test set.
The model with the lowest RMSE is selected as the best performing one.”

Keywords: Model comparison, Evaluation metrics, RMSE-based ranking.


Q11. What’s the difference between ARIMA and SARIMA?

Answer:
“ARIMA models non-seasonal time-series data, while SARIMA adds seasonal components defined by (P, D, Q, s).
SARIMA can handle monthly or weekly seasonal cycles.”

Bonus: “For example, s=12 for monthly sales data.”


Q12. How did you ensure time-series data was split correctly?

Answer:
“I used a chronological split (not random) where the last 30 observations are used for testing.
This preserves temporal integrity.”


Q13. Why did you include XGBoost for forecasting?

Answer:
“XGBoost can capture complex non-linear relationships and interactions between engineered features (like month, day, lag).
It’s especially useful when classical models underfit or when seasonality changes dynamically.”


Q14. What challenges did you face using Prophet or ARIMA?

Answer:
“Prophet sometimes struggles with very short datasets (<10 rows).
ARIMA can fail to converge on non-stationary data, so differencing and seasonal checks are required.”

Bonus: “I mitigated this by including error handling and fallback models.”


πŸ“Š 5️⃣ Model Evaluation

Q15. Which metrics do you use and why?

Answer:
“I use MAE, RMSE, and MAPE:

  • MAE → average absolute error, easy to interpret

  • RMSE → penalizes large errors

  • MAPE → relative error in percentage, good for business users.”


Q16. What are the limitations of MAPE?

Answer:
“MAPE becomes unreliable when actual values are near zero — hence I exclude zero values in the denominator.”


Q17. How would you improve model evaluation further?

Answer:
“I could implement rolling-origin cross-validation or time-series CV instead of a single holdout set to better estimate generalization.”


🧾 6️⃣ Reporting & Visualization

Q18. How did you generate PDF reports in your app?

Answer:
“I used ReportLab’s SimpleDocTemplate and Table components to dynamically build a sales forecast report including model name, metrics, and summary details.
The file is then offered as a downloadable button in Streamlit.”

Keywords: ReportLab, Platypus, TableStyle, Streamlit download_button.


Q19. What could you add to make the reports more useful?

Answer:
“I can embed forecast charts, add dataset summary statistics, and include the user’s name and timestamp for audit tracking.”


πŸ” 7️⃣ Authentication & Security

Q20. How does authentication work in your platform?

Answer:
“I used Firebase Authentication integrated with Pyrebase.
Streamlit forms collect user credentials, and Firebase handles registration, login, and password resets securely via its REST API.”


Q21. How do you maintain user sessions in Streamlit?

Answer:
“I store login status (logged_in) and user email in st.session_state, which persists across reruns.”


Q22. Why did you choose Firebase over a local database?

Answer:
“Firebase offers free, scalable, and secure authentication with minimal setup — ideal for a solo developer using a free stack.”


Q23. How do you handle logout?

Answer:
“I simply reset session variables since Firebase doesn’t maintain server sessions for Python clients.”


πŸ’‘ 8️⃣ Improvements, Scaling & Deployment

Q24. What improvements would you make if you had more time?

Answer:

  • Add time-series cross-validation

  • Introduce hyperparameter tuning for ARIMA/SARIMA

  • Implement caching to improve speed

  • Deploy via Streamlit Cloud with custom domain

  • Store forecasts in Firebase Firestore per user.


Q25. How would you deploy this project in production?

Answer:
“I can host the app on Streamlit Cloud or Hugging Face Spaces for free.
Firebase handles authentication, and the app can connect to a Firestore database for storing user history.”


Q26. How do you ensure data security when using Firebase?

Answer:
“I don’t store passwords locally — all auth requests go through Firebase.
API keys are stored securely using Streamlit’s st.secrets.”


Q27. If dataset has multiple stores or regions, how will you extend this app?

Answer:
“I can add a store/region filter and run the forecasting pipeline per subset.
Prophet and XGBoost support multi-series forecasts efficiently.”


🧠 9️⃣ Conceptual Deep-Dive (Advanced-level Questions)

Q28. What’s the mathematical idea behind Prophet’s additive model?

Answer:
“Prophet decomposes time series into three components:
[
y(t) = g(t) + s(t) + h(t) + \epsilon_t
]
Where (g(t)) is trend, (s(t)) is seasonality, (h(t)) are holidays, and (\epsilon_t) is noise.”


Q29. How would you explain RMSE to a business stakeholder?

Answer:
“It’s the square root of the average squared difference between predictions and actuals —
lower RMSE means your predictions are closer to reality.”


Q30. How does XGBoost handle overfitting?

Answer:
“It uses regularization terms (L1, L2), shrinkage (learning rate), and early stopping to control complexity.”


🧩 10️⃣ Behavioral / Design Thought

Q31. Why did you modularize your project into multiple files?

Answer:
“To keep code maintainable and reusable.
Each file has a single responsibility, making debugging easier and scaling simpler.”


Q32. How does your platform benefit a business user?

Answer:
“It enables even non-technical users to upload data, clean it, generate accurate forecasts, and download reports — all without writing code.”


Q33. What’s unique about your project compared to others?

Answer:
“It’s an end-to-end, free-stack, self-service forecasting tool with explainability, auto model selection, and downloadable reporting built by a solo developer.”


SUMMARY OF YOUR PROJECT STORYLINE FOR INTERVIEW

Phase Focus Tech / Keywords
Upload & Validation Data upload, validation, missing checks Pandas, Validation
Cleaning Imputation, Outlier Handling IQR, Median, Mode
Feature Engineering Time features, Lag, Rolling Datetime, Lag1, Lag7
Forecasting Prophet, ARIMA, SARIMA, XGBoost Auto Model Selection, RMSE
Evaluation Metrics calculation RMSE, MAE, MAPE
Reporting PDF generation ReportLab
Security Firebase authentication Pyrebase, Streamlit session
Deployment Streamlit Cloud Free tech stack



🎯 Project: CSV Reasoning Agent (LangChain + Streamlit + Groq)


🧠 Section 1: Project Concept & Overview



Here is a clean, interview-ready project brief for the code you shared, followed by likely interview questions + strong sample answers.


✅ PROJECT BRIEF (Short, Strong, Interview-Ready)

Project Name: Dynamic CSV Reasoning Agent with Code & Chart Generation
Tech Stack: Streamlit, LangChain, Groq LLaMA-3.1, Pandas, Plotly

πŸ”Ή What the Project Does

This project is an AI-powered data analysis application where a user uploads any CSV file and can ask natural-language questions.
The system automatically:

  1. Parses the data

  2. Uses an LLM (LLaMA-3.1 via Groq API) to reason about the dataset

  3. Calls custom EDA tools (schema, missing values, describe)

  4. The LLM dynamically generates Python code blocks

  5. The platform extracts that code, executes it safely, and displays:

    • Generated code

    • Generated Plotly charts

    • A final natural-language answer

πŸ”Ή Key Features

  • Dynamic Agent Creation: A LangChain agent is initialized on file upload using the dataframe.

  • Custom Tooling:

    • schema_tool → Shows data types

    • missing_tool → Finds missing values

    • describe_tool → Basic profiling

  • LLM-Driven Code Generation:
    The LLM returns Python code in fenced blocks, which the app parses and executes.

  • Plotly Chart Execution:
    The system captures the generated fig object and renders interactive visualizations.

  • Streamlit Front-End:
    Upload → Query → Code → Chart → Final Answer.

πŸ”Ή Your Role (Interview Positioning)

You can say you:

  • Designed the agent architecture using LangChain Runnable pipelines

  • Integrated custom EDA tools

  • Built the full Streamlit UI

  • Implemented safe dynamic code execution

  • Integrated Groq LLaMA model

  • Improved usability through automatic chart extraction and rendering


1. Explain this project in one minute.

Answer:
This is an AI-driven data analysis tool where a user uploads a CSV and can ask natural language questions. I built a LangChain agent that uses LLaMA-3.1 with custom EDA tools like schema, missing values, and statistical summary. The LLM returns Python code blocks for analysis and visualization. My Streamlit app extracts those code blocks, executes them safely, and renders the charts using Plotly. The final answer combines code, insight, and visualizations, making the analysis fully dynamic and automated.


2. Why did you use LangChain instead of directly calling an LLM?

Answer:
LangChain gives a structured way to use tools, have prompt templates, and create a runnable agent pipeline. With it, I can cleanly inject custom tools (schema_toolmissing_tooldescribe_tool) and let the model call them when needed. Also, LangChain's Runnable architecture makes execution, parsing, and tracing much more predictable.


3. How does the agent call tools? What triggers a tool call?

Answer:
The system prompt explicitly says the model has access to tools and should use them for details like schema, missing values, or describe.
When the model identifies a need—like “what are the data types?”—it triggers a tool call with the input. I implemented each tool as a LangChain Tool class, where the function returns the corresponding dataframe information.


4. How do you execute LLM-generated code safely?

Answer:
I isolate execution using a restricted local_vars dictionary that contains only:
dfstpx, and go.
I remove any .show() calls and assume the model prepares a fig object. After executing the code, if fig exists, I plot it via st.plotly_chart.
This limits the execution environment and prevents access to global variables or sensitive operations.


5. How did you extract code blocks from the LLM response?

Answer:
Using regex:

re.findall(r"```python\n(.*?)\n```", agent_output_text, re.DOTALL)

All python code fences are captured.
The remaining text (after removing these blocks) becomes the final natural-language answer.


6. Why did you choose Groq + LLaMA-3.1?

Answer:
Groq offers extremely fast inference for LLaMA models with low latency, which significantly improves user experience for iterative querying. For an interactive application like this, speed is crucial.


7. How do you handle errors in user code or LLM-generated code?

Answer:
All execution is wrapped in a try–except block:

except Exception as e:
    st.error(f"Error executing plotting code: {e}")

If the generated code is invalid, I safely catch the error and display it without crashing the app.


8. How would you scale or extend this project?

Possible answers:

  • Add authentication & project save history

  • Add more advanced tools: correlation, clustering, forecasting

  • Sandbox the code execution using Pyodide or docker-level execution

  • Deploy on cloud + use async calls for responsiveness

  • Add multi-agent reasoning (one for EDA, one for visualization, one for insights)


9. What are the security risks of executing LLM-generated code?

Answer:
Arbitrary code execution is dangerous. To mitigate:

  • Run code only within a restricted variable scope

  • Strip disallowed keywords (file operations, os commands)

  • Prefer sandboxed environments (Pyodide, containerized worker)

  • Add static code analysis before execution


10. How does Streamlit manage session state in your app? Why is it needed?

Answer:
Streamlit re-runs the script on every interaction.
Session state lets me persist the agent object after upload:

if 'agent' not in st.session_state:
    st.session_state['agent'] = None

Without session state, the agent would recreate on every text input, losing context.


1. What problem does your project solve?

It allows non-technical users to upload any dataset and ask natural-language questions. The system automatically performs EDA, generates Python code, creates visualizations, and gives insights without needing a data analyst.


2. How does your agent architecture work?

I use a LangChain Runnable Agent with custom tools. When a query comes in, the prompt plus tools are sent to LLaMA-3.1, which chooses whether to call a tool or directly answer. The LLM returns an output with Python code blocks and natural-language explanations. The app extracts those and executes them.


3. Why did you use LangChain Runnables instead of Agents API?

Runnables are cleaner, more predictable, and avoid implicit agent behavior. They give full control over prompt → model → output parsing → execution without the complexity of old AgentExecutor APIs.


4. What tools did you implement and why?

  • schema_tool: return column names + data types

  • missing_tool: show missing value counts

  • describe_tool: statistical summary

These represent the most frequently used EDA operations and help the LLM reason more accurately about the data.


5. How does the model know when to call a tool?

The system prompt explicitly instructs the model that tools exist and they should be used when schema/missing/describe details are needed. The LLM decides based on context in the user query.


6. How do you extract Python code from the LLM output?

Using regex for fenced code blocks:

re.findall(r"```python\n(.*?)\n```", text, re.DOTALL)

Everything inside python … is identified as code.


7. How do you separate the final answer from code?

I remove all code blocks using regex, and whatever remains is the final natural-language explanation.


8. How do you safely execute LLM-generated code?

Execution is isolated within a local variable dictionary:

local_vars = {'df': df, 'st': st, 'px': px, 'go': go}

Dangerous libraries or OS-level access are not available. This limits what the code can do.


9. What if the LLM generates incorrect or failing code?

Execution is wrapped in a try–except block. If an error occurs, it’s shown as a Streamlit error rather than letting the app crash.


10. How do you render charts generated by the model?

I assume the LLM creates a variable named fig. After execution, I check:

if 'fig' in local_vars:
    st.plotly_chart(local_vars['fig'])

This renders the Plotly figure interactively.


11. Why use Plotly instead of Matplotlib?

Plotly provides interactive, responsive charts. Streamlit integrates with Plotly seamlessly, making it ideal for data exploration apps.


12. Why did you choose Groq’s LLaMA-3.1 model?

Groq offers extremely low latency. Since data analysis involves iterative “ask → refine → ask more” loops, fast responses greatly improve UX.


13. How does Streamlit’s session state help your app?

Streamlit re-runs the code on every interaction.
Session state keeps the agent alive after the CSV upload.
Without it, the agent would reset every time you type a question.


14. How does the agent get access to the dataframe?

I bind the dataframe directly into the Runnable pipeline using:

"df": lambda _: df

This keeps the DF consistent for the lifetime of the session.


15. What are security risks in executing LLM-generated code?

Main risks:

  • Arbitrary code execution

  • Accessing disk

  • Deleting files

  • Running OS/system commands

I mitigate this by isolating execution, blocking access to os, and restricting to safe local variables.


16. How would you make this execution sandbox fully safe?

Possible improvements:

  • Use Pyodide (browser-based Python sandbox)

  • Run code inside a Docker microservice

  • Add static code scanning to detect disallowed keywords

  • Use a restricted AST parser


17. How do you handle different CSV encodings?

I load with:

pd.read_csv(encoded="latin1")

This avoids errors from common encodings like CP1252 or ISO-8859.


18. What happens when the user uploads a new CSV?

file_uploader uses on_change to clear the old agent:

on_change=lambda: st.session_state.update(agent=None)

A fresh agent is created for the new file.


19. What is the purpose of the StrOutputParser?

It forces the LLM output to return as a plain string.
This string can contain:

  • Tool call results

  • Python code blocks

  • Final explanation

Runnables require explicit parsing, so StrOutputParser is essential.


20. Why didn’t you use the LangChain PandasAgent?

PandasAgent internally executes arbitrary Python code, which is risky and hard to control.
I wanted a simpler, tool-based agent with explicit control over code generation and parsing.


21. What kinds of queries can your agent handle?

  • “Show correlations”

  • “Plot sales trend over time”

  • “Find missing value summary”

  • “Which category has highest profit?”

  • “Give me a bar chart of count by segment”

As long as the LLM can generate valid code, it works.


22. How does the prompt structure affect reasoning?

The system prompt defines the skills of the LLM and describes the tools.
The human prompt contains the actual query.
This separation ensures the model understands the context and roles.


23. What happens internally when a query is submitted?

Flow:
Upload CSV → Initialize agent → Query → Runnable pipeline → LLM output → Extract code → Execute → Render charts → Show final answer.


24. How does your agent handle missing value queries?

The LLM calls missing_tool, which computes null counts.
This is better than letting the LLM “hallucinate missing values.”


25. What is the biggest challenge in building LLM code execution apps?

Balancing flexibility (allowing dynamic Python code) with safety.
LLMs can easily hallucinate imports, variable names, or unsafe code.
Designing the correct environment to avoid crashes is key.


26. How would you extend this app to support SQL?

Add a tool that runs user SQL queries on the DF converted to an in-memory SQLite database.


27. How would you speed up performance for large datasets?

  • Load CSV in chunks

  • Use Dask or Polars

  • Add sample previews before full operations

  • Cache results using Streamlit’s st.cache_data


28. How will you scale this to production?

  • Move to FastAPI backend

  • Streamlit frontend served separately

  • Add GPU or Groq server for fast inference

  • Use Redis for caching metadata

  • Add user auth and saved analysis sessions


29. What is the role of regex in your app?

Regex extracts code and cleans the final answer.
It also ensures only code inside python blocks is executed.


30. What is the most impressive part of this project?

The combination of:

  • Dynamic CSV reasoning

  • Tool-augmented LLM

  • Automatic code generation

  • Executing charts from generated code

  • Full EDA automation

It essentially transforms any dataset into an interactive AI data analyst.

Comments

Popular posts from this blog

Time Series and MMM basics

LINEAR REGRESSION