ViT, Diffusion models, Open CV, RLHF and RLAF, LLM poisoning
🧩 Vision Transformer (ViT) — Cheatsheet
1. 💡 What is ViT?
-
ViT (Vision Transformer) applies the Transformer architecture (originally for NLP) to image recognition tasks.
-
It splits an image into patches, embeds them as tokens, and processes them like words in a sentence.
-
Proposed by Dosovitskiy et al., 2020 (Google Research).
✅ Key Idea: Treat an image as a sequence of patches instead of a 2D grid.
✅ Goal: Leverage self-attention for long-range spatial relationships.
2. 🧠 Motivation — Why ViT?
| CNNs | Vision Transformers |
|---|---|
| Use convolution kernels for local features | Use global self-attention |
| Limited receptive field | Global context at every layer |
| Requires inductive biases | Learns relationships directly |
| High performance on small datasets | Scales better with large data |
3. 🧩 Architecture Overview
Input: Image of size (e.g., 224×224×3)
Step-by-step Pipeline:
-
Patch Splitting
-
Divide image into fixed-size patches, e.g.,
-
Flatten each patch → becomes one token
-
Number of patches =
Example:
224×224 image, patch size 16 → (224/16)^2 = 196 patches -
-
Linear Embedding
-
Each flattened patch (size ) → projected to a D-dimensional vector (embedding)
-
-
-
Add CLS Token
-
A learnable token
[CLS]prepended to the patch embeddings -
Used to aggregate global image representation (like in BERT for classification)
-
-
Positional Encoding
-
Since Transformers don’t know order, positional embeddings are added
-
-
-
Transformer Encoder Layers
Each block has:-
Multi-Head Self Attention (MHSA)
-
MLP Feed Forward Network (2 fully connected layers)
-
Layer Normalization + Residual Connections
Formula:
-
-
Classification Head
-
Use final
[CLS]token representation → feed into MLP → output class probabilities
-
4. 🧮 Key Equations
-
Self-Attention:
where = query, key, value matrices from input embeddings.
-
Multi-Head Attention:
Each head learns different relationships.
5. 🧱 ViT Components Summary
| Component | Role |
|---|---|
| Patch Embedding | Converts image patches to token embeddings |
| Position Embedding | Preserves spatial order |
| Transformer Encoder | Learns global relations |
| Classification Head | Predicts final output |
6. ⚙️ Code Example (Simplified PyTorch)
7. 🔍 ViT Variants
| Model | Description |
|---|---|
| DeiT | Data-efficient ViT (trained with less data + distillation) |
| Swin Transformer | Hierarchical ViT with shifted windows |
| ViT-GPT / CLIP-ViT | Used in multimodal models (text + vision) |
| Hybrid ViT | Combines CNN patch extraction with transformer blocks |
8. 📊 Advantages & Limitations
Advantages:
✅ Captures long-range dependencies
✅ Parallelizable (unlike CNN sliding windows)
✅ Scales well with data
Limitations:
❌ Needs large datasets to train
❌ Computationally expensive
❌ Lacks inherent inductive bias (like CNN’s translation invariance)
9. 🧠 Key Interview Questions
-
What is the intuition behind Vision Transformers?
-
How does ViT differ from CNNs?
-
What are image patches, and why are they needed?
-
Explain the role of the
[CLS]token in ViT. -
What is positional encoding and why do we need it?
-
How is attention computed in ViT?
-
How is ViT used in multimodal models (like CLIP or DALL·E)?
-
What are challenges of training ViTs?
-
Compare ViT and Swin Transformer.
-
How can ViTs be fine-tuned for small datasets?
10. 🌍 Real-World Applications
-
Image classification (ImageNet)
-
Object detection (ViT-Det, DETR)
-
Segmentation (Segmenter)
-
Multimodal tasks (CLIP, DALL·E)
-
Medical imaging, satellite image analysis
🌫️ Diffusion Models — Complete Cheatsheet for Data Science & GenAI Interviews
1. 💡 What are Diffusion Models?
-
Diffusion Models are generative models that learn to create data by reversing a gradual noising process.
-
The idea:
→ Start with an image
→ Gradually add noise until it becomes pure noise
→ Then train a model to reverse this process, turning noise → data.
They’re the backbone of:
-
DALL·E 2
-
Stable Diffusion
-
Imagen
-
Midjourney
2. 🧩 Core Intuition
Think of it like teaching a model to denoise step by step.
| Step | Direction | Description |
|---|---|---|
| Forward Diffusion | Adds noise | Slowly destroys structure in the image |
| Reverse Diffusion | Removes noise | Model learns to reconstruct clean images |
3. 🧠 High-level Process
🔹 Forward Process (Diffusion / Noise Addition)
We add small Gaussian noise to data over T time steps.
-
: variance schedule (small positive number)
-
After many steps → ≈ pure noise
🔹 Reverse Process (Denoising / Generation)
Train a neural network (usually a U-Net) to estimate the noise added.
We train it to predict the noise:
At inference, we start from pure noise and iteratively denoise → image.
4. 🔄 Summary Flow
| Stage | Input | Output | Description |
|---|---|---|---|
| Training | Clean image → add noise | Learn to predict noise | Model learns reverse steps |
| Inference | Pure noise | Generated image | Reverse steps to generate |
5. ⚙️ Key Components
| Component | Description |
|---|---|
| U-Net | Core model predicting noise at each step |
| Scheduler / Noise Schedule | Defines how much noise is added each step |
| Timestep Embedding | Encodes current diffusion step |
| Variance () Schedule | Linear, cosine, or learned noise levels |
| Latent Diffusion (Stable Diffusion) | Runs diffusion in latent space (compressed representation) |
6. 📉 Training Objective
Goal: teach model to predict noise added at each timestep.
-
Simpler than GANs (no discriminator)
-
Stable training
-
High-quality samples
7. 🧮 Inference Steps (Simplified)
Each step gradually cleans the noise → final image.
8. 🧱 Types of Diffusion Models
| Model | Description |
|---|---|
| DDPM (Denoising Diffusion Probabilistic Models) | Original diffusion framework (Ho et al., 2020) |
| DDIM (Deterministic Diffusion Implicit Models) | Faster sampling (non-stochastic reverse process) |
| Latent Diffusion (LDM) | Runs in latent space (Stable Diffusion) |
| Score-based Models | Train score function ∇log(p(x)) for data distribution |
| Guided Diffusion | Conditional generation (e.g., class or text guided) |
9. 🧠 Stable Diffusion (Real-world Example)
Stable Diffusion = Latent Diffusion Model (LDM)
Pipeline:
-
Encode image/text using VAE (Variational Autoencoder) → latent space
-
Add noise to latent
-
Model (U-Net + CLIP Text Encoder) learns to denoise
-
Decode latent → image
Benefits:
-
Faster, smaller (latent = compressed)
-
Supports text-to-image (via CLIP)
-
Can run on consumer GPUs
10. 🎯 Key Interview Questions
-
Explain the intuition behind diffusion models.
-
What is the difference between forward and reverse diffusion?
-
Why is a U-Net used in diffusion models?
-
What loss function is used to train diffusion models?
-
Compare diffusion models and GANs.
-
What is the role of the β schedule?
-
How does DDIM differ from DDPM?
-
What is Latent Diffusion?
-
How is text conditioning added in Stable Diffusion?
-
How does classifier-free guidance work?
11. ⚔️ Diffusion vs GANs
| Feature | GAN | Diffusion |
|---|---|---|
| Architecture | Generator + Discriminator | Single denoiser network |
| Training | Adversarial (unstable) | Simple MSE loss (stable) |
| Diversity | May collapse | Excellent diversity |
| Inference | Fast (1 step) | Slow (many steps) |
| Quality | Good | Very high (fidelity & detail) |
12. 🧩 Key Math Recap
| Symbol | Meaning |
|---|---|
| Original image | |
| Noised image at timestep t | |
| Total timesteps | |
| Noise variance | |
| Retained signal ratio | |
| Cumulative product of alphas | |
| Predicted noise by model |
13. ⚙️ Simple Implementation Idea
14. 🚀 Applications
-
Text-to-Image: Stable Diffusion, DALL·E 2
-
Image-to-Image: Inpainting, Super-resolution
-
Video Diffusion: Gen-2, RunwayML
-
Audio Diffusion: Music generation
-
3D Diffusion: Generating 3D objects from text
15. 📚 Key Research Papers
| Paper | Description |
|---|---|
| DDPM (Ho et al., 2020) | Original diffusion model |
| Improved DDPM (Nichol & Dhariwal, 2021) | Better β schedules |
| DDIM (Song et al., 2021) | Deterministic faster sampling |
| LDM (Rombach et al., 2022) | Stable Diffusion |
| Guided Diffusion | Conditional sampling |
| SDE Diffusion | Score-based diffusion using stochastic differential equations |
16. 🌈 Quick Summary
✅ Diffusion models learn to reverse noise
✅ Trained with MSE loss
✅ U-Net backbone is common
✅ Outperform GANs in quality & stability
✅ Used in text-to-image, video, and multimodal GenAI
🧠 OpenCV (Computer Vision) — Complete Cheatsheet
1. ⚙️ What is OpenCV?
-
OpenCV (Open Source Computer Vision Library)
→ A fast, open-source library for image processing, computer vision, and machine learning. -
Written in C++, with bindings for Python, Java, C, etc.
-
Commonly used with NumPy, Matplotlib, and Deep Learning frameworks (TensorFlow, PyTorch).
2. 🧩 Importing and Basic Setup
3. 📷 Reading & Displaying Images
🧠 Note: OpenCV uses BGR, not RGB color ordering.
4. 🎨 Image Properties
5. 🧮 Color Conversions
6. ✂️ Image Cropping, Resizing & Rotation
Custom Rotation:
7. 🔍 Drawing Shapes & Text
8. 🧠 Basic Image Operations
a) Arithmetic:
b) Bitwise:
c) Image Blending:
9. 🧹 Image Thresholding
Convert grayscale → binary image.
Adaptive Thresholding:
Otsu’s Threshold:
10. 🎛️ Image Filtering & Blurring
| Filter | Code Example |
|---|---|
| Averaging | cv2.blur(img, (5,5)) |
| Gaussian | cv2.GaussianBlur(img, (5,5), 0) |
| Median | cv2.medianBlur(img, 5) |
| Bilateral | cv2.bilateralFilter(img, 9, 75, 75) |
11. ✨ Edge Detection
Sobel / Laplacian:
12. 🔲 Morphological Operations
Used for noise removal, shape detection.
13. 🧩 Contours (Shape Detection)
14. 📏 Edge & Corner Detection
Harris Corner:
Shi-Tomasi Corners:
15. 🧍♂️ Face Detection (Haar Cascades)
16. 🧰 Video Processing
17. 📐 Geometric Transformations
| Transformation | Code Example |
|---|---|
| Translation | cv2.warpAffine(img, M, (cols, rows)) |
| Rotation | cv2.getRotationMatrix2D(center, angle, scale) |
| Affine | cv2.getAffineTransform(pts1, pts2) |
| Perspective | cv2.getPerspectiveTransform(pts1, pts2) |
18. 📦 Integration with Deep Learning
OpenCV integrates with TensorFlow/PyTorch for inference:
19. 🎯 Key Interview Topics
-
What is OpenCV used for?
-
Explain difference between RGB and BGR.
-
How to perform edge detection in OpenCV?
-
How to find contours and draw bounding boxes?
-
Explain morphological operations.
-
What are Haar cascades and how do they work?
-
How to integrate OpenCV with deep learning models?
-
How to perform color space conversion?
-
What are different blurring techniques?
-
How does
cv2.Cannydetect edges?
20. 🚀 Real-world Applications
-
Face & object detection
-
License plate recognition
-
Gesture & pose tracking
-
Image segmentation
-
Optical Character Recognition (OCR)
-
Image preprocessing for deep learning
21. 🧠 Bonus: Useful Shortcuts
| Operation | Command |
|---|---|
| Draw line | cv2.line(img, p1, p2, color, thickness) |
| Flip image | cv2.flip(img, 1) |
| Concatenate images | cv2.hconcat([img1, img2]), cv2.vconcat([...]) |
| Convert to binary | cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY) |
22. 🧩 Libraries Commonly Used with OpenCV
| Library | Purpose | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NumPy | Matrix & image manipulation | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Matplotlib | Displaying images | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Pillow (PIL) | Image input/output | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| PyTorch / TF | Model integration | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
🤖 RLHF vs RLAIF — Complete Cheatsheet🧩 1. What is RLHF?RLHF = Reinforcement Learning from Human Feedback It’s a fine-tuning technique used to align large language models (LLMs) with human preferences, improving the quality, helpfulness, and safety of responses. 🎯 Objective:Instead of optimizing just for “next-token prediction” (like in pre-training), RLHF makes the model optimize for what humans prefer. ⚙️ 2. RLHF Pipeline — Step-by-Step
🧠 3. Key RLHF Components
🧮 4. PPO Objective FunctionWhere:
🧩 5. Why RLHF is Needed✅ Aligns model outputs with human expectations ⚖️ 6. Limitations of RLHF❌ Human annotation is expensive and time-consuming 🤝 7. What is RLAIF?RLAIF = Reinforcement Learning from AI Feedback It’s the next evolution of RLHF — where AI models themselves provide the feedback instead of relying solely on human annotators. 🧩 8. RLAIF Pipeline
🧠 9. Why RLAIF is Emerging
✅ Goal: Make alignment scalable using “AI teachers” (meta-alignment). ⚙️ 10. RLAIF Workflow Example
🧩 11. Direct Preference Optimization (DPO) — Simplified RLAIFRLAIF can also use DPO instead of PPO. DPO directly learns from ranked preferences without complex RL optimization: Where = preferred response, = rejected response. ✅ No separate reward model or RL loop 📊 12. Key Differences — RLHF vs RLAIF
🚀 13. Real-World Use Cases
🧠 14. Common Interview Questions
🧩 15. Bonus — Constitutional AI (Anthropic’s Method)
Example:
🧭 16. Summary Table
🧠 LLM Poisoning & Prompt Injection — Complete Cheatsheet for Data Science / Gen AI Interviews⚠️ 1. What is LLM Poisoning?LLM Poisoning refers to maliciously manipulating the training or fine-tuning data of a Large Language Model so that it behaves incorrectly, leaks data, or produces attacker-desired outputs. 🧩 Types of LLM Poisoning Attacks
🧠 2. Goal of Poisoning
🧪 3. Real-World Examples
🧰 4. Defenses against Poisoning
💣 5. Prompt Injection AttacksPrompt Injection means crafting an input prompt that overrides or manipulates system instructions to make the model perform unintended actions.
💥 Types of Prompt Injection
🧠 6. Prompt Injection MechanismLLMs lack a strict separation between:
Hence, attackers exploit this flat context structure to insert overriding commands. 🧩 7. Impact of Prompt Injection
🧰 8. Defenses against Prompt Injection
🧭 9. Prompt Injection vs Data Poisoning
🧠 10. Advanced Concepts for Interviews
🧩 11. Mitigation Best Practices
💬 12. Example Interview Questions
🧩 13. Summary Table
|
Comments
Post a Comment