1. Reinforcement Learning Basics (LLM Context)

In traditional RL, an agent moves through a physical environment. In LLMs, the Environment is the sequence of tokens, and the Agent is the LLM itself.

State ( $S$ ): The prompt plus the tokens generated so far.
Action ( $A$ ): The next token to be sampled from the vocabulary.
Reward ( $R$ ): A scalar value indicating the quality of the completion (often provided by a Reward Model or a rule-based verifier).
Policy ( $\pi$ ): The LLM's probability distribution over the vocabulary given the current context.

2. Markov Decision Process (MDP)

LLM generation is treated as a Partially Observable MDP or a standard MDP where the transition is deterministic (adding a token leads to a new string).

The Markov Property: The probability of the next token depends only on the current sequence (state), not how we got there.
Discount Factor ( $\gamma$ ): In LLMs, we often set $\gamma \approx 1$ because the "outcome" (correctness of an answer) usually matters more at the very end of the sequence than at each individual token step.

3. CoT Reasoning & Verifiers

This is where modern RL departs from simple "next-token prediction."

Chain of Thought (CoT): Instead of going $Prompt \rightarrow Answer$ , we train the model to go $Prompt \rightarrow Reasoning \rightarrow Answer$ .
Verifiers (Outcome vs. Process):
- Outcome-based Reward Models (ORM): Reward the model only if the final answer is correct (e.g., in Math).
- Process-based Reward Models (PRM): Reward the model for each individual correct reasoning step. This is much more effective for complex problem-solving.
Beam Search (The Verifier's Tool): A search strategy used during inference to explore multiple reasoning paths. The "Verifier" scores these paths, and Beam Search keeps the $N$ highest-scoring ones to find the best conclusion.

4. Value Functions & Dynamic Programming (DP)

State-Value Function $V(s)$ : Predicts the total expected reward from the current prompt/token sequence.
Action-Value Function $Q(s, a)$ : Predicts the reward if we pick a specific next token $a$ .
DP connection: In LLMs, we rarely use pure DP because the state space (all possible sentences) is infinite. However, the concept of "bootstrapping" (using current estimates to update future ones) is fundamental to the algorithms below.

5. Monte Carlo (MC) vs. Temporal Difference (TD)

Monte Carlo: You wait until the LLM finishes the entire paragraph, check the reward, and then update.
- Pros: Unbiased.
- Cons: High variance (one bad token at the start can ruin a long, otherwise good sequence).
Temporal Difference (TD): You update the model's expectations during the generation of the sentence.
- LLM Application: Most RLHF (Reinforcement Learning from Human Feedback) uses a mix. We use TD-like methods (like PPO) to estimate values at each token step to reduce variance.

6. Policy Gradient & GAE

Policy Gradient: Directly changes the LLM's weights to increase the probability of tokens that led to high rewards.
Generalized Advantage Estimation (GAE): A "magic" formula used to calculate the Advantage ( $A = Q - V$ ). It tells us: "How much better was this specific token compared to what the model expected?" GAE allows us to balance bias and variance, making training much more stable.

7. The "O" Family: PPO, TRPO, and GRPO (Critical for Prep)

These are the algorithms that actually train the LLMs you use today.

Algorithm	Key Concept	Why it matters for LLMs
TRPO	Uses a "Trust Region" to ensure updates aren't too large.	Mathematically heavy (requires second-order derivatives); rarely used now.
PPO	Clips the policy update so the "new" LLM doesn't drift too far from the "old" LLM.	The industry standard for RLHF. Prevents the model from "collapsing" or becoming nonsensical.
GRPO	Group Relative Policy Optimization. Instead of a Critic model, it compares a group of outputs against their own average.	Used by DeepSeek-R1. It significantly saves GPU memory because you don't need a separate Reward/Value model in memory.

8. Summary Example: Training a Math LLM

Prompt: "What is $2 + 2 \times 3$ ?"
CoT Generation: The model generates 5 different paths (Beam Search).
Verifier: A PRM (Process Reward Model) looks at each step.
- Path A: "First $2+2=4$ , then $4\times 3=12$ " $\rightarrow$ Low Reward (Wrong order).
- Path B: "First $2 \times 3=6$ , then $2+6=8$ " $\rightarrow$ High Reward.
Optimization: Using GRPO, the model sees that Path B is better than the group average and updates its weights to favor that "order of operations" reasoning in the future.

9. Function Approximation

In basic RL, we use tables to store values for states (Tabular RL). However, with LLMs, the "state space" is the set of all possible text combinations—which is effectively infinite.

The Concept: Instead of a table, we use a Deep Neural Network (the LLM itself) as a "Function Approximator."
The Goal: To learn a mapping from a sequence of tokens to a probability distribution (Policy) or a scalar value (Value Function) without having seen that exact sequence before.
Generalization: This allows the model to "guess" the value of a state it has never encountered based on its semantic similarity to known states.

10. Temporal Difference (TD) Prediction and Control

This is the "engine" behind how models learn over time without waiting for the very end of a conversation.

TD Prediction (Learning the Value)

Bootstrapping: The model updates its estimate of a state's value ( $V(s)$ ) based on the reward it just got plus its estimate of the next state ( $V(s')$ ).
Formula Logic: $NewEstimate \leftarrow OldEstimate + Step(Reward + NextStateEstimate - OldEstimate)$ .
LLM Context: This is how Value Models are trained to predict "How good is this partial sentence?" before the sentence is even finished.

TD Control (Finding the Best Policy)

SARSA (On-Policy): Learns the value of the action the model is actually taking. It's safer but can be slower.
Q-Learning (Off-Policy): Learns the value of the best possible next action, regardless of what the model actually does.
LLM Context: Most LLM RL (like PPO) is "On-Policy" during the actual update phase to ensure the model doesn't drift into generating gibberish that it can't recover from.

11. Policy Gradient Methods

This is the most important family of algorithms for Master's prep because PPO and GRPO are Policy Gradient methods.

Direct Optimization: Unlike Q-learning (which picks the highest value), Policy Gradients directly tweak the weights of the LLM to make "good" tokens more likely ( $\uparrow P(token)$ ) and "bad" tokens less likely ( $\downarrow P(token)$ ).
The Objective: We maximize $J(\theta) = E[\log \pi(a|s) \cdot A]$ , where $A$ is the Advantage.
- If Advantage is positive: "This token was better than average, increase its probability."
- If Advantage is negative: "This token was worse than average, decrease its probability."

12. Trust Region Policy Optimization (TRPO)

Before PPO, there was TRPO. It addressed a massive problem: The Collapse.

The Problem: In RL, one bad update can ruin the entire model. If the weights change too much, the LLM starts outputting nonsense, and it can't "unlearn" that because its future data will also be nonsense.
The Solution: TRPO uses a mathematical constraint (KL Divergence) to ensure the "new" policy isn't too different from the "old" policy.
The Catch: It requires calculating a "Fisher Information Matrix," which is computationally expensive and nearly impossible for LLMs with billions of parameters.

13. Proximal Policy Optimization (PPO)

PPO is the "cleaner, faster brother" of TRPO and is the default for most LLM training.

The "Clipping" Trick: Instead of complex math, PPO simply clips the update. If the math says "make this token 10x more likely," PPO says "No, we only allow a 20% change at a time."
Objective Function:
$L^{CLIP} = \min(ratio \cdot A, \text{clip}(ratio, 1-\epsilon, 1+\epsilon) \cdot A)$
Why it wins: It’s easy to implement, stable, and works beautifully with the high-dimensional space of LLMs.

14. Group Relative Policy Optimization (GRPO)

This is the hottest topic in RL for LLMs right now (pioneered by DeepSeek).

The Innovation: Traditionally, PPO requires a Critic Model (a second LLM that predicts rewards) to calculate the "Baseline." This doubles the GPU memory needed.
How GRPO works:
1. For one prompt, the LLM generates a group of outputs (e.g., 8 different answers).
2. It calculates the reward for all 8.
3. The "Baseline" is simply the average reward of that group.
4. The "Advantage" for each answer is how much better it is than its siblings in that specific group.
The Benefit: You can fit much larger models on the same hardware because you delete the Critic/Value model entirely.

15. Summary of the RL-LLM Workflow

SFT (Supervised Fine-Tuning): Start with a model that knows how to talk.
Rollout: The model generates multiple sequences (Actions).
Evaluation: A Reward Model or Rule-based Verifier (for Math/Code) scores them.
Advantage Calculation: Use GAE or Group Averaging (GRPO) to see which tokens were "stars."
Update: Use PPO or GRPO to update the model weights within a "Trust Region" so it doesn't break.

16. Comprehensive RL-LLM Example: The "Logical Reasoner"

Imagine we are training an LLM to solve a logic puzzle: "If A is taller than B, and B is taller than C, who is the tallest?"

Step 1: The MDP Setup

State ( $s_0$ ): The prompt text.
Action ( $a_t$ ): The token "A" produced during the reasoning chain.
Transition ( $P$ ): Deterministic. The new state becomes the prompt + "A".
Policy ( $\pi$ ): The LLM's current weights deciding which name to pick next.

Step 2: Generation (Beam Search & CoT)

The model doesn't just bark an answer. We force it to use Chain of Thought (CoT).

The model generates multiple "paths" using Beam Search to explore high-probability token sequences.
Path 1: "Compare A and B... A is taller. Compare B and C... B is taller. So A > B > C. Answer: A."
Path 2: "A is taller than B. C is short. Answer: C."

Step 3: Evaluation (The Verifier)

Instead of a human, we use a Verifier (common in o1-style models).

Outcome Verifier: Checks the final answer "A" vs "C". "A" gets +1.0, "C" gets 0.0.
Process Verifier (PRM): Looks at Path 1 and gives positive rewards to the intermediate logical steps "A > B" and "B > C".

Step 4: The Optimization (GRPO vs. PPO)

If using PPO: We would need a separate Value Model to look at the sentence "Compare A and B..." and predict: "I think this sentence has a 0.8 probability of being correct." This requires massive VRAM.
If using GRPO: We take 8 different paths the model generated. We calculate the average score of those 8 paths. If Path 1 scored 1.0 and the average was 0.4, Path 1 has a Positive Advantage of 0.6.

Step 5: The Update (Policy Gradient + GAE)

We use GAE to smooth out the rewards across the tokens.
We use the PPO/GRPO clipping to ensure the model doesn't suddenly become "overconfident" and start repeating "A A A A A" because it got a high reward for "A".

17. Why this is "Important" (The Prep Perspective)

In your Master's studies, you will likely be asked about the "Credit Assignment Problem."

Definition: In a long sequence of tokens, which specific token was responsible for the correct answer?
The RL Solution: 1. TD Learning helps estimate values step-by-step.
2. GAE helps distribute the "credit" (advantage) effectively.
3. Process Verifiers provide dense rewards so the model doesn't have to "guess" which step was the breakthrough.

18. Key Comparison Table for Fast Review

Concept	High-Level Definition	LLM Importance
MDP	Framework of State, Action, Reward.	Defines the text generation "game."
Monte Carlo	Reward at the very end.	Simple but slow; used for final answer scoring.
TD Learning	Reward/Value updated step-by-step.	Helps the model "anticipate" success mid-sentence.
PPO	Policy update with a safety "clip."	The backbone of RLHF (ChatGPT, Claude).
GRPO	Group-relative rewards (no Critic).	The secret sauce for efficient reasoning (DeepSeek).
CoT + RL	Rewarding the "thinking" steps.	Essential for complex math, coding, and logic.

Reinforcement Learning for Large Language Models (LLMs) – Descriptive End-to-End Notes

1. Introduction: Why Reinforcement Learning is Needed for LLMs

Large Language Models (LLMs) such as GPT, LLaMA, or Claude are initially trained using self-supervised learning, where the objective is to predict the next token given previous tokens. This training paradigm, known as Maximum Likelihood Estimation (MLE), enables models to learn grammar, facts, and general language patterns at scale.

However, next-token prediction alone does not guarantee that the model will be:

Helpful to users
Safe and aligned with human values
Honest and non-hallucinatory
Concise or context-aware

For example, a model may generate a response that is fluent but factually incorrect, unsafe, or misaligned with user intent. These properties are difficult—or impossible—to encode directly into a likelihood-based loss function.

This is where Reinforcement Learning (RL) becomes essential. RL allows us to optimize models against non-differentiable, high-level objectives, such as human preferences, safety guidelines, or task success. In the context of LLMs, RL is primarily used for alignment, i.e., shaping model behavior to better match what humans actually want.

2. Framing Language Modeling as a Reinforcement Learning Problem

To apply reinforcement learning to LLMs, we reinterpret text generation as a sequential decision-making process.

The agent is the language model itself.
The environment consists of the user prompt, conversation history, and task context.
The state is the current text context (prompt + generated tokens so far).
The action is the selection of the next token (or a sequence of tokens).
The policy is the probability distribution over tokens produced by the LLM.
The reward is a scalar signal representing how good the generated response is.

An entire response generation—from the first token to the final token—can be treated as a single episode in reinforcement learning.

Unlike classical RL (e.g., robotics or games), LLMs operate in an extremely large action space (the vocabulary) and often receive rewards only at the end of the episode, making the problem particularly challenging.

3. The Complete LLM Training Pipeline

Modern LLM training typically follows three major phases:

3.1 Pretraining

In the pretraining phase, the model is trained on massive amounts of unlabeled text using next-token prediction. The objective is purely statistical: maximize the likelihood of observed text.

This phase gives the model broad linguistic and world knowledge but does not enforce alignment with human preferences.

3.2 Supervised Fine-Tuning (SFT)

After pretraining, the model is fine-tuned on a smaller, high-quality dataset of human-written prompt–response pairs. This step teaches the model how to respond in a helpful and instruction-following manner.

SFT significantly improves usability but still cannot fully capture nuanced human preferences or safety constraints.

3.3 Reinforcement Learning Alignment

In the final phase, reinforcement learning is applied to directly optimize the model’s outputs according to preference-based reward signals. This phase is where RLHF, RLAIF, or DPO comes into play.

4. Reinforcement Learning from Human Feedback (RLHF)

4.1 Core Motivation

Human judgment is often the best signal for determining whether a model response is good. RLHF leverages this by explicitly training models based on human preferences, rather than predefined rules or heuristics.

4.2 Data Collection via Human Preferences

Instead of asking humans to score responses numerically (which is inconsistent and subjective), humans are shown multiple responses to the same prompt and asked to rank or choose the better one.

This pairwise comparison approach is more reliable and easier for humans.

4.3 Training the Reward Model

A reward model (RM) is trained to predict which response humans would prefer. The reward model takes a prompt–response pair as input and outputs a single scalar reward.

The reward model is usually architecturally similar to the base LLM but smaller in size.

Training uses a pairwise ranking loss, such as the Bradley–Terry loss, which encourages higher rewards for preferred responses and lower rewards for rejected ones.

5. Policy Optimization with Proximal Policy Optimization (PPO)

Once the reward model is trained, it is used as a proxy for human judgment. The LLM is then fine-tuned using a reinforcement learning algorithm, most commonly Proximal Policy Optimization (PPO).

5.1 Why PPO?

Vanilla policy gradient methods are unstable when applied to large neural networks. PPO introduces constraints that prevent excessively large updates to the policy, ensuring stable training.

5.2 PPO Objective in LLMs

The PPO objective maximizes expected reward while restricting how much the new policy can deviate from the old one. This is critical when optimizing models with billions of parameters.

5.3 KL-Divergence Constraint

To prevent the model from drifting too far from the supervised fine-tuned behavior, a KL-divergence penalty is added. The reward is adjusted as:

Reward = Reward_Model_Output − β × KL(New_Policy || Reference_Policy)

This ensures that the model improves alignment without losing linguistic quality.

6. Challenges and Failure Modes of RLHF

Despite its effectiveness, RLHF has several challenges:

Reward hacking: The model may exploit weaknesses in the reward model.
Mode collapse: The model may become overly cautious or generic.
High cost: Human feedback collection and PPO training are expensive.

These challenges have motivated simpler and more stable alternatives.

7. Reinforcement Learning from AI Feedback (RLAIF)

RLAIF replaces human annotators with stronger LLMs that act as judges. This approach dramatically reduces cost and improves scalability.

However, it introduces the risk of bias amplification and requires careful prompt and rubric design.

8. Direct Preference Optimization (DPO)

8.1 Motivation for DPO

DPO was introduced to eliminate the complexity of PPO and reward modeling while preserving the benefits of preference-based learning.

8.2 How DPO Works

Instead of training a reward model, DPO directly optimizes the policy using preference pairs. It increases the likelihood of preferred responses relative to rejected ones.

DPO can be derived as a special case of KL-regularized RL, making it theoretically grounded and practically stable.

8.3 Advantages of DPO

No explicit reward model
No reinforcement learning loop
More stable and simpler training

9. RL for Tool Use and Agentic LLMs

When LLMs act as agents—calling tools, APIs, or performing multi-step reasoning—the RL formulation becomes even more important.

Rewards may depend on:

Task success
Correct tool usage
Latency and cost

In such systems, the LLM policy selects not only tokens but also actions, making RL a natural fit.

10. Offline vs Online Reinforcement Learning in LLMs

Most alignment methods today rely on offline RL, where training is done on static datasets to minimize risk.

Online RL, where models learn directly from real-time interactions, offers adaptability but introduces significant safety concerns.

11. Credit Assignment and Long-Horizon Issues

LLMs typically receive a reward only after the full response is generated. Assigning credit to individual tokens is non-trivial.

Common solutions include:

Monte Carlo rollouts
Token-level advantage estimation
Segment-level rewards

12. Safety, Alignment, and Constitutional AI

Safety constraints can be integrated via:

Negative rewards for unsafe outputs
Rule-based reward shaping
Constitutional AI, where models critique and revise their own outputs based on predefined principles

13. Industry Tooling and Practical Stack

Pretraining: Megatron-LM, DeepSpeed
Supervised Fine-Tuning: Hugging Face Trainer
RLHF & DPO: Hugging Face TRL
Reward Models: DeBERTa, LLaMA-based classifiers

14. Conceptual Summary

Reinforcement Learning for LLMs is primarily an alignment technique rather than a capability booster. By optimizing models against preference-based rewards, we can move beyond likelihood-based training and align LLM behavior with human values, safety requirements, and real-world usefulness.

15. Interview-Level Closing Statement

Reinforcement learning enables large language models to optimize for human-centric objectives that cannot be expressed through traditional supervised learning. Methods such as RLHF and DPO form the backbone of modern LLM alignment and are critical for deploying safe, helpful, and trustworthy AI systems.

Agent learning via early experience

1. The "Cold Start" Problem

In pure RL (Tabular or Deep), an agent starts with zero knowledge and takes random actions. For an LLM agent, "random actions" mean gibberish text.

The Issue: The probability of a random agent generating a coherent 10-step reasoning chain that solves a math problem is effectively zero.
The Solution: Early experience—giving the agent a "head start" using existing data before letting it explore on its own.

2. Supervised Fine-Tuning (SFT) as Early Experience

The most common form of early experience for agents is SFT.

Concept: You take a pre-trained model and fine-tune it on high-quality, human-curated "trajectories."
Agent Perspective: A trajectory consists of $[State \rightarrow Action \rightarrow State \rightarrow Reward]$ .
Goal: To move the model’s initial policy $\pi_0$ into the "neighborhood" of high reward so that when RL starts, it actually hits some positive rewards occasionally.

3. Behavioral Cloning (BC)

Behavioral Cloning is the simplest form of learning from early experience.

How it works: The agent treats the expert's actions as labels in a supervised learning task.
The "Distribution Shift" Risk: If the agent makes a small mistake and enters a state the expert never visited, it won't know how to recover. This is why we move from BC to RL.

4. Replay Buffers & "Warm" Buffers

In early learning, we don't just use live experience.

Offline RL: Learning entirely from a fixed dataset of early experiences without interacting with the environment.
Warm-starting the Buffer: Before the agent starts "acting," we fill its memory (Replay Buffer) with "Expert Demonstrations." When the agent starts learning, it samples a mix of its own (clumsy) attempts and the expert's (perfect) attempts.

5. Curiosity-Driven Exploration (Intrinsic Motivation)

When an agent is in the "early experience" phase, external rewards are often sparse (you only get a reward at the very end).

The Concept: Give the agent an "Intrinsic Reward" for visiting new states or seeing things it can't predict yet.
Benefit: This prevents the agent from getting stuck in one corner of the environment and encourages it to gather diverse early experiences.

6. Learning from "Hindsight" (HER - Hindsight Experience Replay)

This is a brilliant "early experience" hack for agents.

The Scenario: The agent tries to reach Goal A but ends up at Goal B. Normally, this is a "fail" (0 reward).
The Hack: We tell the agent, "Pretend you were actually TRYING to reach Goal B all along." * Result: The agent learns how to reach Goal B from that experience, making even its early failures valuable for learning.

7. The Transition: From Imitation to Optimization

The transition from early experience (SFT/BC) to RL (PPO/GRPO) is delicate.

KL-Divergence Constraint: During early RL steps, we penalize the model if it moves too far from its "early experience" policy.
Why? To ensure it doesn't "forget" the basic language and logic it learned during the SFT phase while it tries to maximize rewards.

8. Interview "Checklist" Questions on Early Experience

Q: Why can't we train an LLM agent using ONLY RL from scratch? A: The state-action space is too vast. The "Reward Signal" would be too sparse for the model to ever find by accident.
Q: What is the difference between Imitation Learning and Reinforcement Learning? A: Imitation Learning (early experience) teaches the model to match a pattern. RL teaches the model to maximize a result.
Q: How does "Expert Iteration" work? A: 1. Agent generates many attempts. 2. We use a Verifier to find the few successful ones. 3. We fine-tune the agent on those successful attempts (creating "new" early experience). 4. Repeat.

Summary for your folder:

Agent Learning via Early Experience is about solving the Search Problem. By using SFT, Behavioral Cloning, and Expert Demonstrations, we place the agent in a high-probability zone for success, allowing RL to then fine-tune those behaviors into optimal strategies.

Part 1: Foundations & Markov Processes (Questions 1–10)

Q: What is the "Markov Property"? A: The future state depends only on the current state and action, not on the sequence of events that preceded it.
Q: Define the components of a Markov Decision Process (MDP). A: $(S, A, P, R, \gamma)$ — States, Actions, Transition Probability, Reward Function, and Discount Factor.
Q: What is the role of the Discount Factor ( $\gamma$ )? A: It determines the importance of future rewards. $\gamma \approx 0$ makes the agent "myopic" (greedy), while $\gamma \approx 1$ makes it "farsighted."
Q: What is a "Policy" ( $\pi$ )? A: A mapping from states to a probability distribution over actions. It defines the agent's behavior.
Q: Differentiate between $V(s)$ and $Q(s, a)$ . A: $V(s)$ is the expected return from being in state $s$ . $Q(s, a)$ is the expected return from taking action $a$ in state $s$ .
Q: What are the Bellman Equations? A: Recursive equations that decompose the value function into the immediate reward plus the discounted value of the next state.
Q: Explain "Exploration vs. Exploitation." A: Exploration is trying new actions to find better strategies; Exploitation is using known actions that yield high rewards.
Q: What is Model-Free vs. Model-Based RL? A: Model-free (e.g., Q-learning) learns directly from experience. Model-based learns a transition model of the environment to "plan" ahead.
Q: In an LLM context, what is the "State Space"? A: The "State" is the sequence of tokens (prompt + generated text) currently in the context window.
Q: Why is LLM training often viewed as a "Partially Observable" MDP? A: Because the model doesn't "know" the hidden intent of the user perfectly; it only sees the text tokens provided.

Part 2: Temporal Difference & Monte Carlo (Questions 11–20)

Q: How does Monte Carlo (MC) differ from Temporal Difference (TD) learning? A: MC updates after a full episode (high variance, zero bias). TD updates after every step by bootstrapping (low variance, some bias).
Q: What is "Bootstrapping"? A: Updating an estimate (Value Function) based on another estimate of the future, rather than a final outcome.
Q: Explain SARSA vs. Q-Learning. A: SARSA is On-Policy (learns from the current action). Q-Learning is Off-Policy (learns based on the best possible future action).
Q: What is the "Deadly Triad" in RL? A: The combination of Function Approximation, Bootstrapping, and Off-policy learning, which can cause training to diverge/fail.
Q: What is TD(0)? A: The simplest TD method where the value estimate is updated using only one step of lookahead.
Q: Define "Experience Replay." A: Storing past transitions $(s, a, r, s')$ in a buffer and sampling them randomly to break correlation in training data (common in DQN).
Q: What is the "Credit Assignment Problem"? A: The difficulty of determining which specific action in a long sequence was responsible for a final reward.
Q: Why is TD learning preferred for LLMs over pure Monte Carlo? A: Because waiting for a 1,000-token generation to finish before updating (MC) is too slow and high-variance for stable training.
Q: What are "Eligibility Traces"? A: A bridge between TD and MC that keeps a record of which states were visited recently to assign credit more effectively.
Q: Define "n-step TD." A: An approach that looks $n$ steps ahead before bootstrapping, balancing the benefits of both TD and MC.

Part 3: Policy Gradients & Optimization (Questions 21–35)

Q: What is the core idea of Policy Gradient (PG) methods? A: Instead of learning values, PG directly optimizes the policy weights to maximize the expected return using the Gradient Ascent.
Q: What is the "REINFORCE" algorithm? A: The simplest Monte Carlo policy gradient method that scales the gradient by the total return of the episode.
Q: What is the "Advantage Function" $A(s, a)$ ? A: $A(s, a) = Q(s, a) - V(s)$ . It measures how much better an action is compared to the average action in that state.
Q: Why use Actor-Critic models? A: The Actor learns the policy, while the Critic learns the value function to reduce the variance of the policy gradient.
Q: What is the primary problem TRPO (Trust Region Policy Optimization) solves? A: It prevents "catastrophic forgetting" or policy collapse by ensuring updates stay within a "Trust Region" (KL-divergence constraint).
Q: Why is PPO (Proximal Policy Optimization) more popular than TRPO? A: PPO is easier to implement and computationally cheaper because it uses a "clipped" objective instead of complex second-order math.
Q: Explain PPO "Clipping." A: It limits the change in the policy ratio (new/old) to a range (e.g., $0.8$ to $1.2$ ) so the update isn't too aggressive.
Q: What is GAE (Generalized Advantage Estimation)? A: A method to compute advantages that combines multi-step TD errors to balance bias and variance.
Q: What is GRPO (Group Relative Policy Optimization)? A: A memory-efficient RL algorithm that removes the Critic model by using the average reward of a group of outputs as the baseline.
Q: Why is GRPO significant for "Reasoning" models (like DeepSeek-R1)? A: It saves massive GPU memory, allowing RL to be applied to very large models without needing a second "Value" model in VRAM.
Q: What is the "Policy Ratio" in PPO? A: The probability of an action under the new policy divided by its probability under the old policy.
Q: Define "Entropy Regularization" in RL. A: Adding an entropy term to the loss function to encourage the model to stay "uncertain" and continue exploring.
Q: What is the "KL Penalty" in RLHF? A: A penalty added to the reward to prevent the RL model from deviating too far from the original "safe" SFT (Supervised Fine-Tuned) model.
Q: Compare PPO and DPO (Direct Preference Optimization). A: PPO is an online RL method requiring a reward model; DPO is an offline method that optimizes the policy directly from preference data without a reward model.
Q: What is "Reward Hacking"? A: When an agent finds a loophole in the reward function to get high scores without actually achieving the desired goal (e.g., generating gibberish that a Reward Model likes).

Part 4: RL applied to LLMs (Questions 36–50)

Q: Explain the RLHF pipeline. A: 1. Supervised Fine-Tuning (SFT) $\rightarrow$ 2. Reward Model Training $\rightarrow$ 3. Reinforcement Learning (PPO/GRPO).
Q: What is a "Reward Model" (RM)? A: A model trained on human rankings (e.g., "A is better than B") to output a scalar score representing human preference.
Q: Define ORM vs. PRM. A: ORM (Outcome Reward Model) scores the final answer. PRM (Process Reward Model) scores each individual step of reasoning.
Q: Why are PRMs currently "trending"? A: They allow for better "Reasoning" training by providing "dense" feedback on every logical step, reducing hallucinations.
Q: How does RL help in "Chain of Thought" (CoT)? A: RL rewards the model for generating the correct sequence of steps that lead to the right answer, not just the answer itself.
Q: What is "Self-Correction" in the context of RL? A: Training the model via RL to recognize its own errors in a reasoning chain and backtrack to fix them.
Q: Explain "Best-of-N" sampling. A: Generating $N$ completions and using the Reward Model to pick the one with the highest score.
Q: What is the "Alignment Tax"? A: The potential drop in a model's raw capability or creativity (e.g., in creative writing) after it undergoes strict safety/human alignment.
Q: What are "Verifiers"? A: Automated systems (like Code Compilers or Math Checkers) that provide binary rewards (Success/Fail) to train models on verifiable tasks.
Q: How does RL handle "Sparse Rewards" in LLMs? A: By using Reward Shaping (adding intermediate rewards) or PRMs to give the model feedback before the very end of the text.
Q: What is "Instruction Following" in the RL context? A: Using RL to maximize the reward for strictly adhering to constraints (e.g., "Summarize in exactly 50 words").
Q: Define "Online" vs. "Offline" RL for LLMs. A: Online: The model generates text and gets rewards during training. Offline: The model learns from a pre-collected dataset of trajectories.
Q: What is "RLAIF"? A: Reinforcement Learning from AI Feedback. Using a stronger model (like GPT-4) to provide rewards for a smaller model.
Q: Why does PPO require four models in memory? A: 1. Policy model, 2. Value/Critic model, 3. Reference (original) model, 4. Reward Model.
Q: What is the "KL-Divergence" constraint's mathematical role? A: It acts as a regularizer, ensuring the probability distribution of the new model doesn't shift too violently, preserving the linguistic coherence learned during pre-training.

Reinforcement learning, Agentic Early experience