Transformers, Attention and Prompt Engineering Interview Questions

Transformers, Attention and Prompt Engineering Interview Questions

- August 29, 2025

Transformers Architecture

What is a Transformer? (Frequent)
A Transformer is a deep learning architecture introduced in the paper "Attention Is All You Need." It relies entirely on the self-attention mechanism to process sequential data, like text, and avoids the use of recurrent neural networks (RNNs).
What problem do Transformers solve that RNNs couldn't? (Frequent)
Transformers primarily solve two major problems of RNNs:
- Long-Range Dependencies: They can capture relationships between words far apart in a sequence more effectively.
- Parallelization: Unlike the sequential nature of RNNs, Transformers can process all tokens in a sequence simultaneously, making training much faster.
Explain the overall architecture of a Transformer. (Frequent)
The original Transformer has an Encoder-Decoder structure. The Encoder processes the entire input sequence and creates a rich representation of it. The Decoder then uses this representation, along with the previously generated output, to produce the next token in the output sequence. Both the Encoder and Decoder are stacks of identical layers.
What are the key components of a Transformer layer?
Each Encoder and Decoder layer has two main sub-layers:
- A Multi-Head Self-Attention mechanism.
- A position-wise Feed-Forward Network.
  These are supplemented by Residual Connections and Layer Normalization.
Explain the Encoder and Decoder stacks.
- Encoder Stack: A series of N identical layers. Each layer takes a sequence of word embeddings and refines them using self-attention to encode contextual information from the entire input sequence.
- Decoder Stack: Also a series of N identical layers. In addition to self-attention on the output generated so far, it also performs cross-attention over the encoder's output to focus on relevant parts of the input sequence.
What is the role of Positional Encoding? (Frequent)
The role of Positional Encoding is to inject information about the relative or absolute position of the tokens in the sequence. Since the self-attention mechanism itself is permutation-invariant (it doesn't care about order), this is crucial for the model to understand the sequence's structure.
Why is Positional Encoding added and not concatenated?
It's added to the input embeddings. This is a design choice that proved effective. Adding it allows the model to easily learn to attend to position information. If the embeddings are large, the addition doesn't drastically distort the original embedding meaning but provides a "nudge" in a direction that encodes position.
What are Residual Connections and Layer Normalization? Why are they important? (Frequent)
- Residual Connections (or Skip Connections) add the input of a sub-layer to its output ( $x + SubLayer (x)$ ). They help prevent the vanishing gradient problem in deep networks, allowing for deeper models.
- Layer Normalization stabilizes the training process by normalizing the inputs to each sub-layer. It helps ensure that the data flowing through the network has a consistent mean and variance.
Explain the Feed-Forward Network in a Transformer.
It's a simple, fully connected neural network applied to each position separately and identically. It consists of two linear transformations with a ReLU activation in between: FFN(x)=max(0,xW1+b1)W2+b2. Its purpose is to introduce non-linearity and transform the attention outputs into a more suitable representation for the next layer.
How does the output of a Transformer decoder generate text?
The decoder's final output is passed through a Linear layer followed by a Softmax function. The Linear layer projects the vector into a high-dimensional space (the size of the vocabulary), and the Softmax function converts these scores into probabilities for each possible next word. The word with the highest probability is typically chosen.
What are the advantages of Transformers over RNNs/LSTMs? (Frequent)
- Parallelism: Can process tokens in parallel, leading to faster training.
- Long-Range Dependencies: Better at capturing context between distant tokens.
- State-of-the-Art Performance: Have become the foundation for most leading NLP models.
What are the disadvantages of Transformers?
The primary disadvantage is the computational complexity of the self-attention mechanism, which is quadratic (O(n2⋅d)) with respect to the sequence length n. This makes it very memory and compute-intensive for long sequences.
What is a Vision Transformer (ViT)?
A Vision Transformer (ViT) adapts the Transformer architecture for computer vision tasks. It works by splitting an image into a sequence of fixed-size patches, linearly embedding them, adding position embeddings, and feeding this sequence of vectors into a standard Transformer encoder.
Differentiate between Encoder-only, Decoder-only, and Encoder-Decoder models. (Frequent)
- Encoder-only (e.g., BERT): Designed for understanding context. Good for tasks like classification, sentiment analysis, and named entity recognition. They are bidirectional.
- Decoder-only (e.g., GPT series): Designed for text generation. They are unidirectional (auto-regressive), meaning they can only look at past tokens to predict the next one.
- Encoder-Decoder (e.g., T5, BART): Designed for sequence-to-sequence tasks like translation, summarization, or question answering where the input and output can be different lengths.
Explain BERT, GPT, and T5 at a high level.
- BERT (Bidirectional Encoder Representations from Transformers): An encoder-only model pre-trained to understand language by predicting masked words and next sentences. It's excellent for analysis tasks.
- GPT (Generative Pre-trained Transformer): A decoder-only model pre-trained on a massive amount of text to predict the next word. It's excellent for text generation.
- T5 (Text-to-Text Transfer Transformer): An encoder-decoder model that frames all NLP tasks as a "text-to-text" problem, where the model takes text as input and produces text as output.

Attention Mechanism

What is the Attention mechanism? (Frequent)
Attention is a mechanism that allows a neural network to focus on specific parts of an input sequence when producing an output. It calculates a set of attention weights that determine how much importance or "attention" to pay to each input element.
Explain the concept of Query, Key, and Value. (Frequent)
This is an analogy from information retrieval systems. For each token, we create three vectors:
- Query (Q): Represents the current token that is "looking for" information.
- Key (K): Represents the tokens in the sequence that have information to offer. The query is matched against all keys.
- Value (V): Represents the actual content of the tokens. The final output is a weighted sum of the values.
How is the attention score calculated? (Frequent)
The score is calculated by taking the dot product of the Query (Q) vector of the current token with the Key (K) vector of every other token in the sequence.
Explain Scaled Dot-Product Attention. (Frequent)
This is the specific attention mechanism used in Transformers. The formula is:
$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$
It involves three steps:
1. Calculate dot product scores between all queries and keys.
2. Scale the scores by dividing by the square root of the key dimension ( $d_{k}$ ).
3. Apply a softmax function to get the attention weights, then multiply by the Value matrix.
Why is the dot product scaled by dk? (Frequent)
For large values of the key dimension dk, the dot products can grow very large in magnitude. This pushes the softmax function into regions where it has extremely small gradients, making learning difficult. Scaling counteracts this effect, leading to more stable training.
What is the role of the Softmax function in attention? (Frequent)
The softmax function converts the raw attention scores into a probability distribution. The resulting weights are all positive and sum to 1, which can be interpreted as the percentage of attention the model should pay to each input token.
What is Self-Attention? (Frequent)
Self-attention is when the attention mechanism is applied to a single sequence, relating different positions of that sequence to compute its representation. Here, the Queries, Keys, and Values all come from the same input sequence.
What is Multi-Head Attention? (Frequent)
Instead of performing a single attention calculation, multi-head attention runs the attention mechanism multiple times in parallel with different, learned linear projections of the Queries, Keys, and Values. The outputs are then concatenated and linearly transformed.
Why use Multi-Head Attention? (Frequent)
It allows the model to jointly attend to information from different representation subspaces at different positions. A single attention head might learn to focus on one type of relationship (e.g., subject-verb), while another head learns another (e.g., pronoun-antecedent). This provides a richer representation.
Explain Masked Self-Attention in the Transformer decoder. (Frequent)
In the decoder, during generation, the model should only be able to attend to previous positions in the output sequence. Masked self-attention achieves this by setting the attention scores for all future tokens to negative infinity before the softmax step, effectively "masking" them out and making their weights zero.
What is Cross-Attention?
Cross-attention is used in the decoder of an encoder-decoder Transformer. Here, the Queries come from the decoder, while the Keys and Values come from the encoder's output. This allows the decoder to look at and focus on the most relevant parts of the input sequence while generating each token of the output sequence.
How does attention help with long-range dependencies? (Frequent)
Attention creates direct connections between any two tokens in the sequence, regardless of their distance. The path length between them is just one step. In contrast, RNNs need to pass information sequentially through all intermediate steps, making it hard to maintain context over long distances.
What are Hard Attention and Soft Attention?
- Soft Attention: The standard mechanism used in Transformers. It's differentiable and assigns a "soft" probabilistic weight to every input token.
- Hard Attention: Instead of a weighted average, it selects just one part of the input to attend to. It's typically non-differentiable and requires more complex training methods like reinforcement learning.
Can you visualize what the attention mechanism "learns"?
Yes, by creating an attention heat map. This is a matrix where rows represent the output words (queries) and columns represent the input words (keys). The intensity of the color at each cell shows the weight or "attention" a specific output word pays to a specific input word.
What is local attention?
A modification of self-attention designed to reduce computational cost. Instead of attending to the entire sequence, each token only attends to a fixed-size window of surrounding tokens. This makes the complexity linear (O(n)) instead of quadratic.

Prompt Engineering

What is Prompt Engineering? (Frequent)
Prompt engineering is the art and science of designing and refining input text (the "prompt") to guide a large language model (LLM) toward generating a desired and accurate output. It's about communicating your goal clearly to the model.
Why is Prompt Engineering important? (Frequent)
It's the primary way to interact with and control powerful LLMs. A well-crafted prompt can dramatically improve the quality, relevance, and safety of a model's response without needing to retrain or fine-tune the model itself.
What is a prompt? What are its main components?
A prompt is the input text given to an LLM. It can include any of these components:
- Instruction: A specific task or command for the model.
- Context: External information or background the model can use.
- Input Data: The specific question or text to be processed.
- Output Indicator: The format for the output (e.g., "JSON:", "A:").
What is Zero-Shot Prompting? Give an example. (Frequent)
This is when you ask the model to perform a task without giving it any prior examples of that task.
- Example: "Translate the following English text to French: 'Hello, how are you?'"
What is Few-Shot Prompting? Give an example. (Frequent)
This is when you provide a few examples (typically 1 to 5) of the task in the prompt to help the model understand the pattern and desired output format. This is also called in-context learning.
- Example:
  "English: sea otter -> French: loutre de mer
  English: peppermint -> French: menthe poivrée
  English: cheese -> French:"
What is Chain-of-Thought (CoT) prompting? (Frequent)
CoT prompting is a technique where you encourage the model to "think step by step" by providing a few-shot example that includes the reasoning process used to get to the final answer. This helps the model break down complex problems.
- Example:
  "Q: The cafeteria had 23 apples. If they used 20 for lunch and bought 6 more, how many apples do they have?
  A: The cafeteria started with 23 apples. They used 20, so they had 23 - 20 = 3. They bought 6 more, so they have 3 + 6 = 9. The answer is 9.
  Q: [Your new problem here]"
Why does Chain-of-Thought prompting improve results?
It encourages the model to follow a logical reasoning path rather than jumping to a conclusion. This allocates more computational steps to the problem, reducing errors in tasks requiring arithmetic, commonsense, or symbolic reasoning.
What is a System Prompt or Persona?
A system prompt is an instruction given at the beginning of a conversation that sets the context, personality, and rules for the AI for the entire interaction. For example: "You are a helpful assistant that explains complex scientific topics to a five-year-old."
What is a Negative Prompt?
Common in image generation models, a negative prompt specifies what you do not want to see in the output. For example, when generating a photorealistic image, the negative prompt might be "cartoon, anime, watermark, text."
What are some common techniques for writing effective prompts? (Frequent)
- Be Specific and Clear: Avoid ambiguity.
- Provide Examples (Few-Shot): Show, don't just tell.
- Assign a Persona: "Act as a senior software developer..."
- Use Delimiters: Use ```, ###, or < > to separate instructions from content.
- Break Down Tasks: Ask the model to think step-by-step.
- Specify the Output Format: Ask for JSON, a list, a table, etc.
How do you control the output format of a language model?
You explicitly ask for it in the prompt. For example: "Extract the names of all people mentioned in the text below and return them as a JSON list with the key 'names'."
What is prompt chaining?
Prompt chaining is the process of breaking a complex task into a series of simpler sub-tasks, where the output of one prompt becomes the input for the next prompt. This creates a multi-step workflow.
What is the difference between instruction tuning and prompt engineering?
- Prompt Engineering: Modifying the input (prompt) to an existing model to get a better output. No model weights are changed.
- Instruction Tuning: A form of fine-tuning where a pre-trained model is further trained on a large dataset of (instruction, output) pairs. This modifies the model's weights to make it better at following instructions in general.
What is "prompt injection"?
Prompt injection is a security vulnerability where a user provides malicious input that overrides or ignores the original system prompt, causing the model to behave in unintended ways.
Explain Tree of Thoughts (ToT) prompting.
ToT is an advanced prompting technique where the model explores multiple reasoning paths ("thoughts") at each step. It evaluates these paths and self-corrects, creating a tree-like structure of reasoning. It's more powerful than Chain-of-Thought for problems that require exploration or strategic lookahead.
What is the ReAct (Reason and Act) framework?
ReAct is a framework that combines reasoning and action. The LLM generates both a "thought" (a reasoning trace) and an "action" (like using a tool, e.g., a calculator or search engine). The result of the action is then fed back into the model to inform the next thought-action step, creating a powerful interactive loop.
How do you debug a bad prompt?
1. Simplify: Start with the simplest version of the prompt and gradually add complexity.
2. Check for Ambiguity: Is there any way your prompt could be misinterpreted?
3. Adjust Temperature/Top-p: Tweak model parameters to make the output more/less creative.
4. Add Examples: Move from zero-shot to few-shot to provide more context.
5. Rephrase: Try asking the question in a completely different way.
What are some tools or frameworks used for prompt engineering?
Frameworks like LangChain and LlamaIndex help developers build complex applications by managing prompts, chaining them together, and integrating LLMs with external data sources and tools.
How do you evaluate the quality of a prompt?
Evaluation can be done through:
- Human Evaluation: Subjective scoring by humans based on relevance, accuracy, and helpfulness.
- Model-Based Evaluation: Using another powerful LLM (like GPT-4) to score the output of a model based on a rubric.
- Objective Metrics: For tasks like summarization (ROUGE scores) or code generation (unit test pass rates).
What is self-consistency in prompting?
It's a technique that improves on Chain-of-Thought. Instead of just generating one reasoning path, you prompt the model to generate multiple paths and then take the majority vote on the final answer. This tends to be more robust and accurate.
What is the role of temperature in a model's output?
Temperature is a hyperparameter that controls the randomness of the output.
- A low temperature (e.g., 0.1) makes the output more deterministic and focused, picking the most likely words. Good for factual tasks.
- A high temperature (e.g., 0.9) makes the output more random and creative. Good for brainstorming or creative writing.

Comments