Data Science Cheet sheets (Final stage)
๐ Statistics for Data Science – Cheat Sheet
1. Types of Statistics
-
Descriptive Statistics → Summarizing data (mean, median, mode, variance, std. dev., etc.)
-
Inferential Statistics → Drawing conclusions about a population using sample data (hypothesis testing, confidence intervals).
2. Types of Data
-
Qualitative (Categorical)
-
Nominal → Categories without order (e.g., Gender, Colors).
-
Ordinal → Categories with order (e.g., Ratings: Poor, Average, Good).
-
-
Quantitative (Numerical)
-
Discrete → Countable (e.g., No. of students).
-
Continuous → Measurable (e.g., Height, Weight).
-
3. Measures of Central Tendency
-
Mean = Average
-
Median = Middle value (robust to outliers)
-
Mode = Most frequent value
4. Measures of Dispersion
-
Range = Max – Min
-
Variance = Average squared deviation from mean
-
Standard Deviation (ฯ) = √Variance
-
IQR (Interquartile Range) = Q3 – Q1
-
Coefficient of Variation (CV) = (Std. Dev. / Mean) × 100
5. Probability Basics
-
Probability = (Favorable outcomes / Total outcomes)
-
Addition Rule: P(A ∪ B) = P(A) + P(B) – P(A ∩ B)
-
Multiplication Rule: P(A ∩ B) = P(A) × P(B|A)
6. Probability Distributions
-
Binomial Distribution → Discrete, fixed trials (e.g., coin toss).
-
Poisson Distribution → Discrete, number of events in time/space (e.g., calls per hour).
-
Normal Distribution → Continuous, bell-shaped, mean=median=mode.
-
Uniform Distribution → Equal probability for all outcomes.
-
Exponential Distribution → Continuous, time between events.
7. Sampling Methods
-
Random Sampling
-
Stratified Sampling (split into groups, sample each)
-
Cluster Sampling (random clusters)
-
Systematic Sampling (every k-th item)
8. Hypothesis Testing
-
Formulate hypotheses:
-
Null Hypothesis (H₀): No effect/difference.
-
Alternative Hypothesis (H₁): Significant effect/difference.
-
-
Steps:
-
Select significance level (ฮฑ, usually 0.05).
-
Choose test (t-test, chi-square, ANOVA).
-
Calculate test statistic & p-value.
-
Compare p-value with ฮฑ.
-
-
Common Tests:
-
Z-test → Large sample, known variance.
-
T-test → Small sample, unknown variance.
-
One-sample, two-sample, paired.
-
-
Chi-Square Test → Independence of categorical variables.
-
ANOVA → Compare means of 3+ groups.
-
Mann-Whitney U Test → Non-parametric test for medians.
-
9. Confidence Intervals
-
Formula:
CI = sample_mean ± Z * (ฯ / √n) -
Interpretation: "We are 95% confident the population parameter lies within this range."
10. Correlation & Covariance
-
Covariance: Measures joint variability (can be positive/negative).
-
Correlation (r): Standardized measure (–1 ≤ r ≤ 1).
-
+1 = strong positive, –1 = strong negative, 0 = no relation.
-
11. Regression Basics
-
Simple Linear Regression: y = ฮฒ₀ + ฮฒ₁x + ฮต
-
Multiple Linear Regression: y = ฮฒ₀ + ฮฒ₁x₁ + ฮฒ₂x₂ + ... + ฮต
-
Key metrics: R², Adjusted R², p-values, F-statistic.
12. Outliers
-
Detection methods:
-
Z-score > |3|
-
IQR method → Outside [Q1 – 1.5×IQR, Q3 + 1.5×IQR]
-
-
Handling: Remove, cap, or transform (log).
13. Bias & Variance
-
Bias: Error from wrong assumptions (underfitting).
-
Variance: Error from too much sensitivity to training data (overfitting).
-
Tradeoff → Goal is balance (low bias, low variance).
14. Important Concepts
-
Central Limit Theorem (CLT): Sample means from any distribution tend to follow a normal distribution as n → ∞.
-
Law of Large Numbers: As sample size increases, sample mean → population mean.
-
p-value: Probability of getting results at least as extreme as observed, assuming H₀ is true.
-
Type I Error (ฮฑ): Rejecting H₀ when it is true (false positive).
-
Type II Error (ฮฒ): Failing to reject H₀ when it is false (false negative).
15. Interview Quick Tips
-
Always clarify type of data before choosing a test.
-
Use median & IQR when data is skewed.
-
State assumptions (e.g., normality, equal variance) before applying tests.
-
Be ready to interpret results (not just calculate).
๐️ SQL Cheatsheet for Data Science Interviews
1. Basics
-
Database: Collection of tables
-
Table: Rows (records) + Columns (fields)
-
Query: Command to interact with DB
-- Select columns from table
SELECT column1, column2
FROM table_name
WHERE condition;
2. Selecting Data
SELECT * FROM employees; -- all columns
SELECT DISTINCT department FROM employees; -- unique values
SELECT salary AS monthly_salary FROM employees; -- alias
3. Filtering Rows
-- Comparison operators: =, !=, >, <, >=, <=
-- Logical operators: AND, OR, NOT
SELECT * FROM employees
WHERE age BETWEEN 25 AND 35
AND department IN ('HR', 'Finance')
AND name LIKE 'A%';
4. Sorting & Limiting
SELECT * FROM employees
ORDER BY salary DESC, age ASC
LIMIT 10;
5. Aggregate Functions
SELECT department, COUNT(*) AS num_employees, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000;
-
Common functions:
COUNT,SUM,AVG,MIN,MAX
6. Joins
-- Inner Join (matching rows)
SELECT e.name, d.department_name
FROM employees e
INNER JOIN departments d
ON e.dept_id = d.id;
-- Left Join (all left + matched right)
-- Right Join (all right + matched left)
-- Full Outer Join (all rows)
7. Subqueries
-- In WHERE
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- In FROM (derived table)
SELECT dept_id, avg_salary
FROM (SELECT dept_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id) t;
8. Window Functions
-
ROW_NUMBER: unique row index
-
RANK / DENSE_RANK: ranking with/without gaps
-
NTILE(n): bucket into n groups
SELECT name, department, salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;
-
Running totals / moving averages:
SELECT name, salary,
SUM(salary) OVER (ORDER BY hire_date) AS running_salary
FROM employees;
9. Case Statements
SELECT name,
CASE
WHEN salary > 80000 THEN 'High'
WHEN salary BETWEEN 50000 AND 80000 THEN 'Medium'
ELSE 'Low'
END AS salary_band
FROM employees;
10. Common Table Expressions (CTE)
WITH dept_avg AS (
SELECT dept_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id
)
SELECT e.name, e.salary, d.avg_salary
FROM employees e
JOIN dept_avg d
ON e.dept_id = d.dept_id;
11. Set Operations
-
UNION → combines distinct results
-
UNION ALL → includes duplicates
-
INTERSECT → common rows
-
EXCEPT / MINUS → rows in first query not in second
12. Data Cleaning
-- Remove duplicates
SELECT DISTINCT * FROM employees;
-- Handle NULLs
SELECT COALESCE(salary, 0) AS salary FROM employees;
-- String functions
SELECT TRIM(name), UPPER(name), LOWER(name), LENGTH(name), SUBSTRING(name,1,3)
FROM employees;
-- Date functions
SELECT CURRENT_DATE, EXTRACT(YEAR FROM hire_date)
FROM employees;
13. Keys & Constraints
-
Primary Key → unique & not null
-
Foreign Key → references another table
-
Unique, Check, Not Null
14. Performance Tips
-
Use indexes on frequently filtered/joined columns
-
Avoid
SELECT *, fetch only needed columns -
Use EXPLAIN to check query plan
-
Use CTEs & window functions instead of deeply nested subqueries
15. Popular Interview Queries
-
Second highest salary
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
-
Top 3 salaries per department
SELECT * FROM (
SELECT name, dept_id, salary,
RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rnk
FROM employees
) t
WHERE rnk <= 3;
-
Find duplicates
SELECT name, COUNT(*)
FROM employees
GROUP BY name
HAVING COUNT(*) > 1;
-
Percentage contribution of salary by department
SELECT department,
SUM(salary) AS dept_salary,
SUM(salary)*100.0 / SUM(SUM(salary)) OVER() AS pct_contribution
FROM employees
GROUP BY department;
This covers almost all SQL topics that DS interviews usually test: selection, joins, aggregation, window functions, subqueries, CTEs, string/date handling, and performance optimization.
๐ Python for Data Science Cheatsheet
1. Basics
# Variables & Datatypes
x = 10 # int
y = 3.14 # float
name = "Alice" # str
flag = True # bool
# Type & Casting
type(x) # <class 'int'>
int(y), str(x), float(x)
# String Operations
s = "Hello"
s.upper(), s.lower(), s.isupper(), s.islower()
len(s)
s.replace("H", "J")
s.split(" ")
2. Data Structures
List
lst = [1, 2, 3]
lst.append(4)
lst.extend([5,6])
lst.insert(0,0)
lst.pop()
lst.count(2)
lst.sort(), lst.reverse()
Tuple
t = (1,2,3)
t.count(2)
t.index(3)
# Immutable
Dictionary
d = {'a':1, 'b':2}
d['a']
d.get('c', 0)
d.keys(), d.values(), d.items()
d.pop('b')
d.clear()
Set
s = {1,2,3}
s.add(4)
s.remove(2)
s.union({5,6})
s.intersection({3,4})
s.difference({1,5})
3. Loops & Comprehensions
# Loop
for i in range(5):
print(i)
# List Comprehension
squared = [x**2 for x in range(10) if x%2==0]
# Dictionary Comprehension
squared_dict = {x: x**2 for x in range(5)}
# Set Comprehension
squared_set = {x**2 for x in range(5)}
4. Functions & Lambda
def add(a, b=0):
return a+b
# Lambda
f = lambda x,y: x+y
map(lambda x: x*2, [1,2,3])
filter(lambda x: x%2==0, [1,2,3,4])
from functools import reduce
reduce(lambda x,y: x+y, [1,2,3])
5. NumPy Basics
import numpy as np
arr = np.array([1,2,3])
arr.ndim
arr.shape
np.zeros((2,3))
np.arange(1,10,2)
np.random.randint(1,100,size=(3,3))
# Operations
arr + 2
np.add(arr, 2)
np.multiply(arr, 2)
np.matmul(np.array([[1,2],[3,4]]), np.array([[2,0],[1,2]]))
# Indexing & Slicing
arr[0], arr[1:3]
np.sum(arr, axis=0)
np.concatenate([arr1, arr2])
np.reshape(arr, (3,1))
6. Pandas Basics
import pandas as pd
# Series
s = pd.Series([1,2,3], index=['a','b','c'])
s.mean(), s.median(), s.mode()
s.describe()
# DataFrame
df = pd.DataFrame({'A':[1,2], 'B':[3,4]})
df.head(5), df.tail(5)
df.info(), df.shape
df.columns, df.dtypes
df['A'], df[['A','B']]
df.isnull().sum()
df.dropna(), df.fillna(0)
df.duplicated(), df.drop_duplicates(inplace=True)
# Indexing
df.loc[0,'A'] # label-based
df.iloc[0,0] # position-based
# GroupBy
df.groupby('A')['B'].mean()
df.agg({'A':['min','max'], 'B':['mean','sum']})
# Merge/Concat
pd.concat([df1, df2], axis=1)
df1.merge(df2, on='key')
# Pivot
pd.pivot_table(df, index='A', columns='B', values='C')
# Map / Replace
df['A'] = df['A'].map(lambda x:x+2)
df['B'].replace(0, np.nan, inplace=True)
7. Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Matplotlib
plt.plot(df['A'])
plt.hist(df['B'], bins=10)
plt.scatter(df['A'], df['B'])
plt.boxplot(df['A'])
plt.pie([10,20,30], labels=['X','Y','Z'])
# Seaborn
sns.distplot(df['A'])
sns.pairplot(df)
sns.heatmap(df.corr(), annot=True)
sns.countplot(df['A'])
sns.boxplot(x='A', y='B', data=df)
sns.violinplot(x='A', y='B', data=df)
8. Data Preprocessing
# Handling missing values
df['A'].fillna(df['A'].mean(), inplace=True)
# Encoding categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
df['cat'] = le.fit_transform(df['cat'])
ohe = OneHotEncoder()
encoded = ohe.fit_transform(df[['cat']]).toarray()
# Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = StandardScaler()
df[['num']] = scaler.fit_transform(df[['num']])
minmax = MinMaxScaler()
df[['num']] = minmax.fit_transform(df[['num']])
9. Train-Test Split
from sklearn.model_selection import train_test_split
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
10. Quick Tips for DS Interviews
-
Know Python data structures thoroughly.
-
Practice list/dict/set comprehensions.
-
Be able to clean, transform, and manipulate data with Pandas.
-
Know NumPy for array operations and vectorization.
-
Be ready to plot insights using Matplotlib/Seaborn.
-
Practice train-test split, encoding, and scaling for ML pipelines.
๐งพ Machine Learning Cheatsheet (Interview-Focused)
๐น 1. Types of ML
-
Supervised Learning → Input + Output labels (Regression, Classification).
-
Unsupervised Learning → Only input, find patterns (Clustering, Dimensionality Reduction).
-
Reinforcement Learning → Agent learns via rewards.
๐น 2. Common Algorithms
Regression
-
Linear Regression: ( y = \beta_0 + \beta_1x + \epsilon )
-
Regularization:
-
Ridge (L2 penalty) → shrinks coefficients.
-
Lasso (L1 penalty) → can set coefficients to zero (feature selection).
-
Classification
-
Logistic Regression: Uses sigmoid → ( P(y=1) = \frac{1}{1+e^{-z}} ).
-
Decision Tree: Splits based on features to minimize impurity (Gini/Entropy).
-
Random Forest: Ensemble of decision trees (bagging).
-
Gradient Boosting (XGBoost, LightGBM, CatBoost): Trees built sequentially to reduce error.
-
SVM: Finds max-margin hyperplane; kernel trick for non-linear data.
-
kNN: Classify based on majority of k-nearest neighbors.
Unsupervised
-
k-Means: Minimizes within-cluster variance.
-
Hierarchical Clustering: Agglomerative/divisive merging of clusters.
-
DBSCAN: Density-based clustering.
-
PCA: Projects data to lower dimensions (maximize variance).
๐น 3. Key Concepts
-
Bias-Variance Tradeoff:
-
High Bias → Underfitting.
-
High Variance → Overfitting.
-
-
Overfitting Prevention: Cross-validation, regularization, pruning, dropout (NN).
-
Feature Engineering: Encoding (One-Hot, Label), Scaling (Standard, MinMax), Feature selection.
-
Evaluation Metrics:
-
Regression: MSE, RMSE, MAE, ( R^2 ).
-
Classification: Accuracy, Precision, Recall, F1, ROC-AUC, Log Loss.
-
Imbalanced Data → Use Precision/Recall, ROC-AUC, PR-AUC.
-
-
Cross-Validation: k-Fold, Stratified k-Fold (for classification).
๐น 4. Probability & Statistics in ML
-
Bayes Theorem:
[
P(A|B) = \frac{P(B|A) P(A)}{P(B)}
] -
Expectation: Mean value of distribution.
-
Variance/Std. Dev.: Spread of data.
-
Normal Distribution: Symmetric, mean=median=mode.
-
Central Limit Theorem: Sampling distribution tends to normal.
๐น 5. Neural Networks (Basics)
-
Perceptron: ( y = f(\sum w_ix_i + b) ).
-
Activation Functions:
-
Sigmoid → (0,1), vanishing gradients.
-
ReLU → efficient, sparse activation.
-
Tanh → (-1,1).
-
-
Backpropagation: Gradient descent on weights using chain rule.
-
Optimizers: SGD, Adam, RMSProp.
๐น 6. Ensemble Learning
-
Bagging: Parallel training on bootstrapped samples (e.g., Random Forest).
-
Boosting: Sequential training to fix previous errors (e.g., AdaBoost, XGBoost).
-
Stacking: Combine predictions of multiple models using a meta-learner.
๐น 7. Model Selection & Evaluation
-
Hyperparameter Tuning: Grid Search, Random Search, Bayesian Optimization.
-
Regularization: L1 (Lasso), L2 (Ridge).
-
Early Stopping: Stop training when validation loss doesn’t improve.
๐น 8. Common Interview Questions (Quick Recall)
-
Difference between supervised, unsupervised, and reinforcement learning?
-
Why does logistic regression use sigmoid instead of linear function?
-
What is multicollinearity and how do you detect it?
-
Explain bias-variance tradeoff.
-
How do you handle imbalanced datasets?
-
Difference between bagging and boosting?
-
What is ROC-AUC and when is it better than accuracy?
-
Difference between PCA and LDA?
-
Explain gradient descent and learning rate effect.
-
When to prefer Random Forest vs Gradient Boosting?
๐ง Deep Learning Cheatsheet (DS Interview-Focused)
๐น 1. What is Deep Learning?
-
A subset of machine learning that uses artificial neural networks (ANNs) with multiple hidden layers to learn complex patterns from data.
-
Learns hierarchical representations — low-level features in early layers, high-level features in deeper layers.
-
Best suited for unstructured data: images, audio, video, text.
2. Core Components of a Neural Network
-
Neuron (Perceptron): Basic computational unit
-
Weights (w): Learnable parameters
-
Bias (b): Shifts the activation
-
Activation Function (f): Introduces non-linearity
| Function | Range | Use Case |
|---|---|---|
| Sigmoid ( \frac{1}{1+e^{-x}} ) | (0, 1) | Binary classification output |
| Tanh ( \frac{e^x - e^{-x}}{e^x + e^{-x}} ) | (-1, 1) | Hidden layers |
| ReLU ( \max(0, x) ) | [0, ∞) | Most common for hidden layers |
| Leaky ReLU | (-∞, ∞) | Prevents dying ReLU |
| Softmax | (0, 1) sum=1 | Multi-class output layer |
Training Neural Networks
-
Forward Propagation → Calculate output layer predictions.
-
Loss Functions:
-
Regression → MSE, MAE.
-
Classification → Cross-Entropy, Hinge Loss.
-
-
Backpropagation → Compute gradients using chain rule.
-
Gradient Descent Variants:
-
Batch GD, Mini-batch GD, Stochastic GD.
-
Optimizers: SGD, Adam, RMSProp, Adagrad.
Key Concepts
-
Epoch: One pass through entire training data.
-
Batch Size: Number of samples per gradient update.
-
Learning Rate: Step size in gradient descent. Too high → divergence; too low → slow.
-
Overfitting Prevention:
-
Regularization (L1, L2)
-
Dropout
-
Early Stopping
-
Data Augmentation
Neural Network Architecture
-
Input Layer – Features
-
Hidden Layers – Transformations/feature learning
-
Output Layer – Predictions
Forward & Backpropagation
-
Forward pass: Compute predictions
-
Loss function: Measure error
-
Backward pass (Backpropagation): Compute gradients using chain rule
-
Gradient Descent: Update weights to minimize loss
-
Learning Rate (ฮท): Step size during optimization
| Optimizer | Description |
|---|---|
| SGD | Basic gradient descent |
| Momentum | Adds previous gradients to speed up convergence |
| RMSProp | Adapts learning rate per parameter |
| Adam | Combines Momentum + RMSProp (most used) |
Common loss function
| Task | Loss |
|---|---|
| Regression | MSE (Mean Squared Error) |
| Binary Classification | Binary Cross-Entropy |
| Multi-class Classification | Categorical Cross-Entropy |
| Ranking | Hinge Loss |
Regularization techniques
| Method | Purpose |
|---|---|
| L1/L2 Regularization | Penalize large weights |
| Dropout | Randomly deactivate neurons |
| Batch Normalization | Normalize activations to stabilize training |
| Early Stopping | Stop training before overfitting |
CNN (Convolutional Neural Networks)
-
Use Case: Image data, spatial features.
-
Layers:
-
Convolution → extracts features using filters/kernels.
-
Pooling → reduces dimensions (Max Pooling, Avg Pooling).
-
Fully Connected → classification head.
-
-
Concepts:
-
Padding (same vs valid).
-
Stride.
-
Transfer Learning with pre-trained models (ResNet, VGG, Inception).
-
๐น 6. RNN (Recurrent Neural Networks)
-
Use Case: Sequential data (time series, NLP).
-
Problem: Vanishing/Exploding gradients.
-
Variants:
-
LSTM (Long Short-Term Memory) → handles long dependencies.
-
GRU (Gated Recurrent Unit) → simpler, fewer parameters.
-
-
Attention Mechanism → focus on relevant parts of sequence.
๐น 7. Transformers (Modern NLP)
-
Self-Attention: Captures relationships between tokens.
-
Architecture: Encoder–Decoder.
-
Popular Models: BERT, GPT, T5.
-
Why Transformers > RNN: Parallelization, better long-range dependency capture.
๐น 8. Autoencoders
-
Use Case: Dimensionality reduction, anomaly detection.
-
Structure: Encoder (compress) + Decoder (reconstruct).
-
Variational Autoencoder (VAE) → generates new samples.
๐น 9. Generative Models
-
GAN (Generative Adversarial Networks):
-
Generator → creates fake data.
-
Discriminator → distinguishes real vs fake.
-
-
Applications: Image synthesis, text-to-image, deepfakes.
๐น 10. Regularization & Normalization
-
Dropout → randomly deactivate neurons during training.
-
Batch Normalization → normalizes layer outputs, stabilizes training.
-
Weight Decay (L2 Regularization) → penalizes large weights.
๐น 11. Evaluation Metrics (DL-specific)
-
Classification → Accuracy, Precision, Recall, F1, AUC.
-
Object Detection → IoU (Intersection over Union), mAP (mean Average Precision).
-
Segmentation → Dice Coefficient, Jaccard Index.
-
Language Models → Perplexity, BLEU score.
10. Key Deep Learning Tricks
-
Weight Initialization: Use He or Xavier for faster convergence
-
Learning Rate Scheduling: Reduce LR over time
-
Data Augmentation: Improves generalization (especially in vision tasks)
-
Transfer Learning: Use pre-trained models for small datasets
Deployment Best Practices
-
Save model:
.h5(Keras),.pt(PyTorch) -
Export for inference: ONNX, TF Serving
-
Monitor drift: Check model performance post-deployment
-
Use GPUs/TPUs for large-scale training
๐น 13. Common Interview Questions (Quick Review)
-
What is the vanishing gradient problem and how to solve it?
-
Difference between CNN and RNN?
-
Why use ReLU over sigmoid?
-
What is the role of batch normalization?
-
Explain LSTM internals (gates and cell state).
-
How does attention work in transformers?
-
What is transfer learning and why is it useful?
-
How do you prevent overfitting in deep networks?
-
Explain dropout and how it works.
-
What is the difference between batch size and epoch?
๐ Tips for Interviews
-
Focus on intuitions behind architectures, not just formulas.
-
Be ready to draw neural network diagrams and explain data flow.
-
Know pros/cons and real-world applications of CNNs, RNNs, Transformers.
-
Be comfortable with libraries like TensorFlow, Keras, and PyTorch.
Perfect! Here’s a NLP (Natural Language Processing) Cheatsheet for Data Science interviews, structured for quick revision. It covers concepts, preprocessing, feature extraction, and popular models.
๐ง NLP Cheatsheet for Data Science
1. Basics
-
NLP → Process & analyze text data using algorithms & models.
-
Applications: Sentiment analysis, chatbots, translation, summarization, topic modeling.
Common terms:
-
Corpus → Collection of text
-
Token → Word or sentence unit
-
Vocabulary → Set of unique tokens
-
Stopwords → Common words with little semantic value (e.g., “is”, “the”)
-
Stemming → Reduce words to root (running → run)
-
Lemmatization → Convert to dictionary form (better than stemming)
2. Text Preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
text = "NLTK is a leading platform for building Python programs!"
# Lowercase
text = text.lower()
# Remove punctuation
text = re.sub(r'[^a-zA-Z]', ' ', text)
# Tokenization
from nltk.tokenize import word_tokenize, sent_tokenize
words = word_tokenize(text)
sentences = sent_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if w not in stop_words]
# Stemming
ps = PorterStemmer()
words_stemmed = [ps.stem(w) for w in words]
# Lemmatization
lemmatizer = WordNetLemmatizer()
words_lemma = [lemmatizer.lemmatize(w) for w in words]
3. Text Representation
3.1 Bag-of-Words (BoW)
-
Counts frequency of each word.
-
Example using
CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=500)
X = cv.fit_transform(corpus).toarray()
3.2 TF-IDF
-
Term Frequency-Inverse Document Frequency → weighs words by importance.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=500)
X = tfidf.fit_transform(corpus).toarray()
3.3 Word Embeddings
-
Capture semantic meaning of words.
-
Word2Vec, GloVe, FastText
-
Libraries:
gensim,spacy,tensorflow,pytorch
4. Text Similarity & NLP Tasks
-
Cosine Similarity → Measure similarity between vectors
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vec1, vec2)
-
Text Classification → spam detection, sentiment analysis
-
Named Entity Recognition (NER) → extract names, locations
-
POS Tagging → Part-of-speech tagging (noun, verb, etc.)
-
Topic Modeling → LDA (Latent Dirichlet Allocation)
5. NLP Libraries
-
NLTK → Preprocessing, tokenization, stopwords, stem/lemma
-
spaCy → Fast tokenization, NER, POS tagging
-
gensim → Topic modeling, Word2Vec
-
TextBlob → Sentiment analysis
-
scikit-learn → TF-IDF, vectorization, ML models
-
transformers (HuggingFace) → BERT, GPT, other transformer models
6. Sequence Models for NLP
| Model | Use Case | Key Points |
|---|---|---|
| RNN | Text, sequential data | Maintains hidden state; suffers vanishing gradient |
| LSTM | Long sequences | Solves RNN vanishing gradient; uses gates |
| GRU | Lightweight LSTM | Fewer parameters, faster |
| Transformer | Modern NLP | Uses attention mechanism; parallelizable; BERT, GPT |
7. Pretrained Models
-
Word Embeddings: GloVe, Word2Vec, FastText
-
Transformers:
-
BERT → Contextual embeddings, masked language modeling
-
GPT → Text generation, autoregressive
-
RoBERTa, DistilBERT → Optimized BERT variants
-
-
Libraries:
transformers(Hugging Face),torch,tensorflow
8. NLP Metrics
| Task | Metric |
|---|---|
| Classification | Accuracy, Precision, Recall, F1-score, ROC-AUC |
| Sequence Generation | BLEU, ROUGE, METEOR |
| Language Modeling | Perplexity |
9. Feature Engineering Tips
-
Remove stopwords, punctuation, numbers
-
Lowercase & normalize text
-
Consider n-grams for context (bigrams, trigrams)
-
Use TF-IDF or embeddings instead of raw counts
-
Handle imbalanced classes (SMOTE, weighted loss)
10. Quick Interview Qs
-
Difference between stemming and lemmatization?
-
How is TF-IDF better than Bag-of-Words?
-
What is word embedding? Why is it useful?
-
Explain RNN, LSTM, GRU differences.
-
How does attention mechanism work in transformers?
-
What are some text preprocessing steps?
-
Difference between context-free embeddings (Word2Vec) and contextual embeddings (BERT)?
-
How to handle OOV (out-of-vocabulary) words?
-
How to measure similarity between two sentences?
-
Popular pretrained NLP models for sentiment analysis?
Perfect! Here’s a Generative AI Cheatsheet tailored for Data Science / AI interviews, structured for quick revision. Covers concepts, architectures, models, and evaluation.
๐ค Generative AI Cheatsheet (Interview-Focused)
1. What is Generative AI?
-
Definition: AI that creates new content similar to training data.
-
Content Types: Text, images, audio, video, code, 3D models.
-
Applications:
-
Text: Chatbots, summarization, code generation
-
Images: AI art, deepfakes
-
Audio: Music generation, speech synthesis
-
Video: Animation, video synthesis
-
2. Key Concepts
-
Discriminative vs Generative Models:
Type Task Example Discriminative Predict label from input Logistic Regression, SVM Generative Learn data distribution & generate data GAN, VAE, Diffusion Models -
Latent Space: Compressed representation capturing data features.
-
Sampling: Process of generating new data points from learned distribution.
3. Popular Generative Models
3.1 GANs (Generative Adversarial Networks)
-
Components:
-
Generator → Creates fake data
-
Discriminator → Distinguishes real vs fake
-
-
Objective: Minimax game
[
\min_G \max_D V(D,G) = E_{x\sim P_{data}}[\log D(x)] + E_{z\sim P_z}[\log(1-D(G(z)))]
] -
Applications: Image synthesis, super-resolution, style transfer
3.2 VAEs (Variational Autoencoders)
-
Encoder → Maps input to latent distribution
-
Decoder → Reconstructs data from latent vector
-
Probabilistic latent space → Sample new data
-
Applications: Image generation, anomaly detection
3.3 Diffusion Models
-
Generate data by gradually denoising from Gaussian noise
-
Examples: DALL-E 2, Stable Diffusion
3.4 Transformer-Based Generative Models
-
GPT (Generative Pretrained Transformer) → Autoregressive text generation
-
BERT → Masked language modeling (not autoregressive)
-
T5, LLaMA, Falcon → Large language models for text generation
4. Training Techniques
-
Adversarial Training: GANs use generator vs discriminator
-
Reconstruction Loss: VAEs minimize reconstruction + KL divergence
-
Pretraining + Fine-tuning: Transformers pretrained on large corpora, fine-tuned for tasks
5. Evaluation Metrics
| Task | Metric |
|---|---|
| Images | Inception Score (IS), FID (Frรฉchet Inception Distance) |
| Text | Perplexity, BLEU, ROUGE, METEOR |
| Audio | Signal-to-Noise Ratio, MOS (Mean Opinion Score) |
| General | Human evaluation for realism & quality |
6.1 Sampling from VAE
z = torch.randn(batch_size, latent_dim)
generated = decoder(z)
7. Applications in Industry
-
Text: ChatGPT, Jasper AI, Code generation (Copilot)
-
Images: DALL-E, MidJourney, Stable Diffusion
-
Audio: Jukebox (OpenAI), Speech synthesis
-
Video: RunwayML, DeepFake creation
-
Healthcare: Drug molecule generation
8. Common Interview Questions
-
Difference between GAN, VAE, and Diffusion models?
-
Explain the generator and discriminator roles in GANs.
-
What is latent space? Why is it important?
-
How do diffusion models generate images?
-
Difference between autoregressive and masked language models?
-
How do you evaluate generative models?
-
What are challenges in training GANs?
-
How is Generative AI different from traditional ML models?
-
Name real-world applications of generative AI.
-
How to prevent mode collapse in GANs?
⚡ Advanced Generative AI Cheatsheet
1. Core Idea
Generative AI = Learning a data distribution (P_{data}(x)) and generating new samples that look like real data.
-
Input: Random noise or seed data
-
Output: Synthetic images, text, audio, or 3D data
-
Key property: Creativity + Realism
2. Generative Model Categories
| Type | Examples | Key Idea |
|---|---|---|
| Explicit Density Models | VAE, PixelRNN, Normalizing Flows | Learn (P(x)) explicitly |
| Implicit Density Models | GANs | Learn via adversarial game, no explicit probability |
| Energy-Based Models (EBMs) | Boltzmann Machines | Model data via energy function |
| Autoregressive Models | GPT, PixelCNN | Generate sequentially (factorized probability) |
| Diffusion Models | Denoising Diffusion Probabilistic Models | Gradual noise removal to generate data |
3. GANs (Generative Adversarial Networks)
-
Objective: Generator (G) creates data → Discriminator (D) distinguishes real vs fake
-
Loss Function:
[
\min_G \max_D V(D,G) = E_{x \sim P_{data}}[\log D(x)] + E_{z \sim P_z}[\log(1 - D(G(z)))]
] -
Variants:
-
DCGAN → Deep Convolutional GAN (images)
-
WGAN → Wasserstein GAN (stable training)
-
CycleGAN → Image-to-image translation without paired data
-
StyleGAN → High-quality controllable image synthesis
-
-
Common Problems & Solutions:
-
Mode Collapse → Generator produces limited variety
-
Vanishing Gradient → Use WGAN, label smoothing
-
Training instability → Careful learning rate tuning, batch normalization
-
4. Variational Autoencoders (VAEs)
-
Probabilistic model: Encode input to latent distribution (q(z|x)) → Sample → Decode
-
Loss Function: Reconstruction + KL divergence
[
L = \text{Reconstruction Loss} + D_{KL}(q(z|x) || p(z))
] -
Applications: Image generation, anomaly detection, data compression
5. Diffusion Models
-
Idea: Start with noise → Iteratively denoise to generate data
-
Steps:
-
Forward process: add Gaussian noise to data
-
Reverse process: learn denoising function
-
-
Popular models: DALL-E 2, Imagen, Stable Diffusion
-
Pros: High-quality images, stable training
-
Cons: Slow sampling, compute-intensive
6. Transformer-Based Generative Models
-
GPT (Autoregressive): Predict next token → generate text
-
BERT (Masked LM): Predict masked tokens → contextual embeddings
-
T5 / BART: Seq2Seq → summarization, translation
-
LLMs: ChatGPT, LLaMA, Falcon, GPT-4 → text generation, coding, reasoning
Key Components:
-
Multi-head Self-Attention
-
Positional Encoding
-
Feedforward Layers
-
Layer Normalization
7. Evaluation Metrics
-
Text: Perplexity, BLEU, ROUGE, METEOR
-
Images: FID (Frรฉchet Inception Distance), IS (Inception Score), human evaluation
-
Audio: MOS (Mean Opinion Score), SNR (Signal-to-Noise Ratio)
-
General: Diversity, novelty, coherence
8. Feature Techniques in Generative AI
-
Latent Space Manipulation: Interpolation, style transfer, attribute editing
-
Conditional Generation: Generate based on labels or prompts
-
Example: Conditional GAN (cGAN) → generate images conditioned on class
-
-
Prompt Engineering: Critical for text & multimodal generation
9. Practical Implementation Tips
-
Data Augmentation: Increases diversity of training samples
-
Transfer Learning: Fine-tune pre-trained models (GPT, Stable Diffusion)
-
Compute Optimization: Use mixed precision, distributed training for large models
-
Safety & Bias: Check outputs for toxicity, hallucinations, or bias
10. Popular Generative AI Applications
-
Text: Chatbots, code generation, story writing
-
Images: AI art, avatars, deepfakes, medical image synthesis
-
Audio: Music, voice cloning, speech synthesis
-
Video: Animation, deepfake videos, scene generation
-
Healthcare: Drug discovery, molecule generation
-
Marketing: Personalized content creation, ad generation
11. Interview-Focused Questions
-
Difference between GAN, VAE, and Diffusion models?
-
How does self-attention work in transformers?
-
Explain mode collapse and solutions in GANs.
-
How to evaluate generative models?
-
What is latent space and how is it used?
-
Explain conditional vs unconditional generation.
-
Challenges in training diffusion models.
-
Applications of Generative AI in industry.
-
How do you fine-tune a pre-trained generative model?
-
Ethical considerations & bias in Generative AI.
Comments
Post a Comment