Data Science Cheet sheets (Final stage)



๐Ÿ“Š Statistics for Data Science – Cheat Sheet


1. Types of Statistics

  • Descriptive Statistics → Summarizing data (mean, median, mode, variance, std. dev., etc.)

  • Inferential Statistics → Drawing conclusions about a population using sample data (hypothesis testing, confidence intervals).


2. Types of Data

  • Qualitative (Categorical)

    • Nominal → Categories without order (e.g., Gender, Colors).

    • Ordinal → Categories with order (e.g., Ratings: Poor, Average, Good).

  • Quantitative (Numerical)

    • Discrete → Countable (e.g., No. of students).

    • Continuous → Measurable (e.g., Height, Weight).


3. Measures of Central Tendency

  • Mean = Average

  • Median = Middle value (robust to outliers)

  • Mode = Most frequent value


4. Measures of Dispersion

  • Range = Max – Min

  • Variance = Average squared deviation from mean

  • Standard Deviation (ฯƒ) = √Variance

  • IQR (Interquartile Range) = Q3 – Q1

  • Coefficient of Variation (CV) = (Std. Dev. / Mean) × 100


5. Probability Basics

  • Probability = (Favorable outcomes / Total outcomes)

  • Addition Rule: P(A ∪ B) = P(A) + P(B) – P(A ∩ B)

  • Multiplication Rule: P(A ∩ B) = P(A) × P(B|A)


6. Probability Distributions

  • Binomial Distribution → Discrete, fixed trials (e.g., coin toss).

  • Poisson Distribution → Discrete, number of events in time/space (e.g., calls per hour).

  • Normal Distribution → Continuous, bell-shaped, mean=median=mode.

  • Uniform Distribution → Equal probability for all outcomes.

  • Exponential Distribution → Continuous, time between events.


7. Sampling Methods

  • Random Sampling

  • Stratified Sampling (split into groups, sample each)

  • Cluster Sampling (random clusters)

  • Systematic Sampling (every k-th item)


8. Hypothesis Testing

  1. Formulate hypotheses:

    • Null Hypothesis (H₀): No effect/difference.

    • Alternative Hypothesis (H₁): Significant effect/difference.

  2. Steps:

    • Select significance level (ฮฑ, usually 0.05).

    • Choose test (t-test, chi-square, ANOVA).

    • Calculate test statistic & p-value.

    • Compare p-value with ฮฑ.

  3. Common Tests:

    • Z-test → Large sample, known variance.

    • T-test → Small sample, unknown variance.

      • One-sample, two-sample, paired.

    • Chi-Square Test → Independence of categorical variables.

    • ANOVA → Compare means of 3+ groups.

    • Mann-Whitney U Test → Non-parametric test for medians.


9. Confidence Intervals

  • Formula:

    CI = sample_mean ± Z * (ฯƒ / √n)
    
  • Interpretation: "We are 95% confident the population parameter lies within this range."


10. Correlation & Covariance

  • Covariance: Measures joint variability (can be positive/negative).

  • Correlation (r): Standardized measure (–1 ≤ r ≤ 1).

    • +1 = strong positive, –1 = strong negative, 0 = no relation.


11. Regression Basics

  • Simple Linear Regression: y = ฮฒ₀ + ฮฒ₁x + ฮต

  • Multiple Linear Regression: y = ฮฒ₀ + ฮฒ₁x₁ + ฮฒ₂x₂ + ... + ฮต

  • Key metrics: R², Adjusted R², p-values, F-statistic.


12. Outliers

  • Detection methods:

    • Z-score > |3|

    • IQR method → Outside [Q1 – 1.5×IQR, Q3 + 1.5×IQR]

  • Handling: Remove, cap, or transform (log).


13. Bias & Variance

  • Bias: Error from wrong assumptions (underfitting).

  • Variance: Error from too much sensitivity to training data (overfitting).

  • Tradeoff → Goal is balance (low bias, low variance).


14. Important Concepts

  • Central Limit Theorem (CLT): Sample means from any distribution tend to follow a normal distribution as n → ∞.

  • Law of Large Numbers: As sample size increases, sample mean → population mean.

  • p-value: Probability of getting results at least as extreme as observed, assuming H₀ is true.

  • Type I Error (ฮฑ): Rejecting H₀ when it is true (false positive).

  • Type II Error (ฮฒ): Failing to reject H₀ when it is false (false negative).


15. Interview Quick Tips

  • Always clarify type of data before choosing a test.

  • Use median & IQR when data is skewed.

  • State assumptions (e.g., normality, equal variance) before applying tests.

  • Be ready to interpret results (not just calculate).




๐Ÿ—„️ SQL Cheatsheet for Data Science Interviews


1. Basics

  • Database: Collection of tables

  • Table: Rows (records) + Columns (fields)

  • Query: Command to interact with DB

-- Select columns from table
SELECT column1, column2 
FROM table_name 
WHERE condition;

2. Selecting Data

SELECT * FROM employees;       -- all columns
SELECT DISTINCT department FROM employees;   -- unique values
SELECT salary AS monthly_salary FROM employees; -- alias

3. Filtering Rows

-- Comparison operators: =, !=, >, <, >=, <=
-- Logical operators: AND, OR, NOT
SELECT * FROM employees
WHERE age BETWEEN 25 AND 35
AND department IN ('HR', 'Finance')
AND name LIKE 'A%';

4. Sorting & Limiting

SELECT * FROM employees
ORDER BY salary DESC, age ASC
LIMIT 10;

5. Aggregate Functions

SELECT department, COUNT(*) AS num_employees, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000;
  • Common functions: COUNT, SUM, AVG, MIN, MAX


6. Joins

-- Inner Join (matching rows)
SELECT e.name, d.department_name
FROM employees e
INNER JOIN departments d
ON e.dept_id = d.id;

-- Left Join (all left + matched right)
-- Right Join (all right + matched left)
-- Full Outer Join (all rows)

7. Subqueries

-- In WHERE
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

-- In FROM (derived table)
SELECT dept_id, avg_salary
FROM (SELECT dept_id, AVG(salary) AS avg_salary
      FROM employees
      GROUP BY dept_id) t;

8. Window Functions

  • ROW_NUMBER: unique row index

  • RANK / DENSE_RANK: ranking with/without gaps

  • NTILE(n): bucket into n groups

SELECT name, department, salary,
       RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;
  • Running totals / moving averages:

SELECT name, salary,
       SUM(salary) OVER (ORDER BY hire_date) AS running_salary
FROM employees;

9. Case Statements

SELECT name, 
       CASE 
         WHEN salary > 80000 THEN 'High'
         WHEN salary BETWEEN 50000 AND 80000 THEN 'Medium'
         ELSE 'Low'
       END AS salary_band
FROM employees;

10. Common Table Expressions (CTE)

WITH dept_avg AS (
    SELECT dept_id, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY dept_id
)
SELECT e.name, e.salary, d.avg_salary
FROM employees e
JOIN dept_avg d
ON e.dept_id = d.dept_id;

11. Set Operations

  • UNION → combines distinct results

  • UNION ALL → includes duplicates

  • INTERSECT → common rows

  • EXCEPT / MINUS → rows in first query not in second


12. Data Cleaning

-- Remove duplicates
SELECT DISTINCT * FROM employees;

-- Handle NULLs
SELECT COALESCE(salary, 0) AS salary FROM employees;

-- String functions
SELECT TRIM(name), UPPER(name), LOWER(name), LENGTH(name), SUBSTRING(name,1,3)
FROM employees;

-- Date functions
SELECT CURRENT_DATE, EXTRACT(YEAR FROM hire_date)
FROM employees;

13. Keys & Constraints

  • Primary Key → unique & not null

  • Foreign Key → references another table

  • Unique, Check, Not Null


14. Performance Tips

  • Use indexes on frequently filtered/joined columns

  • Avoid SELECT *, fetch only needed columns

  • Use EXPLAIN to check query plan

  • Use CTEs & window functions instead of deeply nested subqueries


15. Popular Interview Queries

  1. Second highest salary

SELECT MAX(salary) 
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
  1. Top 3 salaries per department

SELECT * FROM (
    SELECT name, dept_id, salary,
           RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rnk
    FROM employees
) t
WHERE rnk <= 3;
  1. Find duplicates

SELECT name, COUNT(*)
FROM employees
GROUP BY name
HAVING COUNT(*) > 1;
  1. Percentage contribution of salary by department

SELECT department, 
       SUM(salary) AS dept_salary,
       SUM(salary)*100.0 / SUM(SUM(salary)) OVER() AS pct_contribution
FROM employees
GROUP BY department;

This covers almost all SQL topics that DS interviews usually test: selection, joins, aggregation, window functions, subqueries, CTEs, string/date handling, and performance optimization.



๐Ÿ Python for Data Science Cheatsheet


1. Basics

# Variables & Datatypes
x = 10          # int
y = 3.14        # float
name = "Alice"  # str
flag = True     # bool

# Type & Casting
type(x)         # <class 'int'>
int(y), str(x), float(x)

# String Operations
s = "Hello"
s.upper(), s.lower(), s.isupper(), s.islower()
len(s)
s.replace("H", "J")
s.split(" ")

2. Data Structures

List

lst = [1, 2, 3]
lst.append(4)
lst.extend([5,6])
lst.insert(0,0)
lst.pop()
lst.count(2)
lst.sort(), lst.reverse()

Tuple

t = (1,2,3)
t.count(2)
t.index(3)
# Immutable

Dictionary

d = {'a':1, 'b':2}
d['a']
d.get('c', 0)
d.keys(), d.values(), d.items()
d.pop('b')
d.clear()

Set

s = {1,2,3}
s.add(4)
s.remove(2)
s.union({5,6})
s.intersection({3,4})
s.difference({1,5})

3. Loops & Comprehensions

# Loop
for i in range(5):
    print(i)

# List Comprehension
squared = [x**2 for x in range(10) if x%2==0]

# Dictionary Comprehension
squared_dict = {x: x**2 for x in range(5)}

# Set Comprehension
squared_set = {x**2 for x in range(5)}

4. Functions & Lambda

def add(a, b=0):
    return a+b

# Lambda
f = lambda x,y: x+y
map(lambda x: x*2, [1,2,3])
filter(lambda x: x%2==0, [1,2,3,4])
from functools import reduce
reduce(lambda x,y: x+y, [1,2,3])

5. NumPy Basics

import numpy as np

arr = np.array([1,2,3])
arr.ndim
arr.shape
np.zeros((2,3))
np.arange(1,10,2)
np.random.randint(1,100,size=(3,3))

# Operations
arr + 2
np.add(arr, 2)
np.multiply(arr, 2)
np.matmul(np.array([[1,2],[3,4]]), np.array([[2,0],[1,2]]))

# Indexing & Slicing
arr[0], arr[1:3]
np.sum(arr, axis=0)
np.concatenate([arr1, arr2])
np.reshape(arr, (3,1))

6. Pandas Basics

import pandas as pd

# Series
s = pd.Series([1,2,3], index=['a','b','c'])
s.mean(), s.median(), s.mode()
s.describe()

# DataFrame
df = pd.DataFrame({'A':[1,2], 'B':[3,4]})
df.head(5), df.tail(5)
df.info(), df.shape
df.columns, df.dtypes
df['A'], df[['A','B']]
df.isnull().sum()
df.dropna(), df.fillna(0)
df.duplicated(), df.drop_duplicates(inplace=True)

# Indexing
df.loc[0,'A']      # label-based
df.iloc[0,0]       # position-based

# GroupBy
df.groupby('A')['B'].mean()
df.agg({'A':['min','max'], 'B':['mean','sum']})

# Merge/Concat
pd.concat([df1, df2], axis=1)
df1.merge(df2, on='key')

# Pivot
pd.pivot_table(df, index='A', columns='B', values='C')

# Map / Replace
df['A'] = df['A'].map(lambda x:x+2)
df['B'].replace(0, np.nan, inplace=True)

7. Data Visualization

import matplotlib.pyplot as plt
import seaborn as sns

# Matplotlib
plt.plot(df['A'])
plt.hist(df['B'], bins=10)
plt.scatter(df['A'], df['B'])
plt.boxplot(df['A'])
plt.pie([10,20,30], labels=['X','Y','Z'])

# Seaborn
sns.distplot(df['A'])
sns.pairplot(df)
sns.heatmap(df.corr(), annot=True)
sns.countplot(df['A'])
sns.boxplot(x='A', y='B', data=df)
sns.violinplot(x='A', y='B', data=df)

8. Data Preprocessing

# Handling missing values
df['A'].fillna(df['A'].mean(), inplace=True)

# Encoding categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
df['cat'] = le.fit_transform(df['cat'])
ohe = OneHotEncoder()
encoded = ohe.fit_transform(df[['cat']]).toarray()

# Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = StandardScaler()
df[['num']] = scaler.fit_transform(df[['num']])
minmax = MinMaxScaler()
df[['num']] = minmax.fit_transform(df[['num']])

9. Train-Test Split

from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

10. Quick Tips for DS Interviews

  • Know Python data structures thoroughly.

  • Practice list/dict/set comprehensions.

  • Be able to clean, transform, and manipulate data with Pandas.

  • Know NumPy for array operations and vectorization.

  • Be ready to plot insights using Matplotlib/Seaborn.

  • Practice train-test split, encoding, and scaling for ML pipelines.



๐Ÿงพ Machine Learning Cheatsheet (Interview-Focused)


๐Ÿ”น 1. Types of ML

  • Supervised Learning → Input + Output labels (Regression, Classification).

  • Unsupervised Learning → Only input, find patterns (Clustering, Dimensionality Reduction).

  • Reinforcement Learning → Agent learns via rewards.


๐Ÿ”น 2. Common Algorithms

Regression

  • Linear Regression: ( y = \beta_0 + \beta_1x + \epsilon )

  • Regularization:

    • Ridge (L2 penalty) → shrinks coefficients.

    • Lasso (L1 penalty) → can set coefficients to zero (feature selection).

Classification

  • Logistic Regression: Uses sigmoid → ( P(y=1) = \frac{1}{1+e^{-z}} ).

  • Decision Tree: Splits based on features to minimize impurity (Gini/Entropy).

  • Random Forest: Ensemble of decision trees (bagging).

  • Gradient Boosting (XGBoost, LightGBM, CatBoost): Trees built sequentially to reduce error.

  • SVM: Finds max-margin hyperplane; kernel trick for non-linear data.

  • kNN: Classify based on majority of k-nearest neighbors.

Unsupervised

  • k-Means: Minimizes within-cluster variance.

  • Hierarchical Clustering: Agglomerative/divisive merging of clusters.

  • DBSCAN: Density-based clustering.

  • PCA: Projects data to lower dimensions (maximize variance).


๐Ÿ”น 3. Key Concepts

  • Bias-Variance Tradeoff:

    • High Bias → Underfitting.

    • High Variance → Overfitting.

  • Overfitting Prevention: Cross-validation, regularization, pruning, dropout (NN).

  • Feature Engineering: Encoding (One-Hot, Label), Scaling (Standard, MinMax), Feature selection.

  • Evaluation Metrics:

    • Regression: MSE, RMSE, MAE, ( R^2 ).

    • Classification: Accuracy, Precision, Recall, F1, ROC-AUC, Log Loss.

    • Imbalanced Data → Use Precision/Recall, ROC-AUC, PR-AUC.

  • Cross-Validation: k-Fold, Stratified k-Fold (for classification).


๐Ÿ”น 4. Probability & Statistics in ML

  • Bayes Theorem:
    [
    P(A|B) = \frac{P(B|A) P(A)}{P(B)}
    ]

  • Expectation: Mean value of distribution.

  • Variance/Std. Dev.: Spread of data.

  • Normal Distribution: Symmetric, mean=median=mode.

  • Central Limit Theorem: Sampling distribution tends to normal.


๐Ÿ”น 5. Neural Networks (Basics)

  • Perceptron: ( y = f(\sum w_ix_i + b) ).

  • Activation Functions:

    • Sigmoid → (0,1), vanishing gradients.

    • ReLU → efficient, sparse activation.

    • Tanh → (-1,1).

  • Backpropagation: Gradient descent on weights using chain rule.

  • Optimizers: SGD, Adam, RMSProp.


๐Ÿ”น 6. Ensemble Learning

  • Bagging: Parallel training on bootstrapped samples (e.g., Random Forest).

  • Boosting: Sequential training to fix previous errors (e.g., AdaBoost, XGBoost).

  • Stacking: Combine predictions of multiple models using a meta-learner.


๐Ÿ”น 7. Model Selection & Evaluation

  • Hyperparameter Tuning: Grid Search, Random Search, Bayesian Optimization.

  • Regularization: L1 (Lasso), L2 (Ridge).

  • Early Stopping: Stop training when validation loss doesn’t improve.


๐Ÿ”น 8. Common Interview Questions (Quick Recall)

  1. Difference between supervised, unsupervised, and reinforcement learning?

  2. Why does logistic regression use sigmoid instead of linear function?

  3. What is multicollinearity and how do you detect it?

  4. Explain bias-variance tradeoff.

  5. How do you handle imbalanced datasets?

  6. Difference between bagging and boosting?

  7. What is ROC-AUC and when is it better than accuracy?

  8. Difference between PCA and LDA?

  9. Explain gradient descent and learning rate effect.

  10. When to prefer Random Forest vs Gradient Boosting?




๐Ÿง  Deep Learning Cheatsheet (DS Interview-Focused)


๐Ÿ”น 1. What is Deep Learning?

  • A subset of machine learning that uses artificial neural networks (ANNs) with multiple hidden layers to learn complex patterns from data.

  • Learns hierarchical representations — low-level features in early layers, high-level features in deeper layers.

  • Best suited for unstructured data: images, audio, video, text.


 2. Core Components of a Neural Network

  • Neuron (Perceptron): Basic computational unit

    y=f(wixi+b)y = f(\sum w_i x_i + b)
  • Weights (w): Learnable parameters

  • Bias (b): Shifts the activation

  • Activation Function (f): Introduces non-linearity



  3. Activation Functions

Function Range Use Case
Sigmoid ( \frac{1}{1+e^{-x}} ) (0, 1) Binary classification output
Tanh ( \frac{e^x - e^{-x}}{e^x + e^{-x}} ) (-1, 1) Hidden layers
ReLU ( \max(0, x) ) [0, ∞) Most common for hidden layers
Leaky ReLU (-∞, ∞) Prevents dying ReLU
Softmax (0, 1) sum=1 Multi-class output layer


Training Neural Networks

  • Forward Propagation → Calculate output layer predictions.

  • Loss Functions:

    • Regression → MSE, MAE.

    • Classification → Cross-Entropy, Hinge Loss.

  • Backpropagation → Compute gradients using chain rule.

  • Gradient Descent Variants:

    • Batch GD, Mini-batch GD, Stochastic GD.

    • Optimizers: SGD, Adam, RMSProp, Adagrad.


Key Concepts

  • Epoch: One pass through entire training data.

  • Batch Size: Number of samples per gradient update.

  • Learning Rate: Step size in gradient descent. Too high → divergence; too low → slow.

  • Overfitting Prevention:

    • Regularization (L1, L2)

    • Dropout

    • Early Stopping

    • Data Augmentation


Neural Network Architecture

  • Input Layer – Features

  • Hidden Layers – Transformations/feature learning

  • Output Layer – Predictions


Forward & Backpropagation

  • Forward pass: Compute predictions

  • Loss function: Measure error

  • Backward pass (Backpropagation): Compute gradients using chain rule

  • Gradient Descent: Update weights to minimize loss

    w=wฮทLw​
  • Learning Rate (ฮท): Step size during optimization



Optimization Algorithms

Optimizer Description
SGD Basic gradient descent
Momentum Adds previous gradients to speed up convergence
RMSProp Adapts learning rate per parameter
Adam Combines Momentum + RMSProp (most used)


Common loss function

Task Loss
Regression MSE (Mean Squared Error)
Binary Classification Binary Cross-Entropy
Multi-class Classification Categorical Cross-Entropy
Ranking Hinge Loss

Regularization techniques

Method Purpose
L1/L2 Regularization Penalize large weights
Dropout Randomly deactivate neurons
Batch Normalization Normalize activations to stabilize training
Early Stopping Stop training before overfitting

CNN (Convolutional Neural Networks)

  • Use Case: Image data, spatial features.

  • Layers:

    • Convolution → extracts features using filters/kernels.

    • Pooling → reduces dimensions (Max Pooling, Avg Pooling).

    • Fully Connected → classification head.

  • Concepts:

    • Padding (same vs valid).

    • Stride.

    • Transfer Learning with pre-trained models (ResNet, VGG, Inception).


๐Ÿ”น 6. RNN (Recurrent Neural Networks)

  • Use Case: Sequential data (time series, NLP).

  • Problem: Vanishing/Exploding gradients.

  • Variants:

    • LSTM (Long Short-Term Memory) → handles long dependencies.

    • GRU (Gated Recurrent Unit) → simpler, fewer parameters.

  • Attention Mechanism → focus on relevant parts of sequence.


๐Ÿ”น 7. Transformers (Modern NLP)

  • Self-Attention: Captures relationships between tokens.

  • Architecture: Encoder–Decoder.

  • Popular Models: BERT, GPT, T5.

  • Why Transformers > RNN: Parallelization, better long-range dependency capture.


๐Ÿ”น 8. Autoencoders

  • Use Case: Dimensionality reduction, anomaly detection.

  • Structure: Encoder (compress) + Decoder (reconstruct).

  • Variational Autoencoder (VAE) → generates new samples.


๐Ÿ”น 9. Generative Models

  • GAN (Generative Adversarial Networks):

    • Generator → creates fake data.

    • Discriminator → distinguishes real vs fake.

  • Applications: Image synthesis, text-to-image, deepfakes.


๐Ÿ”น 10. Regularization & Normalization

  • Dropout → randomly deactivate neurons during training.

  • Batch Normalization → normalizes layer outputs, stabilizes training.

  • Weight Decay (L2 Regularization) → penalizes large weights.


๐Ÿ”น 11. Evaluation Metrics (DL-specific)

  • Classification → Accuracy, Precision, Recall, F1, AUC.

  • Object Detection → IoU (Intersection over Union), mAP (mean Average Precision).

  • Segmentation → Dice Coefficient, Jaccard Index.

  • Language Models → Perplexity, BLEU score.


 10. Key Deep Learning Tricks

  • Weight Initialization: Use He or Xavier for faster convergence

  • Learning Rate Scheduling: Reduce LR over time

  • Data Augmentation: Improves generalization (especially in vision tasks)

  • Transfer Learning: Use pre-trained models for small datasets


Deployment Best Practices

  • Save model: .h5 (Keras), .pt (PyTorch)

  • Export for inference: ONNX, TF Serving

  • Monitor drift: Check model performance post-deployment

  • Use GPUs/TPUs for large-scale training


๐Ÿ”น 13. Common Interview Questions (Quick Review)

  1. What is the vanishing gradient problem and how to solve it?

  2. Difference between CNN and RNN?

  3. Why use ReLU over sigmoid?

  4. What is the role of batch normalization?

  5. Explain LSTM internals (gates and cell state).

  6. How does attention work in transformers?

  7. What is transfer learning and why is it useful?

  8. How do you prevent overfitting in deep networks?

  9. Explain dropout and how it works.

  10. What is the difference between batch size and epoch?


๐Ÿ“Œ Tips for Interviews

  • Focus on intuitions behind architectures, not just formulas.

  • Be ready to draw neural network diagrams and explain data flow.

  • Know pros/cons and real-world applications of CNNs, RNNs, Transformers.

  • Be comfortable with libraries like TensorFlow, Keras, and PyTorch.



Perfect! Here’s a NLP (Natural Language Processing) Cheatsheet for Data Science interviews, structured for quick revision. It covers concepts, preprocessing, feature extraction, and popular models.


๐Ÿง  NLP Cheatsheet for Data Science


1. Basics

  • NLP → Process & analyze text data using algorithms & models.

  • Applications: Sentiment analysis, chatbots, translation, summarization, topic modeling.

Common terms:

  • Corpus → Collection of text

  • Token → Word or sentence unit

  • Vocabulary → Set of unique tokens

  • Stopwords → Common words with little semantic value (e.g., “is”, “the”)

  • Stemming → Reduce words to root (running → run)

  • Lemmatization → Convert to dictionary form (better than stemming)


2. Text Preprocessing

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

text = "NLTK is a leading platform for building Python programs!"

# Lowercase
text = text.lower()

# Remove punctuation
text = re.sub(r'[^a-zA-Z]', ' ', text)

# Tokenization
from nltk.tokenize import word_tokenize, sent_tokenize
words = word_tokenize(text)
sentences = sent_tokenize(text)

# Remove stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if w not in stop_words]

# Stemming
ps = PorterStemmer()
words_stemmed = [ps.stem(w) for w in words]

# Lemmatization
lemmatizer = WordNetLemmatizer()
words_lemma = [lemmatizer.lemmatize(w) for w in words]

3. Text Representation

3.1 Bag-of-Words (BoW)

  • Counts frequency of each word.

  • Example using CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=500)
X = cv.fit_transform(corpus).toarray()

3.2 TF-IDF

  • Term Frequency-Inverse Document Frequency → weighs words by importance.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=500)
X = tfidf.fit_transform(corpus).toarray()

3.3 Word Embeddings

  • Capture semantic meaning of words.

  • Word2Vec, GloVe, FastText

  • Libraries: gensim, spacy, tensorflow, pytorch


4. Text Similarity & NLP Tasks

  • Cosine Similarity → Measure similarity between vectors

from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vec1, vec2)
  • Text Classification → spam detection, sentiment analysis

  • Named Entity Recognition (NER) → extract names, locations

  • POS Tagging → Part-of-speech tagging (noun, verb, etc.)

  • Topic Modeling → LDA (Latent Dirichlet Allocation)


5. NLP Libraries

  • NLTK → Preprocessing, tokenization, stopwords, stem/lemma

  • spaCy → Fast tokenization, NER, POS tagging

  • gensim → Topic modeling, Word2Vec

  • TextBlob → Sentiment analysis

  • scikit-learn → TF-IDF, vectorization, ML models

  • transformers (HuggingFace) → BERT, GPT, other transformer models


6. Sequence Models for NLP

Model Use Case Key Points
RNN Text, sequential data Maintains hidden state; suffers vanishing gradient
LSTM Long sequences Solves RNN vanishing gradient; uses gates
GRU Lightweight LSTM Fewer parameters, faster
Transformer Modern NLP Uses attention mechanism; parallelizable; BERT, GPT

7. Pretrained Models

  • Word Embeddings: GloVe, Word2Vec, FastText

  • Transformers:

    • BERT → Contextual embeddings, masked language modeling

    • GPT → Text generation, autoregressive

    • RoBERTa, DistilBERT → Optimized BERT variants

  • Libraries: transformers (Hugging Face), torch, tensorflow


8. NLP Metrics

Task Metric
Classification Accuracy, Precision, Recall, F1-score, ROC-AUC
Sequence Generation BLEU, ROUGE, METEOR
Language Modeling Perplexity

9. Feature Engineering Tips

  • Remove stopwords, punctuation, numbers

  • Lowercase & normalize text

  • Consider n-grams for context (bigrams, trigrams)

  • Use TF-IDF or embeddings instead of raw counts

  • Handle imbalanced classes (SMOTE, weighted loss)


10. Quick Interview Qs

  1. Difference between stemming and lemmatization?

  2. How is TF-IDF better than Bag-of-Words?

  3. What is word embedding? Why is it useful?

  4. Explain RNN, LSTM, GRU differences.

  5. How does attention mechanism work in transformers?

  6. What are some text preprocessing steps?

  7. Difference between context-free embeddings (Word2Vec) and contextual embeddings (BERT)?

  8. How to handle OOV (out-of-vocabulary) words?

  9. How to measure similarity between two sentences?

  10. Popular pretrained NLP models for sentiment analysis?



Perfect! Here’s a Generative AI Cheatsheet tailored for Data Science / AI interviews, structured for quick revision. Covers concepts, architectures, models, and evaluation.


๐Ÿค– Generative AI Cheatsheet (Interview-Focused)


1. What is Generative AI?

  • Definition: AI that creates new content similar to training data.

  • Content Types: Text, images, audio, video, code, 3D models.

  • Applications:

    • Text: Chatbots, summarization, code generation

    • Images: AI art, deepfakes

    • Audio: Music generation, speech synthesis

    • Video: Animation, video synthesis


2. Key Concepts

  • Discriminative vs Generative Models:

    Type Task Example
    Discriminative Predict label from input Logistic Regression, SVM
    Generative Learn data distribution & generate data GAN, VAE, Diffusion Models
  • Latent Space: Compressed representation capturing data features.

  • Sampling: Process of generating new data points from learned distribution.


3. Popular Generative Models

3.1 GANs (Generative Adversarial Networks)

  • Components:

    • Generator → Creates fake data

    • Discriminator → Distinguishes real vs fake

  • Objective: Minimax game
    [
    \min_G \max_D V(D,G) = E_{x\sim P_{data}}[\log D(x)] + E_{z\sim P_z}[\log(1-D(G(z)))]
    ]

  • Applications: Image synthesis, super-resolution, style transfer

3.2 VAEs (Variational Autoencoders)

  • Encoder → Maps input to latent distribution

  • Decoder → Reconstructs data from latent vector

  • Probabilistic latent space → Sample new data

  • Applications: Image generation, anomaly detection

3.3 Diffusion Models

  • Generate data by gradually denoising from Gaussian noise

  • Examples: DALL-E 2, Stable Diffusion

3.4 Transformer-Based Generative Models

  • GPT (Generative Pretrained Transformer) → Autoregressive text generation

  • BERT → Masked language modeling (not autoregressive)

  • T5, LLaMA, Falcon → Large language models for text generation


4. Training Techniques

  • Adversarial Training: GANs use generator vs discriminator

  • Reconstruction Loss: VAEs minimize reconstruction + KL divergence

  • Pretraining + Fine-tuning: Transformers pretrained on large corpora, fine-tuned for tasks


5. Evaluation Metrics

Task Metric
Images Inception Score (IS), FID (Frรฉchet Inception Distance)
Text Perplexity, BLEU, ROUGE, METEOR
Audio Signal-to-Noise Ratio, MOS (Mean Opinion Score)
General Human evaluation for realism & quality

6.1 Sampling from VAE

z = torch.randn(batch_size, latent_dim)
generated = decoder(z)

7. Applications in Industry

  • Text: ChatGPT, Jasper AI, Code generation (Copilot)

  • Images: DALL-E, MidJourney, Stable Diffusion

  • Audio: Jukebox (OpenAI), Speech synthesis

  • Video: RunwayML, DeepFake creation

  • Healthcare: Drug molecule generation


8. Common Interview Questions

  1. Difference between GAN, VAE, and Diffusion models?

  2. Explain the generator and discriminator roles in GANs.

  3. What is latent space? Why is it important?

  4. How do diffusion models generate images?

  5. Difference between autoregressive and masked language models?

  6. How do you evaluate generative models?

  7. What are challenges in training GANs?

  8. How is Generative AI different from traditional ML models?

  9. Name real-world applications of generative AI.

  10. How to prevent mode collapse in GANs?




⚡ Advanced Generative AI Cheatsheet


1. Core Idea

Generative AI = Learning a data distribution (P_{data}(x)) and generating new samples that look like real data.

  • Input: Random noise or seed data

  • Output: Synthetic images, text, audio, or 3D data

  • Key property: Creativity + Realism


2. Generative Model Categories

Type Examples Key Idea
Explicit Density Models VAE, PixelRNN, Normalizing Flows Learn (P(x)) explicitly
Implicit Density Models GANs Learn via adversarial game, no explicit probability
Energy-Based Models (EBMs) Boltzmann Machines Model data via energy function
Autoregressive Models GPT, PixelCNN Generate sequentially (factorized probability)
Diffusion Models Denoising Diffusion Probabilistic Models Gradual noise removal to generate data

3. GANs (Generative Adversarial Networks)

  • Objective: Generator (G) creates data → Discriminator (D) distinguishes real vs fake

  • Loss Function:
    [
    \min_G \max_D V(D,G) = E_{x \sim P_{data}}[\log D(x)] + E_{z \sim P_z}[\log(1 - D(G(z)))]
    ]

  • Variants:

    • DCGAN → Deep Convolutional GAN (images)

    • WGAN → Wasserstein GAN (stable training)

    • CycleGAN → Image-to-image translation without paired data

    • StyleGAN → High-quality controllable image synthesis

  • Common Problems & Solutions:

    • Mode Collapse → Generator produces limited variety

    • Vanishing Gradient → Use WGAN, label smoothing

    • Training instability → Careful learning rate tuning, batch normalization


4. Variational Autoencoders (VAEs)

  • Probabilistic model: Encode input to latent distribution (q(z|x)) → Sample → Decode

  • Loss Function: Reconstruction + KL divergence
    [
    L = \text{Reconstruction Loss} + D_{KL}(q(z|x) || p(z))
    ]

  • Applications: Image generation, anomaly detection, data compression


5. Diffusion Models

  • Idea: Start with noise → Iteratively denoise to generate data

  • Steps:

    1. Forward process: add Gaussian noise to data

    2. Reverse process: learn denoising function

  • Popular models: DALL-E 2, Imagen, Stable Diffusion

  • Pros: High-quality images, stable training

  • Cons: Slow sampling, compute-intensive


6. Transformer-Based Generative Models

  • GPT (Autoregressive): Predict next token → generate text

  • BERT (Masked LM): Predict masked tokens → contextual embeddings

  • T5 / BART: Seq2Seq → summarization, translation

  • LLMs: ChatGPT, LLaMA, Falcon, GPT-4 → text generation, coding, reasoning

Key Components:

  • Multi-head Self-Attention

  • Positional Encoding

  • Feedforward Layers

  • Layer Normalization


7. Evaluation Metrics

  • Text: Perplexity, BLEU, ROUGE, METEOR

  • Images: FID (Frรฉchet Inception Distance), IS (Inception Score), human evaluation

  • Audio: MOS (Mean Opinion Score), SNR (Signal-to-Noise Ratio)

  • General: Diversity, novelty, coherence


8. Feature Techniques in Generative AI

  • Latent Space Manipulation: Interpolation, style transfer, attribute editing

  • Conditional Generation: Generate based on labels or prompts

    • Example: Conditional GAN (cGAN) → generate images conditioned on class

  • Prompt Engineering: Critical for text & multimodal generation


9. Practical Implementation Tips

  • Data Augmentation: Increases diversity of training samples

  • Transfer Learning: Fine-tune pre-trained models (GPT, Stable Diffusion)

  • Compute Optimization: Use mixed precision, distributed training for large models

  • Safety & Bias: Check outputs for toxicity, hallucinations, or bias


10. Popular Generative AI Applications

  • Text: Chatbots, code generation, story writing

  • Images: AI art, avatars, deepfakes, medical image synthesis

  • Audio: Music, voice cloning, speech synthesis

  • Video: Animation, deepfake videos, scene generation

  • Healthcare: Drug discovery, molecule generation

  • Marketing: Personalized content creation, ad generation


11. Interview-Focused Questions

  1. Difference between GAN, VAE, and Diffusion models?

  2. How does self-attention work in transformers?

  3. Explain mode collapse and solutions in GANs.

  4. How to evaluate generative models?

  5. What is latent space and how is it used?

  6. Explain conditional vs unconditional generation.

  7. Challenges in training diffusion models.

  8. Applications of Generative AI in industry.

  9. How do you fine-tune a pre-trained generative model?

  10. Ethical considerations & bias in Generative AI.









Comments

Popular posts from this blog

Resume Work and Project Details

Time Series and MMM basics

LINEAR REGRESSION