Mlops - I
1. Foundations of MLOps
๐ What is MLOps?
✅ Definition:
MLOps (Machine Learning Operations) is a set of practices that combines Machine Learning (ML), DevOps, and Data Engineering to deploy, monitor, and maintain ML models in production reliably and efficiently.
It aims to automate and streamline the end-to-end machine learning lifecycle, from data ingestion to model deployment and monitoring.
๐งฑ Core Components of MLOps:
-
Model Development
-
Data preprocessing
-
Feature engineering
-
Model training and evaluation
-
-
Model Deployment
-
Serving the model via REST APIs or batch pipelines
-
Scalable deployment using Docker, Kubernetes, etc.
-
-
Model Monitoring
-
Tracking performance drift, data drift, and model accuracy
-
Logging and alerting mechanisms
-
-
CI/CD for ML
-
Continuous Integration (CI): Auto-testing ML pipelines
-
Continuous Delivery (CD): Automated deployment of models
-
-
Model Versioning & Experiment Tracking
-
Tools like MLflow, DVC, or Weights & Biases
-
Reproducibility and rollback
-
-
Data & Feature Management
-
Feature stores (e.g., Feast, Tecton)
-
Data versioning tools like DVC
-
๐ฏ Objectives of MLOps:
-
Faster model deployment
-
Reliable and reproducible results
-
Scalable workflows
-
Reduced technical debt
-
Collaborative development between data scientists and operations teams
๐งฐ Tools Commonly Used in MLOps:
| Category | Tools |
|---|---|
| Version Control | Git, DVC |
| Experiment Tracking | MLflow, Neptune.ai |
| Model Serving | TensorFlow Serving, TorchServe, FastAPI |
| Orchestration | Airflow, Kubeflow, Prefect |
| Deployment | Docker, Kubernetes, AWS SageMaker |
| Monitoring | Prometheus, Grafana, WhyLabs |
๐ MLOps vs DevOps:
| DevOps | MLOps |
|---|---|
| Focuses on app/software development lifecycle | Focuses on ML lifecycle (data, code, model) |
| Continuous Integration/Delivery | CI/CD + Continuous Training/Monitoring |
| Unit testing and static checks | Data validation, model evaluation |
๐ MLOps Lifecycle
The MLOps lifecycle covers the end-to-end process of developing, deploying, and maintaining machine learning models in production. It integrates ML workflows with DevOps principles to ensure automation, scalability, collaboration, and reliability.
๐งฉ 1. Problem Definition & Business Understanding
-
Identify business goals and success metrics.
-
Translate problem into a machine learning task (classification, regression, etc.).
๐ 2. Data Engineering
-
Data Collection: Ingest data from multiple sources (APIs, DBs, logs).
-
Data Validation: Check data quality, missing values, schema validation.
-
Data Versioning: Use tools like DVC for reproducibility.
-
Data Preprocessing: Cleaning, normalization, handling imbalances.
๐ Tools: Airflow, DVC, Great Expectations, Pandas, Spark
๐️ 3. Feature Engineering & Feature Store
-
Derive meaningful features from raw data.
-
Store and reuse features across teams and models.
๐ Tools: Feast, Tecton, Featureform
๐ง 4. Model Development
-
Model selection, training, and evaluation.
-
Hyperparameter tuning and cross-validation.
-
Experiment tracking and versioning.
๐ Tools: Jupyter, scikit-learn, MLflow, Weights & Biases
๐งช 5. Model Validation & Testing
-
Validate model on holdout/test datasets.
-
Evaluate using relevant metrics (accuracy, F1-score, RMSE, etc.).
-
Perform fairness, explainability, and robustness checks.
๐ Tools: SHAP, LIME, Fairlearn, EvidentlyAI
๐ 6. Model Deployment
-
Convert models into production-ready APIs or batch jobs.
-
Choose deployment strategy:
-
Batch inference
-
Real-time (REST API)
-
Edge deployment
-
๐ Tools: Docker, Kubernetes, TensorFlow Serving, TorchServe, FastAPI, Flask
๐ 7. Continuous Integration / Continuous Delivery (CI/CD)
-
Automate training, testing, and deployment pipelines.
-
Enable reproducibility and rollback.
๐ Tools: GitHub Actions, Jenkins, GitLab CI, CircleCI, Argo Workflows
๐ 8. Model Monitoring & Management
-
Monitor:
-
Model performance (accuracy, latency)
-
Data drift and concept drift
-
-
Alerting and retraining triggers if needed.
๐ Tools: Prometheus, Grafana, WhyLabs, Fiddler, Evidently, Seldon
๐ 9. Model Retraining & Feedback Loop
-
Retrain models based on new data or performance degradation.
-
Automate with continuous training pipelines.
๐ Tools: Kubeflow Pipelines, TFX, Metaflow
๐ฆ Summary Diagram:
[Problem] ➝ [Data Engg] ➝ [Feature Engg] ➝ [Model Dev] ➝ [Validation]
⬇ ⬆
[Monitoring] ◄──── [Deployment] ◄──── [CI/CD] ◄───
⬇
[Retraining & Feedback Loop]
⚠️ Challenges in Traditional ML Workflows
Traditional ML workflows often face operational, scalability, and collaboration challenges when moving from model development to production. These issues become more severe in real-world, large-scale applications.
๐ 1. Manual and Fragmented Processes
-
No automation across data preprocessing, training, validation, and deployment.
-
Data scientists write code locally; engineers reimplement it for production — leading to duplication and errors.
๐งช 2. Poor Reproducibility
-
No version control of datasets, models, or code.
-
Difficult to reproduce experiments or trace model outputs to exact configurations.
๐ Solution: Use Git, DVC, MLflow for versioning.
๐ฆ 3. Hard to Deploy Models into Production
-
Trained models are often shared as pickled files or scripts.
-
No standardized interface for model serving (e.g., REST API, batch jobs).
-
Lack of containerization and scalable serving infrastructure.
๐ง๐ค๐ง 4. Lack of Collaboration Between Teams
-
Data scientists, ML engineers, and DevOps often work in silos.
-
No common pipeline or workflow to hand off models between teams.
๐ 5. Model Degradation Over Time
-
Once deployed, models aren't monitored for data drift, performance decay, or real-world behavior.
-
No system to trigger retraining or alert on poor performance.
๐ Solution: Use monitoring tools (EvidentlyAI, Prometheus) and retraining pipelines.
๐ ️ 6. No CI/CD or Automated Pipelines
-
Manual testing and deployment steps.
-
Inability to quickly test new data or retrain models in a reliable way.
๐ Solution: Use CI/CD with GitHub Actions, Jenkins, or Kubeflow Pipelines.
๐ 7. Data Security and Compliance Issues
-
Lack of controls over sensitive data usage.
-
Non-compliance with regulations like GDPR can lead to legal risks.
๐งพ 8. Experiment Tracking is Manual or Missing
-
Results stored in notebooks or spreadsheets.
-
Hard to compare models, tune hyperparameters, or audit outcomes.
๐ Solution: Use tools like MLflow, Neptune.ai, or Weights & Biases.
๐ 9. Inconsistent Environments
-
Code works in local but fails in production due to different Python/library versions or hardware.
-
No use of virtual environments, Docker, or reproducible infrastructure.
๐งฑ Summary Table
| Challenge | Consequence | MLOps Solution |
|---|---|---|
| Manual workflows | Slower dev cycles | Automate with pipelines |
| Poor reproducibility | Hard to debug/replicate | Version control (DVC, MLflow) |
| Deployment gap | Models not reaching production | Standardized serving (Docker, REST) |
| Siloed teams | Inefficient handoffs | Collaborative CI/CD workflows |
| Model decay | Business impact | Monitoring + retraining |
| No CI/CD | Risky manual deployments | Automated CI/CD |
| No tracking | Loss of insight | Experiment mgmt tools |
| Env mismatch | Code breaks in prod | Docker, containerization |
๐ฏ Key Goals of MLOps
๐งฌ 1. Reproducibility
Goal: Ensure that the same results can be consistently reproduced across environments, by any team member.
๐ Why it’s important:
-
Debug and trace model behavior.
-
Ensure scientific and engineering integrity.
-
Comply with audits and regulations.
✅ How MLOps helps:
-
Code versioning using Git.
-
Data versioning with DVC or LakeFS.
-
Experiment tracking (MLflow, Weights & Biases).
-
Environment isolation (Docker, Conda, virtualenv).
-
Metadata logging for all pipeline stages.
⚙️ 2. Automation
Goal: Eliminate manual steps and build robust, repeatable workflows for training, testing, and deployment.
๐ Why it’s important:
-
Reduces human error and effort.
-
Enables faster iteration and delivery.
-
Standardizes processes across teams.
✅ How MLOps helps:
-
CI/CD pipelines (GitHub Actions, Jenkins, Argo Workflows).
-
Automated data validation (Great Expectations).
-
AutoML pipelines (SageMaker Pipelines, Vertex AI).
-
Scheduled retraining and model deployment jobs.
๐ 3. Scalability
Goal: Seamlessly handle increasing data, compute demand, and model complexity.
๐ Why it’s important:
-
ML workloads grow with business/data size.
-
Ensures consistent performance across models and teams.
✅ How MLOps helps:
-
Containerization (Docker) for portable environments.
-
Orchestration using Kubernetes or Kubeflow.
-
Distributed computing via Spark, Ray, or Dask.
-
Cloud integration (AWS, GCP, Azure) for elastic compute.
๐ 4. Monitoring
Goal: Continuously track model performance, system health, and data behavior in production.
๐ Why it’s important:
-
Detect data drift, model decay, and latency issues.
-
Prevent silent model failures.
-
Enable retraining triggers and alerts.
✅ How MLOps helps:
-
Model performance tracking (EvidentlyAI, WhyLabs).
-
Data drift detection (Fiddler, Alibi Detect).
-
Metrics/logs dashboards (Prometheus, Grafana, ELK Stack).
-
Alerting systems via Slack, Email, PagerDuty integrations.
๐ฆ Summary Table
| Goal | Problem it Solves | MLOps Tools |
|---|---|---|
| Reproducibility | Inconsistent results | Git, DVC, MLflow, Docker |
| Automation | Manual errors, slow cycles | CI/CD, Airflow, Kubeflow |
| Scalability | Data/model growth | Kubernetes, Spark, Cloud |
| Monitoring | Undetected failures | Prometheus, EvidentlyAI, Grafana |
2. Version Control Systems
๐ง Git & Git Platforms (GitHub / GitLab / Bitbucket)
๐งฌ 1. Git: Version Control System
✅ Definition:
Git is a distributed version control system that helps track changes in source code, collaborate on codebases, and manage different versions of projects.
๐ Why Git is Essential in MLOps:
-
Tracks changes in code, configs, and notebooks.
-
Enables collaborative model development.
-
Provides rollback and branch management.
-
Helps integrate with CI/CD pipelines for automation.
๐ Key Git Concepts:
| Concept | Description |
|---|---|
git init |
Initialize a Git repository |
git clone |
Copy a remote repo to your local machine |
git add |
Stage changes for commit |
git commit |
Save changes to history |
git push / pull |
Upload/download to/from remote repo |
git branch / merge |
Manage multiple versions (branches) of code |
git log |
View history of commits |
.gitignore |
Exclude files from tracking (e.g., .env, large datasets) |
๐ 2. Git Hosting Platforms
| Platform | Description | Key MLOps Use |
|---|---|---|
| GitHub | Most popular; free for open source; integrates with GitHub Actions for CI/CD | Collaborations, CI/CD, open-source projects |
| GitLab | Self-hosted or cloud; built-in DevOps pipelines | End-to-end DevOps lifecycle (CI/CD + Repo + Registry) |
| Bitbucket | Integrated with Atlassian (Jira, Confluence) | Enterprise collaboration & issue tracking |
๐ How Git Platforms Support MLOps:
๐จ CI/CD Integration
-
Run tests, linting, model evaluation on every commit.
-
Deploy models automatically via GitHub Actions, GitLab CI, Bitbucket Pipelines.
๐ฌ Collaboration
-
Pull Requests / Merge Requests for code review and discussion.
-
Branch-based workflows (e.g.,
dev,main,experiments).
๐ฆ Artifacts & Package Management
-
GitLab/Bitbucket supports storing model artifacts, Docker images.
๐ Security & Access Control
-
Role-based access to repositories.
-
Secrets and environment variable management for pipelines.
๐ ️ Example: GitHub in MLOps Pipeline
graph LR
A[Data Scientist] -->|Push Code| B[GitHub Repo]
B --> C[GitHub Actions CI/CD]
C --> D[Model Training Job]
C --> E[Unit Tests, Linting]
C --> F[Model Deployment]
⚠️ Best Practices in Git for MLOps
-
Keep large data and models out of Git — use DVC or cloud storage.
-
Use meaningful commit messages.
-
Use
.gitignorewisely. -
Branching strategy:
main,dev,feature/*,experiment/*. -
Automate pipelines with GitHub Actions/GitLab CI.
๐ฆ DVC (Data Version Control)
✅ Definition:
DVC is an open-source tool that extends Git capabilities to handle versioning of large data files, ML models, and experiments.
Think of DVC as Git for data and ML pipelines.
๐ฏ Why DVC in MLOps?
Traditional Git:
-
Can't version large files (e.g., datasets,
.pkl,.h5models). -
Has no support for ML pipeline steps.
DVC:
-
Helps track, version, and share large datasets and model artifacts.
-
Supports reproducible experiments and collaboration.
๐งฑ Core Features of DVC
| Feature | Description |
|---|---|
| ๐ Data Versioning | Track large files (datasets, models) using dvc add instead of Git |
| ⚙️ Pipeline Management | Define ML pipelines using dvc.yaml |
| ๐งช Experiment Tracking | Compare multiple model runs with dvc exp run |
| ☁️ Remote Storage Support | Store data/models in S3, GCS, Azure, SSH, etc. |
| ๐งฌ Reproducibility | Automatically captures data, code, and config dependencies |
๐งฐ Basic DVC Workflow
# 1. Initialize DVC in Git project
dvc init
# 2. Add dataset to DVC tracking
dvc add data/train.csv
# 3. Git track the DVC metadata
git add data/train.csv.dvc .gitignore
git commit -m "Track training data with DVC"
# 4. Push data to remote storage (e.g., S3)
dvc remote add -d myremote s3://mybucket/path
dvc push
# 5. Create pipeline
dvc run -n train_model -d train.py -d data/train.csv -o model.pkl python train.py
# 6. Track pipeline stages
dvc dag # visualize the DAG
๐ก Remote Storage Options
| Type | Examples |
|---|---|
| Cloud | AWS S3, GCP, Azure Blob |
| Network | SSH, WebDAV |
| Local | Shared folders, NFS |
๐ Experiment Tracking with DVC
# Run and track experiments
dvc exp run
# List and compare experiments
dvc exp show
# Save best experiment to Git
dvc exp apply <exp_id>
git commit -am "Best experiment"
๐ How DVC Supports MLOps Goals
| MLOps Goal | How DVC Helps |
|---|---|
| ✅ Reproducibility | Tracks exact data, code, and params used in each run |
| ⚙️ Automation | Pipelines can be triggered via CI/CD tools |
| ๐ Collaboration | Share .dvc files and let others pull data via dvc pull |
| ๐งช Experiment Mgmt | Run isolated experiments and compare results |
๐ฆ DVC Folder Structure (Example)
project/
│
├── data/ # Large data files (Git-ignored)
│ └── train.csv
├── model.pkl # Model file (Git-ignored)
├── train.py # Training script
├── dvc.yaml # Pipeline definition
├── dvc.lock # Snapshot of current run
├── .dvc/ # Internal DVC files
└── .gitignore # Auto-updated by DVC
๐ง Tips & Best Practices:
-
Never push large data directly to Git.
-
Use
.dvcfiles in Git to track what version of data/model you used. -
Integrate DVC with GitHub Actions or GitLab CI for automated ML pipelines.
-
Use DVC Studio (GUI) for experiment comparison and collaboration.
๐ MLflow Tracking
✅ What is MLflow Tracking?
MLflow Tracking is a component of the MLflow platform used to log, organize, compare, and query machine learning experiments.
It helps you track model training runs, parameters, metrics, artifacts, and source code — all in a centralized system.
๐ Think of it as an experiment tracker for reproducible and collaborative ML.
๐งฑ Core Components of MLflow Tracking
| Component | Description |
|---|---|
| Run | A single execution of training script (with params, metrics, etc.) |
| Experiment | A collection/group of runs (e.g., all models for one business use case) |
| Parameters (params) | Hyperparameters like learning rate, max_depth |
| Metrics | Quantitative results like accuracy, loss, RMSE |
| Artifacts | Files like models, plots, checkpoints |
| Tags | User-defined labels for filtering and searching |
| Source | Git commit ID or script used in the run |
๐ How to Use MLflow Tracking
✅ Step-by-Step Usage in Code:
import mlflow
# Start experiment
mlflow.set_experiment("churn_prediction")
with mlflow.start_run():
# Log parameters
mlflow.log_param("max_depth", 5)
mlflow.log_param("learning_rate", 0.1)
# Train your model (example)
model = train_model(...)
# Log metrics
mlflow.log_metric("accuracy", 0.89)
mlflow.log_metric("f1_score", 0.76)
# Log model or other artifacts
mlflow.sklearn.log_model(model, "model")
mlflow.log_artifact("plots/confusion_matrix.png")
๐ฅ️ MLflow UI
You can launch the UI to view runs:
mlflow ui
-
Runs on
http://localhost:5000 -
Visual comparison of experiments
-
Filter/search by metric, param, tags
๐ฆ Storage Backends (for Tracking Server)
| Backend | Description |
|---|---|
| Local File System | Default setup; good for quick trials |
| Remote DB (MySQL/Postgres) | Production-ready tracking |
| S3/MinIO/Azure | For storing large artifacts |
| Tracking Server | Can be hosted locally or remotely with REST API access |
๐ MLflow in MLOps Pipelines
| Stage | Use of MLflow |
|---|---|
| Experimentation | Track multiple model versions and their performance |
| CI/CD | Log and compare runs automatically in training pipelines |
| Collaboration | Share experiment dashboards with team |
| Reproducibility | Every run is logged with code, data version, and env metadata |
๐ MLflow + Tools Integration
-
MLflow + DVC → For combined code/data versioning
-
MLflow + GitHub Actions → Auto-log runs in CI/CD
-
MLflow + Airflow/Kubeflow → Schedule and track pipeline steps
-
MLflow + Docker/K8s → Track runs in containerized/cloud envs
๐ง Best Practices
-
Use meaningful experiment and run names.
-
Use
tagsto add context (e.g., "model_type: random_forest"). -
Store metrics for every epoch/step (e.g., using
mlflow.log_metric("loss", val, step=epoch)). -
Log artifacts like:
-
Model binaries
-
Plots (confusion matrix, learning curves)
-
JSON/YAML config files
-
๐ Quick CLI Commands
mlflow experiments list
mlflow runs list --experiment-name "churn_prediction"
mlflow ui
๐งช Summary
| Feature | MLflow Tracking |
|---|---|
| Parameters | ✅ |
| Metrics | ✅ |
| Artifacts | ✅ |
| Code tracking | ✅ |
| UI for comparison | ✅ |
| Backend agnostic | ✅ |
| REST API available | ✅ |
๐งฌ Model Versioning in MLOps
✅ What is Model Versioning?
Model versioning refers to the process of tracking, managing, and storing multiple versions of machine learning models over time — including their parameters, training data, code, and artifacts.
๐ Just like code versioning (with Git), model versioning ensures reproducibility, rollback, and collaboration.
๐ฏ Why Model Versioning is Important
| Benefit | Description |
|---|---|
| ๐ Reproducibility | Recreate a model with exact same data, code, and hyperparameters |
| ⏪ Rollback Support | Revert to a previous model if a new one underperforms |
| ๐ Performance Tracking | Compare model versions over time or across experiments |
| ๐ฅ Collaboration | Share specific versions with teams for review, testing, or deployment |
| ✅ Compliance & Audit | Track what was deployed and when (for regulated industries) |
๐ What to Version in a Model
| Component | Why It’s Important |
|---|---|
| ๐ Model code | Ensure logic is reproducible |
| ๐ Training data & schema | Data changes affect model outcomes |
| ⚙️ Hyperparameters | Key to model performance |
๐ฆ Model artifact (e.g., .pkl, .pt, .h5) |
For loading and inference |
| ๐งช Evaluation metrics | Needed for comparison |
| ๐ Environment | Python, libraries (pip, Conda, Docker) |
๐งฐ Tools for Model Versioning
| Tool | Role |
|---|---|
| MLflow | Tracks models, versions, and metadata |
| DVC | Data/model versioning alongside Git |
| Weights & Biases | Model checkpoints + metrics versioning |
| SageMaker Model Registry | Versioning + deployment-ready |
| MLflow Model Registry | Register, promote, stage/production models |
| Git + Git LFS | Basic support (not ideal for large binary files) |
๐งฑ MLflow Model Versioning Workflow
# Log model
mlflow.log_model(model, "model")
# Register model version
mlflow.register_model("runs:/<run_id>/model", "ChurnModel")
# View in Model Registry UI (MLflow UI → Models tab)
# Change stage (Staging → Production)
client.transition_model_version_stage(
name="ChurnModel",
version=2,
stage="Production"
)
๐️ Best Practices for Model Versioning
-
Always tag versions with metadata: Include dataset version, hyperparams, Git commit hash.
-
Store artifacts in cloud/remotes: Use S3, GCS, or shared buckets.
-
Use semantic versioning: v1.0.0, v1.1.0, etc.
-
Link models to experiments: So you know which experiment produced which version.
-
Promote models through stages: E.g.,
Staging → Productionin MLflow Registry.
๐ฆ Example: Folder Structure with Versioning
models/
├── v1/
│ ├── model.pkl
│ ├── metrics.json
│ └── params.yaml
├── v2/
│ ├── model.pkl
│ ├── metrics.json
│ └── params.yaml
Or tracked using tools like:
mlruns/
├── 1/
│ └── run_id/
│ ├── metrics/
│ ├── params/
│ └── artifacts/
๐ง Summary
| Aspect | Notes |
|---|---|
| What to version? | Model, data, code, metrics, params |
| Benefits | Reproducibility, rollback, comparison |
| Tools | MLflow, DVC, W&B, SageMaker |
| Best practice | Link model to source code + data versions |
๐️ Model Registry in MLOps
✅ What is a Model Registry?
A Model Registry is a centralized store or service that manages versioned ML models, their metadata, approval stages, and deployment status.
๐ง Think of it as a "model management system" — like a Git for ML models, but with built-in support for staging, tracking, and deployment.
๐ Why Use a Model Registry?
| Need | Purpose |
|---|---|
| ✅ Model versioning | Track multiple versions of each model |
| ๐ Stage transitions | Move models from "Staging" to "Production" systematically |
| ๐ Centralized metadata | Store metrics, source code, tags, artifacts, etc. |
| ๐ Governance | Approvals, audit logs, ownership, access control |
| ๐ Deployment readiness | Integrates with CI/CD for promoting and serving models |
๐งฑ Key Features of a Model Registry
| Feature | Description |
|---|---|
| ๐ฆ Model storage | Central place for all model artifacts |
| ๐งฌ Versioning | Keep track of all model versions (e.g., v1, v2, ...) |
| ๐งช Metrics tracking | Associate evaluation metrics with each version |
| ๐ Stage transitions | Move models between stages: None, Staging, Production, Archived |
| ๐ Permissions | Control who can approve, deploy, or modify models |
| ๐ CI/CD Integration | Automate promotion and deployment pipelines |
๐ Popular Model Registries
| Tool | Highlights |
|---|---|
| MLflow Model Registry | Integrated with MLflow Tracking & Projects |
| SageMaker Model Registry | Native to AWS ecosystem with deployment support |
| Databricks MLflow Registry | Enterprise-grade hosted MLflow |
| Azure ML Model Registry | Built into Azure ML platform |
| Triton Inference Server Registry | NVIDIA-based deployment registry |
| Feast (Feature Registry) | Not for models, but features – still vital |
๐ MLflow Model Registry: Example Workflow
from mlflow.tracking import MlflowClient
# Set up MLflow client
client = MlflowClient()
# Register a model
result = client.create_registered_model("ChurnModel")
# Add a model version
model_uri = "runs:/<run_id>/model"
client.create_model_version("ChurnModel", model_uri, "<run_id_path>")
# Transition to staging
client.transition_model_version_stage(
name="ChurnModel",
version=2,
stage="Staging"
)
# Move to production after validation
client.transition_model_version_stage(
name="ChurnModel",
version=2,
stage="Production"
)
๐ Stages in Model Registry
| Stage | Purpose |
|---|---|
None |
Model is registered but not assigned a stage yet |
Staging |
Under testing and validation |
Production |
Live model used in production environment |
Archived |
Deprecated version kept for record or rollback |
๐ฆ Example: Model Metadata in Registry
Model: ChurnModel
Version: 3
Stage: Production
Run ID: 8f9c9c872
Metrics:
Accuracy: 0.91
F1 Score: 0.87
Tags:
model_type: RandomForest
dataset_version: v2.1
๐ง Best Practices
-
Tag models with:
-
Dataset version
-
Git commit hash
-
Hyperparameter config ID
-
-
Automate transitions using CI/CD tools.
-
Archive outdated or underperforming models.
-
Monitor production models and trigger retraining pipelines as needed.
๐ก Summary
| Feature | Purpose |
|---|---|
| ✅ Version Control | Track all model versions with metadata |
| ๐ฆ Lifecycle Stages | Move models from Staging to Production safely |
| ๐ Performance Tracking | Store metrics for comparison |
| ๐ Governance | Role-based control, approvals |
| ⚙️ CI/CD Integration | Automate promotion & deployment |
3. Python for MLOps
๐งช Virtual Environments (venv, conda)
✅ What is a Virtual Environment?
A virtual environment is an isolated workspace where you can install specific packages and dependencies without affecting the global Python environment.
๐ฏ It ensures reproducibility, dependency management, and environment isolation — key for collaborative ML projects and MLOps pipelines.
๐งฉ Why Use Virtual Environments in ML/MLOps?
| Reason | Benefit |
|---|---|
| ๐ Reproducibility | Same environment across dev, test, and prod |
| ๐งช Isolation | Avoid package conflicts between projects |
| ๐ Control | Lock specific versions of dependencies (e.g., scikit-learn==1.2.2) |
| ๐ง Automation | Easily export and recreate env using files (requirements.txt, environment.yml) |
| ๐ฆ CI/CD Friendly | Use exact envs in pipelines or Docker images |
⚙️ 1. venv (Python built-in)
๐น Create a venv:
python -m venv myenv
๐น Activate venv:
| OS | Command |
|---|---|
| Windows | myenv\Scripts\activate |
| macOS/Linux | source myenv/bin/activate |
๐น Install packages:
pip install numpy pandas scikit-learn
๐น Freeze environment:
pip freeze > requirements.txt
๐น Recreate environment elsewhere:
python -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt
๐งฌ 2. conda (Anaconda/Miniconda)
๐น Create a conda environment:
conda create -n ml-env python=3.10
๐น Activate conda env:
conda activate ml-env
๐น Install packages:
conda install pandas scikit-learn
# or use pip inside conda env
pip install transformers
๐น Export environment:
conda env export > environment.yml
๐น Recreate from YAML:
conda env create -f environment.yml
๐ venv vs conda – When to Use What
| Feature | venv |
conda |
|---|---|---|
| Built-in? | ✅ (Python stdlib) | ❌ (Needs Anaconda/Miniconda) |
| Virtual Envs | ✅ | ✅ |
| Package Manager | pip |
conda + pip |
| Handles non-Python deps | ❌ | ✅ (e.g., OpenCV, CUDA, etc.) |
| Cross-platform | ✅ | ✅ |
| Best for | Lightweight Python-only projects | Complex projects (e.g., ML/DL) |
๐ฆ Best Practices in MLOps
-
Use
venvorcondafor all ML experiments and pipelines. -
Pin package versions to avoid future incompatibility.
-
Export envs (
requirements.txt/environment.yml) into your Git repo. -
Include env setup in CI/CD scripts, Dockerfiles, and Jupyter notebooks.
๐ Sample Files
๐ requirements.txt:
pandas==1.5.3
scikit-learn==1.2.2
numpy==1.23.5
๐ environment.yml:
name: churn-model
channels:
- defaults
dependencies:
- python=3.10
- pandas=1.5.3
- scikit-learn=1.2.2
- pip:
- mlflow==2.2.2
๐งฐ argparse and CLI Tools in MLOps
✅ What is argparse?
argparse is a built-in Python module used to create command-line interfaces (CLIs) for your Python scripts.
๐ฏ It allows ML engineers to pass hyperparameters, file paths, and config values at runtime — without modifying code.
๐งช Why Use CLI Tools in MLOps?
| Need | How CLI Helps |
|---|---|
| ๐ Reproducibility | Parameters are explicitly defined and logged |
| ๐ฆ Automation | Easy to run scripts in CI/CD, pipelines |
| ๐ ️ Reusability | Same script can be reused with different arguments |
| ๐ค Collaboration | Teammates can run your code without changing it |
๐ argparse – Key Components
import argparse
parser = argparse.ArgumentParser(description="Train a classification model")
# Add arguments
parser.add_argument('--epochs', type=int, default=10, help='Number of epochs')
parser.add_argument('--lr', type=float, default=0.001, help='Learning rate')
parser.add_argument('--model_path', type=str, default='model.pkl', help='Save path')
# Parse arguments
args = parser.parse_args()
# Use them in your script
print(f"Training for {args.epochs} epochs with learning rate {args.lr}")
๐งช Run from CLI:
python train.py --epochs 20 --lr 0.005 --model_path ./models/classifier.pkl
๐งฑ Common Argument Types
| Type | Example |
|---|---|
int |
--batch_size 32 |
float |
--dropout 0.25 |
str |
--model_name bert |
bool (flag) |
--use_gpu via action='store_true' |
parser.add_argument('--use_gpu', action='store_true', help='Use GPU for training')
๐ง๐ป Advanced Usage
๐ Choices (Restrict options):
parser.add_argument('--optimizer', choices=['adam', 'sgd'], default='adam')
๐ Multiple values:
parser.add_argument('--layers', nargs='+', type=int)
# CLI: --layers 128 64 32
๐ Config file as input:
parser.add_argument('--config', type=str, help='Path to YAML or JSON config')
๐งช Use in ML Pipelines
python preprocess.py --input data.csv --output clean.csv
python train.py --epochs 50 --lr 0.01
python evaluate.py --model model.pkl --testset test.csv
๐ฆ CLI Tools in Real-World MLOps
| Tool | Purpose |
|---|---|
argparse |
Create flexible ML scripts |
click |
Decorator-based CLI tool, simpler syntax |
typer |
Type-annotated CLI, great for modern Python |
fire |
Google's auto CLI from functions/classes |
hydra |
Dynamic config management (advanced) |
✅ Best Practices
-
Always define default values and
helpmessages. -
Log parsed arguments using
print()orlogging. -
Group related parameters (e.g., training, data, logging).
-
Use argument parsing instead of hardcoding in notebooks or scripts.
๐งช Example: ML Training Script CLI
python train.py \
--epochs 100 \
--lr 0.001 \
--batch_size 64 \
--train_path ./data/train.csv \
--save_model ./models/model.pkl
๐งพ Logging and Error Handling in MLOps
✅ Why It Matters in MLOps
| Need | Benefit |
|---|---|
| ๐ Traceability | Track events, parameters, and model behavior |
| ๐ Debugging | Identify and fix issues in training or deployment |
| ๐ Monitoring | Log model performance, usage, and failures in prod |
| ๐ฆ Reproducibility | Logs serve as a historical record for every run |
๐ Python logging Module
๐ง Setup Basic Logging
import logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
handlers=[
logging.FileHandler("training.log"),
logging.StreamHandler()
]
)
๐ Log Levels
| Level | Use Case |
|---|---|
DEBUG |
Internal debugging details |
INFO |
General information (e.g., training started, epoch=3) |
WARNING |
Minor issues (e.g., missing optional file) |
ERROR |
Runtime errors that don't stop program |
CRITICAL |
Serious errors (e.g., system failure) |
✅ Example
logging.info("Model training started")
logging.debug(f"Learning rate: {lr}")
logging.warning("Dataset contains null values, filling with mean")
logging.error("Failed to load model checkpoint")
๐ Logs in MLOps
| Stage | What to Log |
|---|---|
| Data Ingestion | Missing files, schema mismatches |
| Training | Epochs, loss/accuracy, hyperparameters |
| Evaluation | Metrics (F1, ROC), confusion matrix |
| Deployment | API errors, latency, predictions |
| Monitoring | Model drift, data drift, usage stats |
๐จ Error Handling with try/except
✅ Basic Structure
try:
model = load_model("model.pkl")
except FileNotFoundError as e:
logging.error(f"Model file not found: {e}")
raise
๐ง Handle Specific Errors
try:
df = pd.read_csv("data.csv")
except FileNotFoundError:
logging.critical("Data file is missing")
except pd.errors.EmptyDataError:
logging.warning("CSV is empty")
except Exception as e:
logging.error(f"Unexpected error: {str(e)}")
๐ฆ Best Practices in Logging & Error Handling
| Area | Best Practice |
|---|---|
| ๐ Log files | Save logs with timestamp in filename (e.g., train_2025_07_24.log) |
| ๐ Format | Include timestamp, level, and module |
| ๐งช Try/Except | Catch exceptions that can be recovered from |
| ๐จ Alerts | For production, integrate with alert systems (e.g., Slack, PagerDuty) |
| ๐ Retention | Store logs for audits or reproducibility (link with DVC/MLflow runs) |
⚙️ Production Logging Tools
| Tool | Purpose |
|---|---|
| Fluentd / Logstash | Log aggregation |
| ELK Stack (Elasticsearch + Kibana) | Log visualization |
| Prometheus + Grafana | Monitoring & alerting |
| Sentry | Real-time error reporting |
| Cloud Logging (AWS CloudWatch, GCP Logging) | Infra + App logs |
๐งช Example: ML Pipeline with Logging
def train_model(config):
try:
logging.info(f"Training started with config: {config}")
model = train(config)
save_model(model)
logging.info("Model training completed successfully")
except Exception as e:
logging.exception("Error during training")
raise
๐ฆ Packaging in MLOps
๐ Why Package ML Projects?
| Purpose | Benefit |
|---|---|
| ♻️ Reproducibility | Consistent environments across machines or teams |
| ๐ Deployability | Easy to deploy to production or cloud |
| ๐ Reusability | Share your code as installable libraries |
| ๐ CI/CD Pipelines | Package can be versioned, tested, deployed |
๐งฐ Tool Overview
| Tool | Use Case | Language |
|---|---|---|
| setuptools | Standard packaging tool (most flexible, low-level) | Python |
| poetry | Modern packaging + dependency + versioning tool | Python |
| pipenv | Simplifies dependency management and virtualenvs | Python |
๐ ️ 1. Packaging with setuptools
✅ Project Structure
mlproject/
│
├── mlproject/
│ ├── __init__.py
│ └── core.py
├── setup.py
├── README.md
└── requirements.txt
๐ง setup.py Example
from setuptools import setup, find_packages
setup(
name='mlproject',
version='0.1.0',
packages=find_packages(),
install_requires=[
'numpy',
'pandas',
'scikit-learn'
],
entry_points={
'console_scripts': [
'ml-run=mlproject.core:main',
]
}
)
๐ฆ Build & Install
python setup.py sdist bdist_wheel
pip install .
✨ 2. Packaging with poetry (Modern & Clean)
✅ Init Project
poetry new mlproject
cd mlproject
This creates:
mlproject/
│
├── mlproject/
│ └── __init__.py
├── pyproject.toml
└── tests/
๐ง Add Dependencies
poetry add pandas scikit-learn
๐️ pyproject.toml (Auto-managed)
[tool.poetry]
name = "mlproject"
version = "0.1.0"
description = "ML pipeline packaged"
authors = ["Sanjay <sanjay@email.com>"]
[tool.poetry.dependencies]
python = "^3.10"
pandas = "^1.5"
scikit-learn = "^1.3"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
๐ฆ Build & Install
poetry build
poetry install
๐งช 3. Managing Environments with pipenv
✅ Init Project
pipenv install pandas scikit-learn
This creates:
-
Pipfile -
Pipfile.lock
⚙️ Workflow
pipenv shell # Activate virtual environment
pipenv install # Install packages from Pipfile
pipenv graph # Show dependency tree
pipenv run python script.py
๐ When to Use What?
| Tool | Use When... |
|---|---|
setuptools |
You need full control or legacy setup |
poetry ✅ |
You want a modern, all-in-one solution (packaging + deps + publishing) |
pipenv |
You focus more on managing virtualenvs + dependencies, not packaging |
๐งฑ Best Practices
-
Always define project metadata (name, version, description).
-
Keep dependencies pinned (
poetry.lock/Pipfile.lock). -
Split
requirements.txtinto:-
requirements.txt(runtime) -
requirements-dev.txt(dev tools, linters, tests)
-
-
Use
entry_pointsfor CLI tools insetup.pyorpoetry.
๐งฉ Writing Modular & Reusable Code
๐ง Why Modular Code Matters in MLOps
| Benefit | Description |
|---|---|
| ๐ ️ Reusability | Code components (e.g., data loading, training) can be reused across experiments or pipelines. |
| ๐ Maintainability | Bugs are easier to isolate and fix. |
| ๐งช Testability | Unit testing becomes straightforward. |
| ๐ Scalability | Easily plug into CI/CD pipelines and deployment workflows. |
| ๐ฅ Team Collaboration | Clear interfaces and structure improve collaboration. |
๐งฑ 1. Key Principles
✅ Separation of Concerns (SoC)
-
Split code by responsibility (e.g., data loading ≠ model training ≠ evaluation).
✅ Single Responsibility Principle (SRP)
-
Each function/module should do one thing well.
✅ Don’t Repeat Yourself (DRY)
-
Avoid code duplication — use functions, classes, and utility modules.
✅ Loose Coupling & High Cohesion
-
Components should work independently (low coupling), but parts of the same module should work closely (high cohesion).
๐ 2. Recommended Project Structure
ml_project/
├── data/
│ └── data_loader.py
├── models/
│ └── model.py
├── pipelines/
│ └── train_pipeline.py
├── utils/
│ └── helpers.py
├── config/
│ └── config.yaml
├── main.py
└── requirements.txt
-
data_loader.py– Load/preprocess data -
model.py– Build model -
train_pipeline.py– Training logic -
helpers.py– Logging, metrics, seed setting, etc.
๐ง 3. Example: Modularizing ML Code
✅ data_loader.py
def load_data(path):
import pandas as pd
return pd.read_csv(path)
✅ model.py
from sklearn.ensemble import RandomForestClassifier
def get_model():
return RandomForestClassifier(n_estimators=100, random_state=42)
✅ train_pipeline.py
from data.data_loader import load_data
from models.model import get_model
def train(path):
df = load_data(path)
X, y = df.drop('target', axis=1), df['target']
model = get_model()
model.fit(X, y)
return model
✅ main.py
from pipelines.train_pipeline import train
if __name__ == "__main__":
model = train("data/train.csv")
๐งฐ 4. Utility Patterns
-
✅ Use
utils/for:-
logger.py– Custom logger setup -
config.py– Load YAML/JSON config -
metrics.py– Custom metric functions
-
-
✅ Avoid putting logic inside
__init__.py -
✅ Keep functions small (ideally <50 lines)
๐งช 5. Testability Boost
Because each function/module is independent:
-
Easy to write unit tests for each piece.
-
Better integration with
pytest, CI tools.
๐ 6. Reusability Patterns in MLOps
| Task | Reusable Component |
|---|---|
| Data prep | data_loader.py, feature transformers |
| Model config | YAML-driven + get_model() |
| Training loop | train_pipeline.py |
| Evaluation | evaluate.py |
| CLI tool | argparse-based wrapper |
๐ฆ 7. Combine with Packaging
If your code is modular:
-
You can package it as a library using
setuptoolsorpoetry. -
Easily integrate into Airflow, Kedro, or Kubeflow pipelines.
✅ Summary
| Tip | Why |
|---|---|
Use folders like data/, models/, pipelines/ |
Logical separation |
| Stick to SRP + DRY principles | Clean, manageable codebase |
| Write pure, testable functions | Better for CI/CD |
| Avoid hardcoding paths/configs | Use YAML/JSON + argparse |
4. Experiment Tracking
๐ MLflow – End-to-End ML Lifecycle Management Tool
๐ What is MLflow?
MLflow is an open-source platform to manage the complete machine learning lifecycle, including:
-
Experiment tracking
-
Model versioning
-
Packaging and reproducibility
-
Deployment
It's framework-agnostic — works with TensorFlow, PyTorch, Scikit-learn, XGBoost, etc.
๐ฆ MLflow Components
| Component | Purpose |
|---|---|
| Tracking | Logs experiments (params, metrics, artifacts, etc.) |
| Projects | Package ML code in a reproducible format |
| Models | Manage and serve trained models |
| Model Registry | Centralized store for model lifecycle management |
๐งช 1. MLflow Tracking
Track:
-
Parameters (
learning_rate,n_estimators, etc.) -
Metrics (
accuracy,loss) -
Artifacts (plots, models, logs)
-
Source code versions
๐ง Basic Code Example:
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
with mlflow.start_run():
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
mlflow.log_param("n_estimators", 100)
mlflow.log_metric("accuracy", model.score(X_test, y_test))
mlflow.sklearn.log_model(model, "model")
๐ก Output:
-
Logged under an experiment
-
Stored locally or on a remote backend (e.g., S3, GCS, SQL, Azure Blob)
๐ 2. MLflow Projects
-
Standard format to package ML code (
MLprojectfile) -
Enables reproducible training across environments
-
Can specify dependencies using
conda.yaml
# MLproject file
name: my_project
conda_env: conda.yaml
entry_points:
main:
parameters:
alpha: {type: float, default: 0.5}
command: "python train.py --alpha {alpha}"
๐ง 3. MLflow Models
-
Standard format for saving models (
mlflow.models) -
Support for:
-
Scikit-learn
-
PyTorch
-
TensorFlow
-
XGBoost
-
Custom Python functions (
pyfunc)
-
๐ง Load Saved Model:
model = mlflow.sklearn.load_model("runs:/<run_id>/model")
preds = model.predict(X)
๐ท️ 4. MLflow Model Registry
Central hub for model lifecycle:
-
Register models from experiments
-
Track versions, stage transitions (Staging → Production)
-
Add descriptions, comments, and annotations
from mlflow.tracking import MlflowClient
client = MlflowClient()
client.create_registered_model("rf_classifier")
client.create_model_version(
name="rf_classifier",
source="runs:/<run_id>/model",
run_id="<run_id>",
)
๐ฅ️ 5. MLflow UI
mlflow ui
-
Starts a local web server (default:
http://localhost:5000) -
View runs, parameters, metrics, artifacts
-
Compare experiments and download models
☁️ 6. MLflow Deployment Support
Deploy ML models to:
-
REST API using
mlflow models serve -
AWS SageMaker
-
Azure ML
-
Docker containers
-
Databricks
๐ 7. MLflow Backend Options
| Storage Type | Usage |
|---|---|
| Local filesystem | Default, quick tests |
| S3/GCS/Azure | Cloud-scale artifact storage |
| SQL database | Run metadata store |
| Remote tracking server | Centralized collaboration for teams |
๐งฐ 8. Best Practices with MLflow
| Practice | Reason |
|---|---|
Use mlflow.start_run() with meaningful names |
Better traceability |
Use tags (mlflow.set_tags) |
Add context like “experiment_type” |
| Log plots and configs as artifacts | Better experiment reproducibility |
| Automate logging inside training scripts | Easier integration into pipelines |
Use MLproject + conda.yaml |
Run anywhere reproducibly |
| Use Model Registry | Manage deployment stages (dev/staging/prod) |
๐ 9. MLflow in MLOps Pipelines
-
Part of CI/CD for ML
-
Used with GitHub Actions, Jenkins, or Kubeflow
-
Combine with tools like:
-
DVC for data versioning
-
Docker/K8s for scalable deployment
-
Airflow for orchestrating pipelines
-
✅ Summary Table
| Feature | Description |
|---|---|
| Tracking | Log params, metrics, and artifacts |
| Projects | Reproducible packaging of ML code |
| Models | Save/load models in a standard format |
| Registry | Central store to manage models & lifecycle |
| UI | Web interface to compare and view runs |
| Deployment | REST API, Docker, SageMaker, etc. |
๐ Weights & Biases (W&B)
✅ What is W&B?
Weights & Biases (W&B) is a machine learning experiment tracking and collaboration platform. It helps teams:
-
Log, track, and visualize experiments
-
Monitor model performance
-
Collaborate with shared dashboards
-
Manage datasets and model versions
It is framework-agnostic and integrates with tools like TensorFlow, PyTorch, Scikit-learn, Keras, HuggingFace, and Jupyter notebooks.
๐ฏ Key Features of W&B
| Feature | Description |
|---|---|
| Experiment Tracking | Log hyperparameters, metrics, system logs |
| Live Visualizations | Interactive charts for loss, accuracy, etc. |
| Artifacts | Version and track datasets, models, files |
| Sweeps | Hyperparameter optimization at scale |
| Reports | Shareable dashboards and visualizations |
| Collaborative UI | Team dashboard with project/workspace structure |
| Alerts | Slack/email notifications for performance changes |
๐ง 1. Experiment Tracking
Track:
-
Hyperparameters (
learning_rate,batch_size) -
Metrics (loss, accuracy, F1-score, etc.)
-
System info (GPU, RAM, CPU)
-
Custom visualizations and plots
Code Example:
import wandb
# Start a new run
wandb.init(project="image-classification")
# Log hyperparameters
wandb.config.learning_rate = 0.001
wandb.config.epochs = 10
# Log metrics in a loop
for epoch in range(10):
loss = train(...)
wandb.log({"epoch": epoch, "loss": loss})
๐ฆ 2. Artifacts (Data & Model Versioning)
-
Track versions of datasets, models, or any files.
-
Automatically logs lineage (what data created what model).
-
Enables reproducibility and collaboration.
artifact = wandb.Artifact("my_dataset", type="dataset")
artifact.add_file("data/train.csv")
wandb.log_artifact(artifact)
๐️ 3. W&B Sweeps (Hyperparameter Optimization)
Automate grid/random/Bayesian search over hyperparameters.
Define Sweep Config (YAML):
method: bayes
metric:
name: accuracy
goal: maximize
parameters:
learning_rate:
min: 0.0001
max: 0.1
batch_size:
values: [16, 32, 64]
Run Sweep:
wandb sweep sweep.yaml
wandb agent <sweep_id>
๐ 4. Reports and Dashboards
-
Custom dashboards with charts, tables, and media
-
Shareable with stakeholders or team members
-
Useful for publishing and presentation
⚙️ 5. System & Environment Logging
-
Logs:
-
Hardware specs (CPU, GPU, memory)
-
Python packages
-
Git commits
-
Terminal outputs
-
-
Makes experiments more reproducible and traceable
☁️ 6. Hosting Options
| Option | Description |
|---|---|
| wandb.ai | Default cloud-hosted platform |
| Local Server | On-premise or private cloud installation (wandb local) |
| Enterprise | For enterprise-grade access controls, SSO, private hosting |
๐ง 7. Use Cases in MLOps
| Use Case | How W&B Helps |
|---|---|
| Experiment management | Track, visualize, compare model runs |
| Collaboration | Shared dashboards and reports |
| Data versioning | Use artifacts for dataset tracking |
| Model audit trails | Link model versions to specific code and data |
| Automated training | Use sweeps in CI/CD pipelines |
๐ Comparison: W&B vs MLflow
| Feature | Weights & Biases | MLflow |
|---|---|---|
| UI & Visualization | Modern, interactive | Basic |
| Hyperparameter Tuning | Built-in (Sweeps) | External (plugins) |
| Artifact Management | Advanced | Basic |
| Collaboration | Strong team workflows | Less collaborative |
| Integrations | HuggingFace, PyTorch Lightning, etc. | Wide framework support |
| Hosting | Cloud, Local, Enterprise | Cloud, Local |
๐ Best Practices
-
Use
wandb.configfor consistent hyperparameter tracking -
Tag runs with meaningful names
-
Use
Artifactsfor tracking datasets and models -
Organize runs into projects and groups
-
Use
wandb.log()inside loops for step-wise tracking -
Visualize confusion matrix, ROC, precision-recall as custom plots
✅ Summary
| Feature | Why It Matters |
|---|---|
| Tracking | Log every experiment reliably |
| Sweeps | Automate hyperparameter tuning |
| Artifacts | Enable reproducibility |
| Reports | Share and present ML results |
| Collaboration | Teams can work together effectively |
Here are well-structured notes on neptune.ai and comet.ml — both are powerful tools for experiment tracking and model management in the MLOps ecosystem.
๐ neptune.ai
✅ What is neptune.ai?
Neptune.ai is a lightweight, metadata store for experiment tracking, model registry, and collaborative research in ML projects. It provides a centralized dashboard to log, compare, and organize your ML runs and experiments.
๐ฏ Key Features
| Feature | Description |
|---|---|
| Experiment Tracking | Logs hyperparameters, metrics, losses, and artifacts |
| Model Registry | Organize and store production-ready models |
| Interactive UI | Explore experiments via filters, tags, dashboards |
| Lightweight Integration | Minimal code changes to get started |
| Collaboration | Share links, view logs across team projects |
| Scalable | Works for single devs to enterprise teams |
| Notebooks & IDE Integration | Works in Jupyter, Colab, VSCode, etc. |
๐งช Experiment Tracking Example
import neptune
run = neptune.init_run(project="your_workspace/project-name")
# Log hyperparameters
run["hyperparameters"] = {"lr": 0.001, "epochs": 20}
# Log metrics
for epoch in range(20):
run["train/accuracy"].log(accuracy)
run["train/loss"].log(loss)
# Log model artifact
run["model"].upload("model.pkl")
run.stop()
๐ฆ Model Registry Example
model = run["model"].upload("model.pkl")
model.register("image-classifier-v1")
๐ neptune.ai vs MLflow
| Feature | neptune.ai | MLflow |
|---|---|---|
| Setup | Cloud-first, easy setup | Requires server setup (for full features) |
| UI | Advanced & customizable | Basic but functional |
| Model Registry | Integrated | Separate module |
| Logging Flexibility | Very high (manual + auto) | Moderate |
| Collaboration | Strong workspace-based | Moderate |
✅ Use Cases
-
Hyperparameter tuning & comparisons
-
Collaborative experiment tracking
-
Production-ready model registry
-
Data scientists working in teams
๐ comet.ml
✅ What is comet.ml?
Comet.ml is a machine learning platform for experiment tracking, collaboration, visualization, and model explainability. It helps you track code, data, experiments, models, and results — in real-time.
๐ฏ Key Features
| Feature | Description |
|---|---|
| Experiment Tracking | Real-time logging of metrics, parameters, and visualizations |
| Code Logging | Automatically logs code diffs, Git info |
| Data & Asset Logging | Track datasets, images, audio, confusion matrices |
| Model Explainability | Visual tools like SHAP, Grad-CAM, etc. |
| Custom Panels | Build dashboards with charts, histograms, text, etc. |
| Team Collaboration | Share results, set visibility, tag versions |
| Offline Mode | Sync runs after training (e.g., on-prem, remote systems) |
๐งช Experiment Tracking Example
from comet_ml import Experiment
experiment = Experiment(
api_key="your-api-key",
project_name="your-project",
workspace="your-workspace"
)
experiment.log_parameters({"lr": 0.001, "batch_size": 32})
experiment.log_metric("accuracy", 0.92)
experiment.log_asset("model.pkl")
๐ Visual Features
-
Compare runs in a table or graph
-
Confusion matrix, precision-recall curves
-
Interactive histograms, image/audio plots
-
Integrated Jupyter and Colab support
๐ง Explainability Features
-
SHAP value visualization
-
Grad-CAM for CNNs
-
Visual debugging with input overlays
๐ comet.ml vs Weights & Biases (W&B)
| Feature | comet.ml | W&B |
|---|---|---|
| Explainability | Built-in (SHAP, Grad-CAM) | Limited |
| Code Tracking | Automatic diffs, commits | Yes |
| Logging Flexibility | High | High |
| Visualization | Advanced, real-time | Interactive, modern UI |
| Offline Logging | Yes | Yes |
| Hyperparam Sweeps | Manual/Basic | Built-in (Sweeps) |
✅ Use Cases
-
Visual tracking of experiments
-
Explainability reports for stakeholders
-
Training on cloud/GPU environments
-
Post-hoc debugging with visual tools
๐งฉ Summary: neptune.ai vs comet.ml
| Feature | neptune.ai | comet.ml |
|---|---|---|
| Focus Area | Experiment tracking + registry | Experiment tracking + visualization |
| Setup | Lightweight | Cloud-first, easy setup |
| Explainability | No (external tools needed) | Yes (SHAP, Grad-CAM, etc.) |
| Visualizations | Moderate | Advanced |
| Artifact Management | Good | Excellent (images, audio, etc.) |
| Offline Mode | Yes | Yes |
| Collaboration | Workspace/projects | Team-based + public sharing |
| Hosting Options | Cloud, On-Prem, Enterprise | Cloud, On-Prem |
๐ TensorBoard — Visualization Toolkit for TensorFlow
✅ What is TensorBoard?
TensorBoard is a web-based visualization tool that helps you monitor and understand your machine learning experiments built using TensorFlow and PyTorch (via plugins or wrappers).
It provides interactive visualizations of:
-
Training progress (loss/accuracy curves)
-
Model graph
-
Histograms of weights and activations
-
Images, audio, and text
-
Embeddings
-
Hyperparameters
๐ง How TensorBoard Works
-
You log data (scalars, histograms, images, etc.) using
tf.summaryAPIs. -
Logs are written to a log directory (
log_dir). -
You run
tensorboard --logdir=path_to_log_dir. -
Access the dashboard via browser (usually http://localhost:6006).
๐งช Basic Code Example
import tensorflow as tf
from tensorflow import keras
# Define model
model = keras.models.Sequential([...])
# TensorBoard callback
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="./logs")
# Train model
model.fit(x_train, y_train, epochs=10, callbacks=[tensorboard_callback])
๐ป Launching TensorBoard
tensorboard --logdir=./logs --port=6006
Then open: http://localhost:6006
๐ ️ Key Features in TensorBoard
| Feature | Purpose |
|---|---|
| Scalars | Plot training/validation loss, accuracy, etc. |
| Graphs | Visualize model architecture and ops |
| Histograms | Track parameter and activation distributions over time |
| Images | Visualize input images, model predictions |
| Text | Display textual logs (e.g., predictions) |
| Audio | For audio signal tracking (e.g., speech models) |
| Embeddings | Project high-dimensional data to 2D/3D |
| Hyperparams | Compare experiment performance for different hyperparameter settings |
๐ฆ Log Custom Data
writer = tf.summary.create_file_writer("logs/custom")
with writer.as_default():
tf.summary.scalar("loss", 0.24, step=1)
tf.summary.text("note", "Training started", step=1)
tf.summary.image("sample_image", image_tensor, step=1)
๐ Use Cases
-
Real-time monitoring during training
-
Debugging model architecture and layer outputs
-
Comparing experiments (e.g., hyperparameter sweeps)
-
Visual storytelling of model performance
๐ TensorBoard vs Other Tools
| Feature | TensorBoard | MLflow UI | W&B / Comet |
|---|---|---|---|
| Real-time plots | ✅ Yes | ✅ Yes | ✅ Yes |
| TensorFlow-native | ✅ Best fit | ⚠️ Requires manual setup | ⚠️ Needs wrappers |
| PyTorch support | ✅ via torch.utils.tensorboard | ✅ | ✅ |
| Model Graph | ✅ Yes | ❌ No | ❌ No |
| Collaboration | ❌ Local only | ✅ | ✅ |
๐ Best Practices
-
Use unique
log_dirfor each experiment run (e.g., timestamp-based) -
Combine with
argparseto track hyperparameters per run -
Use
early_stopping+tensorboard_callbackfor optimal training
5. ML Pipeline Orchestration
๐ What is a Pipeline in MLOps?
✅ Definition:
A pipeline is a sequence of automated, structured steps that process data, train and evaluate machine learning models, and deploy them into production. It ensures reproducibility, scalability, and maintainability of ML workflows.
๐งฑ Key Components of a Typical ML Pipeline:
-
Data Ingestion
-
Load raw data from sources (CSV, databases, APIs, cloud storage, etc.)
-
-
Data Validation & Cleaning
-
Handle missing values, outliers, schema checks, etc.
-
-
Feature Engineering
-
Transform raw data into meaningful features.
-
-
Data Splitting
-
Split into train, validation, and test sets.
-
-
Model Training
-
Train the ML/DL model using the training data.
-
-
Model Evaluation
-
Use metrics (e.g., accuracy, RMSE, F1-score) to evaluate performance.
-
-
Model Tuning
-
Perform hyperparameter optimization.
-
-
Model Serialization
-
Save model (e.g., using
joblib,pickle, orONNX).
-
-
Model Deployment
-
Expose the model via REST API or batch pipeline.
-
-
Monitoring & Feedback Loop
-
Monitor performance and retrain when required.
๐ Why Pipelines Are Important in MLOps:
| Benefit | Description |
|---|---|
| ๐ ️ Automation | Reduces manual intervention |
| ๐ Reproducibility | Same input → same result |
| ⚖️ Scalability | Run at scale using cloud infrastructure |
| ๐ Traceability | Tracks changes, logs, versions |
| ๐งช Modularity | Enables reuse and testing of individual components |
๐ ️ Example Tools for Building Pipelines:
| Tool | Description |
|---|---|
scikit-learn Pipeline |
For basic ML pipelines (preprocessing + model) |
| Airflow | Workflow orchestration for data and ML |
| Kubeflow Pipelines | Kubernetes-native ML pipelines |
| MLflow Pipelines | Production-ready pipelines with experiment tracking |
| Kedro | Python framework for modular ML pipelines |
| ZenML | Clean, reproducible MLOps pipelines |
| TFX (TensorFlow Extended) | TensorFlow-specific ML pipeline framework |
๐งช Basic scikit-learn Pipeline Example:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())
])
pipeline.fit(X_train, y_train)
๐ก Real-World Analogy
A pipeline is like a factory assembly line:
Raw materials (data) go in, each station (step) transforms or processes it, and finally, a finished product (a deployed ML model) comes out.
⚙️ Manual vs Automated ML Pipelines
๐งญ Definition:
| Aspect | Manual Pipeline | Automated Pipeline |
|---|---|---|
| What it is | A workflow executed step-by-step by hand or through ad hoc scripts | A system where ML workflow stages are orchestrated automatically |
| Example | Writing Python scripts to clean data, train models, evaluate, and manually deploy | Using tools like MLflow Pipelines, Kubeflow, or Airflow to automate each step |
๐ Detailed Comparison:
| Criteria | Manual ML Pipeline | Automated ML Pipeline |
|---|---|---|
| ๐ง๐ป Execution | Done manually (run cell-by-cell or script-by-script) | Orchestrated via scheduler or pipeline engine |
| ๐ Reproducibility | Hard to reproduce exactly unless well-documented | High reproducibility due to versioned, codified steps |
| ๐ Scalability | Not scalable for large or multiple datasets/models | Designed to scale easily across environments |
| ๐งช Testing & Validation | Manual or limited testing | Easy to integrate CI/CD and testing checks |
| ๐ Debugging | Often easier (step-by-step control) | Can be complex depending on the orchestration tool |
| ๐ผ Deployment | Manual model packaging and API setup | Auto-deployment using CI/CD and model registry |
| ⏱ Time Efficiency | Time-consuming and repetitive | Saves time, especially with frequent model retraining |
| ๐ฆ Version Control | Often missing for data, code, and models | Integrated with Git/DVC/MLflow for versioning |
| ๐ Monitoring | Ad hoc or post hoc monitoring | Integrated monitoring/logging (e.g., Prometheus, W&B) |
| ๐ Tooling Examples | Jupyter Notebooks, Bash scripts | Airflow, Kubeflow, MLflow, TFX, ZenML |
๐ง Summary
| Manual Pipelines | Automated Pipelines |
|---|---|
| ✅ Good for quick prototypes and small-scale experiments | ✅ Ideal for production-ready, scalable ML systems |
| ❌ Prone to human error and harder to maintain | ❌ More setup time and tool complexity |
| ✅ Easier to debug early-stage issues | ✅ Enables CI/CD, reproducibility, team collaboration |
๐ก Best Practices
-
Start with manual development in notebooks or scripts to iterate quickly.
-
Gradually modularize and automate components using pipeline tools.
-
Use version control (Git, DVC) and tracking tools (MLflow, W&B) even in manual setups.
-
Move to automated pipelines when:
-
You need frequent retraining
-
You work in a team
-
You’re deploying to production
-
๐ Apache Airflow – Notes for MLOps
✅ What is Apache Airflow?
Apache Airflow is an open-source workflow orchestration tool designed to programmatically author, schedule, and monitor workflows (called DAGs). It is widely used in MLOps for automating data pipelines, model training, and deployment tasks.
๐ง Core Concepts
| Term | Description |
|---|---|
| DAG (Directed Acyclic Graph) | Defines a workflow as a sequence of tasks with dependencies. |
| Task | A single unit of work (e.g., Python function, Bash command). |
| Operator | Abstraction to run a task. Examples: PythonOperator, BashOperator, DockerOperator. |
| Scheduler | Triggers DAGs based on time or event intervals. |
| Executor | Decides how tasks are run (LocalExecutor, CeleryExecutor, KubernetesExecutor). |
| Task Instance | A specific run of a task at a certain time. |
⚙️ How Airflow Works
-
Define a DAG in Python (
*.pyfile). -
Specify tasks using Operators.
-
Airflow schedules the DAG based on
start_date,schedule_interval, etc. -
Tasks run in the order defined by dependencies.
-
Logs, retries, and monitoring are handled via the UI or CLI.
๐ Sample DAG for ML Workflow
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def preprocess():
print("Data cleaned")
def train_model():
print("Model trained")
with DAG('ml_pipeline',
start_date=datetime(2023, 1, 1),
schedule_interval='@daily',
catchup=False) as dag:
t1 = PythonOperator(task_id='data_preprocessing', python_callable=preprocess)
t2 = PythonOperator(task_id='model_training', python_callable=train_model)
t1 >> t2 # Task dependency
๐ Why Airflow for MLOps?
| Feature | Benefit |
|---|---|
| ✅ Automation | Automate ETL, model training, evaluation, deployment |
| ๐ Reusability | Reuse modular components across projects |
| ๐ Scheduling | Run daily/weekly jobs or triggered workflows |
| ๐ง Observability | Track task success/failure, logs, and retries |
| ๐ UI Dashboard | Monitor DAG runs visually |
๐งฐ Common Operators in MLOps
| Operator | Use Case |
|---|---|
PythonOperator |
Call Python preprocessing/training functions |
BashOperator |
Run CLI commands or scripts |
DockerOperator |
Run tasks in isolated containers |
KubernetesPodOperator |
Run tasks as pods in a K8s cluster |
S3ToGCSOperator, GCSToBigQueryOperator |
Move data between cloud storages |
๐ Best Practices
-
Write idempotent tasks (safe to run multiple times).
-
Use XCom for inter-task communication (small data).
-
Store large artifacts in external systems (e.g., S3, GCS, DVC).
-
Use Airflow Variables or Secrets Manager for configs.
-
Monitor DAGs using email alerts, Slack hooks, or Prometheus exporters.
๐งฑ Airflow in ML Lifecycle
| ML Stage | Airflow Role |
|---|---|
| Data Ingestion | Schedule ETL jobs from API, databases |
| Data Validation | Run data checks with Great Expectations |
| Model Training | Trigger Python scripts, notebooks, or Docker containers |
| Model Evaluation | Automate evaluation metrics & logging |
| Model Deployment | Push to model registry or REST API |
| Monitoring | Retrain based on drift detection pipelines |
๐ Alternatives to Airflow
| Tool | Notes |
|---|---|
| Prefect | Easier syntax, better for dynamic workflows |
| Dagster | Strong typing, good for data-first pipelines |
| Luigi | Simpler, more lightweight |
| Kubeflow Pipelines | K8s-native, ML-specific workflows |
☸️ Kubeflow Pipelines (KFP) – Notes for MLOps
✅ What is Kubeflow Pipelines?
Kubeflow Pipelines (KFP) is a component of the Kubeflow ecosystem designed for building, deploying, and managing end-to-end ML workflows on Kubernetes.
It enables data scientists and ML engineers to define reproducible, composable, and scalable pipelines using containers and YAML or Python SDKs.
๐งฑ Key Components
| Component | Description |
|---|---|
| Pipeline | A DAG representing the ML workflow (like Airflow DAG) |
| Component | A self-contained step (usually a Docker container) |
| Step | A single execution of a component |
| Experiment | A group of pipeline runs for comparison |
| Run | A single execution of a pipeline |
| Artifact | Data produced by a component (model, metrics, etc.) |
| Metadata Store | Tracks inputs, outputs, metrics, lineage |
๐ Typical ML Pipeline in Kubeflow
Data Ingestion → Preprocessing → Feature Engineering → Model Training → Evaluation → Deployment
๐ KFP vs Airflow
| Feature | Kubeflow Pipelines | Apache Airflow |
|---|---|---|
| Designed for ML? | ✅ Yes | ❌ General-purpose |
| Kubernetes-native? | ✅ Yes | Optional (via K8sExecutor) |
| Artifact Tracking | ✅ Built-in | ❌ Not by default |
| Built-in UI | ✅ ML-focused | ✅ Generic |
| Notebook Integration | ✅ Strong (Jupyter + Katib) | ❌ Minimal |
| Model Tracking | ✅ Integrated (via MLMD) | ❌ Needs integration |
๐งช Sample KFP Code (Python SDK v2)
from kfp import dsl
@dsl.component
def preprocess_op():
return "Data cleaned"
@dsl.component
def train_op():
return "Model trained"
@dsl.pipeline(name="ml-pipeline")
def my_pipeline():
step1 = preprocess_op()
step2 = train_op()
step1 >> step2 # Optional in SDK v2, for clarity
-
Use
kfp.compiler.Compiler().compile()to compile into a.jsonpipeline spec. -
Deploy with the UI or CLI:
kfp.Client().create_run_from_pipeline_func(...)
๐ Why Use Kubeflow Pipelines?
| Benefit | Description |
|---|---|
| ✅ Scalability | Runs on Kubernetes; each step in a pod |
| ✅ Reproducibility | Pipeline components are versioned and tracked |
| ✅ Modularity | Reuse components like preprocess, train, deploy |
| ✅ UI & Metadata | Visual DAGs, track experiments, parameters |
| ✅ Integration | Katib (AutoML), KFServing (deployment), TensorBoard, etc. |
| ✅ CI/CD | Integrates well with Argo Workflows, Tekton, GitHub Actions |
⚙️ Typical Use Case in MLOps
| Stage | KFP Role |
|---|---|
| Data Preprocessing | Scalable, containerized transformation |
| Feature Engineering | Encapsulated, reusable step |
| Model Training | Train on GPU/TPU in isolated pods |
| Hyperparameter Tuning | Katib integration |
| Evaluation & Metrics | Return as pipeline artifacts |
| Model Registry | Push to MLflow, S3, or Vertex AI Model Registry |
| Deployment | Use KFServing or custom deployment step |
| Monitoring & Retraining | Trigger retrain pipelines based on drift detection |
๐ง Best Practices
-
Build reusable components using Docker and
kfp.components.create_component_from_func. -
Version pipelines and track artifacts using the metadata store.
-
Keep inputs/outputs small (for passing between steps); store large files in S3, GCS, etc.
-
Use Katib for AutoML, Kubeflow Notebooks for experimentation, and KServe for serving.
๐ Tools Often Used With KFP
| Tool | Purpose |
|---|---|
| Katib | AutoML & hyperparameter tuning |
| KServe (KFServing) | Model deployment on Kubernetes |
| MinIO / GCS / S3 | Artifact and data storage |
| MLflow / W&B | Model tracking (external) |
| Argo Workflows | Backend engine for pipeline execution |
| TensorBoard | Training logs visualization |
⚙️ Prefect & Luigi – Orchestration Tools for MLOps
✅ What is Prefect?
Prefect is a modern workflow orchestration tool built for dataflow automation. It is Python-native and designed for developer ergonomics, observability, and scalability.
๐ Key Features:
-
Pythonic API for defining flows and tasks
-
Handles retries, failure notifications, caching
-
Real-time observability dashboard (via Prefect Cloud or Prefect Server)
-
Supports parameterization, scheduling, and dynamic workflows
-
Integrates with Kubernetes, Docker, Dask, and more
๐งฑ Core Concepts:
| Concept | Description |
|---|---|
| Flow | A complete workflow |
| Task | A unit of work inside a flow |
| State | Status of a task (e.g., Success, Failed) |
| Deployment | A versioned, schedulable flow configuration |
| Orion | Prefect 2.0 engine (modern, async-native) |
๐งช Example:
from prefect import flow, task
@task
def extract():
return [1, 2, 3]
@task
def transform(data):
return [i * 2 for i in data]
@flow
def etl():
raw = extract()
result = transform(raw)
print(result)
etl()
✅ What is Luigi?
Luigi is a Python-based workflow engine developed by Spotify. It is designed to build complex pipelines of batch jobs, handling dependency resolution and task scheduling.
๐ Key Features:
-
Strong dependency graph resolution
-
Pythonic task definition
-
File-based output targets (e.g., local, HDFS, S3)
-
CLI & web UI for monitoring pipelines
-
Best suited for ETL & batch data pipelines
๐งฑ Core Concepts:
| Concept | Description |
|---|---|
| Task | Represents a single unit of work |
| Target | Output of a task (e.g., a file) |
| Requires() | Defines upstream task dependencies |
| Run() | Logic to perform the task |
| Output() | Used to track if a task has completed |
๐งช Example:
import luigi
class Extract(luigi.Task):
def output(self):
return luigi.LocalTarget("data.txt")
def run(self):
with self.output().open("w") as f:
f.write("1,2,3")
class Transform(luigi.Task):
def requires(self):
return Extract()
def output(self):
return luigi.LocalTarget("transformed.txt")
def run(self):
with self.input().open("r") as infile, self.output().open("w") as outfile:
numbers = map(int, infile.read().split(','))
doubled = [str(n*2) for n in numbers]
outfile.write(",".join(doubled))
luigi.build([Transform()], local_scheduler=True)
๐ Prefect vs Luigi: Feature Comparison
| Feature | Prefect | Luigi |
|---|---|---|
| Language | Python | Python |
| UI | Modern, real-time (Cloud/Server) | Basic web UI |
| Async Support | ✅ Yes (in v2.0 "Orion") | ❌ No |
| Dynamic Workflows | ✅ Supported | ❌ Static only |
| Retry Policies | ✅ Built-in | ❌ Manual |
| Scheduling | ✅ Yes | ✅ Yes |
| Caching | ✅ Native | ❌ Not built-in |
| Cloud Integration | ✅ Prefect Cloud | ❌ Self-host only |
| Use Case Fit | Modern dataflows, MLOps | Batch ETL, legacy pipelines |
| Ease of Use | ✅ High | ⚠️ Verbose, boilerplate-heavy |
๐ฏ When to Use What?
| Use Case | Recommended Tool |
|---|---|
| MLOps Pipelines | ✅ Prefect |
| Batch ETL in legacy systems | ✅ Luigi |
| Need real-time observability | ✅ Prefect |
| Simpler workflows, local use | ๐ก Luigi |
| Production-grade orchestration with retries, caching | ✅ Prefect |
๐ Tools Similar to Prefect/Luigi:
| Tool | Notes |
|---|---|
| Apache Airflow | Best for complex DAGs, most mature |
| Dagster | Strong type-checking, great for analytics workflows |
| Kubeflow Pipelines | Kubernetes-native ML pipelines |
| Flyte | ML-native orchestration, strong type system |
๐งญ DAGs, Scheduling, and Retries in MLOps
๐ 1. DAG (Directed Acyclic Graph)
✅ Definition:
A DAG is a graph-based structure that represents a pipeline where:
-
Nodes = Tasks
-
Edges = Dependencies
-
Acyclic = No loops; task execution moves forward only
๐ Why DAGs?
-
Ensures that tasks run in the right order
-
Captures dependencies clearly
-
Enables parallel execution when dependencies are met
๐ Example:
[Extract Data]
|
[Preprocess Data]
/ \
[Train Model] [Validate Data]
|
[Deploy Model]
Used by: Airflow, Luigi, Prefect, Kubeflow Pipelines
⏰ 2. Scheduling
✅ Definition:
Scheduling is the process of triggering a pipeline or task automatically based on time or event.
๐งญ Types of Schedules:
| Type | Example |
|---|---|
| Time-based | Run every day at 2 AM |
| Interval-based | Every 10 minutes |
| Event-based | Trigger on new file in S3 or data update |
๐ ️ Tools & Syntax:
-
Airflow: Uses
cronortimedeltaschedule_interval='0 2 * * *' # Every day at 2 AM -
Prefect:
IntervalSchedule,CronSchedulefrom prefect.deployments import Deployment from prefect.orion.schemas.schedules import IntervalSchedule Deployment(flow=etl, schedule=IntervalSchedule(interval=timedelta(days=1)))
๐ Why Scheduling?
-
Automates ML pipelines
-
Ensures consistency (e.g., daily model retraining)
-
Frees up manual effort
๐ 3. Retries
✅ Definition:
Retries refer to automatically re-running a failed task a specific number of times before marking it as failed.
๐ง Why Needed?
-
Handles transient failures (e.g., network issues, timeouts)
-
Improves pipeline robustness
-
Prevents entire pipeline failure due to one flaky task
⚙️ Retry Parameters:
| Parameter | Description |
|---|---|
retries |
Max retry attempts |
retry_delay |
Time delay between retries |
retry_exponential_backoff |
Gradual increase in delay |
๐ง Example (Airflow):
default_args = {
'retries': 3,
'retry_delay': timedelta(minutes=5)
}
๐ง Example (Prefect):
@task(retries=3, retry_delay_seconds=10)
def unstable_task():
...
๐ Summary Table
| Concept | Purpose | Used In | Notes |
|---|---|---|---|
| DAG | Task dependency management | Airflow, Prefect, Luigi | Must be acyclic |
| Scheduling | Automated triggering of workflows | All major orchestration tools | Can be time or event-based |
| Retries | Handle transient failures | All major tools | Improves pipeline resilience |
Comments
Post a Comment