Mlops - III
8. CI/CD for ML
๐ What is CI/CD?
| Acronym | Meaning |
|---|---|
| CI | Continuous Integration |
| CD | Continuous Delivery or Continuous Deployment |
CI/CD automates the process of building, testing, and deploying applications to reduce manual work, improve consistency, and speed up delivery cycles.
✅ 1. Continuous Integration (CI)
๐น Goal:
Automatically integrate code from multiple developers, test it, and detect errors early.
๐ง Typical Steps in CI:
-
Developer pushes code to GitHub/GitLab/Bitbucket
-
CI pipeline triggers:
-
Run unit tests
-
Run linting/formatting (e.g., flake8, black)
-
Build application artifacts
-
Generate reports (e.g., test coverage)
-
๐ Tools:
-
GitHub Actions
-
GitLab CI
-
Jenkins
-
CircleCI
-
Travis CI
✅ 2. Continuous Delivery (CD)
๐น Goal:
Automatically prepare the application to be deployed in a staging or production environment — but with manual approval for final deployment.
๐งฉ Steps:
-
All CI steps
-
Deploy to staging
-
Run integration tests
-
Wait for approval → deploy to production
✅ 3. Continuous Deployment (CD)
๐น Goal:
Fully automate build → test → production deployment with no human approval step.
This is riskier, but good for small frequent releases if tests are reliable.
๐️ CI/CD Pipeline Example (ML App)
1. Code pushed to GitHub → triggers pipeline
2. Environment setup
3. Code linting & formatting
4. Unit & model testing
5. Train model (optionally)
6. Store model artifact (e.g., in S3 or MLflow)
7. Build Docker image
8. Deploy to staging or production (e.g., via Kubernetes)
๐ Common CI/CD Tools in MLOps
| Tool | Use |
|---|---|
| GitHub Actions | Git-based CI/CD |
| GitLab CI | Full Git + CI/CD integration |
| Jenkins | Flexible, customizable pipelines |
| ArgoCD | Kubernetes-native CD |
| Tekton | Kubernetes-native CI/CD |
| MLflow / DVC | Model versioning/artifacts |
| Docker + K8s | Containerized deployment |
๐งช Why CI/CD is Important in MLOps?
-
Keeps models reproducible
-
Automates testing of data pipelines
-
Ensures consistent deployment of models
-
Avoids "it worked on my machine" issues
GitHub actions
GitHub Actions is a CI/CD (Continuous Integration and Continuous Deployment) tool built into GitHub. It allows you to automate workflows such as building, testing, and deploying code when certain events occur in your repository (like push, pull request, etc.).
๐ง Common Use Cases
-
CI/CD pipelines (build, test, deploy code)
-
Linting and formatting
-
Running cron jobs
-
Publishing packages
-
Automating issues, PRs, labels, etc.
๐ Basic Structure of GitHub Actions
You define actions using YAML inside the .github/workflows/ folder in your repository.
Example:
# .github/workflows/nodejs.yml
name: Node.js CI
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install dependencies
run: npm install
- name: Run tests
run: npm test
⚙️ Key Components
| Component | Description |
|---|---|
on |
Triggers (e.g., push, pull_request, schedule) |
jobs |
A collection of tasks to run |
runs-on |
Environment (e.g., ubuntu-latest) |
steps |
Individual commands or actions |
uses |
Reusable actions (like actions/checkout) |
run |
Shell commands |
✅ Example for Python Project
name: Python CI
on: [push]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run tests
run: pytest
๐ฆ Popular Actions
| Action | Purpose |
|---|---|
actions/checkout |
Check out repo code |
actions/setup-node |
Setup Node.js |
actions/setup-python |
Setup Python |
docker/build-push-action |
Build & push Docker image |
github/super-linter |
Code linting |
๐ ️ Advanced Features
-
Matrix builds (test on multiple environments)
-
Secrets (store API keys securely)
-
Reusable workflows via
workflow_call -
Artifacts (store and share test reports, build files, etc.)
Gitlab CI/CD
GitLab CI/CD is GitLab’s built-in continuous integration and deployment system. Like GitHub Actions, it lets you automate build, test, and deployment pipelines, but is more tightly integrated into the GitLab platform.
๐งฑ Core Concept: .gitlab-ci.yml
The pipeline is defined in a .gitlab-ci.yml file in the root of your repository.
✅ Simple Example
stages:
- build
- test
- deploy
build_job:
stage: build
script:
- echo "Compiling the code..."
- make
test_job:
stage: test
script:
- echo "Running tests..."
- make test
deploy_job:
stage: deploy
script:
- echo "Deploying application..."
- make deploy
only:
- main
๐ง Key Components
| Component | Description |
|---|---|
stages |
The pipeline flow (e.g., build → test → deploy) |
jobs |
Each job runs a script and belongs to a stage |
script |
Shell commands that the job will execute |
only / except |
Control when the job runs (e.g., only on main) |
tags |
Used to target specific GitLab Runners |
๐ Common Features
-
Built-in Docker support for containerized pipelines
-
Manual jobs for approval steps
-
Artifacts and caching for build outputs or dependencies
-
Environment variables & secrets
-
Parallel/Matrix jobs
-
Trigger other pipelines
-
Use private/public runners
๐ Python Example
image: python:3.11
stages:
- test
test:
stage: test
script:
- pip install -r requirements.txt
- pytest
๐ณ Docker + GitLab CI Example
image: docker:latest
services:
- docker:dind
stages:
- build
build:
stage: build
script:
- docker build -t myapp:latest .
๐ Using Secrets (CI/CD Variables)
Set in GitLab → Project Settings → CI/CD → Variables
Then reference in your script:
script:
- echo "$SECRET_KEY"
๐ Deployment Example with SSH
deploy:
stage: deploy
script:
- ssh user@your-server 'cd /var/www/app && git pull && systemctl restart app'
only:
- main
✳ Comparison with GitHub Actions
| Feature | GitLab CI | GitHub Actions |
|---|---|---|
| Config File | .gitlab-ci.yml |
.github/workflows/*.yml |
| Built-in Docker | ✅ Native | ✅ With setup |
| Matrix Build | ✅ Via parallel |
✅ With matrix |
| Community Marketplace | ✅ (less extensive) | ✅ Huge marketplace |
| Integrated UI | Deeply built-in | More plug & play |
In CI/CD, artifacts are files generated during a pipeline run that you want to save, archive, or pass to later stages—like test reports, build outputs, or deployment packages.
Both GitLab CI and GitHub Actions support artifacts, but their usage and syntax differ.
๐งฑ GitLab CI: Artifacts
๐น Basic Usage
build_job:
stage: build
script:
- make build
artifacts:
paths:
- build/
This saves the build/ folder after the build_job runs. These artifacts:
-
Are downloadable from the GitLab UI
-
Can be passed to later stages (unless
expire_inremoves them)
๐น With Expiration and Custom Settings
test_job:
stage: test
script:
- pytest --junitxml=report.xml
artifacts:
paths:
- report.xml
expire_in: 1 week
reports:
junit: report.xml
Key fields:
| Field | Purpose |
|---|---|
paths |
Files or directories to save |
expire_in |
Auto-delete time (e.g., 1 day, 1 week) |
reports |
Special format reports like junit, coverage, etc. |
๐น Passing Artifacts to Next Stage
Artifacts are automatically passed to jobs in later stages, not within the same stage.
stages:
- build
- test
build:
stage: build
script:
- make build
artifacts:
paths:
- build/
test:
stage: test
script:
- ./test-runner build/
๐งฐ GitHub Actions: Artifacts
๐น Save Artifacts
- name: Upload build output
uses: actions/upload-artifact@v4
with:
name: build-artifact
path: build/
๐น Download in Another Job
- name: Download artifact
uses: actions/download-artifact@v4
with:
name: build-artifact
You must split into separate jobs to upload/download artifacts.
๐งฐ What is Jenkins?
Jenkins is an open-source automation server widely used for CI/CD pipelines. It lets you automate building, testing, and deploying applications through pipelines (typically defined in Jenkinsfile).
๐ง Key Concepts
| Concept | Description |
|---|---|
| Job | A build configuration (freestyle or pipeline) |
| Pipeline | Scripted or declarative workflow for CI/CD |
| Agent | A machine (or container) where jobs run |
| Stage | A high-level step (e.g., Build, Test) |
| Step | A single task inside a stage (e.g., shell command) |
| Node | A Jenkins worker (agent) that executes pipelines |
๐ Sample Jenkinsfile (Declarative Pipeline)
pipeline {
agent any
environment {
MY_ENV_VAR = 'value'
}
stages {
stage('Build') {
steps {
echo 'Building the project...'
sh 'make build'
}
}
stage('Test') {
steps {
echo 'Running tests...'
sh 'make test'
}
}
stage('Deploy') {
when {
branch 'main'
}
steps {
echo 'Deploying to production...'
sh './deploy.sh'
}
}
}
post {
always {
echo 'Pipeline finished.'
}
failure {
echo 'Pipeline failed!'
}
}
}
๐ฆ Artifacts in Jenkins
To store and archive files like build outputs or test results:
post {
success {
archiveArtifacts artifacts: 'build/*.jar', fingerprint: true
}
}
To publish test results:
post {
always {
junit 'reports/**/*.xml'
}
}
๐งช Jenkins Plugins You’ll Need
| Plugin Name | Purpose |
|---|---|
| Pipeline | Enables pipeline-as-code |
| Git | Checkout from Git repositories |
| JUnit | Test reporting |
| Docker Pipeline | Build & run Docker in pipeline |
| Credentials Binding | Secure secret handling |
| SSH | Remote deployments |
| Blue Ocean | Modern UI for pipelines |
๐ณ Jenkins with Docker
pipeline {
agent {
docker {
image 'python:3.11'
args '-v /var/run/docker.sock:/var/run/docker.sock'
}
}
stages {
stage('Install') {
steps {
sh 'pip install -r requirements.txt'
}
}
stage('Test') {
steps {
sh 'pytest'
}
}
}
}
๐ Secrets in Jenkins
-
Store credentials in Manage Jenkins → Credentials
-
Use in pipeline:
withCredentials([string(credentialsId: 'MY_SECRET_ID', variable: 'MY_SECRET')]) {
sh 'echo $MY_SECRET'
}
๐ What is CircleCI?
CircleCI is a modern cloud-native CI/CD platform known for speed, flexibility, and Docker-first support. It automates building, testing, and deploying your code every time you commit changes.
๐ Config File: .circleci/config.yml
CircleCI uses a YAML file stored in the .circleci/ folder in your repo.
✅ Minimal Example (Node.js)
version: 2.1
jobs:
build:
docker:
- image: cimg/node:20.4
steps:
- checkout
- run: npm install
- run: npm test
workflows:
build_and_test:
jobs:
- build
๐งฑ Key Components
| Component | Description |
|---|---|
version |
CircleCI configuration version (use 2.1+) |
jobs |
Group of steps to run (build/test/deploy) |
steps |
Commands in a job (e.g., checkout, run) |
workflows |
Defines job orchestration (sequential/parallel) |
executors |
Runtime environment (Docker, machine, macOS) |
๐ณ Docker Support Example
jobs:
build:
docker:
- image: cimg/python:3.11
steps:
- checkout
- run:
name: Install dependencies
command: pip install -r requirements.txt
- run:
name: Run tests
command: pytest
๐ฆ Artifacts in CircleCI
Artifacts are files saved from a job (e.g., logs, coverage reports).
Upload Artifacts
- store_artifacts:
path: test-results/
destination: test-results
Test Reports
- store_test_results:
path: test-results
You can see artifacts and test results in the CircleCI UI after job execution.
๐ Environment Variables & Secrets
-
Define them via CircleCI Project Settings → Environment Variables
-
Reference them directly in your
runcommands:
- run: echo $MY_SECRET_TOKEN
๐ Advanced Features
| Feature | Example |
|---|---|
| Workflows | Run jobs in parallel or sequentially |
| Conditional steps | Use when and unless |
| Caching | Speed up builds using save_cache / restore_cache |
| Reusable configs | commands, executors, orbs |
| Matrix builds | Run tests against multiple language versions |
⚙️ Caching Example
- restore_cache:
keys:
- v1-deps-{{ checksum "package-lock.json" }}
- run: npm install
- save_cache:
paths:
- node_modules
key: v1-deps-{{ checksum "package-lock.json" }}
๐ CircleCI vs GitHub Actions vs GitLab CI vs Jenkins
| Feature | CircleCI | GitHub Actions | GitLab CI | Jenkins |
|---|---|---|---|---|
| Hosted | ✅ Yes | ✅ Yes | ✅ Yes | ❌ Self-hosted |
| Docker-native | ✅ Strong | ✅ Good | ✅ Strong | ✅ with config |
| Config as Code | ✅ .yml |
✅ .yml |
✅ .yml |
✅ Groovy DSL |
| Marketplace | ✅ Orbs | ✅ Actions | ⚠️ Few | ✅ Plugins |
| Matrix builds | ✅ Built-in | ✅ Supported | ✅ Parallel jobs | ✅ Scripted |
๐ง What is Amazon SageMaker Pipelines?
SageMaker Pipelines is Amazon's CI/CD service for machine learning workflows. It lets you build, automate, and manage ML workflows (like data prep, training, tuning, evaluation, and deployment) using a Python SDK.
It’s similar to Kubeflow Pipelines or Airflow but tightly integrated into AWS SageMaker.
⚙️ Typical Use Case: End-to-End ML Workflow
[Data Prep] → [Feature Engineering] → [Model Training] → [Model Evaluation] → [Model Registration] → [Deployment]
๐ Basic Structure Using SageMaker Python SDK
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep, ModelStep
from sagemaker.workflow.parameters import ParameterString, ParameterFloat
from sagemaker.processing import ScriptProcessor
from sagemaker.estimator import Estimator
from sagemaker.workflow.pipeline_context import PipelineSession
✅ Example: Full ML Pipeline
from sagemaker.workflow.parameters import ParameterString
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.processing import ScriptProcessor
from sagemaker.estimator import Estimator
from sagemaker.workflow.pipeline_context import PipelineSession
import sagemaker
# Setup
region = sagemaker.Session().boto_region_name
role = sagemaker.get_execution_role()
pipeline_session = PipelineSession()
# Parameters
input_data = ParameterString(name="InputData", default_value="s3://my-bucket/input.csv")
# Step 1: Preprocessing
processor = ScriptProcessor(
image_uri=sagemaker.image_uris.retrieve("sklearn", region),
command=["python3"],
role=role,
instance_count=1,
instance_type="ml.m5.large",
)
processing_step = ProcessingStep(
name="DataPreprocessing",
processor=processor,
inputs=[input_data],
code="preprocess.py",
outputs=[...]
)
# Step 2: Training
estimator = Estimator(
image_uri=sagemaker.image_uris.retrieve("xgboost", region),
role=role,
instance_count=1,
instance_type="ml.m5.xlarge",
output_path="s3://my-bucket/model/",
)
training_step = TrainingStep(
name="ModelTraining",
estimator=estimator,
inputs={"train": processing_step.properties.ProcessingOutputConfig. Outputs["train_data"].S3Output.S3Uri},
)
# Pipeline Definition
pipeline = Pipeline(
name="MyMLPipeline",
parameters=[input_data],
steps=[processing_step, training_step],
sagemaker_session=pipeline_session
)
pipeline.upsert(role_arn=role)
execution = pipeline.start()
๐ฆ Key Components of SageMaker Pipelines
| Component | Purpose |
|---|---|
ProcessingStep |
Data cleaning, feature engineering, etc. |
TrainingStep |
Model training using Estimator |
TransformStep |
Batch inference |
ConditionStep |
Add logic based on metrics |
ModelStep |
Register model to Model Registry |
CallbackStep |
Integrate with Lambda/custom logic |
ParameterString/Float |
Dynamically pass pipeline inputs |
PipelineSession |
Manages interaction with SageMaker |
๐ก️ Benefits
✅ Managed service – no servers to manage
✅ Trackable runs with versioning, lineage, and metadata
✅ Built-in CI/CD for ML
✅ Integration with SageMaker Experiments, Model Registry, and Feature Store
✅ Scalable with on-demand compute and built-in retry logic
๐ Real-World Example Flow
1. Ingest raw CSV from S3
2. Clean & split data (ProcessingStep)
3. Train XGBoost or sklearn model (TrainingStep)
4. Evaluate accuracy, F1 score (ConditionStep)
5. If metrics are good → register model (ModelStep)
6. Deploy to endpoint via Lambda or manual
๐ Related AWS Services
| Service | Purpose |
|---|---|
| S3 | Data input/output |
| SageMaker Studio | GUI for pipelines |
| SageMaker Feature Store | Feature engineering |
| Model Registry | Version & track models |
| Lambda / Step Functions | Extend logic or trigger deployment |
| CloudWatch | Logging & monitoring |
๐น What is ZenML?
ZenML is an open-source MLOps framework built to orchestrate reproducible ML pipelines across tools like MLflow, Airflow, Kubernetes, and SageMaker.
✅ Features:
-
Tool-agnostic: plug in TensorFlow, PyTorch, sklearn, etc.
-
Built-in support for MLflow, Weights & Biases, GCP, AWS, Kubernetes
-
Focus on pipelines, reproducibility, modularity
-
Developer-friendly CLI + Python SDK
๐ ZenML Pipeline Example:
@step
def ingest_data() -> pd.DataFrame:
...
@step
def train_model(data: pd.DataFrame) -> Any:
...
@pipeline
def training_pipeline(data_loader, trainer):
data = data_loader()
model = trainer(data)
pipeline = training_pipeline(ingest_data, train_model)
pipeline.run()
ZenML separates your pipeline into clean steps and supports plugins to execute on local, Kubeflow, Airflow, Vertex AI, etc.
๐น What is TFX (TensorFlow Extended)?
TFX is Google's official end-to-end platform for deploying TensorFlow models in production. It was built to meet internal Google ML production needs.
✅ Features:
-
Native integration with TensorFlow ecosystem
-
Standard components:
ExampleGen,Trainer,Evaluator,Pusher, etc. -
Works with Apache Beam, Kubeflow Pipelines, Airflow
-
Focuses heavily on data validation, model analysis, serving
๐ TFX Pipeline Example:
from tfx.dsl.components.base import executor_spec
from tfx.orchestration import pipeline
from tfx.orchestration.local import local_dag_runner
from tfx.components import CsvExampleGen, Trainer, Pusher
example_gen = CsvExampleGen(input_base='data/')
trainer = Trainer(...)
pusher = Pusher(...)
pipeline = pipeline.Pipeline(
pipeline_name='my_pipeline',
pipeline_root='pipelines/',
components=[example_gen, trainer, pusher]
)
local_dag_runner.LocalDagRunner().run(pipeline)
TFX enforces TensorFlow-specific best practices for data quality, model performance, and deployment.
๐ ZenML vs TFX: Feature Comparison
| Feature | ZenML | TFX |
|---|---|---|
| Language | Python (framework-agnostic) | Python (TensorFlow-focused) |
| ML Framework Support | TensorFlow, PyTorch, sklearn, etc. | TensorFlow only |
| Component Modularity | Highly modular + customizable | Modular (TensorFlow-centric) |
| Orchestrators | Airflow, Kubeflow, MLflow, Prefect | Airflow, Kubeflow |
| Deployment Support | SageMaker, Vertex AI, KServe | TensorFlow Serving, Vertex AI |
| Visualization / Metadata | MLflow, W&B, ZenML UI | TensorBoard, TFX Metadata |
| Pipeline Reproducibility | ✅ Yes | ✅ Yes |
| Local Execution | ✅ Yes | ✅ Yes |
| Ease of Use | ๐ข Beginner-friendly | ๐ด More complex, steep learning curve |
๐ง When to Use What?
| Scenario | Use |
|---|---|
| Want a framework-agnostic, modular, easy-to-adopt pipeline | ✅ ZenML |
| Already using TensorFlow and want to follow best practices | ✅ TFX |
| Need to plug into SageMaker, MLflow, K8s, etc. | ✅ ZenML |
| Need advanced model validation, explainability, data skew detection | ✅ TFX |
๐ก TL;DR
| ZenML | TFX |
|---|---|
| Flexible, lightweight, easy to start | Powerful, opinionated, deep TensorFlow support |
| Works with any ML/DL framework | TensorFlow-only |
| Ideal for hybrid/multi-cloud & plug-n-play MLOps | Ideal for enterprise-grade TensorFlow pipelines |
9. Monitoring and Logging
๐ What is Drift in Machine Learning?
In production ML, drift refers to changes over time in the data or relationships that the model depends on, which can lead to reduced model accuracy.
There are two main types:
๐ฆ 1. Data Drift (a.k.a. Covariate Shift)
Definition:
The distribution of input features (X) changes over time, but the relationship between input and output (P(y|x)) remains the same.
๐ง Example:
-
A credit scoring model was trained on users from India, but it’s now being used in the US.
-
Feature distributions like
age,income, orcredit historychange → data drift.
๐ Detection Methods:
-
Statistical tests (e.g., Kolmogorov-Smirnov test)
-
Population Stability Index (PSI)
-
Earth Mover’s Distance
-
Histograms & density plots
๐ง 2. Model Drift (a.k.a. Concept Drift)
Definition:
The relationship between input and target variable (P(y|x)) changes over time, even if input distribution remains stable.
๐ง Example:
-
A fraud detection model where fraudster behavior evolves (e.g., new tactics)
-
The model can no longer accurately map inputs to the correct outcome → model drift.
๐ Detection Methods:
-
Monitoring model performance metrics (e.g., accuracy, AUC, F1)
-
If model metrics drop but input features haven’t changed → model drift
-
Concept drift detectors like:
-
DDM (Drift Detection Method)
-
ADWIN
-
Kullback-Leibler divergence
-
๐ Drift Comparison
| Aspect | Data Drift | Model Drift |
|---|---|---|
| What changes | Input features distribution (X) | Relationship between X and Y |
| Impact | Can indirectly reduce accuracy | Directly affects model accuracy |
| Detection | PSI, KS test, histograms | Drop in model performance |
| Remediation | Retrain with recent data | Retrain + re-define model logic |
๐ Common Causes of Drift
| Cause | Type |
|---|---|
| Seasonality or time-based shifts | Data Drift |
| Change in user behavior | Model Drift |
| External events (e.g., pandemic) | Both |
| Sensor recalibration or software upgrades | Data Drift |
๐ก️ How to Monitor & Handle Drift
1. Monitoring Tools
-
Evidently AI – Open-source for drift detection (https://evidentlyai.com/)
-
WhyLabs, Arize AI, Fiddler, SageMaker Model Monitor
-
Custom dashboards with Prometheus/Grafana
2. Detection Frequency
-
Daily/weekly batch comparisons
-
Real-time if using streaming
3. Actions to Take
-
Trigger retraining pipelines
-
Use drift detectors in CI/CD workflows
-
Incorporate active learning or online learning
๐ Summary
| Term | What is it? | Why it matters |
|---|---|---|
| Data Drift | Input feature distribution changes | Model may make wrong inferences |
| Model Drift | Relationship between X and Y changes | Model becomes inaccurate |
๐ What is Model Performance Monitoring?
Model performance monitoring is the process of tracking, measuring, and analyzing how your ML model behaves in production — ensuring it's still accurate, fair, and reliable after deployment.
๐ Why Is It Important?
Even the best model at training time can degrade in production due to:
-
Data drift
-
Model drift
-
Feature pipeline bugs
-
Feedback loops or changing real-world patterns
Without monitoring, you might miss silent failures that hurt business outcomes.
๐ฏ What to Monitor in ML Systems
✅ 1. Performance Metrics
| Metric Type | Example |
|---|---|
| Classification | Accuracy, Precision, Recall, F1, AUC |
| Regression | RMSE, MAE, R² |
| Ranking | MAP, NDCG |
| Business KPIs | Conversion rate, CTR, etc. |
๐ Compare training vs validation vs production performance.
✅ 2. Data Quality & Drift
| What to check | How |
|---|---|
| Missing values | Feature-level monitoring |
| Schema violations | Type, range, shape |
| Data drift | PSI, KS Test |
| Outliers or anomalies | Z-score, IQR, Mahalanobis |
✅ 3. Prediction Distribution
-
Is the model outputting the same predictions every time?
-
Look for prediction bias or overconfident scores.
✅ 4. Fairness and Bias
-
Measure model fairness across sensitive groups (e.g., age, gender).
-
Monitor disparities in performance.
✅ 5. Latency and Throughput
-
Inference latency (ms/req)
-
Request volume
-
System resource usage (CPU/GPU, memory)
⚒️ Tools for Model Monitoring
๐ข Open-Source
| Tool | Features |
|---|---|
| Evidently AI | Data & model drift, dashboards, reports |
| Prometheus + Grafana | Custom monitoring (great for latency, metrics) |
| MLflow | Experiment tracking (with manual model logs) |
| WhyLogs | Logging and monitoring of data quality |
| Fiddler / Arize AI / TruEra | Monitoring + explainability (SaaS) |
๐ก Cloud-Native
| Platform | Monitoring Feature |
|---|---|
| SageMaker Model Monitor | Built-in drift & quality detection |
| Vertex AI (GCP) | Prediction monitoring, alerts |
| Azure ML | Drift + metric monitoring |
| Databricks | MLflow + production metrics |
๐ Monitoring Lifecycle Example
1. Model is deployed (API or batch)
2. User requests come in
3. Log: input data, model predictions, latency
4. Optional: collect true labels later (for supervised metrics)
5. Compare live vs baseline (training) distributions & metrics
6. Trigger alerts / retrain pipelines if performance drops
๐งช Sample: Custom Monitoring Loop (Python)
import pandas as pd
from sklearn.metrics import accuracy_score
# 1. Collect live predictions and labels
preds = pd.read_csv("live_predictions.csv")
truth = pd.read_csv("live_labels.csv")
# 2. Calculate performance
acc = accuracy_score(truth["label"], preds["prediction"])
# 3. Trigger alert if accuracy drops
if acc < 0.75:
print("⚠️ Model accuracy dropped below threshold!")
๐ฆ Best Practices
✅ Set performance baselines from training
✅ Store input + predictions + actuals
✅ Monitor in real-time or batch
✅ Set up alerts or retraining triggers
✅ Regularly audit for fairness and explainability
๐ What is Prometheus?
Prometheus is an open-source monitoring and alerting system originally developed by SoundCloud. It’s widely used for real-time metrics collection, alerting, and visualization, especially in DevOps and ML infrastructure.
✅ Why Use Prometheus for ML & MLOps?
-
Track model inference metrics (latency, throughput, errors)
-
Monitor CPU/GPU usage of ML workloads
-
Combine with Grafana for dashboards
-
Setup alerts for performance or drift degradation
-
Works great with Docker, Kubernetes, FastAPI, Flask, etc.
๐งฑ Core Concepts
| Concept | Description |
|---|---|
| Metric | A time-series data point (e.g., inference_latency_seconds) |
| Labels | Key-value tags for filtering metrics (e.g., model="xgboost") |
| Exporter | Collects metrics from apps (e.g., Python, GPU, Docker) |
| Scraping | Prometheus pulls metrics by scraping a target HTTP endpoint |
| Query | Uses PromQL to query metrics |
| Alertmanager | Sends alerts via email, Slack, PagerDuty, etc. |
๐ Example: Expose Metrics in Python (FastAPI + Prometheus)
pip install prometheus_client fastapi uvicorn
# app.py
from fastapi import FastAPI
from prometheus_client import start_http_server, Summary, Counter
import time
import random
app = FastAPI()
# Metrics
REQUEST_TIME = Summary('inference_latency_seconds', 'Time spent on inference')
REQUEST_COUNT = Counter('inference_requests_total', 'Total inference requests')
@app.get("/predict")
@REQUEST_TIME.time()
def predict():
REQUEST_COUNT.inc()
time.sleep(random.uniform(0.1, 0.5)) # simulate inference delay
return {"result": "cat"}
# Run Prometheus metrics server on port 8001
start_http_server(8001)
๐ Prometheus Configuration (prometheus.yml)
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'ml-api'
static_configs:
- targets: ['localhost:8001']
-
Prometheus scrapes
http://localhost:8001/metricsevery 15s
๐ Visualize in Grafana
-
Run Prometheus + Grafana using Docker:
docker-compose up
-
Add Prometheus as a data source in Grafana
-
Create dashboards using PromQL, e.g.:
inference_latency_seconds_count
rate(inference_latency_seconds_sum[1m])
๐ฆ Popular Exporters
| Exporter | Use |
|---|---|
prometheus_client |
App-level metrics in Python |
node_exporter |
System metrics (CPU, memory) |
gpu_exporter |
NVIDIA GPU metrics |
kube-state-metrics |
Kubernetes objects |
pushgateway |
For short-lived jobs (like batch ML) |
๐ Alerts (via Alertmanager)
Example rule:
groups:
- name: ml-alerts
rules:
- alert: HighLatency
expr: inference_latency_seconds_sum / inference_latency_seconds_count > 0.3
for: 2m
labels:
severity: warning
annotations:
summary: "High inference latency detected"
๐ ML Monitoring Use Cases
| Use Case | Metric |
|---|---|
| Latency | inference_latency_seconds |
| Traffic | inference_requests_total |
| Failure rate | inference_errors_total |
| Resource usage | From node_exporter or gpu_exporter |
| Drift triggers | Custom metrics exposed from model logic |
๐ What is Grafana?
Grafana is an open-source analytics and dashboarding tool used to visualize time-series data from sources like Prometheus, InfluxDB, Elasticsearch, Loki, and many others.
In MLOps, Grafana is often paired with Prometheus to monitor:
-
Model inference latency
-
Drift signals
-
API uptime and errors
-
CPU/GPU utilization
-
Data pipeline performance
✅ Why Use Grafana?
-
Beautiful interactive dashboards
-
Flexible PromQL/SQL queries
-
Alerting capabilities
-
Works with ML/DevOps monitoring tools
-
Integration with Slack, email, PagerDuty for alerts
๐ฆ Key Features
| Feature | Description |
|---|---|
| Panels | Graphs, tables, heatmaps, gauges, logs |
| Variables | Dynamic filters (e.g., model name) |
| Data Sources | Prometheus, Loki, AWS CloudWatch, PostgreSQL, etc. |
| Annotations | Add events or markers to timelines |
| Alerts | Visual + rule-based threshold alerts |
๐ง Common ML Use Cases
| Use Case | Panel Type | Metric Source |
|---|---|---|
| Inference latency | Line chart | Prometheus |
| Drift score over time | Graph panel | Evidently/WhyLogs |
| Error rate | Stat panel | Prometheus |
| GPU usage | Gauge / Time series | NVIDIA exporter |
| Feature distribution | Histogram / Heatmap | Custom app metrics |
๐ Example Setup (Local)
๐ ️ 1. Docker-Compose (Prometheus + Grafana)
# docker-compose.yml
version: '3'
services:
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana
ports:
- "3000:3000"
๐ 2. Start It
docker-compose up -d
๐งฑ Sample Grafana Dashboard Panels for ML
๐ Inference Latency Panel
-
Metric:
inference_latency_seconds -
Query:
rate(inference_latency_seconds_sum[1m]) / rate(inference_latency_seconds_count[1m])
๐ฅ Error Rate Panel
-
Metric:
inference_errors_total -
Query:
rate(inference_errors_total[5m])
๐ Model Drift Detection Panel
-
Metric:
feature_drift_score -
Query:
avg_over_time(feature_drift_score[1h])
๐️ Alerts in Grafana
-
Create a panel → Set thresholds (e.g., latency > 500ms)
-
Add Alert → Define condition (e.g., avg over 5min)
-
Connect Alert Manager / Slack / Email
✨ Example Dashboards
| Dashboard | Panels |
|---|---|
| Model Monitoring | Accuracy, F1, latency, requests |
| System Monitoring | CPU, RAM, GPU, disk |
| ETL Pipeline Monitoring | Job success, failure rate, execution time |
| Data Drift Monitor | PSI/KS scores, feature distribution |
Here’s a complete comparison and overview of Evidently AI, WhyLabs, and Seldon Core, three powerful tools in the MLOps & model monitoring ecosystem:
๐ฆ 1. Evidently AI
Purpose: Open-source Python library for data & model monitoring, focused on drift detection, data quality, and performance reports.
✅ Use Cases:
-
Data & target drift detection
-
Feature distribution changes
-
Model performance reports
-
Offline or in-pipeline monitoring
๐ Integration:
-
Python scripts, Jupyter notebooks
-
Airflow, Prefect, Kubeflow, etc.
๐ Key Features:
| Feature | Description |
|---|---|
| Data Drift Report | Detects change in feature distributions |
| Target Drift Report | Monitors label distribution changes |
| Classification/Regression Reports | Accuracy, F1, ROC, etc. |
| Data Quality Report | Nulls, type mismatches, etc. |
| Dashboards (Evidently UI) | Serve reports as interactive UI locally or in pipelines |
๐งช Example (Python):
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=ref_df, current_data=prod_df)
report.save_html("drift_report.html")
☁️ 2. WhyLabs + WhyLogs
Purpose: Enterprise-grade observability and monitoring platform for ML pipelines and data quality, offering automated logging, drift detection, and dashboards.
✅ Use Cases:
-
Continuous production monitoring
-
Automated data profiling
-
Real-time alerting
-
Integration with cloud & on-prem ML workflows
๐ Key Features:
| Feature | Description |
|---|---|
| WhyLogs | Open-source library for logging statistics about data |
| WhyLabs Platform | SaaS platform for dashboards, alerts |
| Segmented Monitoring | Track metrics across different user segments |
| Lightweight Logging | Doesn’t expose raw data (great for compliance) |
| Streaming / Batch | Works in both modes; supports Spark, Pandas, S3, Kafka, etc. |
๐งช Example (Python):
import whylogs as why
profile = why.log(pandas_df).profile()
profile.write(path="profile.bin")
You then upload this profile to WhyLabs using the WhyLabs agent.
๐ 3. Seldon Core
Purpose: Open-source platform for deploying, scaling, and monitoring ML models on Kubernetes. Sits alongside tools like KFServing and KServe.
✅ Use Cases:
-
Kubernetes-native model serving
-
A/B testing, canary rollout, multi-model serving
-
Real-time inference monitoring
-
Explainability & drift detection
๐ Key Features:
| Feature | Description |
|---|---|
| MLServer | Fast, multi-language model server |
| Explainers | SHAP, Lime integration out of the box |
| Drift Detectors | Kolmogorov-Smirnov, PSI, etc. |
| Outlier Detectors | Use alibi-detect or custom models |
| Seldon Metrics | Prometheus & Grafana-ready metrics |
| Advanced Routing | Can run A/B, multi-armed bandit deployments |
๐งฑ Architecture:
[Kubernetes]
|
[SeldonDeployment YAML]
|
[Model Pods] <---> [Metrics + Monitoring Pods]
|
[Ingress Gateway]
๐ Comparison Table
| Feature/Tool | Evidently AI | WhyLabs + WhyLogs | Seldon Core |
|---|---|---|---|
| Type | Python library/UI | Logging + SaaS platform | Kubernetes deployment |
| Drift Detection | ✅ | ✅ | ✅ (via Alibi Detect) |
| Model Serving | ❌ | ❌ | ✅ |
| Monitoring | ✅ (offline) | ✅ (cloud/streaming) | ✅ (real-time, Prometheus) |
| Alerts | Manual + Grafana | Built-in SaaS alerts | With Prometheus + AlertMgr |
| Integration | Python, Notebooks | Spark, Kafka, S3, Pandas | Kubernetes + Prometheus |
| Visual UI | Local HTML/UI server | WhyLabs dashboard | Grafana integration |
| Open Source | ✅ | Partially (WhyLogs = ✅) | ✅ |
๐ Ideal Tool Based on Need:
| Need | Tool |
|---|---|
| Quick & Local Drift Detection | Evidently AI |
| Enterprise-Grade Logging & SaaS Dashboards | WhyLabs |
| Full ML Deployment + Drift Detection in K8s | Seldon Core |
10. Cloud & Infrastructure for MLOps
๐ง AWS in the ML/MLOps Ecosystem
AWS offers end-to-end tools for data ingestion, training, model deployment, monitoring, and CI/CD.
๐ง Key AWS Services for MLOps
| Category | Service | Purpose |
|---|---|---|
| Storage & Data | S3, Glue, Athena | Data lake, ETL, querying logs/metadata |
| Model Development | SageMaker Studio | IDE for ML dev (like JupyterLab) |
| Model Training | SageMaker Training Jobs | Scalable training on EC2 or Spot |
| Model Deployment | SageMaker Endpoints | Real-time APIs for inference |
| Model Registry | SageMaker Model Registry | Manage model versions and metadata |
| CI/CD | CodePipeline, CodeBuild, Lambda | Automate training/testing/deployment |
| Monitoring & Drift | SageMaker Model Monitor | Detect drift, outliers, quality issues |
| Observability | CloudWatch, Prometheus, Grafana | Metrics, logging, alerting |
| Feature Store | SageMaker Feature Store | Store, reuse, and version features |
| Security & Auth | IAM, KMS, VPC, S3 Policies | Access control and encryption |
๐ MLOps Workflow on AWS
1. Data Collection & Processing
-
Use AWS Glue or S3 + Lambda to collect/clean data
-
Version datasets using DVC or S3 object versioning
2. Model Training
-
Launch training jobs using SageMaker Training
-
Auto-scale compute; log metrics to CloudWatch
3. Model Evaluation & Registration
-
Evaluate metrics, visualize in SageMaker Experiments
-
Register successful model in Model Registry
4. Model Deployment
-
Use SageMaker Inference Endpoints for:
-
Real-time (InvokeEndpoint)
-
Batch (BatchTransform)
-
-
Configure autoscaling and multi-model endpoints if needed
5. Monitoring & Drift Detection
-
Use SageMaker Model Monitor to:
-
Detect data drift (feature value distribution)
-
Detect model drift (label drift, performance)
-
Log anomalies to CloudWatch
-
6. CI/CD for ML
-
Automate with:
-
CodePipeline: Orchestration
-
CodeBuild: Build + test steps
-
Step Functions: Complex ML workflows
-
EventBridge: Trigger on file uploads, model updates
-
๐ Example: Real-time Drift Monitoring with SageMaker
from sagemaker.model_monitor import DataCaptureConfig
# Enable inference data capture
data_capture_config = DataCaptureConfig(
enable_capture=True,
sampling_percentage=100,
destination_s3_uri="s3://mybucket/captured-data"
)
# Attach it while deploying model
predictor = sagemaker.deploy(
initial_instance_count=1,
instance_type='ml.m5.large',
data_capture_config=data_capture_config
)
Then set up a Monitoring Schedule using ModelMonitor.
๐ Observability with CloudWatch + Grafana
-
CloudWatch collects metrics like:
-
Latency
-
Invocation count
-
4xx/5xx errors
-
Custom logs from inference scripts
-
-
Connect Prometheus to Amazon Managed Grafana for:
-
Model-specific dashboards
-
Drift visualization
-
Alerting via SNS or Slack
-
๐ง Final Thoughts
| Goal | Tools to Use |
|---|---|
| Local development & experiments | SageMaker Studio, S3 |
| Deployment with monitoring | SageMaker Endpoint + Model Monitor |
| Production CI/CD pipelines | CodePipeline, Step Functions |
| Enterprise monitoring | CloudWatch + Grafana or Prometheus |
☁️ GCP for Machine Learning & MLOps
GCP offers an end-to-end AI/ML ecosystem via tools like Vertex AI, BigQuery, Cloud Functions, and Cloud Monitoring.
๐ง Key GCP Services for MLOps
| MLOps Phase | GCP Tool | Purpose |
|---|---|---|
| Data Storage | Cloud Storage (GCS) | Object store for datasets/models |
| Data Analysis | BigQuery, Dataflow | SQL-based analytics, streaming pipelines |
| ML Platform | Vertex AI | Unified ML lifecycle platform |
| Training | Vertex AI Training | Managed model training on CPUs/GPUs/TPUs |
| Model Registry | Vertex Model Registry | Store and manage model versions |
| Deployment | Vertex AI Endpoints | Real-time/batch model inference |
| Monitoring | Vertex AI Model Monitoring | Monitor drift, skew, performance |
| CI/CD | Cloud Build, Cloud Functions | Automate ML pipeline steps |
| Observability | Cloud Logging & Monitoring | Alerting, visualization |
| Pipelines | Vertex AI Pipelines (Kubeflow) | Orchestration of ML workflows |
| Feature Store | Vertex Feature Store | Central repository of features |
๐ GCP MLOps Workflow Overview
1. Data Storage & Preparation
-
Store raw/processed data in GCS
-
Use Dataflow or Dataprep for batch/stream processing
-
Explore data via BigQuery
2. Model Development & Training
-
Develop locally or in Vertex AI Workbench (JupyterLab)
-
Train using:
-
Vertex AI Training Jobs (managed)
-
Custom containers (e.g., with PyTorch/TensorFlow)
-
TPUs for large-scale deep learning
-
3. Model Evaluation & Versioning
-
Evaluate metrics post-training
-
Register model in Vertex Model Registry
-
Use Artifact Registry for Docker images
4. Model Deployment
-
Deploy to Vertex AI Endpoints:
-
Real-time predictions via REST API
-
Scalable with autoscaling/load balancing
-
-
For batch inference: use Batch Prediction Jobs
5. Model Monitoring
-
Vertex AI Model Monitoring handles:
-
Prediction data drift
-
Training-serving skew
-
Feature skew
-
-
Configure alerts to trigger on threshold breaches
-
Logs can be piped to Cloud Logging for audits
6. CI/CD Pipelines
-
Use Vertex Pipelines (Kubeflow Pipelines on GCP) to:
-
Automate training → evaluation → deployment
-
Integrate with Cloud Build for custom CI steps
-
# sample Kubeflow component
@component
def train_model(input_data: str) -> str:
...
๐ GCP Monitoring Stack
| Tool | Use |
|---|---|
| Cloud Monitoring (Stackdriver) | Metrics like latency, errors, usage |
| Cloud Logging | Inference logs, pipeline status |
| Vertex Model Monitoring | Drift, skew, performance metrics |
| BigQuery | Store and analyze monitoring data |
| Grafana (via GKE) | Custom dashboards for model metrics |
๐ Model Drift Monitoring Example
Enable monitoring when deploying model:
gcloud beta ai endpoints deploy-model \
--model=model-id \
--display-name="drift-monitored-model" \
--enable-access-logging \
--enable-drift-monitoring
Set thresholds via console or REST API.
๐ Popular Use Cases on GCP
| Use Case | GCP Services |
|---|---|
| NLP model deployment | Vertex AI + Cloud Functions |
| Data pipeline with streaming | Pub/Sub + Dataflow |
| Real-time fraud detection | Vertex AI + BigQuery + Monitoring |
| Retail recommender system | Feature Store + Vertex AI + Monitoring |
๐ง Final Thoughts
| Objective | GCP Tools |
|---|---|
| Unified ML lifecycle | Vertex AI |
| Data processing | BigQuery, Dataflow |
| CI/CD | Vertex Pipelines, Cloud Build |
| Observability | Stackdriver, Vertex Monitoring |
| Custom workflows | Kubeflow, GKE, Cloud Functions |
☁️ Azure for MLOps & Machine Learning
Azure offers a comprehensive and scalable platform to manage the entire ML lifecycle — from data ingestion to deployment, monitoring, and retraining.
๐ง Key Azure MLOps Components
| MLOps Phase | Azure Tool | Purpose |
|---|---|---|
| Data Storage | Azure Blob Storage, ADLS | Store datasets, models, logs |
| Data Processing | Azure Data Factory, Synapse | ETL, big data analytics |
| ML Platform | Azure Machine Learning (Azure ML) | Unified ML development/deployment |
| Model Training | Azure ML Compute, Azure Databricks | Train with CPU, GPU, or Spark |
| Experiment Tracking | Azure ML Experiments | Track metrics, parameters, versions |
| Model Registry | Azure ML Model Registry | Central model storage |
| Deployment | Azure ML Endpoints, AKS, ACI | Real-time/batch serving |
| CI/CD Pipelines | Azure DevOps, GitHub Actions | Automate ML lifecycle |
| Monitoring | Azure Monitor, App Insights | Track performance, drift, logs |
| Feature Store (Preview) | Azure ML Feature Store | Reusable features for ML models |
๐ Azure MLOps Workflow Overview
1. Data Ingestion & Storage
-
Use Azure Data Factory or Synapse Pipelines for ingesting data
-
Store datasets in Blob Storage or ADLS Gen2
2. Data Processing & Exploration
-
Use Azure Synapse, Databricks, or Jupyter Notebooks in Azure ML workspace
-
Perform data cleaning, EDA, feature engineering
3. Model Development & Experimentation
-
Work within Azure ML Studio or integrate with VSCode
-
Use Experiment tracking to compare models across metrics & hyperparameters
-
Train models on:
-
Local or remote compute
-
AML Compute Cluster (autoscaling)
-
Databricks Spark cluster
-
4. Model Versioning & Registry
-
Register successful models into the Model Registry
-
Associate model with training metrics and dataset version
5. Deployment
-
Deploy models to:
-
Managed Endpoints (real-time REST APIs)
-
AKS for production-grade serving
-
ACI for testing/dev workloads
-
Batch Endpoints for offline inference
-
6. CI/CD with Azure DevOps
-
Use Azure DevOps Pipelines or GitHub Actions
-
Automate:
-
Data validation → model training → evaluation → deployment
-
-
YAML-based templates and pre-built tasks available
# Azure ML pipeline YAML (simplified)
trigger:
branches:
include: [main]
jobs:
- job: TrainModel
steps:
- task: AzureMLTrain@1
inputs:
workspaceName: 'ml-workspace'
experimentName: 'churn-model'
7. Monitoring & Retraining
-
Monitor using:
-
Azure Monitor for system metrics
-
Application Insights for API-level logs
-
ML Model Monitoring for data drift, concept drift, and performance
-
-
Set alerts and automate retraining pipelines via triggers
๐ Drift & Performance Monitoring Example
Azure ML can monitor:
-
Data drift between training & production data
-
Prediction drift and label distribution changes
-
Model performance degradation
from azureml.monitoring import ModelDataCollector
collector = ModelDataCollector("model-name", feature_names=["age", "income"])
collector.collect(data=X_inference)
๐ Integration with Azure Ecosystem
| Azure Service | Role in MLOps Pipeline |
|---|---|
| Azure DevOps | CI/CD, testing, version control |
| Azure Monitor | Real-time logging, alerting |
| Azure Kubernetes | Scalable inference serving |
| Azure Key Vault | Secure management of API keys/secrets |
| Azure Functions | Trigger retraining or workflows |
| Power BI | Visualize model outputs and predictions |
| Azure Logic Apps | No-code orchestration for alerts/retraining |
๐ Model Deployment Options
| Environment | Use Case |
|---|---|
| ACI (Azure Container Instance) | Quick testing/staging |
| AKS (Azure Kubernetes) | Scalable production |
| Local Docker | Custom environment |
| Batch Endpoints | Non-real-time inference jobs |
๐ง Azure MLOps Use Cases
| Use Case | Azure Tools |
|---|---|
| Credit risk scoring | Azure ML + DevOps + AKS |
| Demand forecasting | Azure ML Pipelines + Batch Endpoints |
| Real-time recommendation | Azure ML Endpoints + AKS |
| Automated retraining | Azure DevOps + Azure ML Triggers |
๐งฐ Comparison with Other Clouds
| Capability | Azure ML | GCP Vertex AI | AWS SageMaker |
|---|---|---|---|
| GUI & SDK Support | Strong (Studio + CLI + SDK) | Strong | Strong |
| CI/CD Pipelines | Azure DevOps, GitHub | Vertex Pipelines | SageMaker Pipelines |
| Monitoring & Drift | Azure Monitor + ML monitor | Vertex AI Monitoring | SageMaker Model Monitor |
| Feature Store | Preview | Production-ready | Production-ready |
๐ What is IAM & Access Control?
IAM (Identity and Access Management) is the framework for:
-
Identifying users, services, or machines
-
Controlling what they can access (data, services, resources)
-
Auditing and enforcing security policies
๐ก Why IAM is Critical in MLOps
| Use Case | IAM Role Needed |
|---|---|
| Data access for training | Grant access to S3, Blob, GCS buckets |
| CI/CD pipeline automation | Roles for GitHub Actions, Jenkins, or Azure DevOps |
| Model serving | Access to endpoints, containers, logging |
| Secure secrets handling | Access to key vaults or secret managers |
| Auditing & compliance | Logs of who accessed or changed models/data |
๐ IAM Across Cloud Platforms
1. AWS IAM
-
IAM Users, Groups, Roles, Policies (JSON)
-
Common roles:
-
AmazonS3FullAccess -
AmazonSageMakerFullAccess -
Custom policies with fine-grained permissions
-
-
Used in:
-
SageMaker Pipelines
-
Lambda, EC2, Step Functions
-
Secrets Manager for API keys
-
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::ml-data-bucket/*"
}
2. Azure IAM (RBAC)
-
Uses Azure Active Directory (AAD) for identity
-
Role-Based Access Control (RBAC) manages access
-
Predefined roles:
-
Contributor,Reader,Owner,Azure ML Contributor
-
-
Custom roles for:
-
Access to storage accounts
-
Running Azure ML pipelines
-
Accessing compute targets
-
{
"roleName": "CustomMLRole",
"permissions": [
{
"actions": [
"Microsoft.MachineLearningServices/*",
"Microsoft.Storage/*/read"
]
}
]
}
3. GCP IAM
-
Service accounts + IAM roles + resource policies
-
Predefined roles like:
-
roles/aiplatform.admin -
roles/storage.objectViewer
-
-
Used in:
-
Vertex AI Pipelines
-
BigQuery, GCS
-
Secret Manager
-
bindings:
- role: roles/aiplatform.user
members:
- serviceAccount:ml-pipeline@my-project.iam.gserviceaccount.com
๐ก️ IAM in CI/CD & MLOps
| MLOps Stage | IAM Role Needed |
|---|---|
| Data preparation | Access to datasets (S3, GCS, ADLS) |
| Model training | Access to compute, logging, secrets |
| CI/CD pipeline | GitHub Actions or Azure DevOps with scoped secrets |
| Model registry | Read/write permissions to register |
| Model deployment | Invoke permissions for endpoints |
| Monitoring | Access to logs, metrics services |
๐ Example: GitHub Actions + AWS IAM for MLOps
-
GitHub Actions deploys model to SageMaker
-
Needs limited-access IAM role
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: arn:aws:iam::123456789012:role/GitHub-SageMaker-Deploy
aws-region: us-east-1
๐ Best Practices for IAM in MLOps
✅ Principle of Least Privilege
✅ Rotate credentials & use temporary tokens
✅ Use IAM Roles/Service Accounts over hardcoding credentials
✅ Enable logging & audit trails
✅ Store secrets in Key Vault/Secrets Manager
✅ Apply network policies & endpoint security
✅ Tools to Manage IAM + Secrets
| Tool | Purpose |
|---|---|
| AWS IAM + Secrets Manager | Access control & credential store |
| Azure RBAC + Key Vault | Role-based control & secrets |
| GCP IAM + Secret Manager | Fine-grained permissions & key mgmt |
| HashiCorp Vault | Cross-platform secret store |
| Kubernetes RBAC + ServiceAccounts | For model deployment and services |
Here’s a detailed comparison and usage guide for cloud storage in the MLOps and DevOps context, focusing on AWS S3 (Simple Storage Service) and GCP GCS (Google Cloud Storage).
☁️ Overview
| Feature | Amazon S3 | Google Cloud Storage (GCS) |
|---|---|---|
| Service Name | Amazon Simple Storage Service | Google Cloud Storage |
| Storage Structure | Buckets → Objects | Buckets → Objects |
| URL Format | https://s3.amazonaws.com/bucket/key |
https://storage.googleapis.com/bucket/key |
| Access Control | IAM, Bucket Policies, ACLs | IAM, Uniform/Bucket-level Policies |
| Versioning | ✅ Supported | ✅ Supported |
| Encryption | SSE-S3, SSE-KMS, SSE-C | CSE, CMEK, Google-managed keys |
| Lifecycle Mgmt | ✅ (Transitions, Expiry rules) | ✅ (Rules, Policies) |
| Event Triggers | S3 Event Notifications (to Lambda, etc.) | GCS Notifications (Pub/Sub, Cloud Functions) |
๐ง Common MLOps Use Cases
| Task | How S3 / GCS Helps |
|---|---|
| Store raw training data | CSVs, JSON, Parquet in S3 or GCS |
| Save processed features | Feature store intermediates |
| Model artifacts | Store .pkl, .pt, .joblib files |
| Logging / metrics storage | Send logs or model metrics to S3/GCS |
| CI/CD pipelines | Pass artifacts between build stages |
| Model registry (if custom) | Versioned model storage |
๐ ️ How to Use in Practice
๐น AWS S3 Example (Python boto3)
import boto3
s3 = boto3.client('s3')
s3.upload_file('model.pkl', 'ml-bucket', 'models/model.pkl')
s3.download_file('ml-bucket', 'models/model.pkl', 'local_model.pkl')
Set AWS credentials using:
-
IAM role (EC2/SageMaker)
-
~/.aws/credentialsfile -
AWS_ACCESS_KEY_ID&AWS_SECRET_ACCESS_KEYenv vars
๐น GCS Example (Python google-cloud-storage)
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('ml-bucket')
blob = bucket.blob('models/model.pkl')
blob.upload_from_filename('model.pkl')
blob.download_to_filename('local_model.pkl')
Set GCP credentials using:
-
GOOGLE_APPLICATION_CREDENTIALSenv var with service account key JSON
๐ Access Control Tips
| Platform | Recommendation |
|---|---|
| AWS | Use IAM roles with minimal permissions (e.g., s3:GetObject, s3:PutObject) |
| GCP | Use service accounts with specific roles like roles/storage.objectViewer or roles/storage.admin |
⏳ Lifecycle & Cost Management
Both support:
-
Object Lifecycle Rules: move to cold storage (Glacier or Nearline/Coldline)
-
Retention Policies: block deletion of data for X days
-
Auto-delete/expire rules for temp files and old models
๐งช CI/CD & ML Pipelines Integration
| Tool/Framework | S3 Support | GCS Support |
|---|---|---|
| SageMaker Pipelines | ✅ Native | ❌ |
| Vertex AI | ❌ | ✅ Native |
| MLflow | ✅ Via URI (s3://...) |
✅ Via URI (gs://...) |
| Airflow | ✅ | ✅ |
| Kubeflow | ✅ | ✅ |
| ZenML | ✅ | ✅ |
๐ Versioning
-
Enable versioning in both platforms to track changes:
-
In S3: Go to bucket → Enable versioning
-
In GCS: Set bucket versioning with
gsutilor Console
-
-
Useful for:
-
Rolling back models
-
Auditing training dataset changes
-
๐ก️ Security Best Practices
✅ Enable encryption (default in both)
✅ Block public access unless explicitly required
✅ Use signed URLs for limited-time sharing
✅ Monitor with logging (CloudTrail / Cloud Audit Logs)
✅ Use bucket-level access over object-level ACLs
Here’s a detailed breakdown and comparison of compute services: EC2, GKE, and Lambda — often used in DevOps, MLOps, and scalable microservices environments.
๐งฎ 1. Amazon EC2 (Elastic Compute Cloud)
๐งพ What It Is:
-
Virtual machines (VMs) on demand.
-
Full control over the OS, networking, and storage.
-
Ideal for traditional apps, ML model training, hosting APIs, etc.
๐ง Typical Use Cases:
-
Host model servers (like FastAPI, Flask, or TorchServe)
-
Run batch jobs or cron scripts
-
Train ML models on GPU-enabled instances
-
Run Docker containers (via EC2 + ECS or self-managed)
✅ Pros:
-
Full flexibility (install anything)
-
Scalable (manual or via Auto Scaling Groups)
-
GPU support for ML workloads
⚠️ Cons:
-
Must manage patching, scaling, security
-
Pricing can rise with high uptime
๐ ️ Infra-as-Code Example (Terraform):
resource "aws_instance" "ml_server" {
ami = "ami-xxxxxxxx"
instance_type = "t2.medium"
key_name = "your-key"
}
๐งฑ 2. GKE (Google Kubernetes Engine)
๐งพ What It Is:
-
Fully managed Kubernetes (K8s) on Google Cloud.
-
Run containerized apps with built-in scaling, networking, and storage.
๐ง Typical Use Cases:
-
Run microservices (REST APIs, background jobs)
-
Deploy ML inference servers (TF Serving, Triton, custom Flask apps)
-
Deploy ML pipelines (Kubeflow, TFX, MLflow)
✅ Pros:
-
Autoscaling, self-healing pods
-
Native integration with GCP services (BigQuery, GCS, Vertex AI)
-
CI/CD with GitHub/GitLab + Cloud Build or ArgoCD
⚠️ Cons:
-
Requires Kubernetes knowledge
-
Slightly higher learning curve
๐ Useful Tools with GKE:
-
Kubeflow: ML pipelines
-
Argo Workflows: CI/CD or ML pipeline orchestration
-
Istio/Envoy: Service mesh, secure traffic
⚡ 3. AWS Lambda (Serverless)
๐งพ What It Is:
-
Run backend functions in response to events (e.g., S3 upload, HTTP requests, cron).
-
You pay only for compute time used (in milliseconds).
๐ง Typical Use Cases:
-
ML inference for light models
-
Trigger model retraining when new data arrives in S3
-
ETL/ELT tasks on demand
-
Webhook receivers, alert systems
✅ Pros:
-
Zero server management
-
Auto-scaling, highly cost-efficient
-
Works with other AWS services (S3, SNS, DynamoDB)
⚠️ Cons:
-
Limited runtime (max 15 min)
-
Cold start latency (~1s for some languages)
-
Not suitable for large ML models unless optimized
๐ Example:
# handler.py
def lambda_handler(event, context):
return {"message": "Hello from Lambda!"}
Deploy via:
-
AWS Console
-
AWS SAM / Serverless Framework
-
Terraform
๐ Summary Table
| Feature | EC2 | GKE | Lambda |
|---|---|---|---|
| Type | VM | Managed Kubernetes | Serverless Functions |
| Ideal for | ML training/inference | Scalable microservices, ML pipelines | Lightweight functions, events |
| Scaling | Manual / Auto Scaling | Horizontal Pod Autoscaler | Automatic |
| OS Control | Full | Limited to container OS | None |
| Cold Start | No | No | Yes |
| Pricing | Per hour/second | Per node/hour | Per request (ms-based) |
| Infra-as-Code Tools | Terraform, CloudFormation | Terraform, Helm | SAM, Serverless, Terraform |
| Docker Support | Manual via ECS or EKS | Native | Limited (via container Lambda) |
| GPU Support | ✅ Yes | ✅ (with node pools) | ⚠️ Not natively supported |
๐ก Best Practice Guidance
| Scenario | Recommended Compute |
|---|---|
| Model training with GPU | EC2 or GKE (with GPU nodes) |
| Real-time API with low traffic | Lambda or Cloud Functions |
| Batch data processing | Lambda, EC2, or GKE Jobs |
| Large model inference | EC2 or GKE |
| Scalable web app | GKE |
| Orchestrating ML workflows | GKE (Kubeflow, Argo) |
Here’s a concise yet detailed comparison of the major AutoML platforms across AWS, GCP, and Azure: SageMaker Autopilot, Vertex AI, and Azure AutoML — all used for automating ML workflows including preprocessing, training, tuning, and deployment.
๐ Overview Table: AutoML Comparison
| Feature | SageMaker Autopilot (AWS) | Vertex AI AutoML (GCP) | Azure AutoML |
|---|---|---|---|
| Language Support | Python (via SDK, Boto3) | Python (via SDK, REST) | Python (AzureML SDK) |
| UI Available | ✅ SageMaker Studio | ✅ Vertex AI Console | ✅ Azure Studio |
| Model Explainability | ✅ SHAP built-in | ✅ Integrated | ✅ Built-in with visual UI |
| Custom Code Injection | ✅ Custom containers | ⚠️ Limited | ✅ Supported via pipelines |
| Model Deployment | ✅ One-click to endpoint | ✅ Deploy to prediction service | ✅ Deploy to AKS or endpoint |
| Model Type Coverage | Classification, Regression | Vision, Text, Tabular, Forecast | Tabular, Time series, NLP |
| Integration with MLOps | ✅ SageMaker Pipelines | ✅ Vertex AI Pipelines | ✅ Azure ML Pipelines |
| Pricing | Pay-per-job + compute | Pay-per-job + compute | Pay-per-run + compute |
๐ง SageMaker Autopilot (AWS)
✅ Highlights:
-
Input: CSVs or data in S3
-
Handles feature engineering, model tuning, evaluation
-
Gives Jupyter Notebooks of every step (transparency)
-
Easily integrated with SageMaker Pipelines + Endpoints
๐งช Example:
from sagemaker import AutoML
automl = AutoML(role=role,
target_attribute_name="target",
output_path="s3://my-bucket/output")
automl.fit(inputs="s3://my-bucket/input")
๐ Vertex AI AutoML (GCP)
✅ Highlights:
-
Unified with BigQuery, Cloud Storage, Looker
-
Supports Tabular, Text, Vision, and Forecasting
-
Strong low-code/no-code workflow
-
Built-in model evaluation and deploy
๐งช Sample Flow:
-
Upload dataset via console or Python
-
Click “Train New Model”
-
Set target + training options
-
Deploy or export model
Code Example:
from google.cloud import aiplatform
aiplatform.init(project="my-project", location="us-central1")
model = aiplatform.AutoMLTabularTrainingJob(
display_name="my-tabular-model",
optimization_prediction_type="classification",
)
model.run(dataset=my_dataset, target_column="label")
๐ฌ Azure ML AutoML
✅ Highlights:
-
Integrates with Azure Data Factory, Databricks
-
Offers rich UI + code-first SDK
-
Visual model explanations and fairness analysis
-
Deployment to AKS or managed endpoints
๐งช Code Example:
from azureml.train.automl import AutoMLConfig
from azureml.core.experiment import Experiment
automl_config = AutoMLConfig(task='classification',
primary_metric='AUC_weighted',
training_data=dataset,
label_column_name='target',
iterations=20)
experiment = Experiment(ws, "automl-exp")
run = experiment.submit(automl_config)
๐งฐ Use Case Recommendations
| Use Case | Recommended Platform |
|---|---|
| AWS-centric pipeline (S3, Athena) | SageMaker Autopilot |
| GCP-first stack (BigQuery, GCS) | Vertex AI |
| Enterprise + UI-driven | Azure AutoML |
| High control over pipeline steps | Azure/SageMaker |
| Forecasting/Time Series | Vertex AI or Azure |
| Vision/NLP | Vertex AI |
๐ Advanced Add-ons
| Feature | SageMaker Autopilot | Vertex AI AutoML | Azure AutoML |
|---|---|---|---|
| SHAP Explanations | ✅ Yes | ✅ Yes | ✅ Yes |
| Custom Pipelines | ✅ via SageMaker Pipeline | ✅ Vertex AI Pipelines | ✅ via Azure Pipelines |
| Hyperparameter Tuning | ✅ Bayesian search | ✅ Auto-tuning | ✅ Bayesian + Bandit |
| Auto-deploy Models | ✅ Yes | ✅ Yes | ✅ Yes |
Comments
Post a Comment