Mlops - III

8. CI/CD for ML

🚀 What is CI/CD?

Acronym	Meaning
CI	Continuous Integration
CD	Continuous Delivery or Continuous Deployment

CI/CD automates the process of building, testing, and deploying applications to reduce manual work, improve consistency, and speed up delivery cycles.

✅ 1. Continuous Integration (CI)

🔹 Goal:

Automatically integrate code from multiple developers, test it, and detect errors early.

🔧 Typical Steps in CI:

Developer pushes code to GitHub/GitLab/Bitbucket
CI pipeline triggers:
- Run unit tests
- Run linting/formatting (e.g., flake8, black)
- Build application artifacts
- Generate reports (e.g., test coverage)

🔁 Tools:

GitHub Actions
GitLab CI
Jenkins
CircleCI
Travis CI

✅ 2. Continuous Delivery (CD)

🔹 Goal:

Automatically prepare the application to be deployed in a staging or production environment — but with manual approval for final deployment.

🧩 Steps:

All CI steps
Deploy to staging
Run integration tests
Wait for approval → deploy to production

✅ 3. Continuous Deployment (CD)

🔹 Goal:

Fully automate build → test → production deployment with no human approval step.

This is riskier, but good for small frequent releases if tests are reliable.

🏗️ CI/CD Pipeline Example (ML App)

1. Code pushed to GitHub → triggers pipeline
2. Environment setup
3. Code linting & formatting
4. Unit & model testing
5. Train model (optionally)
6. Store model artifact (e.g., in S3 or MLflow)
7. Build Docker image
8. Deploy to staging or production (e.g., via Kubernetes)

🔌 Common CI/CD Tools in MLOps

Tool	Use
GitHub Actions	Git-based CI/CD
GitLab CI	Full Git + CI/CD integration
Jenkins	Flexible, customizable pipelines
ArgoCD	Kubernetes-native CD
Tekton	Kubernetes-native CI/CD
MLflow / DVC	Model versioning/artifacts
Docker + K8s	Containerized deployment

🧪 Why CI/CD is Important in MLOps?

Keeps models reproducible
Automates testing of data pipelines
Ensures consistent deployment of models
Avoids "it worked on my machine" issues

GitHub actions

GitHub Actions is a CI/CD (Continuous Integration and Continuous Deployment) tool built into GitHub. It allows you to automate workflows such as building, testing, and deploying code when certain events occur in your repository (like push, pull request, etc.).

🔧 Common Use Cases

CI/CD pipelines (build, test, deploy code)
Linting and formatting
Running cron jobs
Publishing packages
Automating issues, PRs, labels, etc.

📁 Basic Structure of GitHub Actions

You define actions using YAML inside the .github/workflows/ folder in your repository.

Example:

# .github/workflows/nodejs.yml
name: Node.js CI

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3
    - name: Setup Node.js
      uses: actions/setup-node@v4
      with:
        node-version: '20'

    - name: Install dependencies
      run: npm install

    - name: Run tests
      run: npm test

⚙️ Key Components

Component	Description
`on`	Triggers (e.g., `push`, `pull_request`, `schedule`)
`jobs`	A collection of tasks to run
`runs-on`	Environment (e.g., `ubuntu-latest`)
`steps`	Individual commands or actions
`uses`	Reusable actions (like `actions/checkout`)
`run`	Shell commands

✅ Example for Python Project

name: Python CI

on: [push]

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt

    - name: Run tests
      run: pytest

📦 Popular Actions

Action	Purpose
`actions/checkout`	Check out repo code
`actions/setup-node`	Setup Node.js
`actions/setup-python`	Setup Python
`docker/build-push-action`	Build & push Docker image
`github/super-linter`	Code linting

🛠️ Advanced Features

Matrix builds (test on multiple environments)
Secrets (store API keys securely)
Reusable workflows via workflow_call
Artifacts (store and share test reports, build files, etc.)

Gitlab CI/CD

GitLab CI/CD is GitLab’s built-in continuous integration and deployment system. Like GitHub Actions, it lets you automate build, test, and deployment pipelines, but is more tightly integrated into the GitLab platform.

🧱 Core Concept: `.gitlab-ci.yml`

The pipeline is defined in a .gitlab-ci.yml file in the root of your repository.

✅ Simple Example

stages:
  - build
  - test
  - deploy

build_job:
  stage: build
  script:
    - echo "Compiling the code..."
    - make

test_job:
  stage: test
  script:
    - echo "Running tests..."
    - make test

deploy_job:
  stage: deploy
  script:
    - echo "Deploying application..."
    - make deploy
  only:
    - main

🔧 Key Components

Component	Description
`stages`	The pipeline flow (e.g., build → test → deploy)
`jobs`	Each job runs a script and belongs to a stage
`script`	Shell commands that the job will execute
`only` / `except`	Control when the job runs (e.g., only on `main`)
`tags`	Used to target specific GitLab Runners

🛠 Common Features

Built-in Docker support for containerized pipelines
Manual jobs for approval steps
Artifacts and caching for build outputs or dependencies
Environment variables & secrets
Parallel/Matrix jobs
Trigger other pipelines
Use private/public runners

🐍 Python Example

image: python:3.11

stages:
  - test

test:
  stage: test
  script:
    - pip install -r requirements.txt
    - pytest

🐳 Docker + GitLab CI Example

image: docker:latest

services:
  - docker:dind

stages:
  - build

build:
  stage: build
  script:
    - docker build -t myapp:latest .

🔐 Using Secrets (CI/CD Variables)

Set in GitLab → Project Settings → CI/CD → Variables
Then reference in your script:

script:
  - echo "$SECRET_KEY"

🚀 Deployment Example with SSH

deploy:
  stage: deploy
  script:
    - ssh user@your-server 'cd /var/www/app && git pull && systemctl restart app'
  only:
    - main

✳ Comparison with GitHub Actions

Feature	GitLab CI	GitHub Actions
Config File	`.gitlab-ci.yml`	`.github/workflows/*.yml`
Built-in Docker	✅ Native	✅ With setup
Matrix Build	✅ Via `parallel`	✅ With `matrix`
Community Marketplace	✅ (less extensive)	✅ Huge marketplace
Integrated UI	Deeply built-in	More plug & play

In CI/CD, artifacts are files generated during a pipeline run that you want to save, archive, or pass to later stages—like test reports, build outputs, or deployment packages.

Both GitLab CI and GitHub Actions support artifacts, but their usage and syntax differ.

🧱 GitLab CI: Artifacts

🔹 Basic Usage

build_job:
  stage: build
  script:
    - make build
  artifacts:
    paths:
      - build/

This saves the build/ folder after the build_job runs. These artifacts:

Are downloadable from the GitLab UI
Can be passed to later stages (unless expire_in removes them)

🔹 With Expiration and Custom Settings

test_job:
  stage: test
  script:
    - pytest --junitxml=report.xml
  artifacts:
    paths:
      - report.xml
    expire_in: 1 week
    reports:
      junit: report.xml

Key fields:

Field	Purpose
`paths`	Files or directories to save
`expire_in`	Auto-delete time (e.g., `1 day`, `1 week`)
`reports`	Special format reports like `junit`, `coverage`, etc.

🔹 Passing Artifacts to Next Stage

Artifacts are automatically passed to jobs in later stages, not within the same stage.

stages:
  - build
  - test

build:
  stage: build
  script:
    - make build
  artifacts:
    paths:
      - build/

test:
  stage: test
  script:
    - ./test-runner build/

🧰 GitHub Actions: Artifacts

🔹 Save Artifacts

- name: Upload build output
  uses: actions/upload-artifact@v4
  with:
    name: build-artifact
    path: build/

🔹 Download in Another Job

- name: Download artifact
  uses: actions/download-artifact@v4
  with:
    name: build-artifact

You must split into separate jobs to upload/download artifacts.

🧰 What is Jenkins?

Jenkins is an open-source automation server widely used for CI/CD pipelines. It lets you automate building, testing, and deploying applications through pipelines (typically defined in Jenkinsfile).

🔧 Key Concepts

Concept	Description
Job	A build configuration (freestyle or pipeline)
Pipeline	Scripted or declarative workflow for CI/CD
Agent	A machine (or container) where jobs run
Stage	A high-level step (e.g., Build, Test)
Step	A single task inside a stage (e.g., shell command)
Node	A Jenkins worker (agent) that executes pipelines

📁 Sample `Jenkinsfile` (Declarative Pipeline)

pipeline {
    agent any

    environment {
        MY_ENV_VAR = 'value'
    }

    stages {
        stage('Build') {
            steps {
                echo 'Building the project...'
                sh 'make build'
            }
        }

        stage('Test') {
            steps {
                echo 'Running tests...'
                sh 'make test'
            }
        }

        stage('Deploy') {
            when {
                branch 'main'
            }
            steps {
                echo 'Deploying to production...'
                sh './deploy.sh'
            }
        }
    }

    post {
        always {
            echo 'Pipeline finished.'
        }
        failure {
            echo 'Pipeline failed!'
        }
    }
}

📦 Artifacts in Jenkins

To store and archive files like build outputs or test results:

post {
    success {
        archiveArtifacts artifacts: 'build/*.jar', fingerprint: true
    }
}

To publish test results:

post {
    always {
        junit 'reports/**/*.xml'
    }
}

🧪 Jenkins Plugins You’ll Need

Plugin Name	Purpose
Pipeline	Enables pipeline-as-code
Git	Checkout from Git repositories
JUnit	Test reporting
Docker Pipeline	Build & run Docker in pipeline
Credentials Binding	Secure secret handling
SSH	Remote deployments
Blue Ocean	Modern UI for pipelines

🐳 Jenkins with Docker

pipeline {
    agent {
        docker {
            image 'python:3.11'
            args '-v /var/run/docker.sock:/var/run/docker.sock'
        }
    }

    stages {
        stage('Install') {
            steps {
                sh 'pip install -r requirements.txt'
            }
        }
        stage('Test') {
            steps {
                sh 'pytest'
            }
        }
    }
}

🔐 Secrets in Jenkins

Store credentials in Manage Jenkins → Credentials
Use in pipeline:

withCredentials([string(credentialsId: 'MY_SECRET_ID', variable: 'MY_SECRET')]) {
    sh 'echo $MY_SECRET'
}

🔁 What is CircleCI?

CircleCI is a modern cloud-native CI/CD platform known for speed, flexibility, and Docker-first support. It automates building, testing, and deploying your code every time you commit changes.

📁 Config File: `.circleci/config.yml`

CircleCI uses a YAML file stored in the .circleci/ folder in your repo.

✅ Minimal Example (Node.js)

version: 2.1

jobs:
  build:
    docker:
      - image: cimg/node:20.4
    steps:
      - checkout
      - run: npm install
      - run: npm test

workflows:
  build_and_test:
    jobs:
      - build

🧱 Key Components

Component	Description
`version`	CircleCI configuration version (use 2.1+)
`jobs`	Group of steps to run (build/test/deploy)
`steps`	Commands in a job (e.g., `checkout`, `run`)
`workflows`	Defines job orchestration (sequential/parallel)
`executors`	Runtime environment (Docker, machine, macOS)

🐳 Docker Support Example

jobs:
  build:
    docker:
      - image: cimg/python:3.11
    steps:
      - checkout
      - run:
          name: Install dependencies
          command: pip install -r requirements.txt
      - run:
          name: Run tests
          command: pytest

📦 Artifacts in CircleCI

Artifacts are files saved from a job (e.g., logs, coverage reports).

Upload Artifacts

- store_artifacts:
    path: test-results/
    destination: test-results

Test Reports

- store_test_results:
    path: test-results

You can see artifacts and test results in the CircleCI UI after job execution.

🔐 Environment Variables & Secrets

Define them via CircleCI Project Settings → Environment Variables
Reference them directly in your run commands:

- run: echo $MY_SECRET_TOKEN

🛠 Advanced Features

Feature	Example
Workflows	Run jobs in parallel or sequentially
Conditional steps	Use `when` and `unless`
Caching	Speed up builds using `save_cache` / `restore_cache`
Reusable configs	`commands`, `executors`, `orbs`
Matrix builds	Run tests against multiple language versions

⚙️ Caching Example

- restore_cache:
    keys:
      - v1-deps-{{ checksum "package-lock.json" }}

- run: npm install

- save_cache:
    paths:
      - node_modules
    key: v1-deps-{{ checksum "package-lock.json" }}

🔄 CircleCI vs GitHub Actions vs GitLab CI vs Jenkins

Feature	CircleCI	GitHub Actions	GitLab CI	Jenkins
Hosted	✅ Yes	✅ Yes	✅ Yes	❌ Self-hosted
Docker-native	✅ Strong	✅ Good	✅ Strong	✅ with config
Config as Code	✅ `.yml`	✅ `.yml`	✅ `.yml`	✅ Groovy DSL
Marketplace	✅ Orbs	✅ Actions	⚠️ Few	✅ Plugins
Matrix builds	✅ Built-in	✅ Supported	✅ Parallel jobs	✅ Scripted

🧠 What is Amazon SageMaker Pipelines?

SageMaker Pipelines is Amazon's CI/CD service for machine learning workflows. It lets you build, automate, and manage ML workflows (like data prep, training, tuning, evaluation, and deployment) using a Python SDK.

It’s similar to Kubeflow Pipelines or Airflow but tightly integrated into AWS SageMaker.

⚙️ Typical Use Case: End-to-End ML Workflow

[Data Prep] → [Feature Engineering] → [Model Training] → [Model Evaluation] →

[Model Registration] → [Deployment]

📁 Basic Structure Using SageMaker Python SDK

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep, ModelStep
from sagemaker.workflow.parameters import ParameterString, ParameterFloat
from sagemaker.processing import ScriptProcessor
from sagemaker.estimator import Estimator
from sagemaker.workflow.pipeline_context import PipelineSession

✅ Example: Full ML Pipeline

from sagemaker.workflow.parameters import ParameterString
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.processing import ScriptProcessor
from sagemaker.estimator import Estimator
from sagemaker.workflow.pipeline_context import PipelineSession
import sagemaker

# Setup
region = sagemaker.Session().boto_region_name
role = sagemaker.get_execution_role()
pipeline_session = PipelineSession()

# Parameters
input_data = ParameterString(name="InputData", default_value="s3://my-bucket/input.csv")

# Step 1: Preprocessing
processor = ScriptProcessor(
    image_uri=sagemaker.image_uris.retrieve("sklearn", region),
    command=["python3"],
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
)

processing_step = ProcessingStep(
    name="DataPreprocessing",
    processor=processor,
    inputs=[input_data],
    code="preprocess.py",
    outputs=[...]
)

# Step 2: Training
estimator = Estimator(
    image_uri=sagemaker.image_uris.retrieve("xgboost", region),
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path="s3://my-bucket/model/",
)

training_step = TrainingStep(
    name="ModelTraining",
    estimator=estimator,
    inputs={"train": processing_step.properties.ProcessingOutputConfig.

    Outputs["train_data"].S3Output.S3Uri},
)

# Pipeline Definition
pipeline = Pipeline(
    name="MyMLPipeline",
    parameters=[input_data],
    steps=[processing_step, training_step],
    sagemaker_session=pipeline_session
)

pipeline.upsert(role_arn=role)
execution = pipeline.start()

📦 Key Components of SageMaker Pipelines

Component	Purpose
`ProcessingStep`	Data cleaning, feature engineering, etc.
`TrainingStep`	Model training using Estimator
`TransformStep`	Batch inference
`ConditionStep`	Add logic based on metrics
`ModelStep`	Register model to Model Registry
`CallbackStep`	Integrate with Lambda/custom logic
`ParameterString/Float`	Dynamically pass pipeline inputs
`PipelineSession`	Manages interaction with SageMaker

🛡️ Benefits

✅ Managed service – no servers to manage
✅ Trackable runs with versioning, lineage, and metadata
✅ Built-in CI/CD for ML
✅ Integration with SageMaker Experiments, Model Registry, and Feature Store
✅ Scalable with on-demand compute and built-in retry logic

🔗 Real-World Example Flow

1. Ingest raw CSV from S3
2. Clean & split data (ProcessingStep)
3. Train XGBoost or sklearn model (TrainingStep)
4. Evaluate accuracy, F1 score (ConditionStep)
5. If metrics are good → register model (ModelStep)
6. Deploy to endpoint via Lambda or manual

🚀 Related AWS Services

Service	Purpose
S3	Data input/output
SageMaker Studio	GUI for pipelines
SageMaker Feature Store	Feature engineering
Model Registry	Version & track models
Lambda / Step Functions	Extend logic or trigger deployment
CloudWatch	Logging & monitoring

🔹 What is ZenML?

ZenML is an open-source MLOps framework built to orchestrate reproducible ML pipelines across tools like MLflow, Airflow, Kubernetes, and SageMaker.

✅ Features:

Tool-agnostic: plug in TensorFlow, PyTorch, sklearn, etc.
Built-in support for MLflow, Weights & Biases, GCP, AWS, Kubernetes
Focus on pipelines, reproducibility, modularity
Developer-friendly CLI + Python SDK

📁 ZenML Pipeline Example:

@step
def ingest_data() -> pd.DataFrame:
    ...

@step
def train_model(data: pd.DataFrame) -> Any:
    ...

@pipeline
def training_pipeline(data_loader, trainer):
    data = data_loader()
    model = trainer(data)

pipeline = training_pipeline(ingest_data, train_model)
pipeline.run()

ZenML separates your pipeline into clean steps and supports plugins to execute on local, Kubeflow, Airflow, Vertex AI, etc.

🔹 What is TFX (TensorFlow Extended)?

TFX is Google's official end-to-end platform for deploying TensorFlow models in production. It was built to meet internal Google ML production needs.

✅ Features:

Native integration with TensorFlow ecosystem
Standard components: ExampleGen, Trainer, Evaluator, Pusher, etc.
Works with Apache Beam, Kubeflow Pipelines, Airflow
Focuses heavily on data validation, model analysis, serving

📁 TFX Pipeline Example:

from tfx.dsl.components.base import executor_spec
from tfx.orchestration import pipeline
from tfx.orchestration.local import local_dag_runner
from tfx.components import CsvExampleGen, Trainer, Pusher

example_gen = CsvExampleGen(input_base='data/')
trainer = Trainer(...)
pusher = Pusher(...)

pipeline = pipeline.Pipeline(
    pipeline_name='my_pipeline',
    pipeline_root='pipelines/',
    components=[example_gen, trainer, pusher]
)

local_dag_runner.LocalDagRunner().run(pipeline)

TFX enforces TensorFlow-specific best practices for data quality, model performance, and deployment.

🆚 ZenML vs TFX: Feature Comparison

Feature	ZenML	TFX
Language	Python (framework-agnostic)	Python (TensorFlow-focused)
ML Framework Support	TensorFlow, PyTorch, sklearn, etc.	TensorFlow only
Component Modularity	Highly modular + customizable	Modular (TensorFlow-centric)
Orchestrators	Airflow, Kubeflow, MLflow, Prefect	Airflow, Kubeflow
Deployment Support	SageMaker, Vertex AI, KServe	TensorFlow Serving, Vertex AI
Visualization / Metadata	MLflow, W&B, ZenML UI	TensorBoard, TFX Metadata
Pipeline Reproducibility	✅ Yes	✅ Yes
Local Execution	✅ Yes	✅ Yes
Ease of Use	🟢 Beginner-friendly	🔴 More complex, steep learning curve

🧠 When to Use What?

Scenario	Use
Want a framework-agnostic, modular, easy-to-adopt pipeline	✅ ZenML
Already using TensorFlow and want to follow best practices	✅ TFX
Need to plug into SageMaker, MLflow, K8s, etc.	✅ ZenML
Need advanced model validation, explainability, data skew detection	✅ TFX

💡 TL;DR

ZenML	TFX
Flexible, lightweight, easy to start	Powerful, opinionated, deep TensorFlow support
Works with any ML/DL framework	TensorFlow-only
Ideal for hybrid/multi-cloud & plug-n-play MLOps	Ideal for enterprise-grade TensorFlow pipelines

9. Monitoring and Logging

🔄 What is Drift in Machine Learning?

In production ML, drift refers to changes over time in the data or relationships that the model depends on, which can lead to reduced model accuracy.

There are two main types:

📦 1. Data Drift (a.k.a. Covariate Shift)

Definition:
The distribution of input features (X) changes over time, but the relationship between input and output (P(y|x)) remains the same.

🧠 Example:

A credit scoring model was trained on users from India, but it’s now being used in the US.
Feature distributions like age, income, or credit history change → data drift.

📈 Detection Methods:

Statistical tests (e.g., Kolmogorov-Smirnov test)
Population Stability Index (PSI)
Earth Mover’s Distance
Histograms & density plots

🧠 2. Model Drift (a.k.a. Concept Drift)

Definition:
The relationship between input and target variable (P(y|x)) changes over time, even if input distribution remains stable.

🧠 Example:

A fraud detection model where fraudster behavior evolves (e.g., new tactics)
The model can no longer accurately map inputs to the correct outcome → model drift.

📈 Detection Methods:

Monitoring model performance metrics (e.g., accuracy, AUC, F1)
If model metrics drop but input features haven’t changed → model drift
Concept drift detectors like:
- DDM (Drift Detection Method)
- ADWIN
- Kullback-Leibler divergence

📊 Drift Comparison

Aspect	Data Drift	Model Drift
What changes	Input features distribution (X)	Relationship between X and Y
Impact	Can indirectly reduce accuracy	Directly affects model accuracy
Detection	PSI, KS test, histograms	Drop in model performance
Remediation	Retrain with recent data	Retrain + re-define model logic

🔁 Common Causes of Drift

Cause	Type
Seasonality or time-based shifts	Data Drift
Change in user behavior	Model Drift
External events (e.g., pandemic)	Both
Sensor recalibration or software upgrades	Data Drift

🛡️ How to Monitor & Handle Drift

1. Monitoring Tools

Evidently AI – Open-source for drift detection (https://evidentlyai.com/)
WhyLabs, Arize AI, Fiddler, SageMaker Model Monitor
Custom dashboards with Prometheus/Grafana

2. Detection Frequency

Daily/weekly batch comparisons
Real-time if using streaming

3. Actions to Take

Trigger retraining pipelines
Use drift detectors in CI/CD workflows
Incorporate active learning or online learning

📌 Summary

Term	What is it?	Why it matters
Data Drift	Input feature distribution changes	Model may make wrong inferences
Model Drift	Relationship between X and Y changes	Model becomes inaccurate

📈 What is Model Performance Monitoring?

Model performance monitoring is the process of tracking, measuring, and analyzing how your ML model behaves in production — ensuring it's still accurate, fair, and reliable after deployment.

🔍 Why Is It Important?

Even the best model at training time can degrade in production due to:

Data drift
Model drift
Feature pipeline bugs
Feedback loops or changing real-world patterns

Without monitoring, you might miss silent failures that hurt business outcomes.

🎯 What to Monitor in ML Systems

✅ 1. Performance Metrics

Metric Type	Example
Classification	Accuracy, Precision, Recall, F1, AUC
Regression	RMSE, MAE, R²
Ranking	MAP, NDCG
Business KPIs	Conversion rate, CTR, etc.

👉 Compare training vs validation vs production performance.

✅ 2. Data Quality & Drift

What to check	How
Missing values	Feature-level monitoring
Schema violations	Type, range, shape
Data drift	PSI, KS Test
Outliers or anomalies	Z-score, IQR, Mahalanobis

✅ 3. Prediction Distribution

Is the model outputting the same predictions every time?
Look for prediction bias or overconfident scores.

✅ 4. Fairness and Bias

Measure model fairness across sensitive groups (e.g., age, gender).
Monitor disparities in performance.

✅ 5. Latency and Throughput

Inference latency (ms/req)
Request volume
System resource usage (CPU/GPU, memory)

⚒️ Tools for Model Monitoring

🟢 Open-Source

Tool	Features
Evidently AI	Data & model drift, dashboards, reports
Prometheus + Grafana	Custom monitoring (great for latency, metrics)
MLflow	Experiment tracking (with manual model logs)
WhyLogs	Logging and monitoring of data quality
Fiddler / Arize AI / TruEra	Monitoring + explainability (SaaS)

🟡 Cloud-Native

Platform	Monitoring Feature
SageMaker Model Monitor	Built-in drift & quality detection
Vertex AI (GCP)	Prediction monitoring, alerts
Azure ML	Drift + metric monitoring
Databricks	MLflow + production metrics

🔁 Monitoring Lifecycle Example

1. Model is deployed (API or batch)
2. User requests come in
3. Log: input data, model predictions, latency
4. Optional: collect true labels later (for supervised metrics)
5. Compare live vs baseline (training) distributions & metrics
6. Trigger alerts / retrain pipelines if performance drops

🧪 Sample: Custom Monitoring Loop (Python)

import pandas as pd
from sklearn.metrics import accuracy_score

# 1. Collect live predictions and labels
preds = pd.read_csv("live_predictions.csv")
truth = pd.read_csv("live_labels.csv")

# 2. Calculate performance
acc = accuracy_score(truth["label"], preds["prediction"])

# 3. Trigger alert if accuracy drops
if acc < 0.75:
    print("⚠️ Model accuracy dropped below threshold!")

📦 Best Practices

✅ Set performance baselines from training
✅ Store input + predictions + actuals
✅ Monitor in real-time or batch
✅ Set up alerts or retraining triggers
✅ Regularly audit for fairness and explainability

📊 What is Prometheus?

Prometheus is an open-source monitoring and alerting system originally developed by SoundCloud. It’s widely used for real-time metrics collection, alerting, and visualization, especially in DevOps and ML infrastructure.

✅ Why Use Prometheus for ML & MLOps?

Track model inference metrics (latency, throughput, errors)
Monitor CPU/GPU usage of ML workloads
Combine with Grafana for dashboards
Setup alerts for performance or drift degradation
Works great with Docker, Kubernetes, FastAPI, Flask, etc.

🧱 Core Concepts

Concept	Description
Metric	A time-series data point (e.g., `inference_latency_seconds`)
Labels	Key-value tags for filtering metrics (e.g., `model="xgboost"`)
Exporter	Collects metrics from apps (e.g., Python, GPU, Docker)
Scraping	Prometheus pulls metrics by scraping a target HTTP endpoint
Query	Uses PromQL to query metrics
Alertmanager	Sends alerts via email, Slack, PagerDuty, etc.

📄 Example: Expose Metrics in Python (FastAPI + Prometheus)

pip install prometheus_client fastapi uvicorn

# app.py
from fastapi import FastAPI
from prometheus_client import start_http_server, Summary, Counter
import time
import random

app = FastAPI()

# Metrics
REQUEST_TIME = Summary('inference_latency_seconds', 'Time spent on inference')
REQUEST_COUNT = Counter('inference_requests_total', 'Total inference requests')

@app.get("/predict")
@REQUEST_TIME.time()
def predict():
    REQUEST_COUNT.inc()
    time.sleep(random.uniform(0.1, 0.5))  # simulate inference delay
    return {"result": "cat"}

# Run Prometheus metrics server on port 8001
start_http_server(8001)

🔍 Prometheus Configuration (`prometheus.yml`)

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'ml-api'
    static_configs:
      - targets: ['localhost:8001']

Prometheus scrapes http://localhost:8001/metrics every 15s

📈 Visualize in Grafana

Run Prometheus + Grafana using Docker:

docker-compose up

Add Prometheus as a data source in Grafana
Create dashboards using PromQL, e.g.:

inference_latency_seconds_count
rate(inference_latency_seconds_sum[1m])

📦 Popular Exporters

Exporter	Use
`prometheus_client`	App-level metrics in Python
`node_exporter`	System metrics (CPU, memory)
`gpu_exporter`	NVIDIA GPU metrics
`kube-state-metrics`	Kubernetes objects
`pushgateway`	For short-lived jobs (like batch ML)

🔔 Alerts (via Alertmanager)

Example rule:

groups:
- name: ml-alerts
  rules:
  - alert: HighLatency
    expr: inference_latency_seconds_sum / inference_latency_seconds_count > 0.3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High inference latency detected"

🔁 ML Monitoring Use Cases

Use Case	Metric
Latency	`inference_latency_seconds`
Traffic	`inference_requests_total`
Failure rate	`inference_errors_total`
Resource usage	From `node_exporter` or `gpu_exporter`
Drift triggers	Custom metrics exposed from model logic

📊 What is Grafana?

Grafana is an open-source analytics and dashboarding tool used to visualize time-series data from sources like Prometheus, InfluxDB, Elasticsearch, Loki, and many others.

In MLOps, Grafana is often paired with Prometheus to monitor:

Model inference latency
Drift signals
API uptime and errors
CPU/GPU utilization
Data pipeline performance

✅ Why Use Grafana?

Beautiful interactive dashboards
Flexible PromQL/SQL queries
Alerting capabilities
Works with ML/DevOps monitoring tools
Integration with Slack, email, PagerDuty for alerts

📦 Key Features

Feature	Description
Panels	Graphs, tables, heatmaps, gauges, logs
Variables	Dynamic filters (e.g., model name)
Data Sources	Prometheus, Loki, AWS CloudWatch, PostgreSQL, etc.
Annotations	Add events or markers to timelines
Alerts	Visual + rule-based threshold alerts

🔧 Common ML Use Cases

Use Case	Panel Type	Metric Source
Inference latency	Line chart	Prometheus
Drift score over time	Graph panel	Evidently/WhyLogs
Error rate	Stat panel	Prometheus
GPU usage	Gauge / Time series	NVIDIA exporter
Feature distribution	Histogram / Heatmap	Custom app metrics

🚀 Example Setup (Local)

🛠️ 1. Docker-Compose (Prometheus + Grafana)

# docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"

🔗 2. Start It

docker-compose up -d

🧱 Sample Grafana Dashboard Panels for ML

📉 Inference Latency Panel

Metric: inference_latency_seconds
Query:

rate(inference_latency_seconds_sum[1m]) / rate(inference_latency_seconds_count[1m])

💥 Error Rate Panel

Metric: inference_errors_total
Query:

rate(inference_errors_total[5m])

🔄 Model Drift Detection Panel

Metric: feature_drift_score
Query:

avg_over_time(feature_drift_score[1h])

🛎️ Alerts in Grafana

Create a panel → Set thresholds (e.g., latency > 500ms)
Add Alert → Define condition (e.g., avg over 5min)
Connect Alert Manager / Slack / Email

✨ Example Dashboards

Dashboard	Panels
Model Monitoring	Accuracy, F1, latency, requests
System Monitoring	CPU, RAM, GPU, disk
ETL Pipeline Monitoring	Job success, failure rate, execution time
Data Drift Monitor	PSI/KS scores, feature distribution

Here’s a complete comparison and overview of Evidently AI, WhyLabs, and Seldon Core, three powerful tools in the MLOps & model monitoring ecosystem:

📦 1. Evidently AI

Purpose: Open-source Python library for data & model monitoring, focused on drift detection, data quality, and performance reports.

✅ Use Cases:

Data & target drift detection
Feature distribution changes
Model performance reports
Offline or in-pipeline monitoring

🚀 Integration:

Python scripts, Jupyter notebooks
Airflow, Prefect, Kubeflow, etc.

🔍 Key Features:

Feature	Description
Data Drift Report	Detects change in feature distributions
Target Drift Report	Monitors label distribution changes
Classification/Regression Reports	Accuracy, F1, ROC, etc.
Data Quality Report	Nulls, type mismatches, etc.
Dashboards (Evidently UI)	Serve reports as interactive UI locally or in pipelines

🧪 Example (Python):

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=ref_df, current_data=prod_df)
report.save_html("drift_report.html")

☁️ 2. WhyLabs + WhyLogs

Purpose: Enterprise-grade observability and monitoring platform for ML pipelines and data quality, offering automated logging, drift detection, and dashboards.

✅ Use Cases:

Continuous production monitoring
Automated data profiling
Real-time alerting
Integration with cloud & on-prem ML workflows

🔍 Key Features:

Feature	Description
WhyLogs	Open-source library for logging statistics about data
WhyLabs Platform	SaaS platform for dashboards, alerts
Segmented Monitoring	Track metrics across different user segments
Lightweight Logging	Doesn’t expose raw data (great for compliance)
Streaming / Batch	Works in both modes; supports Spark, Pandas, S3, Kafka, etc.

🧪 Example (Python):

import whylogs as why
profile = why.log(pandas_df).profile()
profile.write(path="profile.bin")

You then upload this profile to WhyLabs using the WhyLabs agent.

🚀 3. Seldon Core

Purpose: Open-source platform for deploying, scaling, and monitoring ML models on Kubernetes. Sits alongside tools like KFServing and KServe.

✅ Use Cases:

Kubernetes-native model serving
A/B testing, canary rollout, multi-model serving
Real-time inference monitoring
Explainability & drift detection

🔍 Key Features:

Feature	Description
MLServer	Fast, multi-language model server
Explainers	SHAP, Lime integration out of the box
Drift Detectors	Kolmogorov-Smirnov, PSI, etc.
Outlier Detectors	Use alibi-detect or custom models
Seldon Metrics	Prometheus & Grafana-ready metrics
Advanced Routing	Can run A/B, multi-armed bandit deployments

🧱 Architecture:

[Kubernetes]
   |
[SeldonDeployment YAML]
   |
[Model Pods] <---> [Metrics + Monitoring Pods]
   |
[Ingress Gateway]

🔄 Comparison Table

Feature/Tool	Evidently AI	WhyLabs + WhyLogs	Seldon Core
Type	Python library/UI	Logging + SaaS platform	Kubernetes deployment
Drift Detection	✅	✅	✅ (via Alibi Detect)
Model Serving	❌	❌	✅
Monitoring	✅ (offline)	✅ (cloud/streaming)	✅ (real-time, Prometheus)
Alerts	Manual + Grafana	Built-in SaaS alerts	With Prometheus + AlertMgr
Integration	Python, Notebooks	Spark, Kafka, S3, Pandas	Kubernetes + Prometheus
Visual UI	Local HTML/UI server	WhyLabs dashboard	Grafana integration
Open Source	✅	Partially (WhyLogs = ✅)	✅

🔗 Ideal Tool Based on Need:

Need	Tool
Quick & Local Drift Detection	Evidently AI
Enterprise-Grade Logging & SaaS Dashboards	WhyLabs
Full ML Deployment + Drift Detection in K8s	Seldon Core

10. Cloud & Infrastructure for MLOps

🧠 AWS in the ML/MLOps Ecosystem

AWS offers end-to-end tools for data ingestion, training, model deployment, monitoring, and CI/CD.

🔧 Key AWS Services for MLOps

Category	Service	Purpose
Storage & Data	S3, Glue, Athena	Data lake, ETL, querying logs/metadata
Model Development	SageMaker Studio	IDE for ML dev (like JupyterLab)
Model Training	SageMaker Training Jobs	Scalable training on EC2 or Spot
Model Deployment	SageMaker Endpoints	Real-time APIs for inference
Model Registry	SageMaker Model Registry	Manage model versions and metadata
CI/CD	CodePipeline, CodeBuild, Lambda	Automate training/testing/deployment
Monitoring & Drift	SageMaker Model Monitor	Detect drift, outliers, quality issues
Observability	CloudWatch, Prometheus, Grafana	Metrics, logging, alerting
Feature Store	SageMaker Feature Store	Store, reuse, and version features
Security & Auth	IAM, KMS, VPC, S3 Policies	Access control and encryption

🔄 MLOps Workflow on AWS

1. Data Collection & Processing

Use AWS Glue or S3 + Lambda to collect/clean data
Version datasets using DVC or S3 object versioning

2. Model Training

Launch training jobs using SageMaker Training
Auto-scale compute; log metrics to CloudWatch

3. Model Evaluation & Registration

Evaluate metrics, visualize in SageMaker Experiments
Register successful model in Model Registry

4. Model Deployment

Use SageMaker Inference Endpoints for:
- Real-time (InvokeEndpoint)
- Batch (BatchTransform)
Configure autoscaling and multi-model endpoints if needed

5. Monitoring & Drift Detection

Use SageMaker Model Monitor to:
- Detect data drift (feature value distribution)
- Detect model drift (label drift, performance)
- Log anomalies to CloudWatch

6. CI/CD for ML

Automate with:
- CodePipeline: Orchestration
- CodeBuild: Build + test steps
- Step Functions: Complex ML workflows
- EventBridge: Trigger on file uploads, model updates

📊 Example: Real-time Drift Monitoring with SageMaker

from sagemaker.model_monitor import DataCaptureConfig

# Enable inference data capture
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=100,
    destination_s3_uri="s3://mybucket/captured-data"
)

# Attach it while deploying model
predictor = sagemaker.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    data_capture_config=data_capture_config
)

Then set up a Monitoring Schedule using ModelMonitor.

📈 Observability with CloudWatch + Grafana

CloudWatch collects metrics like:
- Latency
- Invocation count
- 4xx/5xx errors
- Custom logs from inference scripts
Connect Prometheus to Amazon Managed Grafana for:
- Model-specific dashboards
- Drift visualization
- Alerting via SNS or Slack

🧠 Final Thoughts

Goal	Tools to Use
Local development & experiments	SageMaker Studio, S3
Deployment with monitoring	SageMaker Endpoint + Model Monitor
Production CI/CD pipelines	CodePipeline, Step Functions
Enterprise monitoring	CloudWatch + Grafana or Prometheus

☁️ GCP for Machine Learning & MLOps

GCP offers an end-to-end AI/ML ecosystem via tools like Vertex AI, BigQuery, Cloud Functions, and Cloud Monitoring.

🔧 Key GCP Services for MLOps

MLOps Phase	GCP Tool	Purpose
Data Storage	Cloud Storage (GCS)	Object store for datasets/models
Data Analysis	BigQuery, Dataflow	SQL-based analytics, streaming pipelines
ML Platform	Vertex AI	Unified ML lifecycle platform
Training	Vertex AI Training	Managed model training on CPUs/GPUs/TPUs
Model Registry	Vertex Model Registry	Store and manage model versions
Deployment	Vertex AI Endpoints	Real-time/batch model inference
Monitoring	Vertex AI Model Monitoring	Monitor drift, skew, performance
CI/CD	Cloud Build, Cloud Functions	Automate ML pipeline steps
Observability	Cloud Logging & Monitoring	Alerting, visualization
Pipelines	Vertex AI Pipelines (Kubeflow)	Orchestration of ML workflows
Feature Store	Vertex Feature Store	Central repository of features

🔄 GCP MLOps Workflow Overview

1. Data Storage & Preparation

Store raw/processed data in GCS
Use Dataflow or Dataprep for batch/stream processing
Explore data via BigQuery

2. Model Development & Training

Develop locally or in Vertex AI Workbench (JupyterLab)
Train using:
- Vertex AI Training Jobs (managed)
- Custom containers (e.g., with PyTorch/TensorFlow)
- TPUs for large-scale deep learning

3. Model Evaluation & Versioning

Evaluate metrics post-training
Register model in Vertex Model Registry
Use Artifact Registry for Docker images

4. Model Deployment

Deploy to Vertex AI Endpoints:
- Real-time predictions via REST API
- Scalable with autoscaling/load balancing
For batch inference: use Batch Prediction Jobs

5. Model Monitoring

Vertex AI Model Monitoring handles:
- Prediction data drift
- Training-serving skew
- Feature skew
Configure alerts to trigger on threshold breaches
Logs can be piped to Cloud Logging for audits

6. CI/CD Pipelines

Use Vertex Pipelines (Kubeflow Pipelines on GCP) to:
- Automate training → evaluation → deployment
- Integrate with Cloud Build for custom CI steps

# sample Kubeflow component
@component
def train_model(input_data: str) -> str:
    ...

📈 GCP Monitoring Stack

Tool	Use
Cloud Monitoring (Stackdriver)	Metrics like latency, errors, usage
Cloud Logging	Inference logs, pipeline status
Vertex Model Monitoring	Drift, skew, performance metrics
BigQuery	Store and analyze monitoring data
Grafana (via GKE)	Custom dashboards for model metrics

📊 Model Drift Monitoring Example

Enable monitoring when deploying model:

gcloud beta ai endpoints deploy-model \
  --model=model-id \
  --display-name="drift-monitored-model" \
  --enable-access-logging \
  --enable-drift-monitoring

Set thresholds via console or REST API.

🚀 Popular Use Cases on GCP

Use Case	GCP Services
NLP model deployment	Vertex AI + Cloud Functions
Data pipeline with streaming	Pub/Sub + Dataflow
Real-time fraud detection	Vertex AI + BigQuery + Monitoring
Retail recommender system	Feature Store + Vertex AI + Monitoring

🧠 Final Thoughts

Objective	GCP Tools
Unified ML lifecycle	Vertex AI
Data processing	BigQuery, Dataflow
CI/CD	Vertex Pipelines, Cloud Build
Observability	Stackdriver, Vertex Monitoring
Custom workflows	Kubeflow, GKE, Cloud Functions

☁️ Azure for MLOps & Machine Learning

Azure offers a comprehensive and scalable platform to manage the entire ML lifecycle — from data ingestion to deployment, monitoring, and retraining.

🔧 Key Azure MLOps Components

MLOps Phase	Azure Tool	Purpose
Data Storage	Azure Blob Storage, ADLS	Store datasets, models, logs
Data Processing	Azure Data Factory, Synapse	ETL, big data analytics
ML Platform	Azure Machine Learning (Azure ML)	Unified ML development/deployment
Model Training	Azure ML Compute, Azure Databricks	Train with CPU, GPU, or Spark
Experiment Tracking	Azure ML Experiments	Track metrics, parameters, versions
Model Registry	Azure ML Model Registry	Central model storage
Deployment	Azure ML Endpoints, AKS, ACI	Real-time/batch serving
CI/CD Pipelines	Azure DevOps, GitHub Actions	Automate ML lifecycle
Monitoring	Azure Monitor, App Insights	Track performance, drift, logs
Feature Store (Preview)	Azure ML Feature Store	Reusable features for ML models

🔄 Azure MLOps Workflow Overview

1. Data Ingestion & Storage

Use Azure Data Factory or Synapse Pipelines for ingesting data
Store datasets in Blob Storage or ADLS Gen2

2. Data Processing & Exploration

Use Azure Synapse, Databricks, or Jupyter Notebooks in Azure ML workspace
Perform data cleaning, EDA, feature engineering

3. Model Development & Experimentation

Work within Azure ML Studio or integrate with VSCode
Use Experiment tracking to compare models across metrics & hyperparameters
Train models on:
- Local or remote compute
- AML Compute Cluster (autoscaling)
- Databricks Spark cluster

4. Model Versioning & Registry

Register successful models into the Model Registry
Associate model with training metrics and dataset version

5. Deployment

Deploy models to:
- Managed Endpoints (real-time REST APIs)
- AKS for production-grade serving
- ACI for testing/dev workloads
- Batch Endpoints for offline inference

6. CI/CD with Azure DevOps

Use Azure DevOps Pipelines or GitHub Actions
Automate:
- Data validation → model training → evaluation → deployment
YAML-based templates and pre-built tasks available

# Azure ML pipeline YAML (simplified)
trigger:
  branches:
    include: [main]

jobs:
- job: TrainModel
  steps:
    - task: AzureMLTrain@1
      inputs:
        workspaceName: 'ml-workspace'
        experimentName: 'churn-model'

7. Monitoring & Retraining

Monitor using:
- Azure Monitor for system metrics
- Application Insights for API-level logs
- ML Model Monitoring for data drift, concept drift, and performance
Set alerts and automate retraining pipelines via triggers

📊 Drift & Performance Monitoring Example

Azure ML can monitor:

Data drift between training & production data
Prediction drift and label distribution changes
Model performance degradation

from azureml.monitoring import ModelDataCollector

collector = ModelDataCollector("model-name", feature_names=["age", "income"])
collector.collect(data=X_inference)

📈 Integration with Azure Ecosystem

Azure Service	Role in MLOps Pipeline
Azure DevOps	CI/CD, testing, version control
Azure Monitor	Real-time logging, alerting
Azure Kubernetes	Scalable inference serving
Azure Key Vault	Secure management of API keys/secrets
Azure Functions	Trigger retraining or workflows
Power BI	Visualize model outputs and predictions
Azure Logic Apps	No-code orchestration for alerts/retraining

📁 Model Deployment Options

Environment	Use Case
ACI (Azure Container Instance)	Quick testing/staging
AKS (Azure Kubernetes)	Scalable production
Local Docker	Custom environment
Batch Endpoints	Non-real-time inference jobs

🧠 Azure MLOps Use Cases

Use Case	Azure Tools
Credit risk scoring	Azure ML + DevOps + AKS
Demand forecasting	Azure ML Pipelines + Batch Endpoints
Real-time recommendation	Azure ML Endpoints + AKS
Automated retraining	Azure DevOps + Azure ML Triggers

🧰 Comparison with Other Clouds

Capability	Azure ML	GCP Vertex AI	AWS SageMaker
GUI & SDK Support	Strong (Studio + CLI + SDK)	Strong	Strong
CI/CD Pipelines	Azure DevOps, GitHub	Vertex Pipelines	SageMaker Pipelines
Monitoring & Drift	Azure Monitor + ML monitor	Vertex AI Monitoring	SageMaker Model Monitor
Feature Store	Preview	Production-ready	Production-ready

🔐 What is IAM & Access Control?

IAM (Identity and Access Management) is the framework for:

Identifying users, services, or machines
Controlling what they can access (data, services, resources)
Auditing and enforcing security policies

💡 Why IAM is Critical in MLOps

Use Case	IAM Role Needed
Data access for training	Grant access to S3, Blob, GCS buckets
CI/CD pipeline automation	Roles for GitHub Actions, Jenkins, or Azure DevOps
Model serving	Access to endpoints, containers, logging
Secure secrets handling	Access to key vaults or secret managers
Auditing & compliance	Logs of who accessed or changed models/data

🌐 IAM Across Cloud Platforms

1. AWS IAM

IAM Users, Groups, Roles, Policies (JSON)
Common roles:
- AmazonS3FullAccess
- AmazonSageMakerFullAccess
- Custom policies with fine-grained permissions
Used in:
- SageMaker Pipelines
- Lambda, EC2, Step Functions
- Secrets Manager for API keys

{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:PutObject"],
  "Resource": "arn:aws:s3:::ml-data-bucket/*"
}

2. Azure IAM (RBAC)

Uses Azure Active Directory (AAD) for identity
Role-Based Access Control (RBAC) manages access
Predefined roles:
- Contributor, Reader, Owner, Azure ML Contributor
Custom roles for:
- Access to storage accounts
- Running Azure ML pipelines
- Accessing compute targets

{
  "roleName": "CustomMLRole",
  "permissions": [
    {
      "actions": [
        "Microsoft.MachineLearningServices/*",
        "Microsoft.Storage/*/read"
      ]
    }
  ]
}

3. GCP IAM

Service accounts + IAM roles + resource policies
Predefined roles like:
- roles/aiplatform.admin
- roles/storage.objectViewer
Used in:
- Vertex AI Pipelines
- BigQuery, GCS
- Secret Manager

bindings:
- role: roles/aiplatform.user
  members:
    - serviceAccount:ml-pipeline@my-project.iam.gserviceaccount.com

🛡️ IAM in CI/CD & MLOps

MLOps Stage	IAM Role Needed
Data preparation	Access to datasets (S3, GCS, ADLS)
Model training	Access to compute, logging, secrets
CI/CD pipeline	GitHub Actions or Azure DevOps with scoped secrets
Model registry	Read/write permissions to register
Model deployment	Invoke permissions for endpoints
Monitoring	Access to logs, metrics services

🔄 Example: GitHub Actions + AWS IAM for MLOps

GitHub Actions deploys model to SageMaker
Needs limited-access IAM role

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: arn:aws:iam::123456789012:role/GitHub-SageMaker-Deploy
          aws-region: us-east-1

🔒 Best Practices for IAM in MLOps

✅ Principle of Least Privilege
✅ Rotate credentials & use temporary tokens
✅ Use IAM Roles/Service Accounts over hardcoding credentials
✅ Enable logging & audit trails
✅ Store secrets in Key Vault/Secrets Manager
✅ Apply network policies & endpoint security

✅ Tools to Manage IAM + Secrets

Tool	Purpose
AWS IAM + Secrets Manager	Access control & credential store
Azure RBAC + Key Vault	Role-based control & secrets
GCP IAM + Secret Manager	Fine-grained permissions & key mgmt
HashiCorp Vault	Cross-platform secret store
Kubernetes RBAC + ServiceAccounts	For model deployment and services

Here’s a detailed comparison and usage guide for cloud storage in the MLOps and DevOps context, focusing on AWS S3 (Simple Storage Service) and GCP GCS (Google Cloud Storage).

☁️ Overview

Feature	Amazon S3	Google Cloud Storage (GCS)
Service Name	Amazon Simple Storage Service	Google Cloud Storage
Storage Structure	Buckets → Objects	Buckets → Objects
URL Format	`https://s3.amazonaws.com/bucket/key`	`https://storage.googleapis.com/bucket/key`
Access Control	IAM, Bucket Policies, ACLs	IAM, Uniform/Bucket-level Policies
Versioning	✅ Supported	✅ Supported
Encryption	SSE-S3, SSE-KMS, SSE-C	CSE, CMEK, Google-managed keys
Lifecycle Mgmt	✅ (Transitions, Expiry rules)	✅ (Rules, Policies)
Event Triggers	S3 Event Notifications (to Lambda, etc.)	GCS Notifications (Pub/Sub, Cloud Functions)

🧠 Common MLOps Use Cases

Task	How S3 / GCS Helps
Store raw training data	CSVs, JSON, Parquet in S3 or GCS
Save processed features	Feature store intermediates
Model artifacts	Store `.pkl`, `.pt`, `.joblib` files
Logging / metrics storage	Send logs or model metrics to S3/GCS
CI/CD pipelines	Pass artifacts between build stages
Model registry (if custom)	Versioned model storage

🛠️ How to Use in Practice

🔹 AWS S3 Example (Python `boto3`)

import boto3

s3 = boto3.client('s3')
s3.upload_file('model.pkl', 'ml-bucket', 'models/model.pkl')
s3.download_file('ml-bucket', 'models/model.pkl', 'local_model.pkl')

Set AWS credentials using:

IAM role (EC2/SageMaker)
~/.aws/credentials file
AWS_ACCESS_KEY_ID & AWS_SECRET_ACCESS_KEY env vars

🔹 GCS Example (Python `google-cloud-storage`)

from google.cloud import storage

client = storage.Client()
bucket = client.get_bucket('ml-bucket')
blob = bucket.blob('models/model.pkl')
blob.upload_from_filename('model.pkl')
blob.download_to_filename('local_model.pkl')

Set GCP credentials using:

GOOGLE_APPLICATION_CREDENTIALS env var with service account key JSON

🔒 Access Control Tips

Platform	Recommendation
AWS	Use IAM roles with minimal permissions (e.g., `s3:GetObject`, `s3:PutObject`)
GCP	Use service accounts with specific roles like `roles/storage.objectViewer` or `roles/storage.admin`

⏳ Lifecycle & Cost Management

Both support:

Object Lifecycle Rules: move to cold storage (Glacier or Nearline/Coldline)
Retention Policies: block deletion of data for X days
Auto-delete/expire rules for temp files and old models

🧪 CI/CD & ML Pipelines Integration

Tool/Framework	S3 Support	GCS Support
SageMaker Pipelines	✅ Native	❌
Vertex AI	❌	✅ Native
MLflow	✅ Via URI (`s3://...`)	✅ Via URI (`gs://...`)
Airflow	✅	✅
Kubeflow	✅	✅
ZenML	✅	✅

📂 Versioning

Enable versioning in both platforms to track changes:
- In S3: Go to bucket → Enable versioning
- In GCS: Set bucket versioning with gsutil or Console
Useful for:
- Rolling back models
- Auditing training dataset changes

🛡️ Security Best Practices

✅ Enable encryption (default in both)
✅ Block public access unless explicitly required
✅ Use signed URLs for limited-time sharing
✅ Monitor with logging (CloudTrail / Cloud Audit Logs)
✅ Use bucket-level access over object-level ACLs

Here’s a detailed breakdown and comparison of compute services: EC2, GKE, and Lambda — often used in DevOps, MLOps, and scalable microservices environments.

🧮 1. Amazon EC2 (Elastic Compute Cloud)

🧾 What It Is:

Virtual machines (VMs) on demand.
Full control over the OS, networking, and storage.
Ideal for traditional apps, ML model training, hosting APIs, etc.

🔧 Typical Use Cases:

Host model servers (like FastAPI, Flask, or TorchServe)
Run batch jobs or cron scripts
Train ML models on GPU-enabled instances
Run Docker containers (via EC2 + ECS or self-managed)

✅ Pros:

Full flexibility (install anything)
Scalable (manual or via Auto Scaling Groups)
GPU support for ML workloads

⚠️ Cons:

Must manage patching, scaling, security
Pricing can rise with high uptime

🛠️ Infra-as-Code Example (Terraform):

resource "aws_instance" "ml_server" {
  ami           = "ami-xxxxxxxx"
  instance_type = "t2.medium"
  key_name      = "your-key"
}

🧱 2. GKE (Google Kubernetes Engine)

🧾 What It Is:

Fully managed Kubernetes (K8s) on Google Cloud.
Run containerized apps with built-in scaling, networking, and storage.

🔧 Typical Use Cases:

Run microservices (REST APIs, background jobs)
Deploy ML inference servers (TF Serving, Triton, custom Flask apps)
Deploy ML pipelines (Kubeflow, TFX, MLflow)

✅ Pros:

Autoscaling, self-healing pods
Native integration with GCP services (BigQuery, GCS, Vertex AI)
CI/CD with GitHub/GitLab + Cloud Build or ArgoCD

⚠️ Cons:

Requires Kubernetes knowledge
Slightly higher learning curve

🚀 Useful Tools with GKE:

Kubeflow: ML pipelines
Argo Workflows: CI/CD or ML pipeline orchestration
Istio/Envoy: Service mesh, secure traffic

⚡ 3. AWS Lambda (Serverless)

🧾 What It Is:

Run backend functions in response to events (e.g., S3 upload, HTTP requests, cron).
You pay only for compute time used (in milliseconds).

🔧 Typical Use Cases:

ML inference for light models
Trigger model retraining when new data arrives in S3
ETL/ELT tasks on demand
Webhook receivers, alert systems

✅ Pros:

Zero server management
Auto-scaling, highly cost-efficient
Works with other AWS services (S3, SNS, DynamoDB)

⚠️ Cons:

Limited runtime (max 15 min)
Cold start latency (~1s for some languages)
Not suitable for large ML models unless optimized

🔁 Example:

# handler.py
def lambda_handler(event, context):
    return {"message": "Hello from Lambda!"}

Deploy via:

AWS Console
AWS SAM / Serverless Framework
Terraform

🔍 Summary Table

Feature	EC2	GKE	Lambda
Type	VM	Managed Kubernetes	Serverless Functions
Ideal for	ML training/inference	Scalable microservices, ML pipelines	Lightweight functions, events
Scaling	Manual / Auto Scaling	Horizontal Pod Autoscaler	Automatic
OS Control	Full	Limited to container OS	None
Cold Start	No	No	Yes
Pricing	Per hour/second	Per node/hour	Per request (ms-based)
Infra-as-Code Tools	Terraform, CloudFormation	Terraform, Helm	SAM, Serverless, Terraform
Docker Support	Manual via ECS or EKS	Native	Limited (via container Lambda)
GPU Support	✅ Yes	✅ (with node pools)	⚠️ Not natively supported

💡 Best Practice Guidance

Scenario	Recommended Compute
Model training with GPU	EC2 or GKE (with GPU nodes)
Real-time API with low traffic	Lambda or Cloud Functions
Batch data processing	Lambda, EC2, or GKE Jobs
Large model inference	EC2 or GKE
Scalable web app	GKE
Orchestrating ML workflows	GKE (Kubeflow, Argo)

Here’s a concise yet detailed comparison of the major AutoML platforms across AWS, GCP, and Azure: SageMaker Autopilot, Vertex AI, and Azure AutoML — all used for automating ML workflows including preprocessing, training, tuning, and deployment.

🔁 Overview Table: AutoML Comparison

Feature	SageMaker Autopilot (AWS)	Vertex AI AutoML (GCP)	Azure AutoML
Language Support	Python (via SDK, Boto3)	Python (via SDK, REST)	Python (AzureML SDK)
UI Available	✅ SageMaker Studio	✅ Vertex AI Console	✅ Azure Studio
Model Explainability	✅ SHAP built-in	✅ Integrated	✅ Built-in with visual UI
Custom Code Injection	✅ Custom containers	⚠️ Limited	✅ Supported via pipelines
Model Deployment	✅ One-click to endpoint	✅ Deploy to prediction service	✅ Deploy to AKS or endpoint
Model Type Coverage	Classification, Regression	Vision, Text, Tabular, Forecast	Tabular, Time series, NLP
Integration with MLOps	✅ SageMaker Pipelines	✅ Vertex AI Pipelines	✅ Azure ML Pipelines
Pricing	Pay-per-job + compute	Pay-per-job + compute	Pay-per-run + compute

🧠 SageMaker Autopilot (AWS)

✅ Highlights:

Input: CSVs or data in S3
Handles feature engineering, model tuning, evaluation
Gives Jupyter Notebooks of every step (transparency)
Easily integrated with SageMaker Pipelines + Endpoints

🧪 Example:

from sagemaker import AutoML
automl = AutoML(role=role,
                target_attribute_name="target",
                output_path="s3://my-bucket/output")
automl.fit(inputs="s3://my-bucket/input")

🔍 Vertex AI AutoML (GCP)

✅ Highlights:

Unified with BigQuery, Cloud Storage, Looker
Supports Tabular, Text, Vision, and Forecasting
Strong low-code/no-code workflow
Built-in model evaluation and deploy

🧪 Sample Flow:

Upload dataset via console or Python
Click “Train New Model”
Set target + training options
Deploy or export model

Code Example:

from google.cloud import aiplatform

aiplatform.init(project="my-project", location="us-central1")
model = aiplatform.AutoMLTabularTrainingJob(
    display_name="my-tabular-model",
    optimization_prediction_type="classification",
)
model.run(dataset=my_dataset, target_column="label")

🔬 Azure ML AutoML

✅ Highlights:

Integrates with Azure Data Factory, Databricks
Offers rich UI + code-first SDK
Visual model explanations and fairness analysis
Deployment to AKS or managed endpoints

🧪 Code Example:

from azureml.train.automl import AutoMLConfig
from azureml.core.experiment import Experiment

automl_config = AutoMLConfig(task='classification',
                             primary_metric='AUC_weighted',
                             training_data=dataset,
                             label_column_name='target',
                             iterations=20)

experiment = Experiment(ws, "automl-exp")
run = experiment.submit(automl_config)

🧰 Use Case Recommendations

Use Case	Recommended Platform
AWS-centric pipeline (S3, Athena)	SageMaker Autopilot
GCP-first stack (BigQuery, GCS)	Vertex AI
Enterprise + UI-driven	Azure AutoML
High control over pipeline steps	Azure/SageMaker
Forecasting/Time Series	Vertex AI or Azure
Vision/NLP	Vertex AI

🚀 Advanced Add-ons

Feature	SageMaker Autopilot	Vertex AI AutoML	Azure AutoML
SHAP Explanations	✅ Yes	✅ Yes	✅ Yes
Custom Pipelines	✅ via SageMaker Pipeline	✅ Vertex AI Pipelines	✅ via Azure Pipelines
Hyperparameter Tuning	✅ Bayesian search	✅ Auto-tuning	✅ Bayesian + Bandit
Auto-deploy Models	✅ Yes	✅ Yes	✅ Yes

Mlops - III

8. CI/CD for ML

🚀 What is CI/CD?

✅ 1. Continuous Integration (CI)

🔹 Goal:

🔧 Typical Steps in CI:

🔁 Tools:

✅ 2. Continuous Delivery (CD)

🔹 Goal:

🧩 Steps:

✅ 3. Continuous Deployment (CD)

🔹 Goal:

🏗️ CI/CD Pipeline Example (ML App)

🔌 Common CI/CD Tools in MLOps

🧪 Why CI/CD is Important in MLOps?

GitHub actions

🔧 Common Use Cases

📁 Basic Structure of GitHub Actions

⚙️ Key Components

✅ Example for Python Project

📦 Popular Actions

🛠️ Advanced Features

Gitlab CI/CD

🧱 Core Concept: .gitlab-ci.yml

✅ Simple Example

🔧 Key Components

🛠 Common Features

🐍 Python Example

🐳 Docker + GitLab CI Example

🔐 Using Secrets (CI/CD Variables)

🚀 Deployment Example with SSH

✳ Comparison with GitHub Actions

🧱 GitLab CI: Artifacts

🔹 Basic Usage

🔹 With Expiration and Custom Settings

🔹 Passing Artifacts to Next Stage

🧰 GitHub Actions: Artifacts

🔹 Save Artifacts

🔹 Download in Another Job

🧰 What is Jenkins?

🔧 Key Concepts

📁 Sample Jenkinsfile (Declarative Pipeline)

📦 Artifacts in Jenkins

🧪 Jenkins Plugins You’ll Need

🐳 Jenkins with Docker

🔐 Secrets in Jenkins

🔁 What is CircleCI?

📁 Config File: .circleci/config.yml

✅ Minimal Example (Node.js)

🧱 Key Components

🐳 Docker Support Example

📦 Artifacts in CircleCI

Upload Artifacts

Test Reports

🔐 Environment Variables & Secrets

🛠 Advanced Features

⚙️ Caching Example

🔄 CircleCI vs GitHub Actions vs GitLab CI vs Jenkins

🧠 What is Amazon SageMaker Pipelines?

⚙️ Typical Use Case: End-to-End ML Workflow

📁 Basic Structure Using SageMaker Python SDK

✅ Example: Full ML Pipeline

📦 Key Components of SageMaker Pipelines

🛡️ Benefits

🔗 Real-World Example Flow

🚀 Related AWS Services

🔹 What is ZenML?

✅ Features:

📁 ZenML Pipeline Example:

🔹 What is TFX (TensorFlow Extended)?

✅ Features:

📁 TFX Pipeline Example:

🆚 ZenML vs TFX: Feature Comparison

🧠 When to Use What?

💡 TL;DR

9. Monitoring and Logging

🔄 What is Drift in Machine Learning?

📦 1. Data Drift (a.k.a. Covariate Shift)

🧠 Example:

📈 Detection Methods:

🧱 Core Concept: `.gitlab-ci.yml`

📁 Sample `Jenkinsfile` (Declarative Pipeline)

📁 Config File: `.circleci/config.yml`

🔍 Prometheus Configuration (`prometheus.yml`)