Mlops - III

 

8. CI/CD for ML



๐Ÿš€ What is CI/CD?

Acronym Meaning
CI Continuous Integration
CD Continuous Delivery or Continuous Deployment

CI/CD automates the process of building, testing, and deploying applications to reduce manual work, improve consistency, and speed up delivery cycles.


✅ 1. Continuous Integration (CI)

๐Ÿ”น Goal:

Automatically integrate code from multiple developers, test it, and detect errors early.

๐Ÿ”ง Typical Steps in CI:

  • Developer pushes code to GitHub/GitLab/Bitbucket

  • CI pipeline triggers:

    • Run unit tests

    • Run linting/formatting (e.g., flake8, black)

    • Build application artifacts

    • Generate reports (e.g., test coverage)

๐Ÿ” Tools:

  • GitHub Actions

  • GitLab CI

  • Jenkins

  • CircleCI

  • Travis CI


✅ 2. Continuous Delivery (CD)

๐Ÿ”น Goal:

Automatically prepare the application to be deployed in a staging or production environment — but with manual approval for final deployment.

๐Ÿงฉ Steps:

  • All CI steps

  • Deploy to staging

  • Run integration tests

  • Wait for approval → deploy to production


✅ 3. Continuous Deployment (CD)

๐Ÿ”น Goal:

Fully automate build → test → production deployment with no human approval step.

This is riskier, but good for small frequent releases if tests are reliable.


๐Ÿ—️ CI/CD Pipeline Example (ML App)

1. Code pushed to GitHub → triggers pipeline
2. Environment setup
3. Code linting & formatting
4. Unit & model testing
5. Train model (optionally)
6. Store model artifact (e.g., in S3 or MLflow)
7. Build Docker image
8. Deploy to staging or production (e.g., via Kubernetes)

๐Ÿ”Œ Common CI/CD Tools in MLOps

Tool Use
GitHub Actions Git-based CI/CD
GitLab CI Full Git + CI/CD integration
Jenkins Flexible, customizable pipelines
ArgoCD Kubernetes-native CD
Tekton Kubernetes-native CI/CD
MLflow / DVC Model versioning/artifacts
Docker + K8s Containerized deployment

๐Ÿงช Why CI/CD is Important in MLOps?

  • Keeps models reproducible

  • Automates testing of data pipelines

  • Ensures consistent deployment of models

  • Avoids "it worked on my machine" issues


GitHub actions

GitHub Actions is a CI/CD (Continuous Integration and Continuous Deployment) tool built into GitHub. It allows you to automate workflows such as building, testing, and deploying code when certain events occur in your repository (like push, pull request, etc.).


๐Ÿ”ง Common Use Cases

  1. CI/CD pipelines (build, test, deploy code)

  2. Linting and formatting

  3. Running cron jobs

  4. Publishing packages

  5. Automating issues, PRs, labels, etc.


๐Ÿ“ Basic Structure of GitHub Actions

You define actions using YAML inside the .github/workflows/ folder in your repository.

Example:

# .github/workflows/nodejs.yml
name: Node.js CI

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3
    - name: Setup Node.js
      uses: actions/setup-node@v4
      with:
        node-version: '20'

    - name: Install dependencies
      run: npm install

    - name: Run tests
      run: npm test

⚙️ Key Components

Component Description
on Triggers (e.g., push, pull_request, schedule)
jobs A collection of tasks to run
runs-on Environment (e.g., ubuntu-latest)
steps Individual commands or actions
uses Reusable actions (like actions/checkout)
run Shell commands

✅ Example for Python Project

name: Python CI

on: [push]

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt

    - name: Run tests
      run: pytest

๐Ÿ“ฆ Popular Actions

Action Purpose
actions/checkout Check out repo code
actions/setup-node Setup Node.js
actions/setup-python Setup Python
docker/build-push-action Build & push Docker image
github/super-linter Code linting

๐Ÿ› ️ Advanced Features

  • Matrix builds (test on multiple environments)

  • Secrets (store API keys securely)

  • Reusable workflows via workflow_call

  • Artifacts (store and share test reports, build files, etc.)


Gitlab CI/CD


GitLab CI/CD is GitLab’s built-in continuous integration and deployment system. Like GitHub Actions, it lets you automate build, test, and deployment pipelines, but is more tightly integrated into the GitLab platform.


๐Ÿงฑ Core Concept: .gitlab-ci.yml

The pipeline is defined in a .gitlab-ci.yml file in the root of your repository.


✅ Simple Example

stages:
  - build
  - test
  - deploy

build_job:
  stage: build
  script:
    - echo "Compiling the code..."
    - make

test_job:
  stage: test
  script:
    - echo "Running tests..."
    - make test

deploy_job:
  stage: deploy
  script:
    - echo "Deploying application..."
    - make deploy
  only:
    - main

๐Ÿ”ง Key Components

Component Description
stages The pipeline flow (e.g., build → test → deploy)
jobs Each job runs a script and belongs to a stage
script Shell commands that the job will execute
only / except Control when the job runs (e.g., only on main)
tags Used to target specific GitLab Runners

๐Ÿ›  Common Features

  • Built-in Docker support for containerized pipelines

  • Manual jobs for approval steps

  • Artifacts and caching for build outputs or dependencies

  • Environment variables & secrets

  • Parallel/Matrix jobs

  • Trigger other pipelines

  • Use private/public runners


๐Ÿ Python Example

image: python:3.11

stages:
  - test

test:
  stage: test
  script:
    - pip install -r requirements.txt
    - pytest

๐Ÿณ Docker + GitLab CI Example

image: docker:latest

services:
  - docker:dind

stages:
  - build

build:
  stage: build
  script:
    - docker build -t myapp:latest .

๐Ÿ” Using Secrets (CI/CD Variables)

Set in GitLab → Project Settings → CI/CD → Variables
Then reference in your script:

script:
  - echo "$SECRET_KEY"

๐Ÿš€ Deployment Example with SSH

deploy:
  stage: deploy
  script:
    - ssh user@your-server 'cd /var/www/app && git pull && systemctl restart app'
  only:
    - main

✳ Comparison with GitHub Actions

Feature GitLab CI GitHub Actions
Config File .gitlab-ci.yml .github/workflows/*.yml
Built-in Docker ✅ Native ✅ With setup
Matrix Build ✅ Via parallel ✅ With matrix
Community Marketplace ✅ (less extensive) ✅ Huge marketplace
Integrated UI Deeply built-in More plug & play


In CI/CD, artifacts are files generated during a pipeline run that you want to save, archive, or pass to later stages—like test reports, build outputs, or deployment packages.

Both GitLab CI and GitHub Actions support artifacts, but their usage and syntax differ.


๐Ÿงฑ GitLab CI: Artifacts

๐Ÿ”น Basic Usage

build_job:
  stage: build
  script:
    - make build
  artifacts:
    paths:
      - build/

This saves the build/ folder after the build_job runs. These artifacts:

  • Are downloadable from the GitLab UI

  • Can be passed to later stages (unless expire_in removes them)


๐Ÿ”น With Expiration and Custom Settings

test_job:
  stage: test
  script:
    - pytest --junitxml=report.xml
  artifacts:
    paths:
      - report.xml
    expire_in: 1 week
    reports:
      junit: report.xml

Key fields:

Field Purpose
paths Files or directories to save
expire_in Auto-delete time (e.g., 1 day, 1 week)
reports Special format reports like junit, coverage, etc.

๐Ÿ”น Passing Artifacts to Next Stage

Artifacts are automatically passed to jobs in later stages, not within the same stage.

stages:
  - build
  - test

build:
  stage: build
  script:
    - make build
  artifacts:
    paths:
      - build/

test:
  stage: test
  script:
    - ./test-runner build/

๐Ÿงฐ GitHub Actions: Artifacts

๐Ÿ”น Save Artifacts

- name: Upload build output
  uses: actions/upload-artifact@v4
  with:
    name: build-artifact
    path: build/

๐Ÿ”น Download in Another Job

- name: Download artifact
  uses: actions/download-artifact@v4
  with:
    name: build-artifact

You must split into separate jobs to upload/download artifacts.


๐Ÿงฐ What is Jenkins?

Jenkins is an open-source automation server widely used for CI/CD pipelines. It lets you automate building, testing, and deploying applications through pipelines (typically defined in Jenkinsfile).


๐Ÿ”ง Key Concepts

Concept Description
Job A build configuration (freestyle or pipeline)
Pipeline Scripted or declarative workflow for CI/CD
Agent A machine (or container) where jobs run
Stage A high-level step (e.g., Build, Test)
Step A single task inside a stage (e.g., shell command)
Node A Jenkins worker (agent) that executes pipelines

๐Ÿ“ Sample Jenkinsfile (Declarative Pipeline)

pipeline {
    agent any

    environment {
        MY_ENV_VAR = 'value'
    }

    stages {
        stage('Build') {
            steps {
                echo 'Building the project...'
                sh 'make build'
            }
        }

        stage('Test') {
            steps {
                echo 'Running tests...'
                sh 'make test'
            }
        }

        stage('Deploy') {
            when {
                branch 'main'
            }
            steps {
                echo 'Deploying to production...'
                sh './deploy.sh'
            }
        }
    }

    post {
        always {
            echo 'Pipeline finished.'
        }
        failure {
            echo 'Pipeline failed!'
        }
    }
}

๐Ÿ“ฆ Artifacts in Jenkins

To store and archive files like build outputs or test results:

post {
    success {
        archiveArtifacts artifacts: 'build/*.jar', fingerprint: true
    }
}

To publish test results:

post {
    always {
        junit 'reports/**/*.xml'
    }
}

๐Ÿงช Jenkins Plugins You’ll Need

Plugin Name Purpose
Pipeline Enables pipeline-as-code
Git Checkout from Git repositories
JUnit Test reporting
Docker Pipeline Build & run Docker in pipeline
Credentials Binding Secure secret handling
SSH Remote deployments
Blue Ocean Modern UI for pipelines

๐Ÿณ Jenkins with Docker

pipeline {
    agent {
        docker {
            image 'python:3.11'
            args '-v /var/run/docker.sock:/var/run/docker.sock'
        }
    }

    stages {
        stage('Install') {
            steps {
                sh 'pip install -r requirements.txt'
            }
        }
        stage('Test') {
            steps {
                sh 'pytest'
            }
        }
    }
}

๐Ÿ” Secrets in Jenkins

  • Store credentials in Manage Jenkins → Credentials

  • Use in pipeline:

withCredentials([string(credentialsId: 'MY_SECRET_ID', variable: 'MY_SECRET')]) {
    sh 'echo $MY_SECRET'
}


๐Ÿ” What is CircleCI?

CircleCI is a modern cloud-native CI/CD platform known for speed, flexibility, and Docker-first support. It automates building, testing, and deploying your code every time you commit changes.


๐Ÿ“ Config File: .circleci/config.yml

CircleCI uses a YAML file stored in the .circleci/ folder in your repo.


✅ Minimal Example (Node.js)

version: 2.1

jobs:
  build:
    docker:
      - image: cimg/node:20.4
    steps:
      - checkout
      - run: npm install
      - run: npm test

workflows:
  build_and_test:
    jobs:
      - build

๐Ÿงฑ Key Components

Component Description
version CircleCI configuration version (use 2.1+)
jobs Group of steps to run (build/test/deploy)
steps Commands in a job (e.g., checkout, run)
workflows Defines job orchestration (sequential/parallel)
executors Runtime environment (Docker, machine, macOS)

๐Ÿณ Docker Support Example

jobs:
  build:
    docker:
      - image: cimg/python:3.11
    steps:
      - checkout
      - run:
          name: Install dependencies
          command: pip install -r requirements.txt
      - run:
          name: Run tests
          command: pytest

๐Ÿ“ฆ Artifacts in CircleCI

Artifacts are files saved from a job (e.g., logs, coverage reports).

Upload Artifacts

- store_artifacts:
    path: test-results/
    destination: test-results

Test Reports

- store_test_results:
    path: test-results

You can see artifacts and test results in the CircleCI UI after job execution.


๐Ÿ” Environment Variables & Secrets

  • Define them via CircleCI Project Settings → Environment Variables

  • Reference them directly in your run commands:

- run: echo $MY_SECRET_TOKEN

๐Ÿ›  Advanced Features

Feature Example
Workflows Run jobs in parallel or sequentially
Conditional steps Use when and unless
Caching Speed up builds using save_cache / restore_cache
Reusable configs commands, executors, orbs
Matrix builds Run tests against multiple language versions

⚙️ Caching Example

- restore_cache:
    keys:
      - v1-deps-{{ checksum "package-lock.json" }}

- run: npm install

- save_cache:
    paths:
      - node_modules
    key: v1-deps-{{ checksum "package-lock.json" }}

๐Ÿ”„ CircleCI vs GitHub Actions vs GitLab CI vs Jenkins

Feature CircleCI GitHub Actions GitLab CI Jenkins
Hosted ✅ Yes ✅ Yes ✅ Yes ❌ Self-hosted
Docker-native ✅ Strong ✅ Good ✅ Strong ✅ with config
Config as Code .yml .yml .yml ✅ Groovy DSL
Marketplace ✅ Orbs ✅ Actions ⚠️ Few ✅ Plugins
Matrix builds ✅ Built-in ✅ Supported ✅ Parallel jobs ✅ Scripted



๐Ÿง  What is Amazon SageMaker Pipelines?

SageMaker Pipelines is Amazon's CI/CD service for machine learning workflows. It lets you build, automate, and manage ML workflows (like data prep, training, tuning, evaluation, and deployment) using a Python SDK.

It’s similar to Kubeflow Pipelines or Airflow but tightly integrated into AWS SageMaker.


⚙️ Typical Use Case: End-to-End ML Workflow

[Data Prep] → [Feature Engineering] → [Model Training] → [Model Evaluation] → 
[Model Registration] → [Deployment]

๐Ÿ“ Basic Structure Using SageMaker Python SDK

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep, ModelStep
from sagemaker.workflow.parameters import ParameterString, ParameterFloat
from sagemaker.processing import ScriptProcessor
from sagemaker.estimator import Estimator
from sagemaker.workflow.pipeline_context import PipelineSession

✅ Example: Full ML Pipeline

from sagemaker.workflow.parameters import ParameterString
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.processing import ScriptProcessor
from sagemaker.estimator import Estimator
from sagemaker.workflow.pipeline_context import PipelineSession
import sagemaker

# Setup
region = sagemaker.Session().boto_region_name
role = sagemaker.get_execution_role()
pipeline_session = PipelineSession()

# Parameters
input_data = ParameterString(name="InputData", default_value="s3://my-bucket/input.csv")

# Step 1: Preprocessing
processor = ScriptProcessor(
    image_uri=sagemaker.image_uris.retrieve("sklearn", region),
    command=["python3"],
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
)

processing_step = ProcessingStep(
    name="DataPreprocessing",
    processor=processor,
    inputs=[input_data],
    code="preprocess.py",
    outputs=[...]
)

# Step 2: Training
estimator = Estimator(
    image_uri=sagemaker.image_uris.retrieve("xgboost", region),
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path="s3://my-bucket/model/",
)

training_step = TrainingStep(
    name="ModelTraining",
    estimator=estimator,
    inputs={"train": processing_step.properties.ProcessingOutputConfig.
    Outputs["train_data"].S3Output.S3Uri},
)

# Pipeline Definition
pipeline = Pipeline(
    name="MyMLPipeline",
    parameters=[input_data],
    steps=[processing_step, training_step],
    sagemaker_session=pipeline_session
)

pipeline.upsert(role_arn=role)
execution = pipeline.start()

๐Ÿ“ฆ Key Components of SageMaker Pipelines

Component Purpose
ProcessingStep Data cleaning, feature engineering, etc.
TrainingStep Model training using Estimator
TransformStep Batch inference
ConditionStep Add logic based on metrics
ModelStep Register model to Model Registry
CallbackStep Integrate with Lambda/custom logic
ParameterString/Float Dynamically pass pipeline inputs
PipelineSession Manages interaction with SageMaker

๐Ÿ›ก️ Benefits

Managed service – no servers to manage
Trackable runs with versioning, lineage, and metadata
Built-in CI/CD for ML
Integration with SageMaker Experiments, Model Registry, and Feature Store
Scalable with on-demand compute and built-in retry logic


๐Ÿ”— Real-World Example Flow

1. Ingest raw CSV from S3
2. Clean & split data (ProcessingStep)
3. Train XGBoost or sklearn model (TrainingStep)
4. Evaluate accuracy, F1 score (ConditionStep)
5. If metrics are good → register model (ModelStep)
6. Deploy to endpoint via Lambda or manual

๐Ÿš€ Related AWS Services

Service Purpose
S3 Data input/output
SageMaker Studio GUI for pipelines
SageMaker Feature Store Feature engineering
Model Registry Version & track models
Lambda / Step Functions Extend logic or trigger deployment
CloudWatch Logging & monitoring



๐Ÿ”น What is ZenML?

ZenML is an open-source MLOps framework built to orchestrate reproducible ML pipelines across tools like MLflow, Airflow, Kubernetes, and SageMaker.

✅ Features:

  • Tool-agnostic: plug in TensorFlow, PyTorch, sklearn, etc.

  • Built-in support for MLflow, Weights & Biases, GCP, AWS, Kubernetes

  • Focus on pipelines, reproducibility, modularity

  • Developer-friendly CLI + Python SDK

๐Ÿ“ ZenML Pipeline Example:

@step
def ingest_data() -> pd.DataFrame:
    ...

@step
def train_model(data: pd.DataFrame) -> Any:
    ...

@pipeline
def training_pipeline(data_loader, trainer):
    data = data_loader()
    model = trainer(data)

pipeline = training_pipeline(ingest_data, train_model)
pipeline.run()

ZenML separates your pipeline into clean steps and supports plugins to execute on local, Kubeflow, Airflow, Vertex AI, etc.


๐Ÿ”น What is TFX (TensorFlow Extended)?

TFX is Google's official end-to-end platform for deploying TensorFlow models in production. It was built to meet internal Google ML production needs.

✅ Features:

  • Native integration with TensorFlow ecosystem

  • Standard components: ExampleGen, Trainer, Evaluator, Pusher, etc.

  • Works with Apache Beam, Kubeflow Pipelines, Airflow

  • Focuses heavily on data validation, model analysis, serving

๐Ÿ“ TFX Pipeline Example:

from tfx.dsl.components.base import executor_spec
from tfx.orchestration import pipeline
from tfx.orchestration.local import local_dag_runner
from tfx.components import CsvExampleGen, Trainer, Pusher

example_gen = CsvExampleGen(input_base='data/')
trainer = Trainer(...)
pusher = Pusher(...)

pipeline = pipeline.Pipeline(
    pipeline_name='my_pipeline',
    pipeline_root='pipelines/',
    components=[example_gen, trainer, pusher]
)

local_dag_runner.LocalDagRunner().run(pipeline)

TFX enforces TensorFlow-specific best practices for data quality, model performance, and deployment.


๐Ÿ†š ZenML vs TFX: Feature Comparison

Feature ZenML TFX
Language Python (framework-agnostic) Python (TensorFlow-focused)
ML Framework Support TensorFlow, PyTorch, sklearn, etc. TensorFlow only
Component Modularity Highly modular + customizable Modular (TensorFlow-centric)
Orchestrators Airflow, Kubeflow, MLflow, Prefect Airflow, Kubeflow
Deployment Support SageMaker, Vertex AI, KServe TensorFlow Serving, Vertex AI
Visualization / Metadata MLflow, W&B, ZenML UI TensorBoard, TFX Metadata
Pipeline Reproducibility ✅ Yes ✅ Yes
Local Execution ✅ Yes ✅ Yes
Ease of Use ๐ŸŸข Beginner-friendly ๐Ÿ”ด More complex, steep learning curve

๐Ÿง  When to Use What?

Scenario Use
Want a framework-agnostic, modular, easy-to-adopt pipeline ZenML
Already using TensorFlow and want to follow best practices TFX
Need to plug into SageMaker, MLflow, K8s, etc. ZenML
Need advanced model validation, explainability, data skew detection TFX

๐Ÿ’ก TL;DR

ZenML TFX
Flexible, lightweight, easy to start Powerful, opinionated, deep TensorFlow support
Works with any ML/DL framework TensorFlow-only
Ideal for hybrid/multi-cloud & plug-n-play MLOps Ideal for enterprise-grade TensorFlow pipelines

9. Monitoring and Logging



๐Ÿ”„ What is Drift in Machine Learning?

In production ML, drift refers to changes over time in the data or relationships that the model depends on, which can lead to reduced model accuracy.

There are two main types:


๐Ÿ“ฆ 1. Data Drift (a.k.a. Covariate Shift)

Definition:
The distribution of input features (X) changes over time, but the relationship between input and output (P(y|x)) remains the same.

๐Ÿง  Example:

  • A credit scoring model was trained on users from India, but it’s now being used in the US.

  • Feature distributions like age, income, or credit history change → data drift.

๐Ÿ“ˆ Detection Methods:

  • Statistical tests (e.g., Kolmogorov-Smirnov test)

  • Population Stability Index (PSI)

  • Earth Mover’s Distance

  • Histograms & density plots


๐Ÿง  2. Model Drift (a.k.a. Concept Drift)

Definition:
The relationship between input and target variable (P(y|x)) changes over time, even if input distribution remains stable.

๐Ÿง  Example:

  • A fraud detection model where fraudster behavior evolves (e.g., new tactics)

  • The model can no longer accurately map inputs to the correct outcome → model drift.

๐Ÿ“ˆ Detection Methods:

  • Monitoring model performance metrics (e.g., accuracy, AUC, F1)

  • If model metrics drop but input features haven’t changed → model drift

  • Concept drift detectors like:

    • DDM (Drift Detection Method)

    • ADWIN

    • Kullback-Leibler divergence


๐Ÿ“Š Drift Comparison

Aspect Data Drift Model Drift
What changes Input features distribution (X) Relationship between X and Y
Impact Can indirectly reduce accuracy Directly affects model accuracy
Detection PSI, KS test, histograms Drop in model performance
Remediation Retrain with recent data Retrain + re-define model logic

๐Ÿ” Common Causes of Drift

Cause Type
Seasonality or time-based shifts Data Drift
Change in user behavior Model Drift
External events (e.g., pandemic) Both
Sensor recalibration or software upgrades Data Drift

๐Ÿ›ก️ How to Monitor & Handle Drift

1. Monitoring Tools

  • Evidently AI – Open-source for drift detection (https://evidentlyai.com/)

  • WhyLabs, Arize AI, Fiddler, SageMaker Model Monitor

  • Custom dashboards with Prometheus/Grafana

2. Detection Frequency

  • Daily/weekly batch comparisons

  • Real-time if using streaming

3. Actions to Take

  • Trigger retraining pipelines

  • Use drift detectors in CI/CD workflows

  • Incorporate active learning or online learning


๐Ÿ“Œ Summary

Term What is it? Why it matters
Data Drift Input feature distribution changes Model may make wrong inferences
Model Drift Relationship between X and Y changes Model becomes inaccurate


๐Ÿ“ˆ What is Model Performance Monitoring?

Model performance monitoring is the process of tracking, measuring, and analyzing how your ML model behaves in production — ensuring it's still accurate, fair, and reliable after deployment.


๐Ÿ” Why Is It Important?

Even the best model at training time can degrade in production due to:

  • Data drift

  • Model drift

  • Feature pipeline bugs

  • Feedback loops or changing real-world patterns

Without monitoring, you might miss silent failures that hurt business outcomes.


๐ŸŽฏ What to Monitor in ML Systems

✅ 1. Performance Metrics

Metric Type Example
Classification Accuracy, Precision, Recall, F1, AUC
Regression RMSE, MAE, R²
Ranking MAP, NDCG
Business KPIs Conversion rate, CTR, etc.

๐Ÿ‘‰ Compare training vs validation vs production performance.


✅ 2. Data Quality & Drift

What to check How
Missing values Feature-level monitoring
Schema violations Type, range, shape
Data drift PSI, KS Test
Outliers or anomalies Z-score, IQR, Mahalanobis

✅ 3. Prediction Distribution

  • Is the model outputting the same predictions every time?

  • Look for prediction bias or overconfident scores.


✅ 4. Fairness and Bias

  • Measure model fairness across sensitive groups (e.g., age, gender).

  • Monitor disparities in performance.


✅ 5. Latency and Throughput

  • Inference latency (ms/req)

  • Request volume

  • System resource usage (CPU/GPU, memory)


⚒️ Tools for Model Monitoring

๐ŸŸข Open-Source

Tool Features
Evidently AI Data & model drift, dashboards, reports
Prometheus + Grafana Custom monitoring (great for latency, metrics)
MLflow Experiment tracking (with manual model logs)
WhyLogs Logging and monitoring of data quality
Fiddler / Arize AI / TruEra Monitoring + explainability (SaaS)

๐ŸŸก Cloud-Native

Platform Monitoring Feature
SageMaker Model Monitor Built-in drift & quality detection
Vertex AI (GCP) Prediction monitoring, alerts
Azure ML Drift + metric monitoring
Databricks MLflow + production metrics

๐Ÿ” Monitoring Lifecycle Example

1. Model is deployed (API or batch)
2. User requests come in
3. Log: input data, model predictions, latency
4. Optional: collect true labels later (for supervised metrics)
5. Compare live vs baseline (training) distributions & metrics
6. Trigger alerts / retrain pipelines if performance drops

๐Ÿงช Sample: Custom Monitoring Loop (Python)

import pandas as pd
from sklearn.metrics import accuracy_score

# 1. Collect live predictions and labels
preds = pd.read_csv("live_predictions.csv")
truth = pd.read_csv("live_labels.csv")

# 2. Calculate performance
acc = accuracy_score(truth["label"], preds["prediction"])

# 3. Trigger alert if accuracy drops
if acc < 0.75:
    print("⚠️ Model accuracy dropped below threshold!")

๐Ÿ“ฆ Best Practices

✅ Set performance baselines from training
✅ Store input + predictions + actuals
✅ Monitor in real-time or batch
✅ Set up alerts or retraining triggers
✅ Regularly audit for fairness and explainability


๐Ÿ“Š What is Prometheus?

Prometheus is an open-source monitoring and alerting system originally developed by SoundCloud. It’s widely used for real-time metrics collection, alerting, and visualization, especially in DevOps and ML infrastructure.


✅ Why Use Prometheus for ML & MLOps?

  • Track model inference metrics (latency, throughput, errors)

  • Monitor CPU/GPU usage of ML workloads

  • Combine with Grafana for dashboards

  • Setup alerts for performance or drift degradation

  • Works great with Docker, Kubernetes, FastAPI, Flask, etc.


๐Ÿงฑ Core Concepts

Concept Description
Metric A time-series data point (e.g., inference_latency_seconds)
Labels Key-value tags for filtering metrics (e.g., model="xgboost")
Exporter Collects metrics from apps (e.g., Python, GPU, Docker)
Scraping Prometheus pulls metrics by scraping a target HTTP endpoint
Query Uses PromQL to query metrics
Alertmanager Sends alerts via email, Slack, PagerDuty, etc.

๐Ÿ“„ Example: Expose Metrics in Python (FastAPI + Prometheus)

pip install prometheus_client fastapi uvicorn
# app.py
from fastapi import FastAPI
from prometheus_client import start_http_server, Summary, Counter
import time
import random

app = FastAPI()

# Metrics
REQUEST_TIME = Summary('inference_latency_seconds', 'Time spent on inference')
REQUEST_COUNT = Counter('inference_requests_total', 'Total inference requests')

@app.get("/predict")
@REQUEST_TIME.time()
def predict():
    REQUEST_COUNT.inc()
    time.sleep(random.uniform(0.1, 0.5))  # simulate inference delay
    return {"result": "cat"}

# Run Prometheus metrics server on port 8001
start_http_server(8001)

๐Ÿ” Prometheus Configuration (prometheus.yml)

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'ml-api'
    static_configs:
      - targets: ['localhost:8001']
  • Prometheus scrapes http://localhost:8001/metrics every 15s


๐Ÿ“ˆ Visualize in Grafana

  1. Run Prometheus + Grafana using Docker:

docker-compose up
  1. Add Prometheus as a data source in Grafana

  2. Create dashboards using PromQL, e.g.:

inference_latency_seconds_count
rate(inference_latency_seconds_sum[1m])

๐Ÿ“ฆ Popular Exporters

Exporter Use
prometheus_client App-level metrics in Python
node_exporter System metrics (CPU, memory)
gpu_exporter NVIDIA GPU metrics
kube-state-metrics Kubernetes objects
pushgateway For short-lived jobs (like batch ML)

๐Ÿ”” Alerts (via Alertmanager)

Example rule:

groups:
- name: ml-alerts
  rules:
  - alert: HighLatency
    expr: inference_latency_seconds_sum / inference_latency_seconds_count > 0.3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High inference latency detected"

๐Ÿ” ML Monitoring Use Cases

Use Case Metric
Latency inference_latency_seconds
Traffic inference_requests_total
Failure rate inference_errors_total
Resource usage From node_exporter or gpu_exporter
Drift triggers Custom metrics exposed from model logic


๐Ÿ“Š What is Grafana?

Grafana is an open-source analytics and dashboarding tool used to visualize time-series data from sources like Prometheus, InfluxDB, Elasticsearch, Loki, and many others.

In MLOps, Grafana is often paired with Prometheus to monitor:

  • Model inference latency

  • Drift signals

  • API uptime and errors

  • CPU/GPU utilization

  • Data pipeline performance


✅ Why Use Grafana?

  • Beautiful interactive dashboards

  • Flexible PromQL/SQL queries

  • Alerting capabilities

  • Works with ML/DevOps monitoring tools

  • Integration with Slack, email, PagerDuty for alerts


๐Ÿ“ฆ Key Features

Feature Description
Panels Graphs, tables, heatmaps, gauges, logs
Variables Dynamic filters (e.g., model name)
Data Sources Prometheus, Loki, AWS CloudWatch, PostgreSQL, etc.
Annotations Add events or markers to timelines
Alerts Visual + rule-based threshold alerts

๐Ÿ”ง Common ML Use Cases

Use Case Panel Type Metric Source
Inference latency Line chart Prometheus
Drift score over time Graph panel Evidently/WhyLogs
Error rate Stat panel Prometheus
GPU usage Gauge / Time series NVIDIA exporter
Feature distribution Histogram / Heatmap Custom app metrics

๐Ÿš€ Example Setup (Local)

๐Ÿ› ️ 1. Docker-Compose (Prometheus + Grafana)

# docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"

๐Ÿ”— 2. Start It

docker-compose up -d

๐Ÿงฑ Sample Grafana Dashboard Panels for ML

๐Ÿ“‰ Inference Latency Panel

  • Metric: inference_latency_seconds

  • Query:

rate(inference_latency_seconds_sum[1m]) / rate(inference_latency_seconds_count[1m])

๐Ÿ’ฅ Error Rate Panel

  • Metric: inference_errors_total

  • Query:

rate(inference_errors_total[5m])

๐Ÿ”„ Model Drift Detection Panel

  • Metric: feature_drift_score

  • Query:

avg_over_time(feature_drift_score[1h])

๐Ÿ›Ž️ Alerts in Grafana

  1. Create a panel → Set thresholds (e.g., latency > 500ms)

  2. Add Alert → Define condition (e.g., avg over 5min)

  3. Connect Alert Manager / Slack / Email


✨ Example Dashboards

Dashboard Panels
Model Monitoring Accuracy, F1, latency, requests
System Monitoring CPU, RAM, GPU, disk
ETL Pipeline Monitoring Job success, failure rate, execution time
Data Drift Monitor PSI/KS scores, feature distribution


Here’s a complete comparison and overview of Evidently AI, WhyLabs, and Seldon Core, three powerful tools in the MLOps & model monitoring ecosystem:


๐Ÿ“ฆ 1. Evidently AI

Purpose: Open-source Python library for data & model monitoring, focused on drift detection, data quality, and performance reports.

✅ Use Cases:

  • Data & target drift detection

  • Feature distribution changes

  • Model performance reports

  • Offline or in-pipeline monitoring

๐Ÿš€ Integration:

  • Python scripts, Jupyter notebooks

  • Airflow, Prefect, Kubeflow, etc.

๐Ÿ” Key Features:

Feature Description
Data Drift Report Detects change in feature distributions
Target Drift Report Monitors label distribution changes
Classification/Regression Reports Accuracy, F1, ROC, etc.
Data Quality Report Nulls, type mismatches, etc.
Dashboards (Evidently UI) Serve reports as interactive UI locally or in pipelines

๐Ÿงช Example (Python):

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=ref_df, current_data=prod_df)
report.save_html("drift_report.html")

☁️ 2. WhyLabs + WhyLogs

Purpose: Enterprise-grade observability and monitoring platform for ML pipelines and data quality, offering automated logging, drift detection, and dashboards.

✅ Use Cases:

  • Continuous production monitoring

  • Automated data profiling

  • Real-time alerting

  • Integration with cloud & on-prem ML workflows

๐Ÿ” Key Features:

Feature Description
WhyLogs Open-source library for logging statistics about data
WhyLabs Platform SaaS platform for dashboards, alerts
Segmented Monitoring Track metrics across different user segments
Lightweight Logging Doesn’t expose raw data (great for compliance)
Streaming / Batch Works in both modes; supports Spark, Pandas, S3, Kafka, etc.

๐Ÿงช Example (Python):

import whylogs as why
profile = why.log(pandas_df).profile()
profile.write(path="profile.bin")

You then upload this profile to WhyLabs using the WhyLabs agent.


๐Ÿš€ 3. Seldon Core

Purpose: Open-source platform for deploying, scaling, and monitoring ML models on Kubernetes. Sits alongside tools like KFServing and KServe.

✅ Use Cases:

  • Kubernetes-native model serving

  • A/B testing, canary rollout, multi-model serving

  • Real-time inference monitoring

  • Explainability & drift detection

๐Ÿ” Key Features:

Feature Description
MLServer Fast, multi-language model server
Explainers SHAP, Lime integration out of the box
Drift Detectors Kolmogorov-Smirnov, PSI, etc.
Outlier Detectors Use alibi-detect or custom models
Seldon Metrics Prometheus & Grafana-ready metrics
Advanced Routing Can run A/B, multi-armed bandit deployments

๐Ÿงฑ Architecture:

[Kubernetes]
   |
[SeldonDeployment YAML]
   |
[Model Pods] <---> [Metrics + Monitoring Pods]
   |
[Ingress Gateway]

๐Ÿ”„ Comparison Table

Feature/Tool Evidently AI WhyLabs + WhyLogs Seldon Core
Type Python library/UI Logging + SaaS platform Kubernetes deployment
Drift Detection ✅ (via Alibi Detect)
Model Serving
Monitoring ✅ (offline) ✅ (cloud/streaming) ✅ (real-time, Prometheus)
Alerts Manual + Grafana Built-in SaaS alerts With Prometheus + AlertMgr
Integration Python, Notebooks Spark, Kafka, S3, Pandas Kubernetes + Prometheus
Visual UI Local HTML/UI server WhyLabs dashboard Grafana integration
Open Source Partially (WhyLogs = ✅)

๐Ÿ”— Ideal Tool Based on Need:

Need Tool
Quick & Local Drift Detection Evidently AI
Enterprise-Grade Logging & SaaS Dashboards WhyLabs
Full ML Deployment + Drift Detection in K8s Seldon Core


10. Cloud & Infrastructure for MLOps


๐Ÿง  AWS in the ML/MLOps Ecosystem

AWS offers end-to-end tools for data ingestion, training, model deployment, monitoring, and CI/CD.


๐Ÿ”ง Key AWS Services for MLOps

Category Service Purpose
Storage & Data S3, Glue, Athena Data lake, ETL, querying logs/metadata
Model Development SageMaker Studio IDE for ML dev (like JupyterLab)
Model Training SageMaker Training Jobs Scalable training on EC2 or Spot
Model Deployment SageMaker Endpoints Real-time APIs for inference
Model Registry SageMaker Model Registry Manage model versions and metadata
CI/CD CodePipeline, CodeBuild, Lambda Automate training/testing/deployment
Monitoring & Drift SageMaker Model Monitor Detect drift, outliers, quality issues
Observability CloudWatch, Prometheus, Grafana Metrics, logging, alerting
Feature Store SageMaker Feature Store Store, reuse, and version features
Security & Auth IAM, KMS, VPC, S3 Policies Access control and encryption

๐Ÿ”„ MLOps Workflow on AWS

1. Data Collection & Processing

  • Use AWS Glue or S3 + Lambda to collect/clean data

  • Version datasets using DVC or S3 object versioning

2. Model Training

  • Launch training jobs using SageMaker Training

  • Auto-scale compute; log metrics to CloudWatch

3. Model Evaluation & Registration

  • Evaluate metrics, visualize in SageMaker Experiments

  • Register successful model in Model Registry

4. Model Deployment

  • Use SageMaker Inference Endpoints for:

    • Real-time (InvokeEndpoint)

    • Batch (BatchTransform)

  • Configure autoscaling and multi-model endpoints if needed

5. Monitoring & Drift Detection

  • Use SageMaker Model Monitor to:

    • Detect data drift (feature value distribution)

    • Detect model drift (label drift, performance)

    • Log anomalies to CloudWatch

6. CI/CD for ML

  • Automate with:

    • CodePipeline: Orchestration

    • CodeBuild: Build + test steps

    • Step Functions: Complex ML workflows

    • EventBridge: Trigger on file uploads, model updates


๐Ÿ“Š Example: Real-time Drift Monitoring with SageMaker

from sagemaker.model_monitor import DataCaptureConfig

# Enable inference data capture
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=100,
    destination_s3_uri="s3://mybucket/captured-data"
)

# Attach it while deploying model
predictor = sagemaker.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    data_capture_config=data_capture_config
)

Then set up a Monitoring Schedule using ModelMonitor.


๐Ÿ“ˆ Observability with CloudWatch + Grafana

  • CloudWatch collects metrics like:

    • Latency

    • Invocation count

    • 4xx/5xx errors

    • Custom logs from inference scripts

  • Connect Prometheus to Amazon Managed Grafana for:

    • Model-specific dashboards

    • Drift visualization

    • Alerting via SNS or Slack


๐Ÿง  Final Thoughts

Goal Tools to Use
Local development & experiments SageMaker Studio, S3
Deployment with monitoring SageMaker Endpoint + Model Monitor
Production CI/CD pipelines CodePipeline, Step Functions
Enterprise monitoring CloudWatch + Grafana or Prometheus



☁️ GCP for Machine Learning & MLOps

GCP offers an end-to-end AI/ML ecosystem via tools like Vertex AI, BigQuery, Cloud Functions, and Cloud Monitoring.


๐Ÿ”ง Key GCP Services for MLOps

MLOps Phase GCP Tool Purpose
Data Storage Cloud Storage (GCS) Object store for datasets/models
Data Analysis BigQuery, Dataflow SQL-based analytics, streaming pipelines
ML Platform Vertex AI Unified ML lifecycle platform
Training Vertex AI Training Managed model training on CPUs/GPUs/TPUs
Model Registry Vertex Model Registry Store and manage model versions
Deployment Vertex AI Endpoints Real-time/batch model inference
Monitoring Vertex AI Model Monitoring Monitor drift, skew, performance
CI/CD Cloud Build, Cloud Functions Automate ML pipeline steps
Observability Cloud Logging & Monitoring Alerting, visualization
Pipelines Vertex AI Pipelines (Kubeflow) Orchestration of ML workflows
Feature Store Vertex Feature Store Central repository of features

๐Ÿ”„ GCP MLOps Workflow Overview

1. Data Storage & Preparation

  • Store raw/processed data in GCS

  • Use Dataflow or Dataprep for batch/stream processing

  • Explore data via BigQuery


2. Model Development & Training

  • Develop locally or in Vertex AI Workbench (JupyterLab)

  • Train using:

    • Vertex AI Training Jobs (managed)

    • Custom containers (e.g., with PyTorch/TensorFlow)

    • TPUs for large-scale deep learning


3. Model Evaluation & Versioning

  • Evaluate metrics post-training

  • Register model in Vertex Model Registry

  • Use Artifact Registry for Docker images


4. Model Deployment

  • Deploy to Vertex AI Endpoints:

    • Real-time predictions via REST API

    • Scalable with autoscaling/load balancing

  • For batch inference: use Batch Prediction Jobs


5. Model Monitoring

  • Vertex AI Model Monitoring handles:

    • Prediction data drift

    • Training-serving skew

    • Feature skew

  • Configure alerts to trigger on threshold breaches

  • Logs can be piped to Cloud Logging for audits


6. CI/CD Pipelines

  • Use Vertex Pipelines (Kubeflow Pipelines on GCP) to:

    • Automate training → evaluation → deployment

    • Integrate with Cloud Build for custom CI steps

# sample Kubeflow component
@component
def train_model(input_data: str) -> str:
    ...

๐Ÿ“ˆ GCP Monitoring Stack

Tool Use
Cloud Monitoring (Stackdriver) Metrics like latency, errors, usage
Cloud Logging Inference logs, pipeline status
Vertex Model Monitoring Drift, skew, performance metrics
BigQuery Store and analyze monitoring data
Grafana (via GKE) Custom dashboards for model metrics

๐Ÿ“Š Model Drift Monitoring Example

Enable monitoring when deploying model:

gcloud beta ai endpoints deploy-model \
  --model=model-id \
  --display-name="drift-monitored-model" \
  --enable-access-logging \
  --enable-drift-monitoring

Set thresholds via console or REST API.


๐Ÿš€ Popular Use Cases on GCP

Use Case GCP Services
NLP model deployment Vertex AI + Cloud Functions
Data pipeline with streaming Pub/Sub + Dataflow
Real-time fraud detection Vertex AI + BigQuery + Monitoring
Retail recommender system Feature Store + Vertex AI + Monitoring

๐Ÿง  Final Thoughts

Objective GCP Tools
Unified ML lifecycle Vertex AI
Data processing BigQuery, Dataflow
CI/CD Vertex Pipelines, Cloud Build
Observability Stackdriver, Vertex Monitoring
Custom workflows Kubeflow, GKE, Cloud Functions



☁️ Azure for MLOps & Machine Learning

Azure offers a comprehensive and scalable platform to manage the entire ML lifecycle — from data ingestion to deployment, monitoring, and retraining.


๐Ÿ”ง Key Azure MLOps Components

MLOps Phase Azure Tool Purpose
Data Storage Azure Blob Storage, ADLS Store datasets, models, logs
Data Processing Azure Data Factory, Synapse ETL, big data analytics
ML Platform Azure Machine Learning (Azure ML) Unified ML development/deployment
Model Training Azure ML Compute, Azure Databricks Train with CPU, GPU, or Spark
Experiment Tracking Azure ML Experiments Track metrics, parameters, versions
Model Registry Azure ML Model Registry Central model storage
Deployment Azure ML Endpoints, AKS, ACI Real-time/batch serving
CI/CD Pipelines Azure DevOps, GitHub Actions Automate ML lifecycle
Monitoring Azure Monitor, App Insights Track performance, drift, logs
Feature Store (Preview) Azure ML Feature Store Reusable features for ML models

๐Ÿ”„ Azure MLOps Workflow Overview

1. Data Ingestion & Storage

  • Use Azure Data Factory or Synapse Pipelines for ingesting data

  • Store datasets in Blob Storage or ADLS Gen2


2. Data Processing & Exploration

  • Use Azure Synapse, Databricks, or Jupyter Notebooks in Azure ML workspace

  • Perform data cleaning, EDA, feature engineering


3. Model Development & Experimentation

  • Work within Azure ML Studio or integrate with VSCode

  • Use Experiment tracking to compare models across metrics & hyperparameters

  • Train models on:

    • Local or remote compute

    • AML Compute Cluster (autoscaling)

    • Databricks Spark cluster


4. Model Versioning & Registry

  • Register successful models into the Model Registry

  • Associate model with training metrics and dataset version


5. Deployment

  • Deploy models to:

    • Managed Endpoints (real-time REST APIs)

    • AKS for production-grade serving

    • ACI for testing/dev workloads

    • Batch Endpoints for offline inference


6. CI/CD with Azure DevOps

  • Use Azure DevOps Pipelines or GitHub Actions

  • Automate:

    • Data validation → model training → evaluation → deployment

  • YAML-based templates and pre-built tasks available

# Azure ML pipeline YAML (simplified)
trigger:
  branches:
    include: [main]

jobs:
- job: TrainModel
  steps:
    - task: AzureMLTrain@1
      inputs:
        workspaceName: 'ml-workspace'
        experimentName: 'churn-model'

7. Monitoring & Retraining

  • Monitor using:

    • Azure Monitor for system metrics

    • Application Insights for API-level logs

    • ML Model Monitoring for data drift, concept drift, and performance

  • Set alerts and automate retraining pipelines via triggers


๐Ÿ“Š Drift & Performance Monitoring Example

Azure ML can monitor:

  • Data drift between training & production data

  • Prediction drift and label distribution changes

  • Model performance degradation

from azureml.monitoring import ModelDataCollector

collector = ModelDataCollector("model-name", feature_names=["age", "income"])
collector.collect(data=X_inference)

๐Ÿ“ˆ Integration with Azure Ecosystem

Azure Service Role in MLOps Pipeline
Azure DevOps CI/CD, testing, version control
Azure Monitor Real-time logging, alerting
Azure Kubernetes Scalable inference serving
Azure Key Vault Secure management of API keys/secrets
Azure Functions Trigger retraining or workflows
Power BI Visualize model outputs and predictions
Azure Logic Apps No-code orchestration for alerts/retraining

๐Ÿ“ Model Deployment Options

Environment Use Case
ACI (Azure Container Instance) Quick testing/staging
AKS (Azure Kubernetes) Scalable production
Local Docker Custom environment
Batch Endpoints Non-real-time inference jobs

๐Ÿง  Azure MLOps Use Cases

Use Case Azure Tools
Credit risk scoring Azure ML + DevOps + AKS
Demand forecasting Azure ML Pipelines + Batch Endpoints
Real-time recommendation Azure ML Endpoints + AKS
Automated retraining Azure DevOps + Azure ML Triggers

๐Ÿงฐ Comparison with Other Clouds

Capability Azure ML GCP Vertex AI AWS SageMaker
GUI & SDK Support Strong (Studio + CLI + SDK) Strong Strong
CI/CD Pipelines Azure DevOps, GitHub Vertex Pipelines SageMaker Pipelines
Monitoring & Drift Azure Monitor + ML monitor Vertex AI Monitoring SageMaker Model Monitor
Feature Store Preview Production-ready Production-ready



๐Ÿ” What is IAM & Access Control?

IAM (Identity and Access Management) is the framework for:

  • Identifying users, services, or machines

  • Controlling what they can access (data, services, resources)

  • Auditing and enforcing security policies


๐Ÿ’ก Why IAM is Critical in MLOps

Use Case IAM Role Needed
Data access for training Grant access to S3, Blob, GCS buckets
CI/CD pipeline automation Roles for GitHub Actions, Jenkins, or Azure DevOps
Model serving Access to endpoints, containers, logging
Secure secrets handling Access to key vaults or secret managers
Auditing & compliance Logs of who accessed or changed models/data

๐ŸŒ IAM Across Cloud Platforms

1. AWS IAM

  • IAM Users, Groups, Roles, Policies (JSON)

  • Common roles:

    • AmazonS3FullAccess

    • AmazonSageMakerFullAccess

    • Custom policies with fine-grained permissions

  • Used in:

    • SageMaker Pipelines

    • Lambda, EC2, Step Functions

    • Secrets Manager for API keys

{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:PutObject"],
  "Resource": "arn:aws:s3:::ml-data-bucket/*"
}

2. Azure IAM (RBAC)

  • Uses Azure Active Directory (AAD) for identity

  • Role-Based Access Control (RBAC) manages access

  • Predefined roles:

    • Contributor, Reader, Owner, Azure ML Contributor

  • Custom roles for:

    • Access to storage accounts

    • Running Azure ML pipelines

    • Accessing compute targets

{
  "roleName": "CustomMLRole",
  "permissions": [
    {
      "actions": [
        "Microsoft.MachineLearningServices/*",
        "Microsoft.Storage/*/read"
      ]
    }
  ]
}

3. GCP IAM

  • Service accounts + IAM roles + resource policies

  • Predefined roles like:

    • roles/aiplatform.admin

    • roles/storage.objectViewer

  • Used in:

    • Vertex AI Pipelines

    • BigQuery, GCS

    • Secret Manager

bindings:
- role: roles/aiplatform.user
  members:
    - serviceAccount:ml-pipeline@my-project.iam.gserviceaccount.com

๐Ÿ›ก️ IAM in CI/CD & MLOps

MLOps Stage IAM Role Needed
Data preparation Access to datasets (S3, GCS, ADLS)
Model training Access to compute, logging, secrets
CI/CD pipeline GitHub Actions or Azure DevOps with scoped secrets
Model registry Read/write permissions to register
Model deployment Invoke permissions for endpoints
Monitoring Access to logs, metrics services

๐Ÿ”„ Example: GitHub Actions + AWS IAM for MLOps

  • GitHub Actions deploys model to SageMaker

  • Needs limited-access IAM role

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: arn:aws:iam::123456789012:role/GitHub-SageMaker-Deploy
          aws-region: us-east-1

๐Ÿ”’ Best Practices for IAM in MLOps

Principle of Least Privilege
✅ Rotate credentials & use temporary tokens
✅ Use IAM Roles/Service Accounts over hardcoding credentials
✅ Enable logging & audit trails
✅ Store secrets in Key Vault/Secrets Manager
✅ Apply network policies & endpoint security


✅ Tools to Manage IAM + Secrets

Tool Purpose
AWS IAM + Secrets Manager Access control & credential store
Azure RBAC + Key Vault Role-based control & secrets
GCP IAM + Secret Manager Fine-grained permissions & key mgmt
HashiCorp Vault Cross-platform secret store
Kubernetes RBAC + ServiceAccounts For model deployment and services


Here’s a detailed comparison and usage guide for cloud storage in the MLOps and DevOps context, focusing on AWS S3 (Simple Storage Service) and GCP GCS (Google Cloud Storage).


☁️ Overview

Feature Amazon S3 Google Cloud Storage (GCS)
Service Name Amazon Simple Storage Service Google Cloud Storage
Storage Structure Buckets → Objects Buckets → Objects
URL Format https://s3.amazonaws.com/bucket/key https://storage.googleapis.com/bucket/key
Access Control IAM, Bucket Policies, ACLs IAM, Uniform/Bucket-level Policies
Versioning ✅ Supported ✅ Supported
Encryption SSE-S3, SSE-KMS, SSE-C CSE, CMEK, Google-managed keys
Lifecycle Mgmt ✅ (Transitions, Expiry rules) ✅ (Rules, Policies)
Event Triggers S3 Event Notifications (to Lambda, etc.) GCS Notifications (Pub/Sub, Cloud Functions)

๐Ÿง  Common MLOps Use Cases

Task How S3 / GCS Helps
Store raw training data CSVs, JSON, Parquet in S3 or GCS
Save processed features Feature store intermediates
Model artifacts Store .pkl, .pt, .joblib files
Logging / metrics storage Send logs or model metrics to S3/GCS
CI/CD pipelines Pass artifacts between build stages
Model registry (if custom) Versioned model storage

๐Ÿ› ️ How to Use in Practice

๐Ÿ”น AWS S3 Example (Python boto3)

import boto3

s3 = boto3.client('s3')
s3.upload_file('model.pkl', 'ml-bucket', 'models/model.pkl')
s3.download_file('ml-bucket', 'models/model.pkl', 'local_model.pkl')

Set AWS credentials using:

  • IAM role (EC2/SageMaker)

  • ~/.aws/credentials file

  • AWS_ACCESS_KEY_ID & AWS_SECRET_ACCESS_KEY env vars


๐Ÿ”น GCS Example (Python google-cloud-storage)

from google.cloud import storage

client = storage.Client()
bucket = client.get_bucket('ml-bucket')
blob = bucket.blob('models/model.pkl')
blob.upload_from_filename('model.pkl')
blob.download_to_filename('local_model.pkl')

Set GCP credentials using:

  • GOOGLE_APPLICATION_CREDENTIALS env var with service account key JSON


๐Ÿ”’ Access Control Tips

Platform Recommendation
AWS Use IAM roles with minimal permissions (e.g., s3:GetObject, s3:PutObject)
GCP Use service accounts with specific roles like roles/storage.objectViewer or roles/storage.admin

⏳ Lifecycle & Cost Management

Both support:

  • Object Lifecycle Rules: move to cold storage (Glacier or Nearline/Coldline)

  • Retention Policies: block deletion of data for X days

  • Auto-delete/expire rules for temp files and old models


๐Ÿงช CI/CD & ML Pipelines Integration

Tool/Framework S3 Support GCS Support
SageMaker Pipelines ✅ Native
Vertex AI ✅ Native
MLflow ✅ Via URI (s3://...) ✅ Via URI (gs://...)
Airflow
Kubeflow
ZenML

๐Ÿ“‚ Versioning

  • Enable versioning in both platforms to track changes:

    • In S3: Go to bucket → Enable versioning

    • In GCS: Set bucket versioning with gsutil or Console

  • Useful for:

    • Rolling back models

    • Auditing training dataset changes


๐Ÿ›ก️ Security Best Practices

✅ Enable encryption (default in both)
✅ Block public access unless explicitly required
✅ Use signed URLs for limited-time sharing
✅ Monitor with logging (CloudTrail / Cloud Audit Logs)
✅ Use bucket-level access over object-level ACLs


Here’s a detailed breakdown and comparison of compute services: EC2, GKE, and Lambda — often used in DevOps, MLOps, and scalable microservices environments.


๐Ÿงฎ 1. Amazon EC2 (Elastic Compute Cloud)

๐Ÿงพ What It Is:

  • Virtual machines (VMs) on demand.

  • Full control over the OS, networking, and storage.

  • Ideal for traditional apps, ML model training, hosting APIs, etc.

๐Ÿ”ง Typical Use Cases:

  • Host model servers (like FastAPI, Flask, or TorchServe)

  • Run batch jobs or cron scripts

  • Train ML models on GPU-enabled instances

  • Run Docker containers (via EC2 + ECS or self-managed)

✅ Pros:

  • Full flexibility (install anything)

  • Scalable (manual or via Auto Scaling Groups)

  • GPU support for ML workloads

⚠️ Cons:

  • Must manage patching, scaling, security

  • Pricing can rise with high uptime

๐Ÿ› ️ Infra-as-Code Example (Terraform):

resource "aws_instance" "ml_server" {
  ami           = "ami-xxxxxxxx"
  instance_type = "t2.medium"
  key_name      = "your-key"
}

๐Ÿงฑ 2. GKE (Google Kubernetes Engine)

๐Ÿงพ What It Is:

  • Fully managed Kubernetes (K8s) on Google Cloud.

  • Run containerized apps with built-in scaling, networking, and storage.

๐Ÿ”ง Typical Use Cases:

  • Run microservices (REST APIs, background jobs)

  • Deploy ML inference servers (TF Serving, Triton, custom Flask apps)

  • Deploy ML pipelines (Kubeflow, TFX, MLflow)

✅ Pros:

  • Autoscaling, self-healing pods

  • Native integration with GCP services (BigQuery, GCS, Vertex AI)

  • CI/CD with GitHub/GitLab + Cloud Build or ArgoCD

⚠️ Cons:

  • Requires Kubernetes knowledge

  • Slightly higher learning curve

๐Ÿš€ Useful Tools with GKE:

  • Kubeflow: ML pipelines

  • Argo Workflows: CI/CD or ML pipeline orchestration

  • Istio/Envoy: Service mesh, secure traffic


⚡ 3. AWS Lambda (Serverless)

๐Ÿงพ What It Is:

  • Run backend functions in response to events (e.g., S3 upload, HTTP requests, cron).

  • You pay only for compute time used (in milliseconds).

๐Ÿ”ง Typical Use Cases:

  • ML inference for light models

  • Trigger model retraining when new data arrives in S3

  • ETL/ELT tasks on demand

  • Webhook receivers, alert systems

✅ Pros:

  • Zero server management

  • Auto-scaling, highly cost-efficient

  • Works with other AWS services (S3, SNS, DynamoDB)

⚠️ Cons:

  • Limited runtime (max 15 min)

  • Cold start latency (~1s for some languages)

  • Not suitable for large ML models unless optimized

๐Ÿ” Example:

# handler.py
def lambda_handler(event, context):
    return {"message": "Hello from Lambda!"}

Deploy via:

  • AWS Console

  • AWS SAM / Serverless Framework

  • Terraform


๐Ÿ” Summary Table

Feature EC2 GKE Lambda
Type VM Managed Kubernetes Serverless Functions
Ideal for ML training/inference Scalable microservices, ML pipelines Lightweight functions, events
Scaling Manual / Auto Scaling Horizontal Pod Autoscaler Automatic
OS Control Full Limited to container OS None
Cold Start No No Yes
Pricing Per hour/second Per node/hour Per request (ms-based)
Infra-as-Code Tools Terraform, CloudFormation Terraform, Helm SAM, Serverless, Terraform
Docker Support Manual via ECS or EKS Native Limited (via container Lambda)
GPU Support ✅ Yes ✅ (with node pools) ⚠️ Not natively supported

๐Ÿ’ก Best Practice Guidance

Scenario Recommended Compute
Model training with GPU EC2 or GKE (with GPU nodes)
Real-time API with low traffic Lambda or Cloud Functions
Batch data processing Lambda, EC2, or GKE Jobs
Large model inference EC2 or GKE
Scalable web app GKE
Orchestrating ML workflows GKE (Kubeflow, Argo)


Here’s a concise yet detailed comparison of the major AutoML platforms across AWS, GCP, and Azure: SageMaker Autopilot, Vertex AI, and Azure AutoML — all used for automating ML workflows including preprocessing, training, tuning, and deployment.


๐Ÿ” Overview Table: AutoML Comparison

Feature SageMaker Autopilot (AWS) Vertex AI AutoML (GCP) Azure AutoML
Language Support Python (via SDK, Boto3) Python (via SDK, REST) Python (AzureML SDK)
UI Available ✅ SageMaker Studio ✅ Vertex AI Console ✅ Azure Studio
Model Explainability ✅ SHAP built-in ✅ Integrated ✅ Built-in with visual UI
Custom Code Injection ✅ Custom containers ⚠️ Limited ✅ Supported via pipelines
Model Deployment ✅ One-click to endpoint ✅ Deploy to prediction service ✅ Deploy to AKS or endpoint
Model Type Coverage Classification, Regression Vision, Text, Tabular, Forecast Tabular, Time series, NLP
Integration with MLOps ✅ SageMaker Pipelines ✅ Vertex AI Pipelines ✅ Azure ML Pipelines
Pricing Pay-per-job + compute Pay-per-job + compute Pay-per-run + compute

๐Ÿง  SageMaker Autopilot (AWS)

✅ Highlights:

  • Input: CSVs or data in S3

  • Handles feature engineering, model tuning, evaluation

  • Gives Jupyter Notebooks of every step (transparency)

  • Easily integrated with SageMaker Pipelines + Endpoints

๐Ÿงช Example:

from sagemaker import AutoML
automl = AutoML(role=role,
                target_attribute_name="target",
                output_path="s3://my-bucket/output")
automl.fit(inputs="s3://my-bucket/input")

๐Ÿ” Vertex AI AutoML (GCP)

✅ Highlights:

  • Unified with BigQuery, Cloud Storage, Looker

  • Supports Tabular, Text, Vision, and Forecasting

  • Strong low-code/no-code workflow

  • Built-in model evaluation and deploy

๐Ÿงช Sample Flow:

  • Upload dataset via console or Python

  • Click “Train New Model”

  • Set target + training options

  • Deploy or export model

Code Example:

from google.cloud import aiplatform

aiplatform.init(project="my-project", location="us-central1")
model = aiplatform.AutoMLTabularTrainingJob(
    display_name="my-tabular-model",
    optimization_prediction_type="classification",
)
model.run(dataset=my_dataset, target_column="label")

๐Ÿ”ฌ Azure ML AutoML

✅ Highlights:

  • Integrates with Azure Data Factory, Databricks

  • Offers rich UI + code-first SDK

  • Visual model explanations and fairness analysis

  • Deployment to AKS or managed endpoints

๐Ÿงช Code Example:

from azureml.train.automl import AutoMLConfig
from azureml.core.experiment import Experiment

automl_config = AutoMLConfig(task='classification',
                             primary_metric='AUC_weighted',
                             training_data=dataset,
                             label_column_name='target',
                             iterations=20)

experiment = Experiment(ws, "automl-exp")
run = experiment.submit(automl_config)

๐Ÿงฐ Use Case Recommendations

Use Case Recommended Platform
AWS-centric pipeline (S3, Athena) SageMaker Autopilot
GCP-first stack (BigQuery, GCS) Vertex AI
Enterprise + UI-driven Azure AutoML
High control over pipeline steps Azure/SageMaker
Forecasting/Time Series Vertex AI or Azure
Vision/NLP Vertex AI

๐Ÿš€ Advanced Add-ons

Feature SageMaker Autopilot Vertex AI AutoML Azure AutoML
SHAP Explanations ✅ Yes ✅ Yes ✅ Yes
Custom Pipelines ✅ via SageMaker Pipeline ✅ Vertex AI Pipelines ✅ via Azure Pipelines
Hyperparameter Tuning ✅ Bayesian search ✅ Auto-tuning ✅ Bayesian + Bandit
Auto-deploy Models ✅ Yes ✅ Yes ✅ Yes


Comments

Popular posts from this blog

Resume Work and Project Details

Time Series and MMM basics

LINEAR REGRESSION