Mlops - II
6. Model Packaging & Reproducibility
Here’s a detailed explanation of how to define environments using requirements.txt and Conda environment files (environment.yml), both of which are crucial in MLOps for ensuring reproducibility, consistency, and portability of ML pipelines.
✅ 1. requirements.txt (For pip)
π Purpose:
Used to list Python packages and their versions to be installed via pip.
π Example: requirements.txt
numpy==1.24.3
pandas>=1.5.0,<2.0
scikit-learn
mlflow
dvc
tensorflow==2.13.0
matplotlib
π‘ Usage:
# Create virtual environment
python -m venv venv
source venv/bin/activate # on Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Freeze to regenerate file
pip freeze > requirements.txt
✅ When to Use:
-
You use
pipandvenv -
You want simplicity and lightweight environments
-
CI/CD pipelines and Docker integrations
✅ 2. environment.yml (For conda)
π Purpose:
Defines a Conda environment, including:
-
Python version
-
Pip packages
-
Conda-specific packages
-
Channels
π Example: environment.yml
name: mlops-env
channels:
- defaults
- conda-forge
dependencies:
- python=3.10
- numpy
- pandas
- scikit-learn
- matplotlib
- pip
- pip:
- mlflow
- dvc
- wandb
π‘ Usage:
# Create environment
conda env create -f environment.yml
# Activate environment
conda activate mlops-env
# Export current env
conda env export > environment.yml
✅ When to Use:
-
You're using Anaconda/Miniconda
-
Need non-Python dependencies (e.g.,
libgomp,libx11) -
Complex data science stacks (GPU support, etc.)
π Side-by-Side Comparison
| Feature | requirements.txt |
environment.yml |
|---|---|---|
| Package Manager | pip | conda (+ pip) |
| Language | Python-only | Supports system libs too |
| Format | Plaintext | YAML |
| Virtual Env Tool | venv, virtualenv |
conda |
| Portability | Very portable | More robust for data science |
π‘ Best Practices
✅ Lock dependencies with versions
✅ Always version control these files (commit to Git)
✅ Use requirements-dev.txt for testing tools like pytest, flake8, etc.
✅ Regenerate files after updates using pip freeze or conda env export
π§ͺ Bonus: Hybrid conda + pip in ML Projects
Many ML tools (like MLflow, DVC, wandb) are only on PyPI, so you use both:
dependencies:
- conda dependencies
- pip:
- PyPI packages (like wandb, mlflow, etc.)
π³ What is Docker?
Docker is a platform that allows you to package your application and its dependencies into containers. This helps:
-
Avoid "it works on my machine" problems
-
Simplify deployment
-
Standardize environments
✅ Why Use Docker for ML?
| Benefit | Description |
|---|---|
| Environment Reproducibility | Same code = same result everywhere |
| Easy Deployment | Deploy models as APIs or batch jobs in any cloud/server |
| Dependency Isolation | Avoid conflicts between different projects |
| Portability | Run containers anywhere: local, cloud, CI/CD |
| Scalability | Combine with Kubernetes or ECS for horizontal scaling |
π ️ Key Docker Concepts
| Concept | Description |
|---|---|
| Dockerfile | Script to build a Docker image |
| Image | Snapshot of environment and code |
| Container | Running instance of an image |
| Volume | Persistent data storage |
| Ports | Used to expose services (e.g., API) |
π Sample Dockerfile for ML
# 1. Base Image
FROM python:3.10-slim
# 2. Set working directory
WORKDIR /app
# 3. Copy code and requirements
COPY . .
RUN pip install --upgrade pip
RUN pip install -r requirements.txt
# 4. Default command
CMD ["python", "train.py"]
π§ͺ Example: ML Project Structure
ml-project/
├── Dockerfile
├── requirements.txt
├── train.py
├── model.pkl
└── utils.py
π Build and Run Docker
π§ Build the Image
docker build -t ml-trainer .
▶️ Run the Container
docker run --rm ml-trainer
π Run with Port Binding (for APIs)
docker run -p 5000:5000 ml-api
π¦ Use Case Examples
| Use Case | Description |
|---|---|
| Model Training | train.py inside Docker — GPU support if needed |
| Model Serving | Flask / FastAPI container exposing REST API |
| Data Pipelines | Combine Docker + Airflow for batch jobs |
| Jupyter Notebook | Run notebooks inside container with EXPOSE 8888 |
| CI/CD Integration | Run tests and training in pipelines |
π§ GPU Support (Optional)
For TensorFlow or PyTorch GPU:
π Dockerfile (PyTorch + CUDA)
FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["python", "train.py"]
Then use:
docker run --gpus all ml-gpu-trainer
π Best Practices
-
Use
.dockerignoreto skip unnecessary files (like.git,__pycache__) -
Use
ENTRYPOINTif you want CLI-style container apps -
Keep image size small with
slimoralpinebase images -
Avoid running containers as root (
USER app)
π .dockerignore Example
__pycache__/
*.pyc
*.pkl
*.csv
.env
.git
π³ What is a Dockerfile?
A Dockerfile is a script with step-by-step instructions to build a Docker image (your app + environment).
✅ Sample Dockerfile for a ML Project (Training + Inference)
# Use an official Python base image
FROM python:3.10-slim
# Set the working directory inside the container
WORKDIR /app
# Copy local code to the container
COPY . .
# Install dependencies
RUN pip install --upgrade pip
RUN pip install -r requirements.txt
# Run the training or inference script
CMD ["python", "train.py"]
π¦ Typical File Structure
ml-project/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── train.py
├── serve.py (Flask/FastAPI)
└── data/
π§ͺ Example: requirements.txt
pandas
scikit-learn
flask
joblib
π Docker Commands
| Command | What it does |
|---|---|
docker build -t ml-app . |
Builds an image named ml-app |
docker run ml-app |
Runs a container from ml-app |
docker run -v $(pwd)/data:/app/data ml-app |
Mounts local data/ to container |
docker run -p 5000:5000 ml-app |
Exposes Flask/FastAPI server |
π§© What is Docker Compose?
Docker Compose is a tool to run multi-container apps (like an ML API + a DB + message queue) using a single .yml file.
✅ Sample docker-compose.yml for ML App + Postgres + Jupyter
version: "3.8"
services:
ml-api:
build:
context: .
dockerfile: Dockerfile
container_name: ml-api
ports:
- "5000:5000"
volumes:
- .:/app
command: python serve.py
jupyter:
image: jupyter/scipy-notebook
ports:
- "8888:8888"
volumes:
- ./notebooks:/home/jovyan/work
db:
image: postgres:13
environment:
POSTGRES_USER: mluser
POSTGRES_PASSWORD: mlpass
POSTGRES_DB: mldb
ports:
- "5432:5432"
π‘ Run Compose
docker-compose up --build
This starts:
-
ml-apiservice (Flask/FastAPI) -
jupyterservice -
db(PostgreSQL)
⚙️ Advanced Tips
| Tip | Description |
|---|---|
.env file |
Store secrets and config |
depends_on: |
Specify container startup order |
restart: always |
Auto-restart crashed containers |
networks: |
Custom network for services to talk to each other |
π Optional .env File
POSTGRES_USER=mluser
POSTGRES_PASSWORD=mlpass
POSTGRES_DB=mldb
And in docker-compose.yml:
env_file:
- .env
π Common Use Cases
| Use Case | Compose Needed? |
|---|---|
| Train a model once | ❌ Only Dockerfile |
| Serve API + DB | ✅ Yes |
| Jupyter + Training API | ✅ Yes |
| ML Pipeline in CI/CD | ✅ Yes |
Here’s a step-by-step guide to building and running ML containers using Docker. This will help you containerize your ML projects — for both training and inference.
✅ 1. Prepare Your ML Project
Assume your project has this structure:
ml-project/
├── Dockerfile
├── requirements.txt
├── train.py
├── serve.py # (Flask or FastAPI)
├── model.pkl
└── data/
π³ 2. Create Dockerfile
# Use an official Python image
FROM python:3.10-slim
# Set working directory
WORKDIR /app
# Copy files
COPY . .
# Install dependencies
RUN pip install --upgrade pip
RUN pip install -r requirements.txt
# Default command
CMD ["python", "train.py"]
π¦ 3. Create requirements.txt
numpy
pandas
scikit-learn
flask
joblib
π ️ 4. Build the Docker Image
Open terminal inside the ml-project/ folder and run:
docker build -t ml-container .
-
-t ml-container→ tags the image with nameml-container -
.→ builds from current directory
▶️ 5. Run the Container
A. Train model:
docker run --name train-ml ml-container
This runs train.py inside a container.
B. Serve model as API:
Modify your Dockerfile to:
CMD ["python", "serve.py"]
Then rebuild and run:
docker build -t ml-api .
docker run -p 5000:5000 ml-api
π Alternatively: Override CMD during runtime
docker run -p 5000:5000 ml-container python serve.py
π 6. Mount Local Volume (For Data or Models)
docker run -v $(pwd)/data:/app/data ml-container
-
Mounts your
data/directory into the container
π 7. Access Container Logs
docker logs train-ml
π§½ 8. Clean Up
docker ps -a # List all containers
docker rm train-ml # Remove container
docker rmi ml-container # Remove image
π 9. Testing API Inside Container (Optional)
If serve.py runs a Flask app on port 5000:
curl http://localhost:5000/predict -X POST -H "Content-Type: application/json" -d '{"features": [1, 2, 3]}'
π§ͺ Example: Simple serve.py for FastAPI
from fastapi import FastAPI
import joblib
import numpy as np
model = joblib.load("model.pkl")
app = FastAPI()
@app.post("/predict")
def predict(features: list):
prediction = model.predict([features])
return {"prediction": prediction.tolist()}
π‘ BONUS: Run in Detached Mode
docker run -d -p 5000:5000 ml-api
-
Use
docker logs <container_id>to monitor output -
-ddetaches the container so it runs in background
7. Model Deployment
π What Are Deployment Strategies?
Deployment strategies define how you release a new model (or model version) into production without disrupting users, compromising performance, or introducing bugs.
These strategies often mirror DevOps deployment approaches but are adapted for ML-specific considerations like drift, accuracy, retraining, and latency.
✅ Common Deployment Strategies in MLOps
1. Recreate / Replace
-
π Takes down the old model, then deploys the new one.
-
⏳ Downtime expected.
Use When: Non-critical models or low traffic.
kubectl delete deployment old-model
kubectl apply -f new-model.yaml
2. Blue-Green Deployment
-
π¦ Blue = current production model
-
π© Green = new model deployed alongside
-
π‘ Switch traffic to green after verification.
Pros:
-
Instant rollback
-
Zero downtime
Cons:
-
Requires double resources
Use When: High uptime is critical.
3. Canary Deployment
-
π¦ Roll out to a small subset of users (e.g., 5%)
-
Monitor performance (latency, accuracy, user feedback)
-
Gradually increase traffic
Pros:
-
Safer than full rollout
-
Real-time feedback
Use When: Testing model quality, performance, or monitoring risk of concept/data drift.
4. A/B Testing (Shadow or Split Testing)
-
π °️ Model A (existing)
-
π ±️ Model B (new)
-
Route users randomly (e.g., 50/50) and compare outputs or outcomes
Pros:
-
Compare performance metrics (conversion rate, accuracy, etc.)
-
User-driven validation
Use When: Comparing two models for business impact.
5. Shadow Deployment
-
π΅️ Serve predictions silently in the background
-
New model runs on real-time data but does not affect users
-
Log and compare predictions vs live model
Pros:
-
No risk to user
-
Evaluate on real-world data
Cons:
-
Doubles compute load
Use When: Auditing, regulatory review, or performance benchmarking.
6. Multi-Armed Bandit
-
Like A/B testing, but uses adaptive traffic allocation
-
Routes more traffic to the better-performing model dynamically
Use When: Want to maximize reward during testing phase (e.g., maximize clicks or accuracy).
7. Rolling Update
-
Gradually update pods or services
-
Often used with Kubernetes (
Deploymentstrategy)
Use When: Containerized models using tools like Kubernetes or Docker Swarm
π¦ Tools to Support Model Deployment
| Tool | Use Case |
|---|---|
| FastAPI / Flask | Serving ML models as REST APIs |
| MLflow / TorchServe | Model packaging and deployment |
| Docker + Kubernetes | Scalable containerized deployment |
| Seldon Core / KFServing | K8s-native ML model deployment |
| Triton Inference Server | Optimized inference for deep learning |
| TensorFlow Serving | Serving TensorFlow models in production |
π‘ Best Practices
-
✅ Always monitor model performance post-deployment (latency, accuracy, drift)
-
✅ Use feature stores to ensure consistency in training vs inference
-
✅ Automate deployment through CI/CD (GitHub Actions, Jenkins, etc.)
-
✅ Use rollback mechanisms (Blue-Green, Canary) in case of failure
✅ Batch Inference vs Real-Time Inference
| Feature | Batch Inference | Real-Time Inference |
|---|---|---|
| Definition | Predictions are made on bulk data at once | Predictions are made instantly per request |
| Latency | High (minutes to hours) | Low (milliseconds to seconds) |
| Trigger | Scheduled (e.g., daily, hourly) | On-demand (API call, event, UI trigger) |
| Use Cases | - Monthly credit risk scoring- Email spam tagging- Customer churn scoring | - Fraud detection- Chatbots- Product recommendations |
| Deployment Mode | Often offline / serverless batch jobs | Usually via REST API or streaming systems |
| Cost Efficiency | More efficient at scale | Expensive if traffic is high |
| Examples | Run via Airflow, Spark, DVC pipelines | Run via FastAPI, Flask, TensorFlow Serving |
π¦ Example:
-
Batch Inference: Every night at 2 AM, predict churn for 10 million customers and store results in a database.
-
Real-Time Inference: When a user logs in, instantly recommend 5 products based on their activity.
✅ Online Inference vs Offline Inference
These are broader categories related to when and how predictions are generated and delivered.
| Feature | Offline Inference | Online Inference |
|---|---|---|
| Definition | Predictions are precomputed & stored | Predictions are computed on-the-fly |
| When Used | Before the user needs it | When the user triggers it |
| Model Execution | Happens ahead of time | Happens per request |
| Data Source | Static snapshot | Real-time features (API, current session) |
| Storage | Predictions are stored in DB, files, etc. | Predictions returned directly to UI/API |
| Use Cases | - Risk scoring- Lead prioritization- Email categorization | - Self-driving car inputs- Voice assistants- Stock prediction dashboard |
π Relationship with Batch/Real-Time:
| Offline | Online | |
|---|---|---|
| Batch | ✅ Yes (classic use) | ❌ No |
| Real-Time | ❌ No | ✅ Yes (classic use) |
-
Offline + Batch = Nightly scoring for marketing
-
Online + Real-Time = Instant fraud detection on transactions
π§ Summary Diagram
+-----------------------+-----------------------+
| Batch | Real-Time |
+-----------+-----------------------+-----------------------+
| Offline | ✔ Scheduled Scoring | ✘ Not Common |
| | ✔ Stored Predictions | |
+-----------+-----------------------+-----------------------+
| Online | ✘ Not Practical | ✔ Instant Prediction |
| | | ✔ API-based |
+-----------+-----------------------+-----------------------+
π How to Choose?
| Criteria | Prefer Batch/Offline | Prefer Real-Time/Online |
|---|---|---|
| Latency-critical? | ❌ No | ✅ Yes |
| Prediction volume | ✅ High (millions at once) | ❌ One at a time |
| Data freshness | ❌ Static features | ✅ Real-time data needed |
| Infrastructure | ✅ Cheaper and easier | ❌ Needs always-on API + low latency |
| Examples | Marketing, churn scoring | Chatbot, recommender, fraud alerts |
π What is Flask?
Flask is a lightweight, easy-to-use Python web framework for building APIs and web applications. In MLOps, it is widely used to serve ML models as REST APIs so they can be accessed in real-time by applications or other services.
✅ Key Features of Flask
-
Simple and minimalistic
-
Great for small to medium ML projects
-
Easily integrates with machine learning libraries (scikit-learn, TensorFlow, PyTorch)
-
Supports REST API routes
-
Can be containerized (Docker) and deployed to cloud
π Flask Workflow for ML Model Deployment
-
Train your ML model
Save it usingpickle,joblib, or a framework's native format (likemodel.h5for Keras) -
Create a Flask API app
Load the model and expose it via an endpoint (e.g.,/predict) -
Test the API using tools like Postman or Curl
-
Deploy it using Docker, AWS EC2, or other platforms
π¦ Sample Flask App for Model Deployment
model.pkl — Trained ML model file (e.g., a scikit-learn model)
app.py
from flask import Flask, request, jsonify
import pickle
import numpy as np
# Load the trained model
model = pickle.load(open("model.pkl", "rb"))
# Initialize Flask app
app = Flask(__name__)
# Define home route
@app.route('/')
def home():
return "Welcome to the ML Model API!"
# Define predict route
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json(force=True) # Get JSON payload
features = np.array(data['features']).reshape(1, -1)
prediction = model.predict(features)
return jsonify({'prediction': prediction.tolist()})
# Run the Flask app
if __name__ == '__main__':
app.run(debug=True)
π§ͺ Test Request Example
curl -X POST http://127.0.0.1:5000/predict \
-H "Content-Type: application/json" \
-d '{"features": [6.2, 3.4, 5.4, 2.3]}'
π Typical Flask Project Structure
ml-flask-app/
│
├── model.pkl
├── app.py
├── requirements.txt
└── Dockerfile (optional for containerization)
π§° requirements.txt
Flask==2.3.2
numpy
scikit-learn
π³ Optional: Dockerfile to Containerize
FROM python:3.9
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "app.py"]
π What is FastAPI?
FastAPI is a high-performance web framework for building APIs with Python 3.7+ based on ASGI (Asynchronous Server Gateway Interface). It's designed for speed and includes automatic data validation, type hints, and auto-generated Swagger documentation.
✅ Why Use FastAPI for ML?
| Feature | FastAPI Advantage |
|---|---|
| ⚡ Speed | Faster than Flask |
| π Data validation | Automatic (via Pydantic) |
| π§ͺ Interactive Docs | Built-in Swagger & ReDoc |
| π€ Async support | Native support for async I/O |
| π¦ Dependency Injection | Built-in and clean |
π§ ML Model Deployment with FastAPI – End-to-End Example
1. model.pkl: Trained ML model (e.g., scikit-learn)
Save your model using:
import pickle
pickle.dump(model, open("model.pkl", "wb"))
2. main.py: FastAPI App Code
from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np
import pickle
# Load model
model = pickle.load(open("model.pkl", "rb"))
# Initialize app
app = FastAPI(title="ML Model Inference API")
# Define input schema
class Features(BaseModel):
features: list[float]
# Root endpoint
@app.get("/")
def read_root():
return {"message": "Welcome to the FastAPI ML model server!"}
# Predict endpoint
@app.post("/predict")
def predict(data: Features):
features = np.array(data.features).reshape(1, -1)
prediction = model.predict(features)
return {"prediction": prediction.tolist()}
3. requirements.txt
fastapi
uvicorn
numpy
scikit-learn
pydantic
4. π§ͺ Run Locally
Install dependencies:
pip install -r requirements.txt
Run the API:
uvicorn main:app --reload
You can now visit:
-
Swagger UI: http://127.0.0.1:8000/docs
-
Redoc: http://127.0.0.1:8000/redoc
5. π§ͺ Sample Test (via curl or Postman)
curl -X POST "http://127.0.0.1:8000/predict" \
-H "Content-Type: application/json" \
-d "{\"features\": [5.1, 3.5, 1.4, 0.2]}"
π³ Optional: Dockerfile to Containerize FastAPI App
FROM python:3.9
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
π₯ Bonus: Swagger UI Automatically Available
Thanks to FastAPI, you get Swagger UI without extra effort — excellent for testing, collaboration, or exposing to product teams.
✅ Flask vs FastAPI Quick Recap
| Feature | Flask | FastAPI |
|---|---|---|
| Speed | Slower | ⚡ Faster |
| Type Validation | Manual | ✅ Automatic |
| Async Support | ❌ No | ✅ Yes |
| Docs | ❌ Add-ons | ✅ Built-in |
| Production Ready | ✅ Yes | ✅ Yes |
✅ 1. What is Model Serving?
Model serving is the process of exposing a trained ML model via an API so other applications (like a web or mobile app) can send input data and receive predictions in real-time.
✅ 2. Flask vs FastAPI: Quick Comparison
| Feature | Flask | FastAPI |
|---|---|---|
| Performance | Slower (sync) | Faster (async, Starlette + Pydantic) |
| Type Hint Support | Optional | Fully supports and requires it |
| Validation | Manual or with Flask-RESTful | Built-in via Pydantic |
| Best For | Simpler, legacy projects | High-performance APIs, production ML APIs |
| Learning Curve | Easier for beginners | Slightly steeper but modern |
✅ 3. Serving ML Model Using Flask (Example)
π§ͺ File: model.pkl
Assume you’ve trained and saved a model using joblib or pickle.
π Flask API Code:
from flask import Flask, request, jsonify
import joblib
# Load the model
model = joblib.load("model.pkl")
# Create the app
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
features = [data['feature1'], data['feature2']]
prediction = model.predict([features])
return jsonify({'prediction': prediction[0]})
if __name__ == '__main__':
app.run(debug=True)
➕ cURL Test:
curl -X POST http://localhost:5000/predict -H "Content-Type: application/json" \
-d '{"feature1": 1.5, "feature2": 3.2}'
✅ 4. Serving ML Model Using FastAPI (Example)
⚙️ Install FastAPI + Uvicorn:
pip install fastapi uvicorn joblib
π FastAPI Code:
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
# Load the model
model = joblib.load("model.pkl")
# Define request schema
class InputData(BaseModel):
feature1: float
feature2: float
# Initialize app
app = FastAPI()
@app.post("/predict")
def predict(data: InputData):
features = [[data.feature1, data.feature2]]
prediction = model.predict(features)
return {"prediction": prediction[0]}
π₯ Run with Uvicorn:
uvicorn your_script_name:app --reload
π§ͺ Open Swagger UI:
Visit: http://localhost:8000/docs
✅ 5. Production Enhancements
| Area | Suggestions |
|---|---|
| Security | OAuth2 / JWT, rate limiting, API keys |
| Monitoring | Prometheus + Grafana, request logging |
| Error Handling | Graceful 4xx/5xx responses |
| Scaling | Use Docker, gunicorn (Flask), or uvicorn workers |
| CI/CD | Use GitHub Actions, Jenkins, or CodePipeline |
✅ Summary
| Category | Flask | FastAPI |
|---|---|---|
| Simplicity | Easier for starters | Modern, fast, robust |
| Speed | Slower (WSGI) | Faster (ASGI + async I/O) |
| Validation | Manual | Built-in with Pydantic |
| Use Cases | POCs, simple APIs | High-performance ML API, production use |
π What is TensorFlow Serving?
TensorFlow Serving is a flexible, high-performance model server for deploying machine learning models built with TensorFlow.
✅ It allows you to:
-
Serve models over a gRPC or RESTful API
-
Dynamically load new models (versioning support)
-
Scale easily in production
π§ Why TensorFlow Serving?
| Feature | Benefit |
|---|---|
| π High performance | Optimized for inference speed |
| π¦ Model versioning | Serve multiple versions |
| π Hot swapping | No downtime model updates |
| π» Protocols | Supports REST & gRPC |
| ⚙️ TensorFlow native | Designed specifically for TF models |
π ️ Core Concepts
1. Model Format
You need to export your TensorFlow model using the SavedModel format:
model.save("my_model/1")
Directory structure should look like:
my_model/
└── 1/
├── saved_model.pb
└── variables/
2. Model Versioning
-
Each version is a numbered folder (
1/,2/, etc.) -
You can host multiple versions, and TensorFlow Serving will use the latest by default.
π Serving a Model using Docker
Step 1: Export model in SavedModel format
import tensorflow as tf
model = tf.keras.models.load_model("my_model.h5")
tf.saved_model.save(model, "exported_model/1")
Step 2: Pull TensorFlow Serving Docker image
docker pull tensorflow/serving
Step 3: Run the container
docker run -p 8501:8501 \
--mount type=bind,source=$(pwd)/exported_model,target=/models/my_model \
-e MODEL_NAME=my_model \
-t tensorflow/serving
π This serves your model at:
http://localhost:8501/v1/models/my_model:predict
π REST API Example
Send a request using curl:
curl -X POST http://localhost:8501/v1/models/my_model:predict \
-H "Content-Type: application/json" \
-d '{"instances": [[5.1, 3.5, 1.4, 0.2]]}'
Response:
{
"predictions": [[0.1, 0.9]]
}
π§° Optional: Use gRPC (Advanced)
You can also use gRPC instead of REST for lower latency. This requires protobuf stubs and a gRPC client.
π§ͺ Testing
Test endpoint:
curl http://localhost:8501/v1/models/my_model
Response:
{
"model_version_status": [...],
"model_name": "my_model"
}
π Docker Folder Structure Summary
project/
├── exported_model/
│ └── 1/
│ ├── saved_model.pb
│ └── variables/
└── Dockerfile (optional if customizing serving)
π️ Production Integration Ideas
-
Use NGINX to reverse proxy for secure API gateway
-
Deploy on Kubernetes (e.g., with TFServing Helm chart)
-
Integrate with Prometheus/Grafana for monitoring
-
Use Triton for multi-framework support (TF + PyTorch)
π₯ 1. TorchServe
π§Ύ What is it?
TorchServe is the official model serving framework for PyTorch developed by AWS and Facebook.
π Key Features
| Feature | Description |
|---|---|
| ✅ Native PyTorch support | Built for PyTorch models specifically |
π¦ Model archiver (.mar) |
Package model + code + config in .mar file |
| π REST & gRPC APIs | Serve models using standard APIs |
| π Model versioning | Load/unload multiple versions |
| π Metrics & logs | Prometheus integration, model logs |
| π§ Custom handlers | Customize preprocessing/postprocessing logic |
⚙️ How it works
-
Archive your model using
torch-model-archiver:
torch-model-archiver --model-name resnet18 \
--version 1.0 \
--model-file model.py \
--serialized-file resnet18.pt \
--handler image_classifier
-
Serve the model:
torchserve --start --model-store model_store --models resnet=resnet18.mar
-
Inference via REST:
curl http://127.0.0.1:8080/predictions/resnet -T image.jpg
π Directory structure
project/
├── model_store/
│ └── resnet18.mar
├── model.py
└── config.properties
π Model Versioning
You can serve multiple .mar files with different versions and configure them in config.properties.
π₯‘ 2. BentoML
π§Ύ What is it?
BentoML is a framework-agnostic platform to build, package, and deploy machine learning models as microservices.
⚡ Best for when you want flexibility + clean API with customization + auto Docker build + YAML config.
π Key Features
| Feature | Description |
|---|---|
| π Framework-agnostic | Supports PyTorch, TensorFlow, XGBoost, etc. |
| π§° CLI & Python SDK | Easy to package and serve |
| π³ Docker auto-pack | Generates container with one command |
| πͺ BentoML "Services" | Write custom API logic using Python |
| π§ͺ Local dev server | bentoml serve with hot reload |
| π Model registry | Built-in local model management |
π¦ Sample Service (for PyTorch)
# service.py
import torch
import bentoml
from bentoml.io import NumpyNdarray
from torchvision import models, transforms
model_ref = bentoml.pytorch.load_model("resnet50:latest")
@bentoml.service()
class ResNetService:
@bentoml.api(input=NumpyNdarray(), output=NumpyNdarray())
def predict(self, arr):
tensor = torch.tensor(arr).unsqueeze(0)
return model_ref(tensor).detach().numpy()
π§ CLI Commands
bentoml serve service:ResNetService
bentoml build
bentoml containerize ResNetService:latest
bentoml deploy # for cloud platforms like AWS Lambda, K8s
π₯ TorchServe vs BentoML
| Feature | TorchServe | BentoML |
|---|---|---|
| π§ Framework Support | Only PyTorch | Framework-agnostic |
| π ️ Custom API logic | Handlers | Native Python + FastAPI-style syntax |
| π³ Dockerization | Manual (or use custom Dockerfile) | Auto Dockerfile generation |
| π REST/gRPC Support | REST/gRPC | REST (gRPC planned) |
| π§ͺ Local Dev Experience | Moderate | Excellent (hot reloads, CLI tools) |
| π§± Model Registry | .mar archives |
Local Bento Store + Cloud Registry |
| π§° Monitoring & Metrics | Prometheus support | Prometheus + optional integrations |
| ☁️ Cloud deployment | Manual | AWS Lambda, K8s, SageMaker ready |
✅ When to Use What?
| Use Case | Tool |
|---|---|
| You need official, production-grade PyTorch model serving | TorchServe |
| You want to serve any ML model with full control & flexibility | BentoML |
| You want integrated Docker build and REST API in Python | BentoML |
| You prefer configuration over code with strict control | TorchServe |
π What is AWS Lambda?
AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers. You just upload your code, and Lambda takes care of everything required to run and scale it.
π‘ Great for lightweight ML inference, automation, data preprocessing, and event-driven tasks.
⚙️ Core Concepts
| Concept | Description |
|---|---|
| Function | The unit of deployment in Lambda (your code + configuration) |
| Event | Triggers that invoke Lambda (e.g., API Gateway, S3, SNS) |
| Handler | Entry point of the Lambda function |
| Runtime | Language-specific execution environment (Python, Node.js, etc.) |
| Timeout | Max execution time (default 3 sec, max 15 mins) |
| Memory | 128 MB to 10 GB, affects CPU power & pricing |
π¬ Use Cases in MLOps
| Use Case | Example |
|---|---|
| π Model Inference (small models) | Serve XGBoost/LightGBM/TinyML |
| π§Ή Data Cleaning Pipelines | Preprocess uploaded data from S3 |
| π Event-Driven Triggers | Trigger retraining when new data is added |
| π₯ Batch Prediction | Predict on small batches via API |
| π¬ Slack/Email Alerts | Auto alert on pipeline failures |
| π Glue/Athena Orchestration | Trigger downstream processes |
π§ͺ Sample Python Lambda Function (ML Inference)
# lambda_function.py
import json
import joblib
import numpy as np
# Load your model (ensure it's small or load from S3)
model = joblib.load("/opt/model.pkl")
def lambda_handler(event, context):
data = json.loads(event['body'])
features = np.array(data['features']).reshape(1, -1)
prediction = model.predict(features)
return {
'statusCode': 200,
'body': json.dumps({'prediction': prediction.tolist()})
}
π¦ Packaging ML Models for Lambda
-
Package code + dependencies + model (must be < 250MB unzipped):
project/
├── lambda_function.py
├── model.pkl
├── requirements.txt
-
Zip and deploy:
pip install -r requirements.txt -t ./package
cp lambda_function.py model.pkl ./package
cd package && zip -r ../lambda.zip .
-
Upload lambda.zip to AWS Lambda console or via AWS CLI.
π Triggering Lambda
| Trigger Type | Use |
|---|---|
| π API Gateway | Create REST endpoint for inference |
| π S3 Upload | Trigger when new file added |
| π CloudWatch | Run on schedule (like cron job) |
| π¬ SNS Topic | Alert on specific event |
| π Step Functions | As part of ML pipeline |
π§ AWS Lambda Limitations
| Limitation | Value |
|---|---|
| Max runtime | 15 mins |
| Max memory | 10 GB |
| Max package size | 250 MB (unzipped) |
| No GPU support | ❌ |
Use AWS SageMaker or ECS for large models/GPU inference.
π Permissions (IAM Role)
Make sure the Lambda has a proper execution role with:
-
S3 read/write (if using S3)
-
CloudWatch logs
-
(Optional) Secrets Manager or Parameter Store access
π§ Best Practices
| Tip | Description |
|---|---|
✅ Use /opt layer for models |
Separate large model files into Lambda Layers |
| π³ Local test with Docker | Use AWS SAM CLI |
| π§© Combine with API Gateway | For RESTful inference API |
| π Optimize size | Use scikit-learn, joblib, lightgbm etc. with care |
| π Use async if needed | For faster parallel executions |
π Alternatives for Larger Models
| Tool | Use When |
|---|---|
| AWS SageMaker Endpoint | Large model, GPU needed |
| ECS with Docker | Custom ML containers |
| EKS (Kubernetes) | Complex ML infra with scaling |
| BentoML + Lambda | BentoML can export Lambdas directly |
π What is Google Cloud Functions?
Google Cloud Functions is a serverless execution environment on Google Cloud Platform (GCP). You deploy code snippets that automatically run in response to events like HTTP requests, Pub/Sub messages, or Cloud Storage triggers.
⚡ It’s similar to AWS Lambda — perfect for lightweight ML inference, real-time data handling, or orchestrating workflows.
π§ When to Use GCF in MLOps?
| Use Case | Example |
|---|---|
| ✅ Lightweight model inference | Deploy Scikit-learn or XGBoost models |
| ✅ Event-driven automation | Retrain when new data arrives in GCS |
| ✅ Preprocessing pipelines | Clean/validate data on upload |
| ✅ Alerts & notifications | Trigger Slack/email alerts on failure |
| ✅ API interface for ML models | REST endpoint to serve predictions |
π¦ Supported Runtimes
| Language | Status |
|---|---|
| Python | ✅ (3.7–3.11) |
| Node.js | ✅ |
| Go, Java | ✅ |
| .NET, Ruby | ✅ |
π ️ Components of a Cloud Function
| Component | Description |
|---|---|
| Trigger | What invokes the function (HTTP, Cloud Pub/Sub, GCS, etc.) |
| Entry Point | Your main function/method |
| Dependencies | Listed in requirements.txt |
| Memory & Timeout | Configurable up to 16 GB & 60 min |
π MLOps Use Case Example — Inference API with Scikit-learn
main.py
import functions_framework
import joblib
import numpy as np
from flask import jsonify, request
model = joblib.load("model.pkl")
@functions_framework.http
def predict(request):
try:
data = request.get_json()
features = np.array(data['features']).reshape(1, -1)
prediction = model.predict(features)
return jsonify({'prediction': prediction.tolist()})
except Exception as e:
return jsonify({'error': str(e)}), 500
requirements.txt
flask
joblib
numpy
scikit-learn
π ️ Deploying the Function
-
Authenticate and set project:
gcloud auth login
gcloud config set project <your-gcp-project-id>
-
Deploy:
gcloud functions deploy predict \
--runtime python311 \
--trigger-http \
--allow-unauthenticated \
--memory=1024MB \
--entry-point=predict
-
Test via
curlor Postman:
curl -X POST <your-function-url> \
-H "Content-Type: application/json" \
-d '{"features": [5.1, 3.5, 1.4, 0.2]}'
π Directory Structure
project/
├── main.py
├── requirements.txt
├── model.pkl
π IAM & Permissions
-
For public HTTP functions, use
--allow-unauthenticated -
For private triggers, manage IAM with roles like:
-
roles/cloudfunctions.invoker -
roles/pubsub.subscriber -
roles/storage.objectViewer
-
⚙️ Triggers Supported
| Trigger Type | Use Case |
|---|---|
| HTTP | ML inference APIs |
| Cloud Storage | Run when new data is uploaded |
| Pub/Sub | Trigger model retraining |
| Firestore/BigQuery | Event-based ETL |
| Cloud Scheduler | Scheduled jobs (e.g., batch prediction) |
⚡ Limitations
| Attribute | Limit |
|---|---|
| Max timeout | 60 mins |
| Max memory | 16 GB |
| Max deployment package | 500 MB zipped |
| GPU support | ❌ Not available |
| Cold start | ⏱️ ~1s typical |
For heavy ML workloads, consider Cloud Run or Vertex AI.
✅ Best Practices
| Practice | Tip |
|---|---|
| Use lightweight models | scikit-learn, XGBoost (small) |
| Separate logic from I/O | Helps in testing & scaling |
| Logging | Use print() or logging for Cloud Logging |
| Test locally | With Functions Framework |
| Model versioning | Use GCS to store & load models dynamically |
π Cloud Functions vs Cloud Run vs Vertex AI
| Feature | Cloud Functions | Cloud Run | Vertex AI |
|---|---|---|---|
| Serverless | ✅ | ✅ | ✅ |
| ML Inference | ✅ (small) | ✅ (larger) | ✅ |
| GPU Support | ❌ | ❌ | ✅ |
| Cold Starts | Yes | Less frequent | Depends |
| Custom Docker | ❌ | ✅ | ✅ |
π§ Tips for MLOps Workflows
-
Store models in Google Cloud Storage
-
Use Pub/Sub to automate retraining on new data
-
Integrate with Vertex AI Pipelines for orchestration
-
Use Secrets Manager for secure API keys or credentials
-
Monitor with Cloud Logging & Cloud Monitoring
π What is Kubernetes?
Kubernetes (K8s) is an open-source platform for automating deployment, scaling, and management of containerized applications.
In MLOps, it's widely used for model serving, training pipelines, autoscaling workloads, and managing distributed ML systems.
π― Why Use Kubernetes in MLOps?
| Feature | Benefit |
|---|---|
| ✅ Scalability | Scale model inference under load |
| ✅ Portability | Works across cloud/on-prem |
| ✅ Isolation | Manage separate environments (prod/dev/staging) |
| ✅ Reproducibility | Define infra as code (YAML) |
| ✅ Rollback | Revert broken versions easily |
| ✅ Scheduling | Schedule batch jobs, training runs |
| ✅ GPU Support | Yes, with node pools |
π© What is Helm?
Helm is a package manager for Kubernetes — like pip for Python or apt for Ubuntu.
It lets you:
-
Define Kubernetes manifests as templates
-
Version and package deployments as charts
-
Manage configuration with values.yaml
⚙️ Kubernetes + Helm Workflow for MLOps
π Step-by-Step Breakdown:
1. π³ Containerize Your Model
Create a Dockerfile for your model/app:
FROM python:3.11-slim
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
CMD ["python", "serve.py"]
2. ☸️ Create Kubernetes Manifests
Basic structure:
-
deployment.yaml– app definition -
service.yaml– expose internally or via LoadBalancer -
ingress.yaml– optional (for HTTP routing)
Example: deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-model
spec:
replicas: 2
selector:
matchLabels:
app: ml-model
template:
metadata:
labels:
app: ml-model
spec:
containers:
- name: ml-container
image: your-dockerhub/ml-model:latest
ports:
- containerPort: 5000
resources:
requests:
cpu: "500m"
memory: "512Mi"
3. π¦ Helm Chart Structure
helm create ml-model
Generated structure:
ml-model/
├── Chart.yaml
├── values.yaml
├── templates/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── ingress.yaml
You can now template your YAML using values.yaml:
templates/deployment.yaml (Helm-style):
spec:
replicas: {{ .Values.replicaCount }}
containers:
- image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
values.yaml
replicaCount: 2
image:
repository: your-dockerhub/ml-model
tag: latest
4. π Deploy using Helm
# Install Helm Chart
helm install ml-model ./ml-model
# Upgrade version
helm upgrade ml-model ./ml-model
# Uninstall
helm uninstall ml-model
π§ Expose the App
-
Use ClusterIP for internal services (e.g., other ML microservices)
-
Use LoadBalancer or Ingress for external REST APIs
service.yaml:
apiVersion: v1
kind: Service
metadata:
name: ml-model-service
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 5000
selector:
app: ml-model
π MLOps-Specific Add-ons
| Tool | Use |
|---|---|
| KubeFlow | End-to-end ML pipelines |
| Prometheus + Grafana | Metrics, monitoring |
| ELK Stack | Logs |
| KEDA | Event-based autoscaling (Pub/Sub, Kafka) |
| Istio | Secure traffic control between ML services |
π§ Example: Real Use Case
“Serve a FastAPI model using Kubernetes & Helm”
-
Create
FastAPIprediction app -
Build image, push to Docker Hub/GCR
-
Write Helm chart
-
Deploy on GKE or Minikube:
helm install iris-predictor ./iris-chart
✅ Best Practices
| Practice | Reason |
|---|---|
| Use Helm for reusable templates | Easy rollout of multiple models |
| Use ConfigMaps/Secrets for credentials | Avoid hardcoding |
| Use resource limits | Prevent cluster overload |
| Use liveness/readiness probes | Auto-heal unhealthy containers |
| Version charts and rollback | Safe deployments |
π§Ύ What is YAML?
YAML stands for "YAML Ain’t Markup Language" (recursive acronym).
It’s a human-readable data serialization format, often used for configuration files.
πΉ Where YAML is used?
| Tool / Framework | YAML Use |
|---|---|
| Kubernetes | Deployment & Service specs (.yaml) |
| Docker Compose | docker-compose.yaml to define multi-container apps |
| GitHub Actions | .github/workflows/*.yml CI/CD pipelines |
| MLflow, Airflow | Task configs, pipelines |
| Ansible, Helm | Infrastructure as Code |
| KubeFlow | ML pipelines definitions |
| PyTorch Lightning | Training configs |
| Streamlit / Gradio | App settings |
πΉ Example: Kubernetes Deployment YAML
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-app
spec:
replicas: 2
selector:
matchLabels:
app: ml-app
template:
metadata:
labels:
app: ml-app
spec:
containers:
- name: ml-container
image: yourrepo/ml-model:latest
ports:
- containerPort: 5000
✅ YAML Features
-
Indentation defines structure (use spaces, not tabs)
-
Supports:
-
Lists (
- item) -
Key-value pairs (
key: value) -
Nested objects
-
YAML (YAML Ain’t Markup Language) is a human-readable data serialization language, often used for:
-
Configuration files
-
Kubernetes manifests
-
CI/CD pipelines (e.g., GitHub Actions, GitLab CI)
π Key Features:
-
Simple syntax (indentation-based, like Python)
-
Supports scalars, lists, dictionaries
-
Comments with
# -
Easily converted to/from JSON
π YAML Syntax Basics:
1. Key-Value Pairs:
name: Sanjay
age: 25
2. Lists:
skills:
- Python
- Docker
- Kubernetes
3. Nested Objects:
database:
host: localhost
port: 5432
4. List of Objects:
users:
- name: Alice
role: admin
- name: Bob
role: user
5. Anchors & Aliases (reuse blocks):
defaults: &default_settings
retries: 3
timeout: 5
api_config:
<<: *default_settings
base_url: http://api.example.com
π€ YARN: Yet Another Resource Negotiator
Not related to YAML, though the names sound similar.
✅ What is YARN?
YARN (Yet Another Resource Negotiator) is a core component of Apache Hadoop for cluster resource management.
π ️ Purpose:
YARN acts as the Resource Manager in Hadoop’s ecosystem. It:
-
Allocates CPU, memory to jobs
-
Manages job scheduling & monitoring
-
Enables running multiple distributed applications (like Spark, MapReduce, Hive) in a single Hadoop cluster
π¦ Components of YARN:
| Component | Description |
|---|---|
| ResourceManager (RM) | Central master that allocates resources |
| NodeManager (NM) | Agent on each worker node to monitor resources |
| ApplicationMaster (AM) | Job-specific manager to request resources from RM |
| Container | Actual compute unit (CPU, memory) running the job |
𧬠YARN vs Kubernetes
| Feature | YARN | Kubernetes |
|---|---|---|
| Origin | Big Data (Hadoop Ecosystem) | Cloud-native (Containers) |
| Workload Type | Batch jobs (MapReduce, Spark) | Microservices, ML, API services |
| Resource Type | CPU & memory (fixed JVMs) | Pod-based (containers) |
| Scaling | Manual or with autoscalers | Native autoscaling |
π When to Learn YARN in MLOps?
-
Only if you're working with Hadoop or Spark clusters
-
Most modern MLOps pipelines use Kubernetes or cloud-native platforms instead
✅ Summary:
| Term | Description |
|---|---|
| YAML | Human-friendly config language (used in K8s, CI/CD, Docker Compose, etc.) |
| YARN | Resource manager in Hadoop ecosystem (used in Spark, MapReduce jobs) |
Comments
Post a Comment