MLOPS 1 - Interview questions
MLOps Lifecycle: Core Concepts & Principles (Q1-Q20) ⚙️
Q: What is MLOps?
A: MLOps is a set of practices that combines Machine Learning, DevOps, and Data Engineering to reliably and efficiently deploy and maintain ML systems in production.
Q: What is the main goal of MLOps?
A: The main goal is to bridge the gap between model development (training) and deployment in a scalable and repeatable way.
Q: Name the key stages of a typical MLOps lifecycle.
A: Data Collection & Preparation, Model Development, Experiment Tracking, Model Training, Model Versioning, Model Deployment, Monitoring & Governance.
Q: How does MLOps differ from traditional DevOps?
A: MLOps includes additional complexities like data versioning, model retraining, and monitoring for data and concept drift, which are not present in traditional software deployment.
Q: Why is data versioning crucial in MLOps?
A: Data versioning ensures that the model can be reproduced using the exact dataset it was trained on, which is essential for debugging and auditing.
Q: What is "data drift"?
A: Data drift is when the statistical properties of the production data change over time in an unexpected way, causing the model's performance to degrade.
Q: What is "concept drift"?
A: Concept drift is when the relationship between the input data and the target variable changes over time.
Q: What is the purpose of a Feature Store?
A: A Feature Store is a centralized repository for managing and serving machine learning features. It ensures consistency between training and serving data.
Q: Why is automated retraining a key MLOps practice?
A: Automated retraining is used to keep the model's performance from degrading due to data or concept drift without manual intervention.
Q: What is the "reproducibility" problem in ML?
A: The reproducibility problem is the difficulty in getting the exact same results when running the same code, data, and configuration at a later time.
Q: How does MLOps help with reproducibility?
A: By using version control for code, data, and models, and by tracking every experiment's parameters and metrics.
Q: What is the role of a data scientist in an MLOps team?
A: The data scientist focuses on model development, algorithm selection, and feature engineering.
Q: What is the role of an ML Engineer in an MLOps team?
A: The ML Engineer focuses on building the production-ready ML pipelines, deploying models, and building the necessary infrastructure.
Q: What is the "reproducibility crisis" and how does MLOps address it?
A: The reproducibility crisis refers to the difficulty of reproducing scientific results. MLOps addresses it by enforcing strict versioning and tracking of all components.
Q: What is CI/CD/CT in MLOps?
A: CI (Continuous Integration) integrates code changes. CD (Continuous Delivery/Deployment) automates model deployment. CT (Continuous Training) automates model retraining.
Q: What are the main challenges in MLOps?
A: Challenges include managing diverse dependencies, versioning large datasets, monitoring models in production, and ensuring low latency.
Q: How do you handle model governance in MLOps?
A: Model governance involves tracking model lineage, managing approvals, and maintaining an audit trail for regulatory compliance.
Q: Why is containerization (e.g., Docker) important in MLOps?
A: Containerization packages the model and its dependencies into a single unit, ensuring it runs consistently across different environments.
Q: What is "Model Monitoring"?
A: Model Monitoring is the process of tracking a model's performance in production, including its predictions, latency, and resource usage.
Q: What is an "offline" vs. "online" serving environment for ML models?
A: Offline serving is for batch predictions (e.g., daily recommendations). Online serving is for real-time, low-latency predictions (e.g., fraud detection).
Version Control: DVC, MLflow, and Model Registries (Q21-Q40) 🔄
Q: What is the purpose of a version control system in MLOps?
A: To track and manage changes to code, data, and models, enabling collaboration and reproducibility.
Q: How does Git fit into an MLOps workflow?
A: Git is used to version control the code, scripts, and configuration files of an ML project.
Q: What is the main problem with using Git for data and models?
A: Git is designed for small text files and is inefficient at handling large binary files like datasets and trained models.
Q: What is DVC (Data Version Control)?
A: DVC is an open-source tool built on top of Git that versions large files like data and models without storing them in the Git repository.
Q: How does DVC work with Git?
A: DVC stores pointers to the data files in Git, while the actual data is stored remotely in a cloud storage or shared server.
Q: What command would you use to add a dataset to DVC?
A:
dvc add data/raw_data.csv
Q: What is the purpose of the
.dvcfile?A: The
.dvcfile is a small text file that contains metadata (e.g., file hash, size) for the versioned data. This is what gets committed to Git.
Q: How do you pull a DVC-versioned dataset from a remote storage?
A:
dvc pull
Q: What is MLflow?
A: MLflow is an open-source platform for managing the end-to-end machine learning lifecycle, including experiment tracking, project packaging, and model management.
Q: What are the four main components of MLflow?
A: MLflow Tracking, MLflow Projects, MLflow Models, and MLflow Model Registry.
Q: What is MLflow Tracking used for?
A: MLflow Tracking is used to log parameters, metrics, artifacts (e.g., models), and source code for each experiment run.
Q: What is a "run" in MLflow?
A: A run is a single execution of a machine learning code. It's the primary unit of organization in MLflow Tracking.
Q: How do you log a metric (e.g., accuracy) in MLflow?
A:
mlflow.log_metric("accuracy", 0.95)
Q: What is an MLflow Model Registry?
A: An MLflow Model Registry is a centralized repository to collaboratively manage the lifecycle of an MLflow Model, including versioning and stage transitions (e.g., from Staging to Production).
Q: How would you register a model using MLflow?
A:
mlflow.register_model("runs:/<run_id>/<artifact_path>", "ModelName")
Q: What is the benefit of using a Model Registry?
A: It provides a clear, central place to manage all model versions, allowing teams to collaborate and deploy models confidently.
Q: How does DVC differ from MLflow?
A: DVC is a data and model versioning tool that works with Git. MLflow is a full-lifecycle management platform, with tracking, model registry, and project packaging capabilities.
Q: Can DVC and MLflow be used together?
A: Yes, they are complementary. You can use DVC to version your data and models, and then use MLflow to track the experiments and manage the model's lifecycle in a registry.
Q: What is the main purpose of a Model Registry?
A: To serve as a single source of truth for all deployed models, tracking their versions, metadata, and status.
Q: What is "artifact" in the context of MLflow?
A: An artifact is any file generated by a run that you want to save, such as the trained model, a plot, or a text file.
Experiment Tracking: Importance & Tools (Q41-Q60) 📊
Q: What is Experiment Tracking?
A: Experiment Tracking is the practice of systematically logging and organizing all the details of your ML experiments, including code, data, hyperparameters, and results.
Q: Why is experiment tracking so important?
A: It helps you understand which models perform best, ensures reproducibility, and provides a clear audit trail of your development process.
Q: Name three popular experiment tracking tools.
A: MLflow, Weights & Biases (W&B), and Comet ML.
Q: What information should you track for each experiment?
A: Hyperparameters, evaluation metrics (accuracy, loss), the version of the data, the code version, and any other relevant artifacts.
Q: How does an experiment tracking tool help with collaboration?
A: It provides a centralized dashboard where team members can view, compare, and reproduce each other's experiments.
Q: How can you compare different experiments using an experiment tracking tool?
A: Most tools provide a dashboard that allows you to plot metrics from different runs on a single chart and sort them by performance.
Q: What is a "run ID" in an experiment tracking system?
A: A run ID is a unique identifier for each experiment. It's used to retrieve and inspect the details of a specific run.
Q: What is the benefit of logging code version (e.g., Git commit hash) in an experiment?
A: It ensures that you can always go back to the exact code that was used to train a specific model, guaranteeing reproducibility.
Q: What is the difference between experiment tracking and model monitoring?
A: Experiment tracking is for the development phase to track model performance. Model monitoring is for the production phase to track model performance and health after deployment.
Q: How does logging an artifact (like a trained model) in MLflow differ from using Git LFS or DVC?
A: Logging an artifact in MLflow links the model file to a specific run, while Git LFS/DVC simply versions the file regardless of the experiment that produced it.
Q: What is a "dashboard" in the context of experiment tracking?
A: A dashboard is a web-based interface that provides visualizations and summaries of all your experiments, making it easy to compare results.
Q: How does experiment tracking help with hyperparameter tuning?
A: It allows you to track and compare the metrics of different hyperparameter combinations, helping you find the optimal set of values.
Q: What is a "parameter" in the context of an MLflow run?
A: A parameter is a key-value pair used to record an input to your run, such as the learning rate or the number of epochs.
Q: Can I log a custom image or plot to an experiment tracking tool?
A: Yes, most tools allow you to log various artifacts, including plots generated during analysis.
Q: How does MLflow's Projects component help with experiment tracking?
A: It provides a convention for packaging code in a reusable and reproducible way, ensuring consistent experiment execution across different machines.
Q: What is the main benefit of using a hosted experiment tracking service (like W&B) over a local one?
A: Hosted services provide a centralized, shareable platform for teams, with no setup required and a user-friendly UI.
Q: What is a
run.log()function in a tracking tool?A: A generic function to log any kind of data to an experiment, often used for logging custom metrics or artifacts.
Q: What is the purpose of logging a "tag" in an experiment?
A: Tags are used for labeling and grouping experiments. For example, you can tag runs with "production" or "testing" to easily filter them.
Q: How can experiment tracking help in a debugging process?
A: By tracking all the details of each run, you can compare a failed run with a successful one to identify what went wrong (e.g., a change in hyperparameters).
Q: What is the difference between a metric and an artifact?
A: A metric is a numerical value (e.g., accuracy, loss). An artifact is a file (e.g., the model file, a plot).
Pipeline Orchestration: Automation & Scheduling (Q61-Q80) 🔄
Q: What is a "pipeline" in MLOps?
A: An ML pipeline is a series of automated steps that represent the end-to-end ML workflow, from data ingestion to model deployment.
Q: Why is pipeline orchestration important in MLOps?
A: It automates the entire ML workflow, ensuring consistency, efficiency, and scalability, and eliminating manual steps.
Q: Name three popular pipeline orchestration tools.
A: Apache Airflow, Kubeflow Pipelines, and Prefect.
Q: What is a "DAG" in pipeline orchestration?
A: A DAG (Directed Acyclic Graph) is a visual representation of a pipeline. It defines the sequence of tasks and their dependencies.
Q: What is the difference between a "task" and a "pipeline"?
A: A task is a single, atomic operation within a pipeline (e.g., "preprocess data"). A pipeline is a collection of interconnected tasks.
Q: How does a pipeline orchestrator handle failures?
A: Orchestrators can be configured to automatically retry a failed task or send a notification to a team when a task fails.
Q: What is the purpose of a "scheduler" in a pipeline orchestrator?
A: The scheduler is the component that triggers the execution of pipelines based on a predefined schedule (e.g., every day at midnight).
Q: What is the main benefit of using a managed orchestration service (like Google Cloud Composer)?
A: It removes the operational overhead of setting up and managing the orchestrator's infrastructure.
Q: How does a pipeline orchestrator help with reproducibility?
A: By automating the entire process, it ensures that every time the pipeline runs, the exact same steps are executed, reducing human error.
Q: What is the difference between a "stateless" and a "stateful" pipeline?
A: A stateless pipeline processes data independently. A stateful pipeline remembers information from previous runs, which can complicate reproducibility.
Q: How can you trigger a pipeline run?
A: You can trigger a run manually, on a schedule, or based on an external event (e.g., a new file arriving in a data lake).
Q: What is a "trigger" in the context of an ML pipeline?
A: A trigger is an event or a condition that starts a pipeline run.
Q: What is the purpose of "dependencies" in a pipeline?
A: Dependencies define the order in which tasks must be executed. A task cannot start until its dependent tasks are complete.
Q: How can you visualize a pipeline's progress?
A: Orchestration tools provide a web-based UI that shows the real-time status of each task in a pipeline.
Q: What is the purpose of a "data artifact" in a pipeline?
A: A data artifact is the output of one task that serves as the input for a subsequent task. It ensures a clear hand-off between pipeline steps.
Q: How do you handle secrets (e.g., API keys) in a pipeline?
A: Secrets should be stored in a secure secret management system and accessed by the pipeline through a secure connection.
Q: What is the role of a "Kubeflow Pipeline" in MLOps?
A: Kubeflow Pipelines is a platform for building and deploying portable, scalable ML pipelines on Kubernetes.
Q: How does Apache Airflow handle tasks?
A: Airflow uses Python to define DAGs, and each task is an
Operatorthat represents a specific action.
Q: What is the benefit of using an orchestrator for retraining a model?
A: It automates the entire retraining process, from data fetching and preprocessing to model training and deployment, ensuring a continuous loop.
Q: What is the difference between a pipeline orchestrator and a CI/CD tool?
A: A CI/CD tool (e.g., Jenkins) focuses on code integration and deployment. A pipeline orchestrator is designed specifically for the complex dependencies and data flow of ML workflows.
General & Advanced MLOps Concepts (Q81-Q100) 💡
Q: What is a "Model Registry"?
A: A Model Registry is a centralized hub for managing the lifecycle of ML models, including versioning, metadata, and stage transitions.
Q: What is "Model serving"?
A: Model serving is the process of deploying a trained model so that it can receive input data and return predictions.
Q: Name two common model serving frameworks.
A: TensorFlow Serving and TorchServe.
Q: What is "A/B Testing" in MLOps?
A: A/B Testing is a method for comparing two versions of a model by exposing a portion of traffic to each version to determine which performs better.
Q: How does MLOps help with cost optimization?
A: By automating pipelines and using scalable infrastructure, MLOps reduces manual effort and optimizes resource usage.
Q: What is a "reproducible environment"?
A: A reproducible environment is a consistent and isolated environment (e.g., a Docker container) that contains all the necessary dependencies to run a project.
Q: What is the difference between a CI/CD pipeline and an MLOps pipeline?
A: An MLOps pipeline adds steps like data validation, model training, and model serving to the standard CI/CD process.
Q: How would you deal with a "data drift" alert?
A: You would investigate the data changes, potentially retrain the model on the new data, and deploy the new version.
Q: What is the role of a "feature store"?
A: A feature store acts as a centralized database for both offline training and online serving features, ensuring consistency.
Q: What is the benefit of using a "ML Metadata Store"?
A: It stores metadata about every run, from data lineage to experiment results, providing a comprehensive audit trail.
Q: What are the main components of a model monitoring system?
A: Components include data drift detection, model performance tracking, and anomaly detection.
Q: What is "CI/CD for ML"?
A: A system that automates the testing and deployment of ML code and models, similar to traditional software but with ML-specific considerations.
Q: What is a "canary deployment" in MLOps?
A: Canary deployment is a strategy where a new version of the model is deployed to a small subset of users before a full rollout.
Q: How does MLOps handle ethical concerns in ML?
A: By providing a framework for tracking model lineage and auditing for bias, MLOps helps ensure models are fair and transparent.
Q: What is a "batch prediction" service?
A: A service that processes large volumes of data at once to generate predictions, typically on a scheduled basis.
Q: How does a "Model Registry" help in a team with multiple data scientists?
A: It provides a single source of truth for which model versions are approved for production, preventing conflicts.
Q: What is the difference between "model versioning" and "model staging"?
A: Versioning is about tracking changes over time (version 1, 2, 3). Staging is about categorizing models based on their readiness (e.g., Staging, Production, Archive).
Q: What is the main challenge of "online inference"?
A: Ensuring low latency and high throughput for real-time predictions.
Q: What is the purpose of a "rollback" in MLOps?
A: Rollback is the process of reverting to a previous, stable version of a model in case of a failure in production.
Q: What is the relationship between MLOps, DevOps, and Data Engineering? * A: MLOps can be seen as the intersection of all three: it uses DevOps principles for automation, Data Engineering practices for data pipelines, and ML principles for model development.
Comments
Post a Comment