CNN Frequent interview questions

Q1. What is a Convolutional Neural Network (CNN)?
A1. A CNN is a type of neural network specialized for processing grid-like data, such as images, using convolutional layers to extract spatial features.


Q2. How do CNNs differ from traditional ANNs?
A2. CNNs use convolutional layers for feature extraction and weight sharing, whereas ANNs rely on fully connected layers for input-output mapping. CNNs are more efficient for image and spatial data processing.


Q3. What is the role of convolutional layers in CNN?
A3. Convolutional layers apply filters (kernels) to the input data to extract features such as edges, textures, and patterns by performing element-wise multiplication and summation.


Q4. What is padding in CNNs, and why is it used?
A4. Padding adds extra pixels (zeros) around the input data to preserve its spatial dimensions after convolution, preventing loss of information at edges.


Q5. What is the significance of strides in convolutional layers?
A5. Strides determine the step size of the kernel as it slides over the input, affecting the size of the output feature map.


Q6. What is a receptive field in CNN?
A6. The receptive field is the region of the input image that contributes to the activation of a neuron in a specific layer. It grows as you move deeper into the network.


Q7. What is the role of pooling layers in CNNs?
A7. Pooling layers reduce the spatial dimensions of feature maps, retaining important features while minimizing computation and preventing overfitting.


Q8. Compare max pooling and average pooling.
A8. Max pooling selects the maximum value within a pooling window, emphasizing prominent features, while average pooling computes the mean of values, smoothing the features.


Q9. What is the purpose of normalization in CNNs?
A9. Normalization, such as batch normalization, ensures that the inputs to each layer are centered and scaled, improving training speed and stability.


Q10. What is the significance of filters (kernels) in CNNs?
A10. Filters are used to extract specific features like edges, textures, and patterns from the input data through convolution.


Q11. How do CNNs achieve translation invariance?
A11. Translation invariance is achieved through convolution and pooling operations, allowing the network to recognize features regardless of their position.


Q12. What is overfitting in CNNs, and how can it be mitigated?
A12. Overfitting occurs when the model learns training data too well, including noise. It can be mitigated using dropout, data augmentation, and regularization techniques.


Q13. What is data augmentation, and why is it used in CNNs?
A13. Data augmentation artificially increases the training dataset by applying transformations such as rotation, flipping, and scaling, improving generalization and reducing overfitting.


Q14. What is transfer learning in CNNs?
A14. Transfer learning uses pre-trained CNN models on large datasets as a starting point for new tasks, saving time and improving performance with limited data.


Q15. What is fine-tuning in transfer learning?
A15. Fine-tuning involves updating the weights of a pre-trained model for a specific task, usually by retraining a subset of layers.


Q16. What are popular pre-trained CNN architectures?
A16. Examples include VGG, ResNet, Inception, MobileNet, and EfficientNet.


Q17. Explain the concept of depthwise separable convolution.
A17. Depthwise separable convolution splits standard convolution into two operations: depthwise convolution (spatial filtering) and pointwise convolution (channel mixing), reducing computation.


Q18. What is the difference between shallow and deep CNNs?
A18. Shallow CNNs have fewer layers and are suited for simpler tasks, while deep CNNs have many layers and are capable of learning complex hierarchical features.


Q19. What is the role of fully connected layers in CNNs?
A19. Fully connected layers aggregate features learned by convolutional layers to make predictions.


Q20. What is dropout, and how does it prevent overfitting in CNNs?
A20. Dropout randomly disables neurons during training, reducing reliance on specific neurons and enhancing generalization.


Q21. How do CNNs handle color images compared to grayscale images?
A21. Color images have three channels (RGB) as input, whereas grayscale images have a single channel. CNNs process color images by applying separate filters to each channel.


Q22. What is the vanishing gradient problem in deep CNNs, and how is it addressed?
A22. The vanishing gradient problem occurs when gradients become very small during backpropagation, slowing down learning. Solutions include using ReLU activation and batch normalization.


Q23. What is feature extraction in CNNs?
A23. Feature extraction refers to the process where CNNs automatically identify and learn important patterns from input data, replacing manual feature engineering.


Q24. What is the purpose of stride in convolutional layers?
A24. Stride controls the step size of the filter during convolution, affecting output dimensions and computation.


Q25. How does ResNet solve the vanishing gradient problem?
A25. ResNet introduces residual connections (shortcuts) that allow gradients to flow directly across layers, improving training of very deep networks.


Q26. Explain the concept of dilated convolution.
A26. Dilated convolution involves spacing out filter weights, increasing the receptive field without increasing computation, useful for processing larger contexts.


Q27. What are skip connections, and why are they useful in CNNs?
A27. Skip connections bypass one or more layers, enabling feature reuse and addressing issues like vanishing gradients and degradation in deep networks.


Q28. What is the difference between semantic segmentation and object detection in CNNs?
A28. Semantic segmentation assigns a label to each pixel, while object detection identifies and localizes objects within an image using bounding boxes.


Q29. What is the purpose of anchor boxes in object detection models like YOLO or Faster R-CNN?
A29. Anchor boxes are predefined templates for bounding boxes used to predict objects of various shapes and sizes.


Q30. What is the difference between upsampling and transposed convolution in CNNs?
A30. Upsampling increases spatial dimensions by repeating values or interpolating, while transposed convolution learns how to upscale features using filters.


Q31. How does batch normalization improve CNN training?
A31. Batch normalization stabilizes activations, accelerates convergence, reduces sensitivity to initialization, and mitigates the vanishing gradient problem.


Q32. What is a feature map in CNNs?
A32. A feature map is the output of convolution or pooling layers, representing extracted features from input data.


Q33. What is global average pooling (GAP) in CNNs?
A33. GAP computes the average of each feature map and reduces spatial dimensions to a single value per map, commonly used before fully connected layers.


Q34. What is the difference between region proposal networks (RPNs) and sliding window techniques in object detection?
A34. RPNs generate object proposals dynamically, whereas sliding window techniques involve exhaustive searching, which is computationally expensive.


Q35. How does dropout impact the forward and backward passes in CNN training?
A35. Dropout disables neurons randomly during the forward pass, and gradients for disabled neurons are ignored during the backward pass, ensuring robustness.


Q36. What is the role of softmax in CNN-based classification tasks?
A36. Softmax converts logits into probabilities for multi-class classification, ensuring the sum of probabilities across classes equals 1.


Q37. How does attention improve CNNs?
A37. Attention mechanisms prioritize important regions of input data, enhancing feature extraction and prediction accuracy.


Q38. What is the difference between F1-score and accuracy in CNN evaluation?
A38. Accuracy measures overall correctness, while F1-score balances precision and recall, useful for imbalanced datasets.


Q39. How does YOLO differ from Faster R-CNN in object detection?
A39. YOLO performs object detection in a single forward pass, enabling real-time detection. Faster R-CNN generates region proposals first, leading to higher accuracy but slower speeds.


Q40. What is feature pyramidal networks (FPNs) in CNNs?
A40. FPNs extract features at multiple scales, improving performance for object detection and segmentation tasks.


Q41. What is an epoch in CNN training?
A41. An epoch refers to one full pass through the entire training dataset, ensuring all samples contribute to learning.


Q42. What is the benefit of using depthwise convolution in MobileNet?
A42. Depthwise convolution separates spatial filtering and channel mixing, significantly reducing computation and enabling efficient processing on mobile devices.


Q43. How does Faster R-CNN utilize region proposal networks (RPNs)?
A43. RPNs in Faster R-CNN generate region proposals by sliding a network over the image and predicting object bounds and scores, improving object detection speed and accuracy.


Q44. What are residual connections in ResNet? Why are they important?
A44. Residual connections skip one or more layers by adding input to output, facilitating gradient flow and addressing vanishing gradient issues in deep networks.


Q45. What is the difference between classification and localization in CNNs?
A45. Classification identifies the category of an image, while localization predicts the position (bounding box) of an object within the image.


Q46. What is transfer learning, and how does it apply to CNNs?
A46. Transfer learning involves using pre-trained CNN models to adapt to new tasks, reducing training time and improving performance on limited data.


Q47. What is an anchor box, and why is it used in object detection?
A47. Anchor boxes are predefined bounding boxes of various sizes and aspect ratios used to predict objects, enabling detection of multiple objects in different shapes and scales.


Q48. How do CNNs process sequential frames for video analysis?
A48. For video analysis, CNNs can process individual frames, while combining temporal features can be achieved using recurrent networks (RNN, LSTM) or 3D convolution.


Q49. What is semantic segmentation, and which CNN architecture is commonly used for it?
A49. Semantic segmentation assigns pixel-level labels to images. Architectures like U-Net and DeepLab are commonly used for segmentation tasks.


Q50. What are the challenges of training deep CNNs?
A50. Challenges include computational cost, vanishing/exploding gradients, overfitting, and requiring large labeled datasets.


Q51. Explain the concept of atrous (dilated) convolution.
A51. Atrous convolution expands the receptive field by inserting spaces between filter weights, capturing larger context without increasing the number of parameters.


Q52. What is the difference between single-shot and two-stage object detection?
A52. Single-shot methods like YOLO predict bounding boxes directly, while two-stage methods like Faster R-CNN first generate region proposals, then refine and classify them.


Q53. How does ROI pooling work in Faster R-CNN?
A53. ROI pooling converts region proposals into fixed-size feature maps by dividing them into grids and applying max pooling within each grid cell.


Q54. What is SPPNet, and how does it improve CNN efficiency?
A54. Spatial Pyramid Pooling (SPPNet) eliminates the need for fixed input sizes by pooling feature maps into pyramids of varying sizes, improving computational efficiency.


Q55. How does U-Net handle semantic segmentation tasks effectively?
A55. U-Net uses an encoder-decoder structure with skip connections, allowing detailed spatial information to be preserved for pixel-level predictions.


Q56. What is the difference between object detection and instance segmentation in CNNs?
A56. Object detection identifies and localizes objects with bounding boxes, while instance segmentation provides pixel-level masks for each object.


Q57. What is the advantage of feature pyramids in object detection?
A57. Feature pyramids enable multi-scale detection by combining low-level, high-resolution features with high-level, low-resolution features.


Q58. How does the focal loss improve training in object detection models like RetinaNet?
A58. Focal loss addresses class imbalance by down-weighting easy examples and focusing on hard, misclassified examples during training.


Q59. What is the difference between shared weights and independent weights in CNNs?
A59. Shared weights reduce parameter count by applying the same filter across spatial dimensions, while independent weights are unique for each connection, increasing flexibility.


Q60. How does EfficientNet optimize CNN architecture for scalability?
A60. EfficientNet uses compound scaling to adjust depth, width, and resolution uniformly, improving performance and reducing computation.


Q61. What are Vision Transformers (ViTs), and how do they differ from CNNs?
A61. Vision Transformers are models inspired by the transformer architecture used in NLP. Unlike CNNs, which rely on convolutional layers to extract local spatial features, ViTs process image patches as sequences and use attention mechanisms to model global dependencies.


Q62. What is the self-attention mechanism in Vision Transformers?
A62. The self-attention mechanism computes the relationship between different patches of an image, allowing the model to focus on relevant regions while considering contextual information.


Q63. How does transfer learning differ between CNNs and ViTs?
A63. Both use pre-trained models, but ViTs often benefit more from large-scale datasets during pretraining as they require more data than CNNs to perform well on downstream tasks.


Q64. What is self-supervised learning, and how is it applied in computer vision?
A64. Self-supervised learning creates pseudo-labels from unlabeled data for pretraining models. In vision, methods like contrastive learning (e.g., SimCLR, MoCo) and masked image modeling (e.g., MAE) are common.


Q65. How do you address scalability issues when deploying deep learning models like CNNs or ViTs?
A65. Scalability can be addressed by techniques like model quantization, pruning, knowledge distillation, and using efficient architectures (e.g., MobileNet, EfficientNet).


Q66. What is model quantization, and how does it improve deployment?
A66. Model quantization reduces the precision of weights (e.g., from 32-bit to 8-bit), reducing model size and improving inference speed with minimal loss in accuracy.


Q67. What is model pruning, and how does it help in deep learning deployment?
A67. Model pruning removes redundant weights or neurons from a network, reducing computational complexity and memory requirements.


Q68. What is knowledge distillation, and how is it used in CNNs?
A68. Knowledge distillation involves training a smaller student model to mimic the behavior of a larger teacher model, achieving comparable performance with reduced computational cost.


Q69. How does masked autoencoder (MAE) work in self-supervised vision tasks?
A69. MAE masks random patches of an image and trains a model to reconstruct the missing regions, learning meaningful representations without labeled data.


Q70. What are adversarial attacks on CNNs, and how do you defend against them?
A70. Adversarial attacks involve perturbing inputs to mislead the model. Defenses include adversarial training, input preprocessing, and using robust architectures.


Q71. How does latency affect real-time deployment of CNNs, and how can it be reduced?
A71. Latency refers to the time taken for inference. It can be reduced using techniques like model optimization, hardware acceleration (e.g., GPUs, TPUs), and batching.


Q72. What is contrastive learning in self-supervised learning?
A72. Contrastive learning trains models by maximizing the similarity between positive pairs (e.g., augmented views of the same image) while minimizing the similarity to negative samples.


Q73. What is the difference between SimCLR and MoCo in contrastive learning?
A73. SimCLR computes contrastive loss using a large batch of positive and negative pairs, while MoCo maintains a memory bank of encoded representations, requiring smaller batches.


Q74. How does edge AI differ from traditional cloud-based AI deployments?
A74. Edge AI performs inference locally on edge devices, reducing latency and bandwidth usage, while cloud AI relies on centralized servers for computation.


Q75. What is federated learning, and how is it applied in vision tasks?
A75. Federated learning trains models collaboratively across multiple devices without sharing raw data, preserving privacy while enabling distributed training for vision tasks like image classification.


Q76. How do gradient-based explainability methods like Grad-CAM work for CNNs?
A76. Grad-CAM generates heatmaps highlighting regions of an image that contribute most to the model's prediction by using gradients of class scores with respect to feature maps.


Q77. What is the role of batch size in training CNNs and ViTs?
A77. Larger batch sizes stabilize training and accelerate convergence but require more memory. Smaller batch sizes may lead to noisier updates but can improve generalization.


Q78. What are synthetic datasets, and how are they used in CNN training?
A78. Synthetic datasets are artificially generated data used to augment or replace real-world data, enabling training when labeled data is scarce or expensive to acquire.


Q79. What is the difference between 2D convolution and 3D convolution?
A79. 2D convolution processes spatial dimensions (height and width), while 3D convolution incorporates temporal or volumetric data (height, width, depth) for video or medical imaging tasks.


Q80. What are lightweight CNN architectures, and why are they important?
A80. Lightweight architectures like MobileNet and SqueezeNet are designed for resource-constrained devices, balancing accuracy and efficiency.




Comments

Popular posts from this blog

Resume Work and Project Details

Time Series and MMM basics

LINEAR REGRESSION