| Sno | Back | Back Subject | subject | date | title | note |
|---|---|---|---|---|---|---|
| 1 | 1 | Back to subject | AI for Image Analysis - 20A30702b (Theory) | Sept. 1, 2025 | UNIT V(Image Processing Using Machine Learning & RealTime Use Cases) | Scale-invariant feature transform Scale-invariant feature transform (SIFT) descriptors provide an alternative representation for image regions. They are very useful for matching images. As demonstrated earlier, simple corner detectors work well when the images to be matched are similar in nature (with respect to scale, orientation, and so on). But if they have different scales and rotations, the SIFT descriptors are needed to be used to match them. SIFT is not only just scale invariant, but it still obtains good results when rotation, illumination, and viewpoints of the images change as well. Let's discuss the primary steps involved in the SIFT algorithm that transforms image content into local feature coordinates that are invariant to translation, rotation, scale, and other imaging parameters. Algorithm to compute SIFT descriptors Scale-space extrema detection: Search over multiple scales and image locations, the location and characteristic scales are given by DoG detector Keypoint localization: Select keypoints based on a measure of stability, keep only the strong interest points by eliminating the low-contrast and edge keypoints Orientation assignment: Compute the best orientation(s) for each keypoint region, which contributes to the stability of matching Keypoint descriptor computation: Use local image gradients at selected scale and rotation to describe each keypoint region As discussed, SIFT is robust with regard to small variations in illumination (due to gradient and normalization), pose (small affine variation due to orientation histogram), scale (by DoG), and intra-class variability (small variations due to histograms). With opencv and opencv-contrib We will first construct a SIFT object and then use the detect() method to compute the keypoints in an image. Every keypoint is a special feature, and has several attributes. For example, its (x, y) coordinates, angle (orientation), response (strength of keypoints), size of the meaningful neighborhood, and so on.
RANSAC algorithm In this example, we will match an image with its affine transformed version; they can be considered as if they were taken from different view points. The following steps describe the image matching algorithm: 1. First, we will compute the interest points or the Harris Corners in both the images. 2. A small space around the points will be considered, and the correspondences inbetween the points will then be computed using a weighted sum of squared differences. This measure is not very robust, and it's only usable with slight viewpoint changes. 3. A set of source and corresponding destination coordinates will be obtained once the correspondences are found; they are used to estimate the geometric transformations between both the images. 4. A simple estimation of the parameters with the coordinates is not enough—many of the correspondences are likely to be faulty. 5. The RANdom SAmple Consensus (RANSAC) algorithm is used to robustly estimate the parameters, first by classifying the points into inliers and outliers, and then by fitting the model to inliers while ignoring the outliers, in order to find matches consistent with an affine transformation. The next code block shows how to implement the image matching using the Harris Corner features:
Image Classification Using CNNs CNNs are deep neural networks for which the primarily used input is images. CNNs learn the filters (features) that are hand-engineered in traditional algorithms. This independence from prior knowledge and human effort in feature design is a major advantage. They also reduce the number of parameters to be learned with their shared-weights architecture and possess translation invariance characteristics. In the next subsection, we'll discuss the general architecture of a CNN and how it works. Conv or pooling or FC layers – CNN architecture and how it works The next screenshot shows the typical architecture of a CNN. It consists of one or more convolutional layer, followed by a nonlinear ReLU activation layer, a pooling layer, and, finally, one (or more) fully connected (FC) layer, followed by an FC softmax layer, for example, in the case of a CNN designed to solve an image classification problem. There can be multiple convolution ReLU pooling sequences of layers in the network, making the neural network deeper and useful for solving complex image processing tasks, as seen in the following diagram:
Convolutional layer The main building block of CNN is the convolutional layer. The convolutional layer consists of a bunch of convolution filters (kernels), which we already discussed in detail in Chapter 2(refere Text Book2), Sampling, Fourier Transform, and Convolution. The convolution is applied on the input image using a convolution filter to produce a feature map. On the left side is the input to the convolutional layer; for example, the input image. On the right is the convolution filter, also called the kernel. As usual, the convolution operation is performed by sliding this filter over the input. At every location, the sum of element-wise matrix multiplication goes into the feature map. A convolutional layer is represented by its width, height (the size of a filter is width x height), and depth (number of filters). Stride specifies how much the convolution filter will be moved at each step (the default value is 1). Padding refers to the layers of zeros to surround the input (generally used to keep the input and output image size the same, also known as same padding). The following screenshot shows how 3 x 3 x 3 convolution filters are applied on an RGB image, the first with valid padding and the second with the computation with two such filters with the size of the stride=padding=1
Pooling layer After a convolution operation, a pooling operation is generally performed to reduce dimensionality and the number of parameters to be learned, which shortens the training time, requires less data to train, and combats overfitting. Pooling layers downsample each feature map independently, reducing the height and width, but keeping the depth intact. The most common type of pooling is max pooling, which just takes the maximum value in the pooling window. Contrary to the convolution operation, pooling has no parameters. It slides a window over its input and simply takes the max value in the window. Similar to a convolution, the window size and stride for pooling can be specified. Non-linearity – ReLU layer For any kind of neural network to be powerful, it needs to contain non-linearity. The result of the convolution operation is hence passed through the non-linear activation function. ReLU activation is used in general to achieve non-linearity (and to combat the vanishing gradient problem with sigmoid activation). So, the values in the final feature maps are not actually the sums, but the relu function applied to them. FC layer After the convolutional and pooling layers, generally a couple of FC layers are added to wrap up the CNN architecture. The output of both convolutional and pooling layers are 3D volumes, but an FC layer expects a 1D vector of numbers. So, the output of the final pooling layer needs to be flattened to a vector, and that becomes the input to the FC layer. Flattening is simply arranging the 3D volume of numbers into a 1D vector. Dropout Dropout is the most popular regularization technique for deep neural networks. Dropout is used to prevent overfitting, and it is typically used to increase the performance (accuracy) of the deep learning task on the unseen dataset. During training time, at each iteration, a neuron is temporarily dropped or disabled with some probability, p. This means all the input and output to this neuron will be disabled at the current iteration. This hyperparameter p is called the dropout rate, and it's typically a number around 0.5, corresponding to 50% of the neurons being dropped out.
Image Classification Using Machine Learning Approaches:Decision Trees,Support Vector Machines, Logistics Regression,Code,Important Terms Decision Trees -Supervised learning technique for classification or regression problems Advantages Distadvantages bias: error from misclassifications in the learning algorithm ERROR DUE TO MODEL MISMATCH variance: error from sensitivity to small changes in the training set VARIATION DUE TO TRAINING SAMPLE AND RANDOMIAZTION Bias / variance tradeoff Pruning
Random forests
Support Vector Machines:
-Very popular and widely used supervised learning classification algorithm -Can be applied to almost everything Non-linear spaces Maths: https://chatgpt.com/s/t_68b5c22784e88191960e71fbffc4d2f5
f(x) = sign( i=1N Σ αi yi K(xi, x) + b ) Here:
Optional: Showing Kernel types in1️⃣ Linear Kernel K(x_i, x_j) = x_i · x_j 2️⃣ Polynomial Kernel K(x_i, x_j) = (x_i · x_j + c)d
-SVMs with non-linear kernels add additional dimensions to the data in order to create separation in this way Advantages
Logistics Regression
p(y) = e(b0 + b1 x) / (1 + e(b0 + b1 x))
The p(x) = P(default=1 | balance = x ) is the probability of default when we It has a value between 0 and 1 Logistic regression fits the b and b parameters, these are the This fitted curve is not linear: we can make it linear with the help
Introduction to Real-Time Use Cases: Few Real time Ways A fully convolutional model for detecting objects: YOLO (v2) Deep segmentation with DeepLab (v3) Transfer learning: what is it and when to use it Deep style transfer with cv2 using a pretrained torch-based deep learning model Introducing YOLO v2 : YOLO, is a very popular and fully conventional algorithm that is used for detecting images. It gives a very high accuracy rate compared to other algorithms, and also runs in real time. As the name suggests, this algorithm looks only once at an image. This means that this algorithm requires only one forward propagation pass to make accurate predictions. In this section, we will detect objects in images with a fully convolutional network (FCN) deep learning model. Given an image with some objects (for example, animals, cars, and so on), the goal is to detect objects in those images using a pre-trained YOLO model, with bounding boxes. Deep semantic segmentation with DeepLab V3+ In this section, we'll discuss how to use a deep learning FCN to perform semantic segmentation of an image. Before diving into further details, let's clear the basic concepts. Semantic segmentation Semantic segmentation refers to an understanding of an image at pixel level; that is, when we want to assign each pixel in the image an object class (a semantic label). It is a natural step in the progression from coarse to fine inference. It achieves fine-grained inference by making dense predictions that infer labels for every pixel so that each pixel is labeled with the class of its enclosing object or region. Transfer learning – what it is, and when to use it Transfer learning is a deep learning strategy that reuses knowledge gained from solving one problem by applying it to a different, but related, problem. For example, let's say we have three types of flowers, namely, a rose, a sunflower, and a tulip. We can use the standard pre-trained models, such as VGG16/19, ResNet50, or InceptionV3 models (pretrained on ImageNet with 1000 output classes, which can be found at https://gist. github.com/yrevar/942d3a0ac09ec9e5eb3a) to classify the flower images, but our model wouldn't be able to correctly identify them because these flower categories were not learned by the models. In other words, they are classes that the model is not aware of. Transfer learning with Keras Training of pre-trained models is done on many comprehensive image classification problems. The convolutional layers act as a feature extractor, and the fully connected (FC) layers act as classifiers, as shown in the following diagram, in the context of cat vs. dog image classification with a conv net:
|