Introduction to Computer Vision - yz-coder1375/MR-wiki GitHub Wiki

Computer Vision

Computer vision is a multidisciplinary field of artificial intelligence (AI) and computer science that focuses on enabling computers to interpret and understand visual information from the world. Its primary goal is to teach machines to replicate and even surpass human vision capabilities. This involves extracting meaningful information from images or videos, allowing computers to make decisions, recognize objects, and understand their surroundings. Researchers in computer vision have been developing, in parallel, mathematical techniques for recovering the three-dimensional shape and appearance of objects in imagery. In the process of enabling computers to comprehend visual data, several essential phases come into play. These include elements like Image Acquisition, Image Pre-processing, Feature Extraction, Object Detection, and more, which we will talk more about them during this course.

computer vision is being used today in a wide variety of real-world applications, which include:

  • Optical character recognition (OCR): Reading handwritten words and automatic number plate recognition.
  • Machine inspection: Parts inspection for quality assurance.
  • 3D model building (photogrammetry): The complete automatic generation of 3D models from aerial photos
  • Medical imaging: Interpret and analyze medical images, such as X-rays, MRIs, and CT scans, to aid in diagnosis.
  • Automotive safety

Sensors and Image Formation

Digital Images

A digital image is a visual representation of an object, scene, or subject that has been captured, created, or processed in a digital format. For representing an image in computer systems, digital images stored as an array of numbers (discrete elements called pixels). These numbers stands as intensity (gray level, or each color band), range, X-ray absorption coefficient, and so on.

Array of numbers for representing the image

Intensity Image Sensors

In any imaging system, three main elements are needed for the job to be done correctly:

  1. Aperture: An opening (or "pupil") to limit the amount of light, and angle of incoming light.
  2. Optical System: Lenses that have the function of concentrating light from a specific point in a scene onto a single point in an image.
  3. Imaging photosensitive surface: Film or sensors, usually a plane that captures the received light.

Basic elements of an imaging device

Digital Camera

A digital camera is an electronic device that captures and stores photographs and videos in a digital format. Unlike traditional film cameras, digital cameras use image sensors to convert light into digital data, which can be saved, displayed, and edited on various devices. There are different types of image sensors used in digital cameras like CCD (Charge-Coupled Device) and CMOS (Complementary Metal-Oxide-Semiconductor). The figure bellow shows a CCD camera imaging a 3D scene, discrete cells convert light energy into electrical charges, which are represented as small numbers when input to a computer.

A CCD camera imaging a vase

Camera Pinhole Model

The pinhole camera model, also known as the camera obscura, is a simple and fundamental concept in the field of optics and photography. The model serves as a theoretical basis for understanding how images are formed in a camera. The pinhole camera model describes the mathematical relationship between the coordinates of a point in three-dimensional space and its projection onto the image plane of an ideal pinhole camera, where the camera aperture is described as a point and no lenses are used to focus light. The figure bellow present this model:

Camera Pinhole Model

Perspective Projection Equation

In this section according to the pinhole camera model we want to derive a simple equations describe projection of a scene point onto the image plane. To simplify the process and prevent an inverted image, we begin by considering the image plane as if it were positioned in front of the pinhole, with a distance equivalent to the focal length. Also We define the origin of the camera’s coordinate system at the pinhole (note – this is a 3D XYZ coordinate frame). According to the image bellow we can compute the 2D position of a 3D point in the camera coordinate system in our image plane:

Perspective Projection

By Similar Triangle for the 2D position of the point in the camera coordinate we will have: $$x= f.\frac{X}{Z}\qquad y = f.\frac{Y}{Z}$$

Also here we can compute the camera field of view which is in fact a parameter related to the image plane size. The field of view (FOV) in a camera refers to the extent of the observable world that can be seen through the camera lens or sensor and it is typically measured in degrees.

As shown in the picture, we can calculate the field of view (FOV) of the camera using the following simple method:

Camera Field of View

$$\textrm{field of view} (\theta) \rightarrow \tan (\theta/2) = \frac{w/2}{f} $$

Camera vs Image Plane Coordinate

Camera Coordinate System {C}:

  • A 3D coordinate system (X,Y,Z) – units say, in meters
  • Origin at the center of projection
  • Z axis points outward along optical axis
  • X points right, Y points down

Image Plane Coordinate System {π}:

  • A 2D coordinate system (x,y) – units in mm
  • Origin at the intersection of the optical axis with the image plane
  • In real systems, this is where the CCD or CMOS plane is

Image Buffer vs Image Plane

An image buffer, is infact a memory area or data structure used in computer graphics and imaging to temporarily store and manipulate image data during the rendering or post-processing process. Overall, the image buffer facilitates the conversion from the image plane (measured in millimeters) to pixel coordinates. To have a comparison on Image Plane and Image Buffer:

Image plane and Image Buffer

Image Plane {π}:

  • The real image is formed on the CCD plane
  • (x,y) units in mm
  • Origin in center (principal point)

Image Buffer {I}:

  • Digital (or pixel) image
  • (row, col) indices
  • We can also use (xim, yim)
  • Origin in upper left

Conversion Between Image Plane and Pixel Image Coordinates

Suppose the pixel image has its central point (principal point) positioned at pixel coordinates (cx, cy), with pixel spacing (size of pixels) measuring (sx, sy) in millimeters.

Then for the conversion between the two coordinates we will have:

$$x = (x_{im} – c_{x}) s_{x} \qquad x_{im} = x/s_{x} + c_{x} $$

$$ y = (y_{im} – c_{y}) s_{y} \qquad y_{im} = y/s_{y} + c_{y} $$

Combining this equation with what we have learned from Camera coordinate system conversion we can write a direct transformation from camera coordinate system to the image buffer as bellow:

$$ x_{im} = (f / s_{x}) X/Z + c_{x} \qquad y_{im} = (f / s_{y}) Y/Z + c_{y}$$

Intrinsic Camera Matrix

The intrinsic camera matrix, often referred to as the camera calibration matrix or simply the camera matrix, is a fundamental component in computer vision and computer graphics. It describes the intrinsic properties of a camera, which are necessary for the conversion of 3D world points into 2D image points. In this section we want to represent the direct transformation from camera coordinate system to the image buffer in the form of a matrix multiplication with the help of camera intrinsic matrix. We can project 3D points onto 2D with a matrix multiplication and treat the result as a 2D point in homogeneous coordinates. So we divide through by the last element as bellow:

$$\begin{split} \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ \end{pmatrix} \begin{pmatrix} X \\ Y \\ Z \\ 1 \\ \end{pmatrix} =\begin{pmatrix} X \\ Y \\ Z \\ \end{pmatrix} \rightarrow \tilde{x} =\begin{pmatrix} X \\ Y \\ Z \\ \end{pmatrix} =\begin{pmatrix} X/Z \\ Y/Z \\ 1 \\ \end{pmatrix} \end{split}$$

We also define camera intrinsic matrix as bellow:

$$\begin{split} K =\begin{pmatrix} f/s_{x} & 0 & c_{x} \\ 0 & f/s_{y} & c_y \\ 0 & 0 & 1 \\ \end{pmatrix} \xrightarrow[\text{in units of pixels}]{\text{express focal length}} K =\begin{pmatrix} f_x & 0 & c_{x} \\ 0 & f_y & c_y \\ 0 & 0 & 1 \\ \end{pmatrix} \end{split}$$

So to project 3D points represented in the camera coordinate, to the 2D image plane in the form of matrix we will have:

$$\begin{split}\begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ \end{pmatrix} = K\begin{pmatrix} 1& 0& 0& 0\\ 0& 1& 0& 0\\ 0& 0& 1& 0\\ \end{pmatrix} \prescript{C}{}{\begin{pmatrix} X \\ Y \\ Z\\ 1\\ \end{pmatrix}}, \qquad \begin{pmatrix} x \\ y \\ 1 \\ \end{pmatrix} =\begin{pmatrix} x_1/x_3 \\ x_2/x_3 \\ 1 \\ \end{pmatrix}, \qquad K= \begin{pmatrix} f_x & 0 & c_{x} \\ 0 & f_y & c_y \\ 1 & 0 & 1 \\ \end{pmatrix} \end{split}$$

To see this:

$$\begin{split} \begin{pmatrix} f_x & 0 & c_{x} \\ 0 & f_y & c_y \\ 0 & 0 & 1 \\ \end{pmatrix} \begin{pmatrix} 1& 0& 0& 0\\ 0& 1& 0& 0\\ 0& 0& 1& 0\\ \end{pmatrix} \prescript{C}{}{\begin{pmatrix} X \\ Y \\ Z\\ 1\\ \end{pmatrix}} = \begin{pmatrix} f_x & 0 & c_{x} \\ 0 & f_y & c_y \\ 0 & 0 & 1 \\ \end{pmatrix} \begin{pmatrix} X \\ Y \\ Z\\ \end{pmatrix} =\begin{pmatrix} f_xX+c_xZ\\ f_yY+c_yZ\\ Z\\ \end{pmatrix} =\begin{pmatrix} f_xX/Z+c_x\\ f_yY/Z+c_y\\ 1\\ \end{pmatrix} \end{split}$$

Complete Perspective Projection

The most complete form of a projection would be the Projection of a 3D point in the world coordinate system to a point in the pixel image $(x_{im},y_{im})$. We can find the equations for this type of projection with the help of Extrinsic Camera Matrix.

Extrinsic Camera Matrix

The extrinsic camera matrix, often denoted as [R|t], represents the camera's pose in the world coordinate system. It describes the camera's orientation (rotation) and position (translation) with respect to the 3D world.

For Complete Perspective Projection first if 3D points are in world coordinates, we first need to transform them to camera coordinate by use of homogeneous transformation matrix:

$$ ^C {P} = \prescript{C}{W}{H}. \prescript{W}{}{P} =\begin{pmatrix} \prescript{C}{W}{R} & \prescript{C}{}{t_{Worg}}\\ 0 & 1\\ \end{pmatrix} \prescript{W}{}{P} $$

We can write this as an extrinsic camera matrix, that does the rotation and translation, then a projection from 3D to 2D.

$$ M_{ext} = \left( \prescript{C}{W}{R} \quad \prescript{C}{}{t_{Worg}} \right) =\begin{pmatrix} r_{11} & r_{12} & r_{13} & t_{X}\\ r_{21} &r_{22} &r_{23} & t_{Y}\\ r_{31} &r_{32} &r_{33} & t_{Z}\\ \end{pmatrix} $$

Finally, Projection of a 3D point $^WP$ in the world to a point in the pixel image $(x_{im},y_{im})$ would be as bellow:

$$ \begin{pmatrix} x_1\\ x_2\\ x_3\\ \end{pmatrix} = K M_{ext} \prescript{W}{}{\begin{pmatrix} X \\ Y \\ Z\\ 1\\ \end{pmatrix}}, \qquad x_{img}=x_1/x_3, \quad y_{img} = x_2/x_3 $$

where K is the Camera intrinsic matrix and $M_{ext}$ is a 3*4 extrinsic camera matrix.

Image Filtering

Image filtering is a technique used in digital image processing to enhance or modify an image. It involves applying a certain mathematical operation to an image to highlight or suppress particular features or characteristics within the image. Filters can be applied for various purposes, including Noise Reduction, Sharpening, Blurring or Smoothing, Edge Detection, color enhancement, and etc. we aim to provide a quick overview of different approaches and principles linked to the techniques used in image filtering.

Gray Level Transformation

Gray level transformation involves changing the intensity levels of pixels in an image without changing its basic structure. It's a technique used to manipulate the contrast or brightness of an image. This process primarily involves changing the intensity values of pixels in a grayscale image with resopct to a predefined function or formula. It's often used to enhance image quality by adjusting the brightness or contrast levels. The function should define wisely to change the contrast and brightness in different range of the input pixels. There are different functions that can be use for this kinds of transformations and one of the most important ones named "Gamma Correction". Gamma correction is a common method for illumination enhancement and is defined as:

$$ I^{'} = I_{max}.(\frac{I}{I_{max}})^\gamma $$

where $I^{'} $ is the corrected image, $I_{max}$ is the maximum intensity value of the original image, I is the original image, and $\gamma$ is the parameter. For different values of $\gamma$, the resulting image has different enhancement results, as shown in the figure bellow. When $\gamma < 1$, low-intensity pixels will be increased more than high-intensity pixels. When $\gamma > 1$, the opposite effect is generated. When $\gamma = 1$, the input and output intensities are equal.

An example of gamma correction, the enhanced images with different parameters γ. (a) Original image. (b) $\gamma=0.1$ (c) $\gamma= 0.3$ (d) $\gamma= 0.8$ (e) $\gamma=1.2$ (f) $\gamma=1.5$ (g) The curve along with different parameters $\gamma$.

Spatial (neighborhood) Filtering

Spatial (neighborhood) filtering is a fundamental concept in image processing that involves manipulating the pixel values of an image based on their spatial locations. The goal of spatial filtering is to enhance or extract certain features or information from an image by applying a filter or kernel to the image's pixels and for doing that it consider the neighborhood pixels as well as the pixel itself.

According to the image to filter the image using this method we should operate as bellow:

  1. Begin with a filter or mask or "kernel", denoted as "w," which has dimensions of "m x n."

  2. Apply this filter to the image "f" which has dimensions of "M x N."

  3. Calculate the sum of products by multiplying the filter coefficients with the corresponding pixels under the filter.

  4. Slide the filter over the image, applying it at each point in the image.

  5. This process is also known as "cross-correlation," where the filter is moved across the image, and the operations are performed at each position.

We can show the cross-correlation formula as bellow:

$$ g(x,y) = \sum_{s=-m/2}^{s=m/2} \sum_{t=-n/2}^{t=n/2} w(s,t) f(x+s, y+t) = w(x, y)\otimes f(x, y) $$

The mask or kernel can be selected in different sizes and structures. For example, in the figure below, a 'Box-filter (average filter)' is applied to an image, comprising the original and resulting images. What happens to the picture?

As you can see in the pictures, we actually smooth the picture by averaging the image and as a result, instead of a sharp edge in the original image we have a smooth edge in the filtered image. This type of filter will greatly assist us in the process of smoothing noisy images and effectively diminishing the presence of these unwanted distortions. Also, you can see the picture bellow before and after applying a 11 by 11 box filter.

You can try to make a filter with different sizes and apply it to an image using opencv in python using the code bellow. What is the effect of the kernel size is on the picture?

NOTE: If you don't have the OpenCV module already installed, you can easily install it using pip install opencv-python or from the Ubuntu repositories using sudo apt install python3-opencv.

import cv2
import numpy as np

# Load an image
image = cv2.imread('your_image.jpg')

# Define the size of the box filter (e.g., 3x3)
box_filter_size = (3, 3)

# Create a box filter kernel
box_filter_kernel = np.ones(box_filter_size, np.float32) / (box_filter_size[0] * box_filter_size[1])

# Apply the filter using cv2.filter2D
filtered_image = cv2.filter2D(image, -1, box_filter_kernel)

# Display the original and filtered images
cv2.imshow('Original Image', image)
cv2.imshow('Filtered Image', filtered_image)
cv2.waitKey(0)
cv2.destroyAllWindows()

Gaussian Filter

A Gaussian filter, also known as a Gaussian blur or Gaussian smoothing filter, is a type of filter used in image processing to perform smoothing or blurring operations on an image. Gaussian filter usually preferable to box filter because it is more versatile and suitable for tasks where you want to reduce noise while preserving image details and edges. It provides a more natural and visually pleasing blur compared to a box filter. You can observe the filter and its impact on image smoothing in the images below using the built in function in opencv using the provided code:

Gaussian Filter

import cv2

# Load an image
image = cv2.imread('your_image.jpg')

# Define the kernel size (must be an odd number)
kernel_size = (5, 5)

# Apply Gaussian blur to the image
blurred_image = cv2.GaussianBlur(image, kernel_size, 0)

# Display the original and blurred images
cv2.imshow('Original Image', image)
cv2.imshow('Blurred Image', blurred_image)
cv2.waitKey(0)
cv2.destroyAllWindows()

Sharpening Spetial Filters

In the context of image processing, when you apply a smoothing filter (like a Gaussian filter) to an image, you're essentially "integrating" or averaging the pixel values in a local neighborhood around each pixel. The size and shape of this neighborhood are determined by the kernel you're using for convolution. In contrast, derivation in the context of images refers to the process of finding the rate of change or gradient of pixel values in an image. This is analogous to differentiation in calculus, where you find the rate of change of a function. In image processing, this concept is used to sharpening image, detect edges, sharp changes in intensity, and other features.

To delineate, for example if you have a 1D image (for example a row of an image) the first and second derivatives are defined like bellow and are calculated by applying the corresponding kernels displayed on the image.

  • First derivative (can also do central difference):

$$ \frac{\partial f}{\partial x} \approx f(x+1) - f(x) $$

  • Second derivative:

$$ \frac{\partial^2 f}{\partial x^2} \approx f(x+1) - 2f(x) +f(x-1) $$

Also, in the figure below, it shows how we can detect the edge in a smoothed step edge of a signal using the first and second derivatives

Edge Operators for 2D Images

There are many different kernels for edge detection in a 2D image, but the most commonly used edge operators are the Sobel and Laplacian operators, as shown in the image below. The Sobel kernel applies the derivative in X and Y direction and Laplacian filter detects areas of rapid intensity change, such as edges.

You can apply the Sobel filter on different images using the code bellow:



import cv2
import numpy as np
import matplotlib.pyplot as plt

# Read the image
image = cv2.imread('/home/kamiab/Pictures/Screenshots/29.png',  cv2.IMREAD_GRAYSCALE)

# Check if the image was loaded
if image is None:
    print("Error: Could not open or find the image.")
    exit()

# Apply Sobel filter in x-direction
sobel_x = cv2.Sobel(image, cv2.CV_8U, 1, 0, ksize=3)

# Apply Sobel filter in y-direction
sobel_y = cv2.Sobel(image, cv2.CV_8U, 0, 1, ksize=3)

# Display the results
plt.figure(figsize=(10, 10))

plt.subplot(1, 3, 1), plt.imshow(image, cmap='gray')
plt.title('Original Image'), plt.xticks([]), plt.yticks([])

plt.subplot(1, 3, 2), plt.imshow(sobel_x, cmap='gray')
plt.title('Sobel X'), plt.xticks([]), plt.yticks([])

plt.subplot(1, 3, 3), plt.imshow(sobel_y, cmap='gray')
plt.title('Sobel Y'), plt.xticks([]), plt.yticks([])


plt.tight_layout()
plt.show()

Here is the original image and the result of the image after applying the filter in different directions:

⚠️ **GitHub.com Fallback** ⚠️