Object Detection - ZYL-Harry/Machine_Learning_study GitHub Wiki

We can not only use neural networks to classify an object, but also detect an object

Object localization

Traning set contains:
classification
bounding boxes(containing bx(x-coordinate of the center point), by(y-coordinate of the center point), bh(height), bw(width))
Example:
The target label y:
loss(squared error):

landmark: only care about some specific points
establish a neural network with datasets(images and labels) and train it to find the specific points such as the corners of the eyes, ...
application: face detection, pose detection, computer graphics

procedure:
step1: learning---establish a neural network
a. get a dataset of a specific kind of object with labels
b. establish a neural network(the output's shape is (1 * 1 * 1)) and train it, then we can use the neural network to determine an image whether there is that kind of object in it
step2: prediction---sliding windows detection
a. pick a certain size of widow, and put the window into the image
b. get the part of image bounded by the window, and then put it into the neural network to determine whether there is that kind of object in it
c. shift the window with a specific stride to bound another part of image, and then put it into the neural network to determine whether there is that kind of object in it
d. repeat the step c until we slide the window in every region of the image
e. use a larger window to repeat the above step
disadvantage: computation

Ideal:
instead of forcing us to run several propagations on several subsets of the image independently, we can combine all these subsets into one form of computation and share a lot of computation in the regions of the image which are common
Example:
Disadvantage: the position of the bounding box is not very accurate

Priority: get a trained neural network of the objects, whose output's shape is (m * m * 8)

Process:
1.divide the image into m grid cells
2.for each grid cell, get a label y(8 dimension)---one object is only assigned to one of the grid cells(each object in training image is assigned to the grid cell that contains that object's midpoint)
3.then, with an input image X(w * h * 3), we can use our neural network to get an ouput Y(m * m * 8)
Specify the boudning boxes:
1.from the neural network, get output(m * m * 8), which tells whether there is the object in that grid cell
2.for each output(1 * 1 * 8), if there is the object, we can get a bounding box(a rectangle with the predicted parameters bx, by, bh, bw)
There are several ways to make Yolo algorithm perform better

Process:
1.get all the bounding boxes that have the probability to be the ground truth
2.compute the IoU of each boudning box as the probability to be the ground truth
3.pick the bounding boxes with the maxial probability
4.discard the bounding boxes with IoU≥0.5
5.remaining those low IoU ones for the other object detection\
Concept: output the maximal probability classifications, but suppress the close-by ones that are non-maximal
Example:

Problem: face overlapping objects(their midpoints are in the same grid cell)
Method:
1.pre-define k different shapes called "anchor boxes", then the output of the neural network's shape is (m * m * k)
2.each object in training image is assigned to the grid cell that contains object's midpoint and anchor box for the grid cell with the highest IoU

Training set:
Input: image with label y produced manually
Output: y = [pc, bx, by, bh, bw, c1, c2, c3, ...(other anchors)]
The shape of the output: {num_grid_cell_width} * {num_grid_cell_height} * {num_anchors} * {8:original dimension of label y}
Making predictions:
Outputting the non-max suppressed outputs:
1.For each grid cell, get two predicted boudning boxes
2.Get rid of low probability predictions
3.For each class, use non-max suppression to generate final predictions