Object Detection - ZYL-Harry/Machine_Learning_study GitHub Wiki

  • We can not only use neural networks to classify an object, but also detect an object

Object localization

  • Traning set contains:
    classification
    bounding boxes(containing bx(x-coordinate of the center point), by(y-coordinate of the center point), bh(height), bw(width))
    Example:
    db0578be100adfb32893854389acc29
  • The target label y:
    image
  • loss(squared error):
    image

Landmark detection

  • landmark: only care about some specific points
  • establish a neural network with datasets(images and labels) and train it to find the specific points such as the corners of the eyes, ...
  • application: face detection, pose detection, computer graphics

Object detection

Sliding windows detection

  • procedure:
    step1: learning---establish a neural network
    a. get a dataset of a specific kind of object with labels
    b. establish a neural network(the output's shape is (1 * 1 * 1)) and train it, then we can use the neural network to determine an image whether there is that kind of object in it
    step2: prediction---sliding windows detection
    a. pick a certain size of widow, and put the window into the image
    b. get the part of image bounded by the window, and then put it into the neural network to determine whether there is that kind of object in it
    c. shift the window with a specific stride to bound another part of image, and then put it into the neural network to determine whether there is that kind of object in it
    d. repeat the step c until we slide the window in every region of the image
    e. use a larger window to repeat the above step
  • disadvantage: computation

Convolutional implementation of sliding windows

  • Ideal:
    instead of forcing us to run several propagations on several subsets of the image independently, we can combine all these subsets into one form of computation and share a lot of computation in the regions of the image which are common
  • Example:
    image
    image
  • Disadvantage: the position of the bounding box is not very accurate

Yolo object detection algorithm

  • Priority: get a trained neural network of the objects, whose output's shape is (m * m * 8)

boudning box predictions

  • Process:
    1.divide the image into m grid cells
    2.for each grid cell, get a label y(8 dimension)---one object is only assigned to one of the grid cells(each object in training image is assigned to the grid cell that contains that object's midpoint)
    3.then, with an input image X(w * h * 3), we can use our neural network to get an ouput Y(m * m * 8) image
  • Specify the boudning boxes:
    1.from the neural network, get output(m * m * 8), which tells whether there is the object in that grid cell
    2.for each output(1 * 1 * 8), if there is the object, we can get a bounding box(a rectangle with the predicted parameters bx, by, bh, bw)
    image
  • There are several ways to make Yolo algorithm perform better

Intersection over union(IoU)

  • function: a measure of the overclap between two bounding boxes
  • formula:
    image
  • "Correct" if IoU ≥ 0.5(common threshold)

Non-max suppression

  • Process:
    1.get all the bounding boxes that have the probability to be the ground truth
    2.compute the IoU of each boudning box as the probability to be the ground truth
    3.pick the bounding boxes with the maxial probability
    4.discard the bounding boxes with IoU≥0.5
    5.remaining those low IoU ones for the other object detection\
  • Concept: output the maximal probability classifications, but suppress the close-by ones that are non-maximal
  • Example:
    image

Anchor boxes

  • Problem: face overlapping objects(their midpoints are in the same grid cell)
  • Method:
    1.pre-define k different shapes called "anchor boxes", then the output of the neural network's shape is (m * m * k)
    image 2.each object in training image is assigned to the grid cell that contains object's midpoint and anchor box for the grid cell with the highest IoU

the entire Yolo object detection algorithm

  • Training set:
    Input: image with label y produced manually
    Output: y = [pc, bx, by, bh, bw, c1, c2, c3, ...(other anchors)]
    The shape of the output: {num_grid_cell_width} * {num_grid_cell_height} * {num_anchors} * {8:original dimension of label y}
    image
  • Making predictions:
    image
  • Outputting the non-max suppressed outputs:
    1.For each grid cell, get two predicted boudning boxes
    2.Get rid of low probability predictions
    3.For each class, use non-max suppression to generate final predictions
    image

Regions with Convolutional Neural Network(R-CNN)

image
image

Sematic segmentation

Object detection VS. Semantic segmentation

image

Learning

image
image

Transpose convolutions