Training Tutorial - LArbys/UBImageNetTune GitHub Wiki

Training Tutorial

For a great, quick introduction to network training, I highly recommend this.

We'll describe how to use the LArCV code and our custom Caffe to fine-tune a GoogLeNet model that has been trained by ImageNet to classify neutrino events from empty off-beam events.

Note, we assume you've looked at all the introductory material explaining how convolutional neural networks work. This is just to tell you how training with Caffe and LArCV works.

Getting the scripts

Clone the repository

git clone https://github.com/LArbys/UBImageNetTune

Setting up the environments

The following is software we need to setup for the training to work. Go into the UBImageNetTune folder. On nudot, the script setup_env.sh should setup the environment variables in your shell. Note: you will probably have your own version of LArCV. Change config/setup_larcv.sh to point to it. (But please don't check the change into the repo.)

CUDA

Libraries providing API to GPU so we can run code on them. Required by Caffe.

you can set it up using config/setup_cuda.sh

ROOT 6

Inescapable.

set it up using config/setup_root6.sh (if it's not already setup.)

LArCV

Repo.

This is how we store the images along with meta data used as ground truth.

For this tutorial, it is OK to use the default version pointed to in config/setup_larcv.sh.
The default copy in the script is suppose to be a "stable" version which I will only update if things work.
However, once you start doing things, like writing your own image manipulation operations that get passed into the network or defining your own images for training, you will need your own version. Follow instructions on website to build it.

Caffe LArbys

This is our custom caffe link.

I maintain a "stable" copy that can be setup using config/setup_caffe_env.sh

Caffe in a nutshell

Just a few, quick words on how caffe is organized. For more (and accurate, mostly) details, go here.

Network models in caffe are defined as acyclic directed graphs. The nodes of each graph is a type of operation or "layer". These layers might perform convolutions, pooling, ReLU layers, etc.

The way one specifies which layers are in the graph and how they are configured and connected is through Google Protocal Buffer files. For more info on Google Protobufs go here. These are human readable files that basically contain blocks (or "messages") of parameters for each layer one wants to construct.

Here is an example of one such block in the prototxt file that describes our training network:

layer {
  name: "conv1/7x7_s2"
  type: "Convolution"
  bottom: "data"
  top: "conv1/7x7_s2"
  param {
    lr_mult: 0.1
    decay_mult: 0
  }
  param {
    lr_mult: 1
    decay_mult: 0
  }
  convolution_param {
    num_output: 64
    pad: 3
    kernel_size: 7
    stride: 2
    weight_filler {
      type: "msra"
    }
    bias_filler {
      type: "constant"
      value: 0.2
    }
  }
}

There is layer message with a bunch of parameters. The type of layer is specified and there a bunch of parameter messages inside that are passed to the layer to configure it.

We will pass such prototxt files to caffe to specify the training and testing network. Actually, the way we do the training in the this tutorial (through caffe's built-in binary) is to give another protocal buffer, solver_rmsprop.prototxt, which configures caffe's solver. This solver runs the training, performing forward and backwards passes and managing the strategy used to update the network weights.

Side note: visualizing a network

There is a great (webtool here)[http://ethereon.github.io/netscope/#/editor] available to help visualize (and debug) caffe network prototxts. You can copy your prototxt into the space on the right and it will draw the network. Tip: to use this, you have to copy the prototxt into your copy-buffer. On the mac you can do the following:

cat my-network.prototxt | pbcopy

In this command, the text of the prototxt is printed to standard out (cat my-network.prototxt). But instead of going to standard out, it is piped (using '|') into the command 'pbcopy' which puts it into your copy buffer.

You can then go into the webpage, and hit COMMAND-v to paste the text onto the left.

If you do this to the training network we are using (bvlc_googlenet_train.prototxt), you should see the following:

NetVisExample

Running the training

Setting up the environment is probably the hardest part.

We will be using pretrained weights for GoogleNet. It is located at /mnt/disk1/taritree/caffe_models/bvlc_googlenet.caffemodel. For convenience, make a symlink to it:

ln -s /mnt/disk1/taritree/caffe_models/bvlc_googlenet.caffemodel bvlc_googlenet.caffemodel

Note that training takes hours to days, maybe even weeks. We need to be able to disconnect to the terminal and do other things with our lives. This is where a utility like screen comes in handy (or nohup). I'll let it to you to figure out how to use them.

Once in a screen session (remember you'll have to setup your environment again by sourcing setup_env.sh). To start the training run:

caffe train -solver solver_rmsprop.prototxt -weights bvlc_googlenet.caffemodel -gpu 0

We're training the network on GPU 0. We specify so that we don't interfere with someone else's job. To check what is running on the GPU use the command

nvidia-smi

You'll see an output:

+------------------------------------------------------+                       
| NVIDIA-SMI 352.68     Driver Version: 352.68         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:01:00.0      On |                  N/A |
| 22%   27C    P8    15W / 250W |     56MiB / 12284MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:02:00.0     Off |                  N/A |
| 22%   24C    P8    13W / 250W |     23MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1711    G   /usr/bin/X                                      32MiB |
+-----------------------------------------------------------------------------+

This says that X-windows is using GPU 0. Note that -gpu 0 will have the network run on GPU 1, while -gpu 1 will run on GPU 0. I don't know why it's numbered so oddly. Training/analysis jobs using caffe and the GPU will show up as caffe or python and will take up most of the memory.

You'll see a lot of output. Stuff like this:

I0803 22:48:29.655294 14763 caffe.cpp:219] Starting Optimization
I0803 22:48:29.655308 14763 solver.cpp:279] Solving GoogleNet
I0803 22:48:29.655311 14763 solver.cpp:280] Learning Rate Policy: inv
I0803 22:48:29.659339 14763 solver.cpp:337] Iteration 0, Testing net (#0)
I0803 22:48:36.443246 14763 solver.cpp:404]     Test net output #0: loss1/top-1 = 0.51
I0803 22:48:36.443315 14763 solver.cpp:404]     Test net output #1: loss3/loss3 = 3.47291 (* 1 = 3.47291 loss)
I0803 22:48:36.443325 14763 solver.cpp:404]     Test net output #2: loss3/top-1 = 0.51
I0803 22:48:36.443333 14763 solver.cpp:404]     Test net output #3: uBloss1/loss1 = 0.773115 (* 0.3 = 0.231934 loss)
I0803 22:48:36.443341 14763 solver.cpp:404]     Test net output #4: uBloss2/loss1 = 0.775098 (* 0.3 = 0.232529 loss)
I0803 22:48:36.443348 14763 solver.cpp:404]     Test net output #5: uBloss2/top-1 = 0.49
I0803 22:48:37.248733 14763 solver.cpp:228] Iteration 0, loss = 4.89485
I0803 22:48:37.248782 14763 solver.cpp:244]     Train net output #0: loss2/loss1 = 0.756721 (* 0.3 = 0.227016 loss)
I0803 22:48:37.248791 14763 solver.cpp:244]     Train net output #1: loss3/loss3 = 4.32175 (* 1 = 4.32175 loss)
I0803 22:48:37.248796 14763 solver.cpp:244]     Train net output #2: loss3/top-1 = 0.5
I0803 22:48:37.248802 14763 solver.cpp:244]     Train net output #3: uBloss1/loss1 = 1.15361 (* 0.3 = 0.346084 loss)
I0803 22:48:37.248816 14763 sgd_solver.cpp:106] Iteration 0, lr = 0.0002
I0803 22:48:53.979476 14763 solver.cpp:228] Iteration 20, loss = 23.7716
I0803 22:48:53.979528 14763 solver.cpp:244]     Train net output #0: loss2/loss1 = 9.10038 (* 0.3 = 2.73012 loss)
I0803 22:48:53.979537 14763 solver.cpp:244]     Train net output #1: loss3/loss3 = 14.0202 (* 1 = 14.0202 loss)
I0803 22:48:53.979542 14763 solver.cpp:244]     Train net output #2: loss3/top-1 = 0.625
I0803 22:48:53.979549 14763 solver.cpp:244]     Train net output #3: uBloss1/loss1 = 23.4045 (* 0.3 = 7.02135 loss)
I0803 22:48:53.979557 14763 sgd_solver.cpp:106] Iteration 20, lr = 0.000199701`

What's going on? At every so called iteration, batches of images are passed into the net. The average error, or loss, is calculated for that batch and is used to update the weights of the network using back-propagation. The accuracy percentage is also provided for the batch. However, why are there three loss values?

GoogLeNet has three places in the network where it is asked to classify the image. The three places are roughly in the beginning, the middle, and the very end of the network. The idea is to develop good convolutional filters at all points of the network capable of classifying the event. This kind of strategy is out of style now. In any case, the error of the prediction is called the loss and the three predictions made by GoogLeNet are weighted, with the last prediction weighted the strongest.

This information is critical in determining how well your network is training. You'll want to see the loss fall over time and the accuracy (of both the training and testing set) move higher. However, because we are randomly sampling images in batches, there will be some stochastic behavior in the loss and accuracy. We want to monitor these values over time.

Stop the training by using CRTL-C.

You'll see caffe make two files: X.caffemodel and X.solverstate.

The caffe model file contains the current parameter values of the network. We'll use such files to configure a network after we're done training to evaluate images. One can also use the values to initialize the weights for training, just like we did with the bvlc_googlenet.caffemodel file. The solver state also stores information needed to continue training where one left off. By typing caffe one can see the arguments needed to do such things.

OK. Now remove the files that were mode. We will restart the training, now sending all output into a text file:

caffe train -solver solver_rmsprop.prototxt -weights bvlc_googlenet.caffemodel -gpu 0 >& log_1a.txt

OK, go into another screen session. (A new shell session, another time to setup the environment.) Use the script, plot_training.py, to make a PNG plotting the loss and accuracy.

python plot_training.py log_1a.txt

After several of hours training, you should see something like

Example Loss and Accuracy

On the right, you'll want to see both the training (black) and test (red) accuracy rise together. If the training one rises and the testing accuracy stays low, you're over training. If this occurs you'll have to try a number of techniques to prevent it.

Configuring the training

There are a number of parameters at your disposable to control the training. These are in the solver_rmsprop.prototxt file. Note that there are many different optimizers one can use to train these networks. This is an example of using RMS prop (for others see the Caffe documentation). We'll over some of the more basic dials:

# The train/test net protocol buffer definition
train_net: "bvlc_googlenet_train.prototxt"
test_net:  "bvlc_googlenet_val.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
iter_size: 1
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 100
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.00004
momentum: 0.0
weight_decay: 0.0
# The learning rate policy
lr_policy: "inv"
gamma: 0.0002
power: 0.75
# Display every 20 iterations
display: 20
# The maximum number of iterations
max_iter: 100000
# snapshot intermediate results
snapshot: 2500
snapshot_prefix: "snapshot_rmsprop_googlenet"
# solver mode: CPU or GPU
solver_mode: GPU
type: "RMSProp"
rms_decay: 0.90

Useful dials

base_lr: this is the factor that controls how big the changes to the network parameters can be. Too small and the training goes nowhere. To big and the loss will stay big or diverge. The rule of thumb I've seen is find the value at which the network starts to diverge and divide by 2. You'll start, monitor, stop, and adjust to find the best hyper-parameter.
train_net: this tells caffe the prototxt to use to configure the training network
test_net: this tells caffe the prototxt to use to configure the testing network
iter_size: batch_size (in the network prototxts) times iter_size
test_iter: number of images to average over equals iter_size times batch_size of the testing net
weight_decay: constrains the weights to fall back to zero over time. a form of regularization.
max_iter: number of iterations to run
snapshot: number of iterations before caffe will save a snapshot. this is useful, because one can let the network train until it starts to become overtrained. Then one stops the training and uses the snapshot before overtraining occurs. (A method refer to as 'early-stopping'.)

The Network Prototxt Files

Above bvlc_googlenet_train.prototxt and bvlc_googlenet_test.prototxt define the network. Refer to Caffe documentation to how to set this up. In this tutorial, I'll go into more detail about how we use this to tell Caffe to read in images stored in the LArCV format.

In the network prototxt files you'll find that the first layer defines the data.

layer {
  name: "data"
  type: "ROOTData"
  top: "data"
  top: "label"
  root_data_param {
    batch_size: 8
    filler_config: "filler.cfg"
    filler_name: "train"
  }
}

type: "ROOTData" tells caffe to use our interface (kept in app/APICaffe) in the LArCV repo. batch_size of course controls the size of the batch. The configuration of the LArCV interface occurs in another file, filler.cfg. Note that such config files can contain many configurations. We select the one we want with the filler_name variable.

The filler.cfg file (follow the link to look at it), is an extension of a LArCV process driver configuration file.

In ProcessType field, you'll see a list of processor names. One can write a process to perform whatever image manipulation one needs before sending the image in. This is useful for implementing data augmentation strategies. The basic one in the file is ADCThreshold. It defines the minimum and maximum range of values in the image. The associated ProcessName field gives a unique name to each of the different processes.

The last process should always be SimpleFiller. It takes the image produced by the most recent filter in the list and gives it to Caffe.

These processors require parameters, of course. Following the above fields, you'll see a block of parameters called ProcessList. This is where the parameters of the different processes go. Which process the block of parameters is sent to is determine by the unique name in ProcessName.