2016 08 07: Studying LRCN Activity Recognition. - HenglinShi/LSTM_LIP_READING GitHub Wiki

References:

#Network Infrastructure

#Input:

##Samples a 4-d vector with the size of 384 * 3 * 277 * 277, where 3 denotes the channel and 277 * 277 represents the image size.

However, the content of the input is strange, looks all are integers. (To check: because they are flow image)

##Labels a vector of 384

##Clip_markers a vector of 384

#Before reshaped data ##Labels and clip_markers are kept the same

##Sample

net.blobs['fc6'].data.shape output: 384 * 4096

#After Reshape ##Labels

net.blobs['reshape-label'].data.shape
Output: 16 * 24
dim-1: number of time values, so that each sequence contains 16 time step
dim-2: number of sequences, so that there are 24 sequences fed to lstm in each batch
All sequences are preprocesses to the same length, which is 16 tim step. Each column are for those frames from the same sequence, so their labels are the same.

##Clip_markers

net.blobs['reshape-cm'].data[:,1].shape
Output: 16 * 24
dim-1: number of time values, so that each sequence contains 16 time step
dim-2: number of sequences, so that there are 24 sequences fed to lstm in each batch
Same structure with above
first line are all zero, means the start of sequences, and others are 1

##Samples net.blobs['fc6-reshape'].data.shape Output: 16 * 24 * 4096

#As a result, inputs between our experiment and this, are same