Algorithms - HongkuanZhang/Technique-Notes GitHub Wiki

Viterbi算法和beam search的关系与区别

共同点

两者都是通过计算路径得分来找到最优路径,维特比通常用于序列标注,分数由训练过的CRF来提供,而beam search的分数是从每个时间步softmax输出的词表大小的概率分布中获得。

不同点

beam search搜索

在每次预测的时候是选取概率最高的topk个路径。假设词表大小为3,内容为a,b,c, beam width是2。在seq2seq任务decode时:

生成第1个词的时候,选择概率最大的2个词,假设为a,c,那么当前序列就是a,c 生成第2个词的时候,将当前序列a和c分别作为输入,得到新的6个序列:aa,ab,ac,ca,cb,cc。然后又从其中选择得分最高的两个,假如是aa,cb 不断重复这个过程,知道遇到结束符为止。最终输出得分最高的2个序列。

  • 要点:
  1. 基于贪心算法思想。
  2. seq2seq中decoder的预测过程,可以看做一种宽度(或者说是广度)优先图搜索。因为每一步预测候选集(词表大小是10万量级)很大,所以将概率较低的分支删除,即剪枝。这样大大缩小了搜索空间,所以其得到的是近似解,不是全局最优解。是一种启发式算法。
  3. beam search只在test的时候需要。训练的时候知道正确答案,并不需要再进行这个搜索。常用于搜索空间非常大的情况,如语言生成任务,每一步是选择一个词,而词表非常大,在十万量级,beam search可以大大减少计算量。

viterbi维特比

多用于序列标注,在序列标注中,从第t时间步的n个状态开始, 计算转移到第t+1时间步的n个状态的分数(转移分数*发射分数), 每个时间步的每个状态保留最大值对应的的路径,然后在最后一个时间步回溯找到最优路径。

  • 要点:
  1. viterbi是基于动态规划思想。
  2. 每一步是根据上一步全部可能选择的最高概率推测当前所有可能选择的最高概率保证有全局最优解
  3. 适合于搜索宽度较小的图寻找最优路径,即每一步的选择比较少的时候。如在CRF、HMM中使用。

关于seq2seq中的attention机制,主要有Luong和Bahdanau attention两种,两者的区别如下:

  1. 在计算第t个时间步的attention score alignment的时候,Luong是用当前t时刻的hidden state即h_{t}来计算的,而Bahdanau的是用h_{t-1}来计算的
  2. 关于context vector,Luong是concatenate h_{t}和c{t} 作为预测y_{t}用的vector,而且这个vector作为additional features和y_{t}concatenate作为decoder下一个时间步的输入。而Bahdanau是concatenate c{t}和h_{t-1} 更新h_{t-1},然后用这个更新的h_{t-1}生成h_{t},这个h_{t}也用于预测y_{t}.
    详情参考此链接

3D卷积抽取特征--C3D网络简要介绍以及代码解析

3D卷积特征抽取多用于视频特征抽取,因为视频比图像多出来的时间维度(每个视频的Frame数量)与2D图像的长宽维度组成了3D立体图像。2D卷积中,一个卷积核对一张图像的卷积(类似单通道卷积)结果仍然是一张图像,而一个卷积核对多张图像卷积(类似多通道卷积,多通道卷积结果最后会相加合成一张图像,可以理解为卷积核在通道方向的size等于通道数量,详情参考此 link)的结果也仍然是一张图像,而3D卷积的结果得到的是一个立体的三维图像,具体来说是每16个frame会被转化为一个固定长为4096的tensor,卷积过程会在接下来的C3D代码分析中给出解释。

C3D网络包含8层卷积层,5层池化层,两层FFN,和最后的softmax层,具体结构图在论文中有所展示。网络定义的pytorch代码如下:

class C3D(nn.Module):
    """
    The C3D network.
    """

    def __init__(self, num_classes, pretrained=False):
        super(C3D, self).__init__()

        self.conv1 = nn.Conv3d(3, 64, kernel_size=(3, 3, 3), padding=(1, 1, 1))
        self.pool1 = nn.MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2))

        self.conv2 = nn.Conv3d(64, 128, kernel_size=(3, 3, 3), padding=(1, 1, 1))
        self.pool2 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))

        self.conv3a = nn.Conv3d(128, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1))
        self.conv3b = nn.Conv3d(256, 256, kernel_size=(3, 3, 3), padding=(1, 1, 1))
        self.pool3 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))

        self.conv4a = nn.Conv3d(256, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
        self.conv4b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
        self.pool4 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))

        self.conv5a = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
        self.conv5b = nn.Conv3d(512, 512, kernel_size=(3, 3, 3), padding=(1, 1, 1))
        self.pool5 = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=(0, 1, 1))

        self.fc6 = nn.Linear(8192, 4096)
        self.fc7 = nn.Linear(4096, 4096)
        self.fc8 = nn.Linear(4096, num_classes)

        self.dropout = nn.Dropout(p=0.5)

        self.relu = nn.ReLU()

        self.__init_weight()

        if pretrained:
            self.__load_pretrained_weights()

    def forward(self, x):

        x = self.relu(self.conv1(x))
        x = self.pool1(x)

        x = self.relu(self.conv2(x))
        x = self.pool2(x)

        x = self.relu(self.conv3a(x))
        x = self.relu(self.conv3b(x))
        x = self.pool3(x)

        x = self.relu(self.conv4a(x))
        x = self.relu(self.conv4b(x))
        x = self.pool4(x)

        x = self.relu(self.conv5a(x))
        x = self.relu(self.conv5b(x))
        x = self.pool5(x)

        x = x.view(-1, 8192)
        x = self.relu(self.fc6(x))
        x = self.dropout(x)
        x = self.relu(self.fc7(x))
        x = self.dropout(x)

        logits = self.fc8(x)

        return logits

# C3D(
#   (conv1): Conv3d(3, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1))
#   (pool1): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False)
#   (conv2): Conv3d(64, 128, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1))
#   (pool2): MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=0, dilation=1, ceil_mode=False)
#   (conv3a): Conv3d(128, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1))
#   (conv3b): Conv3d(256, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1))
#   (pool3): MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=0, dilation=1, ceil_mode=False)
#   (conv4a): Conv3d(256, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1))
#   (conv4b): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1))
#   (pool4): MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=0, dilation=1, ceil_mode=False)
#   (conv5a): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1))
#   (conv5b): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1))
#   (pool5): MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=(0, 1, 1), dilation=1, ceil_mode=False)
#   (fc6): Linear(in_features=8192, out_features=4096, bias=True)
#   (fc7): Linear(in_features=4096, out_features=4096, bias=True)
#   (fc8): Linear(in_features=4096, out_features=5, bias=True)
#   (dropout): Dropout(p=0.5, inplace=False)
#   (relu): ReLU()
# )

从上面的代码可以看到,网络大体是卷积后边跟着池化构成的,其中有几层是两层卷积相连之后再加一层池化。关于输入数据在网络中的形状如下所示:

model = C3D(5) # 初始化输出类别为5类的C3D网络
Input = torch.randn(10,3,16,112,112) # batch_size=10,channel_num=3,frame_num=16,H=112,W=112

conv1: torch.Size([10, 64, 16, 112, 112])
pool1: torch.Size([10, 64, 16, 56, 56])
conv2: torch.Size([10, 128, 16, 56, 56])
pool2: torch.Size([10, 128, 8, 28, 28])
conv3a: torch.Size([10, 256, 8, 28, 28])
conv3b: torch.Size([10, 256, 8, 28, 28])
pool3: torch.Size([10, 256, 4, 14, 14])
conv4a: torch.Size([10, 512, 4, 14, 14])
conv4b: torch.Size([10, 512, 4, 14, 14])
pool4: torch.Size([10, 512, 2, 7, 7])
conv5a: torch.Size([10, 512, 2, 7, 7])
conv5b: torch.Size([10, 512, 2, 7, 7])
pool5: torch.Size([10, 512, 1, 4, 4])

reshape: torch.Size([10, 8192])
ffn1: torch.Size([10, 4096])
ffn2: torch.Size([10, 4096])

output: torch.Size([10, 4096])

对于卷积过程中三个维度的计算,公式都为changed_length=(length+2*padding-kernel_size)/stride + 1,在这个模型中因为三个维度的padding都为1,kernel_size都为3,stride都为1,因此计算结果为changed_length=(length+2*1-3)/1 + 1=length, 即卷积过程三个维度的size不变,只有后面的池化会改变他们的size。此外,总体上来看每个卷积核确实对应抽出一个3D的特征图像,这点和2D卷积的二维图像结果不一样。

VGG-16

经典的图像特征抽出模型,包含16个层,输入形状为(3,224,224),输出为(1,1,class number),结构如下

import torch
import torch.nn as nn
import torch.nn.functional as F
 
 
class VGG16(nn.Module):
    
    
    def __init__(self):
        super(VGG16, self).__init__()
        
        # 3 * 224 * 224
        self.conv1_1 = nn.Conv2d(3, 64, 3) # 64 * 222 * 222
        self.conv1_2 = nn.Conv2d(64, 64, 3, padding=(1, 1)) # 64 * 222* 222
        self.maxpool1 = nn.MaxPool2d((2, 2), padding=(1, 1)) # pooling 64 * 112 * 112
        
        self.conv2_1 = nn.Conv2d(64, 128, 3) # 128 * 110 * 110
        self.conv2_2 = nn.Conv2d(128, 128, 3, padding=(1, 1)) # 128 * 110 * 110
        self.maxpool2 = nn.MaxPool2d((2, 2), padding=(1, 1)) # pooling 128 * 56 * 56
        
        self.conv3_1 = nn.Conv2d(128, 256, 3) # 256 * 54 * 54
        self.conv3_2 = nn.Conv2d(256, 256, 3, padding=(1, 1)) # 256 * 54 * 54
        self.conv3_3 = nn.Conv2d(256, 256, 3, padding=(1, 1)) # 256 * 54 * 54
        self.maxpool3 = nn.MaxPool2d((2, 2), padding=(1, 1)) # pooling 256 * 28 * 28
        
        self.conv4_1 = nn.Conv2d(256, 512, 3) # 512 * 26 * 26
        self.conv4_2 = nn.Conv2d(512, 512, 3, padding=(1, 1)) # 512 * 26 * 26
        self.conv4_3 = nn.Conv2d(512, 512, 3, padding=(1, 1)) # 512 * 26 * 26
        self.maxpool4 = nn.MaxPool2d((2, 2), padding=(1, 1)) # pooling 512 * 14 * 14
        
        self.conv5_1 = nn.Conv2d(512, 512, 3) # 512 * 12 * 12
        self.conv5_2 = nn.Conv2d(512, 512, 3, padding=(1, 1)) # 512 * 12 * 12
        self.conv5_3 = nn.Conv2d(512, 512, 3, padding=(1, 1)) # 512 * 12 * 12
        self.maxpool5 = nn.MaxPool2d((2, 2), padding=(1, 1)) # pooling 512 * 7 * 7
        
        # view
        
        self.fc1 = nn.Linear(512 * 7 * 7, 4096)
        self.fc2 = nn.Linear(4096, 4096)
        self.fc3 = nn.Linear(4096, 1000)
        # softmax 1 * 1 * 1000
        
    def forward(self, x):
        
        # x.size(0)即为batch_size
        in_size = x.size(0)
        
        out = self.conv1_1(x) # 222
        out = F.relu(out)
        out = self.conv1_2(out) # 222
        out = F.relu(out)
        out = self.maxpool1(out) # 112
        
        out = self.conv2_1(out) # 110
        out = F.relu(out)
        out = self.conv2_2(out) # 110
        out = F.relu(out)
        out = self.maxpool2(out) # 56
        
        out = self.conv3_1(out) # 54
        out = F.relu(out)
        out = self.conv3_2(out) # 54
        out = F.relu(out)
        out = self.conv3_3(out) # 54
        out = F.relu(out)
        out = self.maxpool3(out) # 28
        
        out = self.conv4_1(out) # 26
        out = F.relu(out)
        out = self.conv4_2(out) # 26
        out = F.relu(out)
        out = self.conv4_3(out) # 26
        out = F.relu(out)
        out = self.maxpool4(out) # 14
        
        out = self.conv5_1(out) # 12
        out = F.relu(out)
        out = self.conv5_2(out) # 12
        out = F.relu(out)
        out = self.conv5_3(out) # 12
        out = F.relu(out)
        out = self.maxpool5(out) # 7
        
        # 展平
        out = out.view(in_size, -1)
        
        out = self.fc1(out)
        out = F.relu(out)
        out = self.fc2(out)
        out = F.relu(out)
        out = self.fc3(out)
        
        out = F.log_softmax(out, dim=1)
        
        
        return out