VGSLSpecs - kana112233/tesseract GitHub Wiki

#VGSL Specs - 用于图像的混合conv/LSTM网络的快速原型设计

可变尺寸图规范语言(VGSL)支持a的规范

神经网络,由卷积和LSTM组成,可以处理

可变大小的图像,来自非常短的定义字符串.

##应用程序:什么是VGSL Specs有用?

VGSL规范专门用于创建以下网络:

*可变大小的图像作为输入. (在一个或两个尺寸!)

*输出图像(热图),序列(如文本)或类别.

  • Convolutions和LSTM是主要的计算组件.

*固定尺寸的图像也可以!

###模型字符串输入和输出

神经网络模型由描述输入规范的字符串描述,

输出规格和介于两者之间的层规格. 例:

[1,0,0,3 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]

前4个数字指定输入的大小和类型,然后按照

图像张量的TensorFlow约定:[批次,高度,宽度,深度]. 批量

目前被忽略,但最终可能用于表示培训

小批量. 高度和/或宽度可以为零,允许它们是可变的.

高度和/或宽度的非零值表示所有输入图像都是

预计将具有这样的尺寸,并且如果需要将弯曲以适应. 深度需要

灰度为1,颜色为3. 作为一个特例,一个不同的价值

深度和高度为1会导致图像作为序列从输入处理

垂直像素带. **注意,通过,反转x和y

常规数学,**使用与TensorFlow相同的约定. 该

原因TF采用这种惯例是为了消除转置图像的需要

输入,因为图像中的相邻存储器位置增加x然后y

TF中的张量中的相邻存储器位置,以及tesseract中的NetworkIO增加

最右边的索引,然后是左下角等,就像C数组一样.

最后一个“单词”是输出规范,采用以下形式:

O(2|1|0)(l|s|c)n output layer with n classes.
  2 (heatmap) Output is a 2-d vector map of the input (possibly at
    different scale). (Not yet supported.)
  1 (sequence) Output is a 1-d sequence of vector values.
  0 (category) Output is a 0-d single vector value.
  l uses a logistic non-linearity on the output, allowing multiple
    hot elements in any output vector value. (Not yet supported.)
  s uses a softmax non-linearity, with one-hot output in each value.
  c uses a softmax with CTC. Can only be used with s (sequence).
  NOTE Only O1s and O1c are currently supported.

类的数量被忽略(仅用于与TensorFlow的兼容性)

因为实际数字取自unicharset.

##中间层的语法

注意* all * ops输入和输出4-d张量的标准TF约定:

[批次,高度,宽度,深度],无论尺寸是否坍塌.

这极大地简化了事情,并允许VGSLSpecs类跟踪更改

到宽度和高度的值,因此它们可以正确地传递给LSTM

操作,并由任何下游CTC操作使用.

注意:在下面的描述中,<d>是一个数值,文字是

使用正则表达式语法描述.

注意:操作之间允许有空格.

###功能操作

C(s|t|r|l|m)<y>,<x>,<d> Convolves using a y,x window, with no shrinkage,
  random infill, d outputs, with s|t|r|l|m non-linear layer.
F(s|t|r|l|m)<d> Fully-connected with s|t|r|l|m non-linearity and d outputs.
  Reduces height, width to 1. Connects to every y,x,depth position of the input,
  reducing height, width to 1, producing a single <d> vector as the output.
  Input height and width *must* be constant.
  For a sliding-window linear or non-linear map that connects just to the
  input depth, and leaves the input image size as-is, use a 1x1 convolution
  eg. Cr1,1,64 instead of Fr64.
L(f|r|b)(x|y)[s]<n> LSTM cell with n outputs.
  The LSTM must have one of:
    f runs the LSTM forward only.
    r runs the LSTM reversed only.
    b runs the LSTM bidirectionally.
  It will operate on either the x- or y-dimension, treating the other dimension
  independently (as if part of the batch).
  s (optional) summarizes the output in the requested dimension, outputting
    only the final step, collapsing the dimension to a single element.
LS<n> Forward-only LSTM cell in the x-direction, with built-in Softmax.
LE<n> Forward-only LSTM cell in the x-direction, with built-in softmax,
  with binary Encoding.

In the above, (s|t|r|l|m) specifies the type of the non-linearity:

s = sigmoid
t = tanh
r = relu
l = linear (i.e., No non-linearity)
m = softmax

例子:

Cr5,5,32使用32深度/数量的滤波器运行5x5 Relu卷积.

Lfx128运行一个仅向前的LSTM,在x维度上有128个输出,处理

y维独立.

Lfys64在y维度上运行一个仅向前的LSTM,具有64个输出,处理

独立的x维度和y维度折叠为1个元素.

Plumbing ops

管道操作允许构建任意复杂的图形. 某物

目前缺少的是能够定义用于生成say的宏

多个地方的初始单位.

[...] Execute ... networks in series (layers).
(...) Execute ... networks in parallel, with their output concatenated in depth.
S<y>,<x> Rescale 2-D input by shrink factor y,x, rearranging the data by
  increasing the depth of the input by factor xy.
  **NOTE** that the TF implementation of VGSLSpecs has a different S that is
  not yet implemented in Tesseract.
Mp<y>,<x> Maxpool the input, reducing each (y,x) rectangle to a single value.

###完整示例:具有高质量OCR的1-D LSTM

[1,1,0,48 Lbx256 O1c105]

作为图层描述:(输入图层位于底部,输出位于顶部.)

O1c105: Output layer produces 1-d (sequence) output, trained with CTC,
  outputting 105 classes.
Lbx256: Bi-directional LSTM in x with 256 outputs
1,1,0,48: Input is a batch of 1 image of height 48 pixels in greyscale, treated
  as a 1-dimensional sequence of vertical pixel strips.
[]: The network is always expressed as a series of layers.

只要输入图像是小心的,该网络适用于OCR

在垂直方向标准化,基线和平均线保持不变

地方.

###完整示例:具有高质量OCR的多层LSTM

作为图层描述:(输入图层位于底部,输出位于顶部.)

O1c105: Output layer produces 1-d (sequence) output, trained with CTC,
  outputting 105 classes.
Lfx256: Forward-only LSTM in x with 256 outputs
Lrx128: Reverse-only LSTM in x with 128 outputs
Lfx128: Forward-only LSTM in x with 128 outputs
Lfys64: Dimension-summarizing LSTM, summarizing the y-dimension with 64 outputs
Mp3,3: 3 x 3 Maxpool
Ct5,5,16: 5 x 5 Convolution with 16 outputs and tanh non-linearity
1,0,0,1: Input is a batch of 1 image of variable size in greyscale
[]: The network is always expressed as a series of layers.

总结LSTM使该网络对垂直变化更具弹性

文字的位置.

##可变大小输入和总结LSTM

请注意,目前是折叠未知大小的维度的唯一方法

已知大小(1)是通过使用总结LSTM. 单一总结

LSTM将折叠一维(x或y),留下一维序列. 1-d

然后可以在另一个维度中折叠序列以生成0-d分类

(softmax)或嵌入(逻辑)输出.

因此,对于OCR目的,输入图像的高度必须是固定的,并且

通过顶层垂直缩放(使用Mp或S),或者允许

可变高度图像,必须使用汇总LSTM来折叠垂直

维度为单个值. 总结LSTM也可以使用固定的

高度输入.

⚠️ **GitHub.com Fallback** ⚠️