VGSLSpecs - kana112233/tesseract GitHub Wiki
#VGSL Specs - 用于图像的混合conv/LSTM网络的快速原型设计
可变尺寸图规范语言(VGSL)支持a的规范
神经网络,由卷积和LSTM组成,可以处理
可变大小的图像,来自非常短的定义字符串.
##应用程序:什么是VGSL Specs有用?
VGSL规范专门用于创建以下网络:
*可变大小的图像作为输入. (在一个或两个尺寸!)
*输出图像(热图),序列(如文本)或类别.
- Convolutions和LSTM是主要的计算组件.
*固定尺寸的图像也可以!
###模型字符串输入和输出
神经网络模型由描述输入规范的字符串描述,
输出规格和介于两者之间的层规格. 例:
[1,0,0,3 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]
前4个数字指定输入的大小和类型,然后按照
图像张量的TensorFlow约定:[批次,高度,宽度,深度]. 批量
目前被忽略,但最终可能用于表示培训
小批量. 高度和/或宽度可以为零,允许它们是可变的.
高度和/或宽度的非零值表示所有输入图像都是
预计将具有这样的尺寸,并且如果需要将弯曲以适应. 深度需要
灰度为1,颜色为3. 作为一个特例,一个不同的价值
深度和高度为1会导致图像作为序列从输入处理
垂直像素带. **注意,通过,反转x和y
常规数学,**使用与TensorFlow相同的约定. 该
原因TF采用这种惯例是为了消除转置图像的需要
输入,因为图像中的相邻存储器位置增加x然后y
TF中的张量中的相邻存储器位置,以及tesseract中的NetworkIO增加
最右边的索引,然后是左下角等,就像C数组一样.
最后一个“单词”是输出规范,采用以下形式:
O(2|1|0)(l|s|c)n output layer with n classes.
2 (heatmap) Output is a 2-d vector map of the input (possibly at
different scale). (Not yet supported.)
1 (sequence) Output is a 1-d sequence of vector values.
0 (category) Output is a 0-d single vector value.
l uses a logistic non-linearity on the output, allowing multiple
hot elements in any output vector value. (Not yet supported.)
s uses a softmax non-linearity, with one-hot output in each value.
c uses a softmax with CTC. Can only be used with s (sequence).
NOTE Only O1s and O1c are currently supported.
类的数量被忽略(仅用于与TensorFlow的兼容性)
因为实际数字取自unicharset.
##中间层的语法
注意* all * ops输入和输出4-d张量的标准TF约定:
[批次,高度,宽度,深度]
,无论尺寸是否坍塌.
这极大地简化了事情,并允许VGSLSpecs类跟踪更改
到宽度和高度的值,因此它们可以正确地传递给LSTM
操作,并由任何下游CTC操作使用.
注意:在下面的描述中,<d>
是一个数值,文字是
使用正则表达式语法描述.
注意:操作之间允许有空格.
###功能操作
C(s|t|r|l|m)<y>,<x>,<d> Convolves using a y,x window, with no shrinkage,
random infill, d outputs, with s|t|r|l|m non-linear layer.
F(s|t|r|l|m)<d> Fully-connected with s|t|r|l|m non-linearity and d outputs.
Reduces height, width to 1. Connects to every y,x,depth position of the input,
reducing height, width to 1, producing a single <d> vector as the output.
Input height and width *must* be constant.
For a sliding-window linear or non-linear map that connects just to the
input depth, and leaves the input image size as-is, use a 1x1 convolution
eg. Cr1,1,64 instead of Fr64.
L(f|r|b)(x|y)[s]<n> LSTM cell with n outputs.
The LSTM must have one of:
f runs the LSTM forward only.
r runs the LSTM reversed only.
b runs the LSTM bidirectionally.
It will operate on either the x- or y-dimension, treating the other dimension
independently (as if part of the batch).
s (optional) summarizes the output in the requested dimension, outputting
only the final step, collapsing the dimension to a single element.
LS<n> Forward-only LSTM cell in the x-direction, with built-in Softmax.
LE<n> Forward-only LSTM cell in the x-direction, with built-in softmax,
with binary Encoding.
In the above, (s|t|r|l|m)
specifies the type of the non-linearity:
s = sigmoid
t = tanh
r = relu
l = linear (i.e., No non-linearity)
m = softmax
例子:
Cr5,5,32
使用32深度/数量的滤波器运行5x5 Relu卷积.
Lfx128
运行一个仅向前的LSTM,在x维度上有128个输出,处理
y维独立.
Lfys64
在y维度上运行一个仅向前的LSTM,具有64个输出,处理
独立的x维度和y维度折叠为1个元素.
管道操作允许构建任意复杂的图形. 某物
目前缺少的是能够定义用于生成say的宏
多个地方的初始单位.
[...] Execute ... networks in series (layers).
(...) Execute ... networks in parallel, with their output concatenated in depth.
S<y>,<x> Rescale 2-D input by shrink factor y,x, rearranging the data by
increasing the depth of the input by factor xy.
**NOTE** that the TF implementation of VGSLSpecs has a different S that is
not yet implemented in Tesseract.
Mp<y>,<x> Maxpool the input, reducing each (y,x) rectangle to a single value.
###完整示例:具有高质量OCR的1-D LSTM
[1,1,0,48 Lbx256 O1c105]
作为图层描述:(输入图层位于底部,输出位于顶部.)
O1c105: Output layer produces 1-d (sequence) output, trained with CTC,
outputting 105 classes.
Lbx256: Bi-directional LSTM in x with 256 outputs
1,1,0,48: Input is a batch of 1 image of height 48 pixels in greyscale, treated
as a 1-dimensional sequence of vertical pixel strips.
[]: The network is always expressed as a series of layers.
只要输入图像是小心的,该网络适用于OCR
在垂直方向标准化,基线和平均线保持不变
地方.
###完整示例:具有高质量OCR的多层LSTM
作为图层描述:(输入图层位于底部,输出位于顶部.)
O1c105: Output layer produces 1-d (sequence) output, trained with CTC,
outputting 105 classes.
Lfx256: Forward-only LSTM in x with 256 outputs
Lrx128: Reverse-only LSTM in x with 128 outputs
Lfx128: Forward-only LSTM in x with 128 outputs
Lfys64: Dimension-summarizing LSTM, summarizing the y-dimension with 64 outputs
Mp3,3: 3 x 3 Maxpool
Ct5,5,16: 5 x 5 Convolution with 16 outputs and tanh non-linearity
1,0,0,1: Input is a batch of 1 image of variable size in greyscale
[]: The network is always expressed as a series of layers.
总结LSTM使该网络对垂直变化更具弹性
文字的位置.
##可变大小输入和总结LSTM
请注意,目前是折叠未知大小的维度的唯一方法
已知大小(1)是通过使用总结LSTM. 单一总结
LSTM将折叠一维(x或y),留下一维序列. 1-d
然后可以在另一个维度中折叠序列以生成0-d分类
(softmax)或嵌入(逻辑)输出.
因此,对于OCR目的,输入图像的高度必须是固定的,并且
通过顶层垂直缩放(使用Mp或S),或者允许
可变高度图像,必须使用汇总LSTM来折叠垂直
维度为单个值. 总结LSTM也可以使用固定的
高度输入.