Working with Ground Truth 获取标注数据 - wanghaisheng/awesome-ocr GitHub Wiki
原始链接
为了训练ocrpy,你需要一行行文本的图片和标注数据,比如正确的原文的文本。另外,标注数据可以用来估计错误率,也就是识别的准确率。也可以通过校对命令行ocropus-gtedit
输出的结果来手动生成标注数据。
For training ocropy one needs images of lines of text along with ground truth, i.e. the correct transcriptions. Moreover, the ground truth can be used to estimate the error rate resp. the accuracy of the recognition. It is possible to manually generate the ground truth by correcting the recognized text using the command ocropus-gtedit
.
'ocrpy'项目的文件夹和文件结构遵循如下固定的规则:假设你把某本书的图片都放在同一个文件夹下。对于书的每一页都新建一个新的子文件夹(4位数字自增 比如 0001)。在这个子文件夹里,对于每一行文本存在这样的几个文件:png后缀格式的图片本身,txt后缀格式的识别后的文本,gt.txt后缀格式的标注数据。最开始 不一定包含所有这些文件。
The file and folder structure in ocropy
follow some fixed rules: Let us assume you have some images of a book which are all saved in one folder. For each page a new subfolder will be created (incremented as four digits number, e.g. 0001
). In such a subfolder, there exists for each line several files: images (png), recognized text (txt), ground truth text (.gt.txt). Not all of the files are present in the beginning.
调用命令行时,如果处理多个文件,使用诸如?
或*
的占位符是很有用的,比如temp/????/??????.bin.png
指的是temp文件夹下名称为5位且自增的所有文本对应的图片。
It is useful to use placeholders like ?
or *
for calling the commands on several files. For example the term temp/????/??????.bin.png
will refer to all (binarized) images of lines (incremented by a 6-digits number) on all pages (incremented by a 4-digits number) in the folder temp.
在执行完 ocropy-rpred
命令后,可以创建一个校对页面,比如
> ocropus-gtedit html temp/????/??????.bin.png -o temp-correction.html
注意运行该命令的前提假设是图片文件和识别后的txt文件都已经存在。详细的流程参考https://github.com/tmbdev/ocropy/blob/master/run-test的例子。可以使用firefox来打开该校对页面。
> firefox temp-correction.html &
你可以看到所有文本行的图片,每行下面都有一排识别后的文本。可以通过输入框中识别后的文本进行校对获得标注数据。比如,上面的例子中应该在REVIEW
中增加缺少的字符I
。你可以将结果重新保存成html页面。
We can create a correction page after we have run ocropy-rpred
, e.g.
> ocropus-gtedit html temp/????/??????.bin.png -o temp-correction.html
Note that this command already assume that the (binarized) image files as well as the recognized text (txt) are present. See also https://github.com/tmbdev/ocropy/blob/master/run-test for an example in the whole workflow. This correction page can be opened with firefox
> firefox temp-correction.html &
You see the images of all the text lines and below each line there is the recognized text. This input field with the recognized text can be corrected manually to generate the ground truth. For example above you should add the missing letter I
in the word REVIEW
. Your result can be saved (as html
page) again.
最后一步是提取标注数据,比如,每一行所对应的正确的文本。可通过如下命令来生成.gt.txt
的所有文件。
> ocropus-gtedit extract temp-correction.html
Last step is to extract the ground truth, i.e. the correct transcription for each line. This can be done with the command:
> ocropus-gtedit extract temp-correction.html
which will generate all the .gt.txt
-files.
Two of the subcommands of ocropus-gtedit
are described in the context above. There some more subcommands which can be handy when working with the ground truth. The complete list is:
-
html: This generates a html page containing the images
as well as the recognized text, which is useful for manual correction.
Example call:
ocropus-gtedit html temp/????/??????.bin.png -o temp-correction.html
-
extract: This extracts the text out of the html page after the correction
Example call:
ocropus-gtedit extract temp-correction.html
-
text: This creates one text file containing all the information of the single text files which are inputed.
Example call:
ocropus-gtedit text -o gen-text.txt temp/????/??????.txt
-
org: This is the same as text but the file structure is different (org-style).
Example call:
ocropus-gtedit org -o org.txt temp/????/??????.txt
-
write: This writes the text of the individual lines into individual files
in the book directory from the text file containing all text (and the format has to
be as the output of text or org would give it).
Example call:
ocropus-gtedit write -x .test.txt gen-text.txt temp
Each subcommand has also a help option -h
where you also see the other possible parameters, e.g. ocropus-gtedit write -h
.
The structure of the text file or org file looks like the following:
https://github.com/ling0322/webdict https://github.com/daya-prac/Rime_custom_dict 中文词库文本准备 1.按行切割
# 创建新路径
def make_dirs(path):
if not os.path.isdir(path):
os.makedirs(path)
# 获取文件的行数
def get_total_lines(file_path):
if not os.path.exists(file_path):
return 0
cmd = 'wc -l %s' % file_path
return int(os.popen(cmd).read().split()[0])
# 函数split_file_by_row: 按行切分文件
# filepath: 切分的目标文件
# new_filepath: 生成新文件的路径
# row_cnt: 每个文件最多包含几行
# suffix_type: 新文件后缀类型,如两位字母或数字
# return: 切分后的文件列表
def split_file_by_row(filepath, new_filepath, row_cnt, suffix_type='-d'):
tmp_dir = "/split_file_by_row/"
make_dirs(new_filepath)
make_dirs(new_filepath+tmp_dir)
total_rows = get_total_lines(filepath)
file_cnt = int(math.ceil(total_rows*1.0/row_cnt))
command = "split -l%d -a2 %s %s %s" % (row_cnt, suffix_type, filepath, new_filepath+tmp_dir)
os.system(command)
filelist = os.listdir(new_filepath+tmp_dir)
command = "mv %s/* %s"%(new_filepath+tmp_dir, new_filepath)
os.system(command)
command = "rm -r %s"%(new_filepath+tmp_dir)
os.system(command)
return [new_filepath+fn for fn in filelist]
http://stackoverflow.com/questions/22970951/split-each-line-in-a-text-file-into-new-text-files-via-command-line http://stackoverflow.com/questions/21093626/split-file-using-awk
tr '\n' ' ' < input_filename
➜ ocr_text tr ' ' '\n' <单字-1.txt > result.txt
➜ ocr_text wc -l result.txt
12373376 result.txt
awk '{ close(fn) fn = $1 ".txt" print >> fn }' TS129.txt
awk '{print $1}' file | sort -u | wc -l
➜ ocr_text awk '{print $1}' result.txt | sort -u | wc -l
1521
➜ ocr_text awk 'NR==FNR{a[i]=$0;i++}NR>FNR{print a[j]" "$0;j++}' result2.txt result_uniq.txt result3.txt > result.txt
➜ ocr_text wc -l result.txt
352794 result.txt
➜ ocr_text sort result.txt | uniq -u | wc -l
10735
➜ ocr_text sort result.txt | uniq -u > result_uniq.txt
➜ ocr_text sort result_uniq.txt | uniq -u | wc -l
10735
cat test1.txt | sort | uniq
sort file | uniq -u | wc -l ➜ ocr_text sort result.txt| uniq > result_uniq.txt
➜ ocr_text awk 'NR==FNR{a[i]=$0;i++}NR>FNR{print a[j]" "$0;j++}' result2.txt result_uniq.txt result3.txt > result.txt
http://stackoverflow.com/questions/16327566/unique-lines-in-bash http://stackoverflow.com/questions/618378/select-unique-or-distinct-values-from-a-list-in-unix-shell-script