Working with Ground Truth 获取标注数据 - wanghaisheng/awesome-ocr GitHub Wiki

原始链接为了训练ocrpy，你需要一行行文本的图片和标注数据,比如正确的原文的文本。另外，标注数据可以用来估计错误率，也就是识别的准确率。也可以通过校对命令行ocropus-gtedit输出的结果来手动生成标注数据。

For training ocropy one needs images of lines of text along with ground truth, i.e. the correct transcriptions. Moreover, the ground truth can be used to estimate the error rate resp. the accuracy of the recognition. It is possible to manually generate the ground truth by correcting the recognized text using the command ocropus-gtedit.

File structure

'ocrpy'项目的文件夹和文件结构遵循如下固定的规则：假设你把某本书的图片都放在同一个文件夹下。对于书的每一页都新建一个新的子文件夹(4位数字自增比如 0001)。在这个子文件夹里，对于每一行文本存在这样的几个文件：png后缀格式的图片本身，txt后缀格式的识别后的文本，gt.txt后缀格式的标注数据。最开始不一定包含所有这些文件。

The file and folder structure in ocropy follow some fixed rules: Let us assume you have some images of a book which are all saved in one folder. For each page a new subfolder will be created (incremented as four digits number, e.g. 0001). In such a subfolder, there exists for each line several files: images (png), recognized text (txt), ground truth text (.gt.txt). Not all of the files are present in the beginning.

调用命令行时，如果处理多个文件，使用诸如?或*的占位符是很有用的，比如temp/????/??????.bin.png 指的是temp文件夹下名称为5位且自增的所有文本对应的图片。

It is useful to use placeholders like ? or * for calling the commands on several files. For example the term temp/????/??????.bin.png will refer to all (binarized) images of lines (incremented by a 6-digits number) on all pages (incremented by a 4-digits number) in the folder temp.

Correction page

在执行完 ocropy-rpred命令后，可以创建一个校对页面，比如

> ocropus-gtedit html temp/????/??????.bin.png -o temp-correction.html

注意运行该命令的前提假设是图片文件和识别后的txt文件都已经存在。详细的流程参考https://github.com/tmbdev/ocropy/blob/master/run-test的例子。可以使用firefox来打开该校对页面。

> firefox temp-correction.html &

correction

你可以看到所有文本行的图片，每行下面都有一排识别后的文本。可以通过输入框中识别后的文本进行校对获得标注数据。比如，上面的例子中应该在REVIEW中增加缺少的字符I 。你可以将结果重新保存成html页面。

We can create a correction page after we have run ocropy-rpred, e.g.

> ocropus-gtedit html temp/????/??????.bin.png -o temp-correction.html

Note that this command already assume that the (binarized) image files as well as the recognized text (txt) are present. See also https://github.com/tmbdev/ocropy/blob/master/run-test for an example in the whole workflow. This correction page can be opened with firefox

> firefox temp-correction.html &

correction

You see the images of all the text lines and below each line there is the recognized text. This input field with the recognized text can be corrected manually to generate the ground truth. For example above you should add the missing letter I in the word REVIEW. Your result can be saved (as html page) again.

Extract ground truth

最后一步是提取标注数据，比如，每一行所对应的正确的文本。可通过如下命令来生成.gt.txt的所有文件。

> ocropus-gtedit extract temp-correction.html

Last step is to extract the ground truth, i.e. the correct transcription for each line. This can be done with the command:

> ocropus-gtedit extract temp-correction.html

which will generate all the .gt.txt-files.

All subcommands of `gtedit`

Two of the subcommands of ocropus-gtedit are described in the context above. There some more subcommands which can be handy when working with the ground truth. The complete list is:

html: This generates a html page containing the images as well as the recognized text, which is useful for manual correction. Example call: ocropus-gtedit html temp/????/??????.bin.png -o temp-correction.html
extract: This extracts the text out of the html page after the correction Example call: ocropus-gtedit extract temp-correction.html
text: This creates one text file containing all the information of the single text files which are inputed. Example call: ocropus-gtedit text -o gen-text.txt temp/????/??????.txt
org: This is the same as text but the file structure is different (org-style). Example call: ocropus-gtedit org -o org.txt temp/????/??????.txt
write: This writes the text of the individual lines into individual files in the book directory from the text file containing all text (and the format has to be as the output of text or org would give it). Example call: ocropus-gtedit write -x .test.txt gen-text.txt temp

Each subcommand has also a help option -h where you also see the other possible parameters, e.g. ocropus-gtedit write -h.

The structure of the text file or org file looks like the following:

text org

https://github.com/ling0322/webdict https://github.com/daya-prac/Rime_custom_dict 中文词库文本准备 1.按行切割

# 创建新路径
def make_dirs(path):
	if not os.path.isdir(path):
		os.makedirs(path)	

# 获取文件的行数
def get_total_lines(file_path):
	if not os.path.exists(file_path):
		return 0
	cmd = 'wc -l %s' % file_path
	return int(os.popen(cmd).read().split()[0])

# 函数split_file_by_row: 按行切分文件
# filepath: 切分的目标文件
# new_filepath: 生成新文件的路径
# row_cnt: 每个文件最多包含几行
# suffix_type: 新文件后缀类型，如两位字母或数字
# return: 切分后的文件列表
def split_file_by_row(filepath, new_filepath, row_cnt, suffix_type='-d'):
	tmp_dir = "/split_file_by_row/"
	make_dirs(new_filepath)
	make_dirs(new_filepath+tmp_dir)

	total_rows = get_total_lines(filepath)
	file_cnt = int(math.ceil(total_rows*1.0/row_cnt))
        command = "split -l%d -a2 %s %s %s" % (row_cnt, suffix_type, filepath, new_filepath+tmp_dir) 
        os.system(command)

        filelist = os.listdir(new_filepath+tmp_dir)
	command = "mv %s/* %s"%(new_filepath+tmp_dir, new_filepath)
	os.system(command)

	command = "rm -r %s"%(new_filepath+tmp_dir)
	os.system(command)

	return [new_filepath+fn for fn in filelist]

http://stackoverflow.com/questions/22970951/split-each-line-in-a-text-file-into-new-text-files-via-command-line http://stackoverflow.com/questions/21093626/split-file-using-awk

tr '\n' ' ' < input_filename
➜  ocr_text tr  ' ' '\n' <单字-1.txt  > result.txt
➜  ocr_text wc -l result.txt                      
 12373376 result.txt

awk '{ close(fn) fn = $1 ".txt" print >> fn }' TS129.txt

awk '{print $1}' file | sort -u | wc -l

➜ ocr_text awk '{print $1}' result.txt | sort -u | wc -l 1521 ➜ ocr_text awk 'NR==FNR{a[i]=$0;i++}NR>FNR{print a[j]" "$0;j++}' result2.txt result_uniq.txt result3.txt > result.txt ➜ ocr_text wc -l result.txt 352794 result.txt ➜ ocr_text sort result.txt | uniq -u | wc -l 10735 ➜ ocr_text sort result.txt | uniq -u > result_uniq.txt
➜ ocr_text sort result_uniq.txt | uniq -u | wc -l
10735

cat test1.txt | sort | uniq

sort file | uniq -u | wc -l ➜ ocr_text sort result.txt| uniq > result_uniq.txt

➜ ocr_text awk 'NR==FNR{a[i]=$0;i++}NR>FNR{print a[j]" "$0;j++}' result2.txt result_uniq.txt result3.txt > result.txt

http://stackoverflow.com/questions/16327566/unique-lines-in-bash http://stackoverflow.com/questions/618378/select-unique-or-distinct-values-from-a-list-in-unix-shell-script

Working with Ground Truth 获取标注数据 - wanghaisheng/awesome-ocr GitHub Wiki

File structure

Correction page

Extract ground truth

All subcommands of gtedit

⚠️ **GitHub.com Fallback** ⚠️

All subcommands of `gtedit`

⚠️ GitHub.com Fallback ⚠️