SourceInspector - Gnurro/FinetuneReFormatter GitHub Wiki

SourceInspector Mode

Shows data content of .txt plaintext data and checks for common raw text data issues.
SourceInspector Mode in LineEnd mode
Token count can be shown at the top by clicking the [Count tokens] button. The 'Instant token count' checkbox enables a (re)count of tokens on every change of the text, but may make ReFormatter unresponsive when handling large texts.

Line Checking Modes

Currently there are three different modes to check for newline issues. The newline checking modes are selected using the dropdown menu at the top.
Newline Mode Selector
If SourceInspector finds 'bad newlines', clicking the [Move cursor to bad line] button moves the text editor cursor to the end of the first 'bad line' found in the text.
(Navigating through all found locations is planned for upcoming versions.)

LineEnd

LineEnd mode checks the end of lines for the presence of certain characters/strings and considers lines not ending in either of them as bad.
The current list of proper 'line enders': '.', '!', '?', '<|endoftext|>', '”', '“', ':', '—', '*', ')', '_', '’', ']', ',', '"'
(These will be configurable through settings in upcoming versions, but can currently be changed by editing line 353 of baseGUI.py.)

InLine

InLine mode checks for the presence of 'line enders' anywhere in each line, and considers lines that lack any 'line enders' bad.
This mode will not consider headlines or paragraph headers as bad, but still spots lines that are most likely containing only a fragment of a sentence.
Uses the same list of 'line enders' as LineEnd mode (see above).

NoDoubles

NoDoubles mode is simpler than the other two modes, and considers empty lines bad, as these indicate the presence of multiple newline characters in a row.

Warnings

EndOfText

This warning is shown if the '<|endoftext|>' string is not present at the very end of the text.

Newline at end

This warning is shown if there is a newline character at the very end of the text.