file_analysis.py - haltosan/RA-python-tools GitHub Wiki

Overview

At the top of the file, we have some variables that should be changed depending on your system and what you're working on.

Variable	Use
pwd	The directory your python shell will be in when it starts up
defaultRegex	This will be used by `collect()` if no argument is given

Here are some variable name conventions I used.

chk - check file (leftover data that wasn't processed; you'll need to check this data again for important information)
text - string
texts - list of text
outl - out list
outs - out string

By file, I mean a list of text like the list returned from get('example.txt')

Predicates

Predicates are functions used primarily in the cleaning functions. These basically return true or false based on if the text is formatted specifically. The Predicates.py class (included as p) has lots of predicate functions already, so refer to the class and docs for more information.

Cleaning

Cleaning functions remove text that are undesirable and this can make NLP/writing regex's easier. Here are some simple examples

raw = get('example.txt')
withoutBlanks = cleanFile(raw, pred = p.blank)
withoutShortLines = cleanFile(withoutBlanks, pred = p.long)
withPageString = cleanFile(withoutShortLines, pred = p.hasPage, negatePred = True)
save(withPageString, 'output.txt')

Check Files

calcScores(raw, check)

Given a raw text file (separated by pages as in get("input.txt", splt="PAGE")) and the corresponding check file (lines that are missing from the output), it returns a list that indicates errors per page. By error, I mean how many lines are missing from the output on that specific page.

This can be useful to find out if specific pages need to be redone and where errors are.

findLarge(n, scores)

Return the index in scores of any number larger than or equal to n. This can be used with calcScores() to determine if specific pages need to be redone.

getSame(chk, other, rems=[' ', ',', '"'])

Given a chk (check) file, and some other file, return a list of all lines that are the same (after running clean() with rems) in the two files.

getContext(chk, other, rems=[' ', ',', '"'])

Creates a context file from a chk (check) file and some other file (raw text file). Just like getSame(), it runs clean() with rems.

A context file is a file that contains the line before, missing line, and line after. The 'missing line' is just a line from the check file. The reason it gives lines before and after is because it uses this to correctly order the data. Context files are used in the merge function.

merge(inName, contextName, regex=defaultRegex)

Given the inName of an output file (after regex) and the contextName (name of the context file), return a collect()-ed file with the lines from the check file.

Clean Functions

Syntax

clean(text: str, rems : Union[str,list] = None, negate : bool = False) -> str

Removes all characters/strings in rems from text. negate is currently a work in progress and kept mainly for compatibility with other functions.

cleanFile(texts: list, pred : predicateFunction = None, cleaner : cleanerFunction = None, cleanerArg=None, negatePred=False, negateClean=False) -> list

Scans through texts and removes all lines that make pred (a predicate type function) false. It then runs cleaner (a clean type function) on each line with the cleanerArg. negatePred removes lines that make pred true. negateClean negates the clean type function.

cleanWords(text: str, pred, negate=False) -> str

Removes all words (text separated by spaces) in text that make pred false. Can negate pred.

cleanChars(text: str, pred : predicateFunction, negate=False) -> str

Removes all characters from text that make pred false. Can negate pred.

cleanColumn(texts: list, n: int, cleaner=cleanChars, cleanerArg=p.nameChar) -> list

Given a csv formatted texts list (like after get('someFile.csv')), run cleaner (a cleaner type function) on each line of the column numbered n (starting at index 0). The cleanerArg is passed to cleaner.

charStrip(texts: Union[str,list], chars: Union[str,list], negate=False) -> list

Strip off chars characters from texts. negate for compatibility only.

borderBlocks(i, f, quiet = True)

Helper function for ghostBuster()

ghostBuster(f : list, quiet = True)

Uses file f and finds all lines that are out of place/clearly wrong. Returns a list of anomalies preceded by what letters it's expecting around the line. It does this by looking at the alphabetical ordering and determining if a line starts with a letter that is out of order.

fileStrip(f : list, cleanerArg=',. ', maxCol = 3)

Runs charStrip() on each column, up to maxCol. maxCol is the index (so column number - 1) of the last column we care about. cleanerArg is passed on to charStrip()

bestPages(dirname='21', regex=R1)

Scan through the directory dirname and runs regex on each. The file that has the most matches with regex is then reported. You can use this to figure out which OCR settings are best for a specific document, and how much better it is.

Examples

Removing bad characters from a string

>>> string1 = "This# is$ a bu3cket"
>>> clean(string1, rems = ['#', '$', '3'])
'This is a bucket'

Sanitizing a text file until it becomes clean, usable text.

>>> raw = get('example.txt')
>>> raw
['This# is$ a bu3cket', '  ', "There's more..........", '', '', '', "It contains sssss every man's dying wish"]
>>> noBlankLines = cleanFile(raw, pred = p.long)
>>> noBlankLines
['This# is$ a bu3cket', "There's more..........", "It contains sssss every man's dying wish"]
>>> removeExtraChars = cleanFile(noBlankLines, cleaner = clean, cleanerArg = ['#', '$', '3'])
>>> removeExtraChars
['This is a bucket', "There's more..........", "It contains sssss every man's dying wish"]
>>> strippedChars = cleanFile(removeExtraChars, cleaner = charStrip, cleanerArg = '.')
>>> strippedChars
['This is a bucket', "There's more", "It contains sssss every man's dying wish"]
>>> removeSss = cleanFile(strippedChars, cleaner = cleanWords, cleanerArg = p.repeatedLetters, negateClean = True)
>>> removeSss
['This is a bucket', "There's more", "It contains every man's dying wish"]

p.long() removes short lines as well as blank lines
clean() removes every character that matches the blacklist
charStrip() strips all the period characters off the ends of each line
cleanWords() removes every line that causes p.repeatedLetters() to be true (because negateClean is set)

Additionally, lambda expressions can be passed in as long as they meet the specifications of a predicate function.

>>> noJarjar = cleanFile(["Me-sa Jarjar Binks"], cleaner = cleanWords, cleanerArg = lambda txt : not txt == "Jarjar")
>>> noJarjar
['Me-sa Binks']

Collect

Usage

collect(text: Union[str, list], regex=defaultRegex, spaceMatches=False) -> tuple

If you used the NLP notebook, this is the same function, it just has a few more features.

Given some text, return matches and non_matches based on the regex. spaceMatches has several possible values: True (put a blank line in matches if there is no match), False (don't add anything to matches if there is no match), "keep" (put the entire line in matches if there is no match).

Returns (matches, non_matches).

Modification

If you want a specific ordering for the matches, edit the following lines (about 10 and 25 lines into the function):

if i is not None:
    matches.append(list(i.groupdict().values()))
    # matches.append((i.groupdict()['first'].strip() + ' ' + i.groupdict()['last'].strip(), i.groupdict()['info']))

The commented line is an example of how to change the list order.

CSV Functions

Syntax

csvJoin(texts: list) -> str

Returns texts as a csv string.

csvSplit(text: str) -> list

Converts text (a csv string) to list of texts.

csvText(text: str) -> str

Converts text to a csv string.

csvColumn(texts: list, n: int, safe=True) -> list

Gets column number n from texts (csv table like from get("table.csv")). If safe is false, it will throw errors when out of bounds (such as when a line isn't n cells long). If safe is true, it will just fill in blank cells.

csvMergeColumn(fullTexts: list, column: list, n: int) -> list

Replaces column number n from fullTexts with column.

Examples

Create a csv string from a list

>>> csvJoin(['Smith, John', 'English'])
'"Smith, John",English'

Extract the first column from a csv table

>>> raw = get('test.csv')
>>> raw
['a1,b1,c1', 'a2,b2,c2', 'a3,b3,c3', '']
>>> csvColumn(raw, 0)
['a1', 'a2', 'a3', '']

General Functions

Syntax

init()

Runs on start each time. I use this to move to my working directory (see var pwd at top of file)

get(fileName: str, splt='\n') -> list

Reads file fileName and returns an array where each element is a line (with default splt). splt is used to split the text up some other way (for example get('foo.txt', splt='PAGE') would make each element a page).

save(text: Union[str, list], out: str, csvStyle=False)

Saves text to a file called out. The csv argument decides if the list will be saved in a csv safe format (for example, a 2d list would be saved as a table).

find(item: str, texts: list) -> str

Returns index of texts where item is, if they are sort of similar (by this I mean one is a substring of the other).

batchPrint(lst:list, batchSize = 10)

Prints out batchSize lines at a time from lst. Useful for printing very large files in a manageable format.

Examples

Read a file in and print the contents of it

>>> raw = get('example.txt')
>>> raw
['Each', 'line', 'is', "its", 'own', 'cell', 'in', 'the', 'array']
>>> printFile(raw)
Each
line
is
its
own
cell
in
the
array

Saving a list to a .txt file

>>> texts = ['yo ho', 'haul together', 'hoist the colors high']
>>> save(texts, 'lyrics.txt')
>>> get('lyrics.txt')
['yo ho', 'haul together', 'hoist the colors high']

Modifying and saving a text file as a .csv file

>>> raw = get('test.csv')
>>> raw
['a1,b1,c1', 'a2,b2,c2', 'a3,b3,c3', '']
>>> raw[-1] = 'a4,b4,c4'
>>> save(raw, 'test2.csv', csvStyle = True)

Misc Functions

Syntax

headerGrab(texts, headerPred, quite = True)

Finds anything that makes headerPred true in texts and adds that onto all lines following, until a new header is found (by headerPred). quite turned off allows you to manually accept each header. Interactive mode is a work in progress.

infoGrab(fname, outName)

Pulls info off of names in file fname and saves it to outName.

slapThatInfoOn(texts)

Reorders texts so names and the corresponding info are all on the same lines. p.REGULAR_EXPRESSION needs to be set before to recognize names.

Examples

headerGrab

>>> headerGrab(get('exampleFile.txt'), p.mostlyCaps, quite = True)

If the file was:

SENIORS

Joe Smith

Frank Stevenson

the function returns:

[Joe Smith, SENIORS]

[Frank Stevenson, SENIORS]

infoGrab(get('namesAndMajors.txt'), 'output.txt')

John Smith, English would become ["John Smith", "English"].

slapThatInfoOn(get('example.txt'))

If the file was:

John Smith

Science

1922

Bob Smith

Math

1935

It becomes:

John Smith Science 1922

Bob Smith Math 1935