file_analysis.py - haltosan/RA-python-tools GitHub Wiki
Overview
At the top of the file, we have some variables that should be changed depending on your system and what you're working on.
Variable | Use |
---|---|
pwd | The directory your python shell will be in when it starts up |
defaultRegex | This will be used by collect() if no argument is given |
Here are some variable name conventions I used.
- chk - check file (leftover data that wasn't processed; you'll need to check this data again for important information)
- text - string
- texts - list of text
- outl - out list
- outs - out string
By file, I mean a list of text like the list returned from get('example.txt')
Predicates
Predicates are functions used primarily in the cleaning functions. These basically return true or false based on if the text is formatted specifically. The Predicates.py class (included as p) has lots of predicate functions already, so refer to the class and docs for more information.
Cleaning
Cleaning functions remove text that are undesirable and this can make NLP/writing regex's easier. Here are some simple examples
raw = get('example.txt')
withoutBlanks = cleanFile(raw, pred = p.blank)
withoutShortLines = cleanFile(withoutBlanks, pred = p.long)
withPageString = cleanFile(withoutShortLines, pred = p.hasPage, negatePred = True)
save(withPageString, 'output.txt')
Check Files
calcScores(raw, check)
Given a raw text file (separated by pages as in get("input.txt", splt="PAGE")
) and the corresponding check file (lines that are missing from the output), it returns a list that indicates errors per page. By error, I mean how many lines are missing from the output on that specific page.
This can be useful to find out if specific pages need to be redone and where errors are.
findLarge(n, scores)
Return the index in scores of any number larger than or equal to n. This can be used with calcScores()
to determine if specific pages need to be redone.
getSame(chk, other, rems=[' ', ',', '"'])
Given a chk (check) file, and some other file, return a list of all lines that are the same (after running clean()
with rems) in the two files.
getContext(chk, other, rems=[' ', ',', '"'])
Creates a context file from a chk (check) file and some other file (raw text file). Just like getSame()
, it runs clean()
with rems.
A context file is a file that contains the line before, missing line, and line after. The 'missing line' is just a line from the check file. The reason it gives lines before and after is because it uses this to correctly order the data. Context files are used in the merge function.
merge(inName, contextName, regex=defaultRegex)
Given the inName of an output file (after regex) and the contextName (name of the context file), return a collect()
-ed file with the lines from the check file.
Clean Functions
Syntax
clean(text: str, rems : Union[str,list] = None, negate : bool = False) -> str
Removes all characters/strings in rems from text. negate is currently a work in progress and kept mainly for compatibility with other functions.
cleanFile(texts: list, pred : predicateFunction = None, cleaner : cleanerFunction = None, cleanerArg=None, negatePred=False, negateClean=False) -> list
Scans through texts and removes all lines that make pred (a predicate type function) false. It then runs cleaner (a clean type function) on each line with the cleanerArg. negatePred removes lines that make pred true. negateClean negates the clean type function.
cleanWords(text: str, pred, negate=False) -> str
Removes all words (text separated by spaces) in text that make pred false. Can negate pred.
cleanChars(text: str, pred : predicateFunction, negate=False) -> str
Removes all characters from text that make pred false. Can negate pred.
cleanColumn(texts: list, n: int, cleaner=cleanChars, cleanerArg=p.nameChar) -> list
Given a csv formatted texts list (like after get('someFile.csv')
), run cleaner (a cleaner type function) on each line of the column numbered n (starting at index 0). The cleanerArg is passed to cleaner.
charStrip(texts: Union[str,list], chars: Union[str,list], negate=False) -> list
Strip off chars characters from texts. negate for compatibility only.
borderBlocks(i, f, quiet = True)
Helper function for ghostBuster()
ghostBuster(f : list, quiet = True)
Uses file f and finds all lines that are out of place/clearly wrong. Returns a list of anomalies preceded by what letters it's expecting around the line. It does this by looking at the alphabetical ordering and determining if a line starts with a letter that is out of order.
fileStrip(f : list, cleanerArg=',. ', maxCol = 3)
Runs charStrip()
on each column, up to maxCol. maxCol is the index (so column number - 1) of the last column we care about. cleanerArg is passed on to charStrip()
bestPages(dirname='21', regex=R1)
Scan through the directory dirname and runs regex on each. The file that has the most matches with regex is then reported. You can use this to figure out which OCR settings are best for a specific document, and how much better it is.
Examples
Removing bad characters from a string
>>> string1 = "This# is$ a bu3cket"
>>> clean(string1, rems = ['#', '$', '3'])
'This is a bucket'
Sanitizing a text file until it becomes clean, usable text.
>>> raw = get('example.txt')
>>> raw
['This# is$ a bu3cket', ' ', "There's more..........", '', '', '', "It contains sssss every man's dying wish"]
>>> noBlankLines = cleanFile(raw, pred = p.long)
>>> noBlankLines
['This# is$ a bu3cket', "There's more..........", "It contains sssss every man's dying wish"]
>>> removeExtraChars = cleanFile(noBlankLines, cleaner = clean, cleanerArg = ['#', '$', '3'])
>>> removeExtraChars
['This is a bucket', "There's more..........", "It contains sssss every man's dying wish"]
>>> strippedChars = cleanFile(removeExtraChars, cleaner = charStrip, cleanerArg = '.')
>>> strippedChars
['This is a bucket', "There's more", "It contains sssss every man's dying wish"]
>>> removeSss = cleanFile(strippedChars, cleaner = cleanWords, cleanerArg = p.repeatedLetters, negateClean = True)
>>> removeSss
['This is a bucket', "There's more", "It contains every man's dying wish"]
p.long()
removes short lines as well as blank linesclean()
removes every character that matches the blacklistcharStrip()
strips all the period characters off the ends of each linecleanWords()
removes every line that causesp.repeatedLetters()
to be true (because negateClean is set)
Additionally, lambda expressions can be passed in as long as they meet the specifications of a predicate function.
>>> noJarjar = cleanFile(["Me-sa Jarjar Binks"], cleaner = cleanWords, cleanerArg = lambda txt : not txt == "Jarjar")
>>> noJarjar
['Me-sa Binks']
Collect
Usage
collect(text: Union[str, list], regex=defaultRegex, spaceMatches=False) -> tuple
If you used the NLP notebook, this is the same function, it just has a few more features.
Given some text, return matches and non_matches based on the regex. spaceMatches has several possible values: True (put a blank line in matches if there is no match), False (don't add anything to matches if there is no match), "keep" (put the entire line in matches if there is no match).
Returns (matches, non_matches).
Modification
If you want a specific ordering for the matches, edit the following lines (about 10 and 25 lines into the function):
if i is not None:
matches.append(list(i.groupdict().values()))
# matches.append((i.groupdict()['first'].strip() + ' ' + i.groupdict()['last'].strip(), i.groupdict()['info']))
The commented line is an example of how to change the list order.
CSV Functions
Syntax
csvJoin(texts: list) -> str
Returns texts as a csv string.
csvSplit(text: str) -> list
Converts text (a csv string) to list of texts.
csvText(text: str) -> str
Converts text to a csv string.
csvColumn(texts: list, n: int, safe=True) -> list
Gets column number n from texts (csv table like from get("table.csv")
). If safe is false, it will throw errors when out of bounds (such as when a line isn't n cells long). If safe is true, it will just fill in blank cells.
csvMergeColumn(fullTexts: list, column: list, n: int) -> list
Replaces column number n from fullTexts with column.
Examples
Create a csv string from a list
>>> csvJoin(['Smith, John', 'English'])
'"Smith, John",English'
Extract the first column from a csv table
>>> raw = get('test.csv')
>>> raw
['a1,b1,c1', 'a2,b2,c2', 'a3,b3,c3', '']
>>> csvColumn(raw, 0)
['a1', 'a2', 'a3', '']
General Functions
Syntax
init()
Runs on start each time. I use this to move to my working directory (see var pwd at top of file)
get(fileName: str, splt='\n') -> list
Reads file fileName and returns an array where each element is a line (with default splt). splt is used to split the text up some other way (for example get('foo.txt', splt='PAGE')
would make each element a page).
save(text: Union[str, list], out: str, csvStyle=False)
Saves text to a file called out. The csv argument decides if the list will be saved in a csv safe format (for example, a 2d list would be saved as a table).
find(item: str, texts: list) -> str
Returns index of texts where item is, if they are sort of similar (by this I mean one is a substring of the other).
batchPrint(lst:list, batchSize = 10)
Prints out batchSize lines at a time from lst. Useful for printing very large files in a manageable format.
Examples
Read a file in and print the contents of it
>>> raw = get('example.txt')
>>> raw
['Each', 'line', 'is', "its", 'own', 'cell', 'in', 'the', 'array']
>>> printFile(raw)
Each
line
is
its
own
cell
in
the
array
Saving a list to a .txt file
>>> texts = ['yo ho', 'haul together', 'hoist the colors high']
>>> save(texts, 'lyrics.txt')
>>> get('lyrics.txt')
['yo ho', 'haul together', 'hoist the colors high']
Modifying and saving a text file as a .csv file
>>> raw = get('test.csv')
>>> raw
['a1,b1,c1', 'a2,b2,c2', 'a3,b3,c3', '']
>>> raw[-1] = 'a4,b4,c4'
>>> save(raw, 'test2.csv', csvStyle = True)
Misc Functions
Syntax
headerGrab(texts, headerPred, quite = True)
Finds anything that makes headerPred true in texts and adds that onto all lines following, until a new header is found (by headerPred). quite turned off allows you to manually accept each header. Interactive mode is a work in progress.
infoGrab(fname, outName)
Pulls info off of names in file fname and saves it to outName.
slapThatInfoOn(texts)
Reorders texts so names and the corresponding info are all on the same lines. p.REGULAR_EXPRESSION
needs to be set before to recognize names.
Examples
headerGrab
>>> headerGrab(get('exampleFile.txt'), p.mostlyCaps, quite = True)
If the file was:
SENIORS
Joe Smith
Frank Stevenson
the function returns:
[Joe Smith, SENIORS]
[Frank Stevenson, SENIORS]
infoGrab(get('namesAndMajors.txt'), 'output.txt')
John Smith, English
would become ["John Smith", "English"]
.
slapThatInfoOn(get('example.txt'))
If the file was:
John Smith
Science
1922
Bob Smith
Math
1935
It becomes:
John Smith Science 1922
Bob Smith Math 1935