Python, NLP, and Working with Text - BKJackson/BKJackson_Wiki GitHub Wiki

Python and PDFs

How to Extract Words from PDFs with Python
PyPDF2 - To convert simple, text-based PDF files into text readable by Python.
PyPDF docs - applies to PyPDF2
textract github - Extract text from any document. No muss. No fuss.
textract docs - This package provides a single interface for extracting content from any type of file, without any irrelevant markup. To convert non-trivial, scanned PDF files into text readable by Python.

Python & NLP

Introduction to Natural Language Processing with Python Jess Bowden, May 31, 2016
PDFMiner Python PDF parser and analyzer.
PyPDF2 A utility to read and write PDFs with Python.
Digital Writing With Python 11 session course by Allison Parrish.
Text Programming from A-Z Earlier version of other course by Allison Parrish.
A (Really) Gentle Introduction to NLP in Python - Jan 15, 2020

Word2Vec

Word2Vec to Transformers - The evolution of word embeddings, notes from CS224n.

FlashText

FlashText - Quickly extract Keywords from sentence or Replace keywords in sentences. Good for large numbers of words or texts.

Python Regular Expression Matching

* : match zero or more
+ : match one or more
? : match none or one
{n,m} : match from n to m of the preceding group of characters
. (dot) : wildcard, matches any single character except for newline (\n)
\d : any integer from 0 to 9 (opposite: \D)
\w : any letter, number, or underscore (opposite: \W)
\s : any space, tab, or newline character (opposite: \S)
[a-zA-Z0-9] : make your own ranges for character matching (opposite: ^[a-zA-Z0-9])
^\d+ : match string that begins with one or more digits (returns digits)
\d+$ : match string that ends with one or more digits (returns digits)

Using re.compile(), search(), findall(), and group():

import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')  # Remember to use a raw (r'...') string
mo = phoneNumRegex.search('My number is 415-555-4242.')   # Returns first instance
print('Phone number found: ' + mo.group())
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000') # Returns a list of strings

Match word boundaries

is_pub = with_amenity.amenity.str.contains(r'\bpub\b')
pubs = with_amenity[is_pub]
pubs.amenity.count().compute()

To avoid words like "public".

Easy way to get capitalized and uncapitalized words from a Pandas dataframe

with_name = data[data.name.notnull()]
with_amenity = data[data.amenity.notnull()]

is_starbucks = with_name.name.str.contains('[Ss]tarbucks')
is_dunkin = with_name.name.str.contains('[Dd]unkin')

starbucks = with_name[is_starbucks]
dunkin = with_name[is_dunkin]

From Out-of-Core Dataframes in Python: Dask and OpenStreetMap

Counter

A Counter is a container that keeps track of how many times equivalent values are added. It can be used to implement the same algorithms for which bag or multiset data structures are commonly used in other languages.

import collections  

print collections.Counter(['a', 'b', 'c', 'a', 'b', 'b'])  

c = collections.Counter()
print 'Initial :', c
 
c.update('abcdaab')
print 'Sequence:', c

c.update({'a':1, 'd':5})
print 'Dict    :', c

PyMOTW: Counter

Unicode, & Character Sets

Python 3.5.2 Unicode HOWTO Python Docs
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets Joel Spolsky (2003) Might be a bit out of date.
Pragmatic Unicode, or, How do I stop the pain?

# Python 3: str: a sequence of bytes
my_string = "Hello World" 
type(my_string)   # <class 'str'>
my_bytes = b"Hello World"   # <class 'bytes'>
"Hello" == b"Hello"  # False

# unicode: a sequence of code points (unicode)
# Note: u = 4 hex units, U = 8 hex units
my_unicode = u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24"
type(my_unicode)  # <type 'unicode'> 

# .encode() and .decode()
my_unicode.encode()  # Converts unicode to bytes. Only meant for unicode!
my_string.decode()   # Converts bytes to unicode. Only meant for bytes going to unicode!

# Moving between bytes and unicode, and back again.
# Note: 'ascii' encoding can only handle 127 characters, unlike utf-8.
my_utf8 = my_unicode.encode('utf-8')
my_unicode2 = my_utf8.decode('utf-8')

# Error handling options (similar with .decode())
my_unicode.encode('ascii', 'replace')
my_unicode.encode('ascii', 'xmlcharrefreplace')
my_unicode.encode('ascii', 'ignore')

# Reading files
# Reading in default text mode will return a unicode string
open("hello.txt", "r").read()   # Returns unicode
# Reading in binary mode returns a byte string
open("hello.txt", "rb").read()  # Returns bytes
# Reading a file encoded in utf-8, you'll get the correct unicode encoding
open("hi_utf8.txt", "r", encoding="utf-8").read()    # Returns unicode
# Reading a file encoded in utf-f using "rb" returns bytes
open("hi_utf8.txt", "rb").read()     # Returns bytes

Pro Tips
Pro tip #1: Bytes on the outside, unicode on the inside. Encode and decode at edges.
Pro tip #2: Know whether you have a byte or unicode string. Other types of encodings: utf-8, iso8859-1 (read: iso Latin 1)
Pro tip #3: Test your unicode in some kind of generator.

Facts of Life
1: I/O is always bytes
2: Need more than 256 symbols
3: Need both bytes and unicode
4: Can't infer encodings, so figure out how you're going to get that information.
5: Declared encodings can be wrong

Reading and working with text files

# Read lines of text into a python list, removing trailing characters and view text  
messages = [line.rstrip() for line in open('./data/textfile')]  
for message_no, message in enumerate(messages[:10]):  
    print(message_no, message)  

# Read TSV file into pandas, adding custom column labels  
messages = pandas.read_csv('./data/textfile', sep='\t', quoting=csv.QUOTE_NONE,
                       names=["label", "message"])  

# View aggregate text statistics by sentence label  
messages.groupby('label').describe()   

# Find character length of sentences using a lambda function  
messages['length'] = messages['message'].map(lambda text: len(text))    
print(messages.head())  

# Print summary statistics of sentence length  
messages.length.describe()  

# Plot sentence lengths in a histogram  
messages.length.plot(bins=20, kind='hist')  

# Create multiple histogram plots of sentence length by label  
messages.hist(column='length', by='label', bins=50)

Tokenization and Part of speech tagging with TextBlob

TextBlob

from textblob import TextBlob  

# Tokenization  
TextBlob("Hello world, how is it going?").words  

# Create list of (word, POS) pairs  
TextBlob("Hello world, how is it going?").tags  

# Normalize words into their base form (lemmas)  
def split_into_lemmas(message):  
    message = message.lower()  
    words = TextBlob(message).words  
    return [word.lemma for word in words]  

messages.message.head().apply(split_into_lemmas)