Python, NLP, and Working with Text - BKJackson/BKJackson_Wiki GitHub Wiki
Python and PDFs
How to Extract Words from PDFs with Python
PyPDF2 - To convert simple, text-based PDF files into text readable by Python.
PyPDF docs - applies to PyPDF2
textract github - Extract text from any document. No muss. No fuss.
textract docs - This package provides a single interface for extracting content from any type of file, without any irrelevant markup. To convert non-trivial, scanned PDF files into text readable by Python.
Python & NLP
Introduction to Natural Language Processing with Python Jess Bowden, May 31, 2016
PDFMiner Python PDF parser and analyzer.
PyPDF2 A utility to read and write PDFs with Python.
Digital Writing With Python 11 session course by Allison Parrish.
Text Programming from A-Z Earlier version of other course by Allison Parrish.
A (Really) Gentle Introduction to NLP in Python - Jan 15, 2020
Word2Vec
Word2Vec to Transformers - The evolution of word embeddings, notes from CS224n.
FlashText
FlashText - Quickly extract Keywords from sentence or Replace keywords in sentences. Good for large numbers of words or texts.
Python Regular Expression Matching
* : match zero or more
+ : match one or more
? : match none or one
{n,m} : match from n to m of the preceding group of characters
. (dot) : wildcard, matches any single character except for newline (\n)
\d : any integer from 0 to 9 (opposite: \D)
\w : any letter, number, or underscore (opposite: \W)
\s : any space, tab, or newline character (opposite: \S)
[a-zA-Z0-9] : make your own ranges for character matching (opposite: ^[a-zA-Z0-9])
^\d+ : match string that begins with one or more digits (returns digits)
\d+$ : match string that ends with one or more digits (returns digits)
Using re.compile(), search(), findall(), and group():
import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # Remember to use a raw (r'...') string
mo = phoneNumRegex.search('My number is 415-555-4242.') # Returns first instance
print('Phone number found: ' + mo.group())
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000') # Returns a list of strings
Match word boundaries
is_pub = with_amenity.amenity.str.contains(r'\bpub\b')
pubs = with_amenity[is_pub]
pubs.amenity.count().compute()
To avoid words like "public".
Easy way to get capitalized and uncapitalized words from a Pandas dataframe
with_name = data[data.name.notnull()]
with_amenity = data[data.amenity.notnull()]
is_starbucks = with_name.name.str.contains('[Ss]tarbucks')
is_dunkin = with_name.name.str.contains('[Dd]unkin')
starbucks = with_name[is_starbucks]
dunkin = with_name[is_dunkin]
From Out-of-Core Dataframes in Python: Dask and OpenStreetMap
Counter
A Counter is a container that keeps track of how many times equivalent values are added. It can be used to implement the same algorithms for which bag or multiset data structures are commonly used in other languages.
import collections
print collections.Counter(['a', 'b', 'c', 'a', 'b', 'b'])
c = collections.Counter()
print 'Initial :', c
c.update('abcdaab')
print 'Sequence:', c
c.update({'a':1, 'd':5})
print 'Dict :', c
Unicode, & Character Sets
Python 3.5.2 Unicode HOWTO Python Docs
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets Joel Spolsky (2003) Might be a bit out of date.
Pragmatic Unicode, or, How do I stop the pain?
# Python 3: str: a sequence of bytes
my_string = "Hello World"
type(my_string) # <class 'str'>
my_bytes = b"Hello World" # <class 'bytes'>
"Hello" == b"Hello" # False
# unicode: a sequence of code points (unicode)
# Note: u = 4 hex units, U = 8 hex units
my_unicode = u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24"
type(my_unicode) # <type 'unicode'>
# .encode() and .decode()
my_unicode.encode() # Converts unicode to bytes. Only meant for unicode!
my_string.decode() # Converts bytes to unicode. Only meant for bytes going to unicode!
# Moving between bytes and unicode, and back again.
# Note: 'ascii' encoding can only handle 127 characters, unlike utf-8.
my_utf8 = my_unicode.encode('utf-8')
my_unicode2 = my_utf8.decode('utf-8')
# Error handling options (similar with .decode())
my_unicode.encode('ascii', 'replace')
my_unicode.encode('ascii', 'xmlcharrefreplace')
my_unicode.encode('ascii', 'ignore')
# Reading files
# Reading in default text mode will return a unicode string
open("hello.txt", "r").read() # Returns unicode
# Reading in binary mode returns a byte string
open("hello.txt", "rb").read() # Returns bytes
# Reading a file encoded in utf-8, you'll get the correct unicode encoding
open("hi_utf8.txt", "r", encoding="utf-8").read() # Returns unicode
# Reading a file encoded in utf-f using "rb" returns bytes
open("hi_utf8.txt", "rb").read() # Returns bytes
Pro Tips
Pro tip #1: Bytes on the outside, unicode on the inside. Encode and decode at edges.
Pro tip #2: Know whether you have a byte or unicode string. Other types of encodings: utf-8, iso8859-1 (read: iso Latin 1)
Pro tip #3: Test your unicode in some kind of generator.
Facts of Life
1: I/O is always bytes
2: Need more than 256 symbols
3: Need both bytes and unicode
4: Can't infer encodings, so figure out how you're going to get that information.
5: Declared encodings can be wrong
Reading and working with text files
# Read lines of text into a python list, removing trailing characters and view text
messages = [line.rstrip() for line in open('./data/textfile')]
for message_no, message in enumerate(messages[:10]):
print(message_no, message)
# Read TSV file into pandas, adding custom column labels
messages = pandas.read_csv('./data/textfile', sep='\t', quoting=csv.QUOTE_NONE,
names=["label", "message"])
# View aggregate text statistics by sentence label
messages.groupby('label').describe()
# Find character length of sentences using a lambda function
messages['length'] = messages['message'].map(lambda text: len(text))
print(messages.head())
# Print summary statistics of sentence length
messages.length.describe()
# Plot sentence lengths in a histogram
messages.length.plot(bins=20, kind='hist')
# Create multiple histogram plots of sentence length by label
messages.hist(column='length', by='label', bins=50)
Tokenization and Part of speech tagging with TextBlob
from textblob import TextBlob
# Tokenization
TextBlob("Hello world, how is it going?").words
# Create list of (word, POS) pairs
TextBlob("Hello world, how is it going?").tags
# Normalize words into their base form (lemmas)
def split_into_lemmas(message):
message = message.lower()
words = TextBlob(message).words
return [word.lemma for word in words]
messages.message.head().apply(split_into_lemmas)