Clean UK postcodes - lvphj/epydemiology GitHub Wiki

Python function to clean and extract correctly formatted UK postcode data in a Pandas dataframe.

phjCleanUKPostcodeVariable()

df = epy.phjCleanUKPostcodeVariable(phjDF,
                                    phjRealPostcodeSer = None,
                                    phjOrigPostcodeVarName = 'postcode',
                                    phjNewPostcodeVarName = 'postcodeClean',
                                    phjNewPostcodeStrLenVarName = 'postcodeCleanStrLen',
                                    phjPostcodeCheckVarName = 'postcodeCheck',
                                    phjMissingValueCode = 'missing',
                                    phjMinDamerauLevenshteinDistanceVarName = 'minDamLevDist',
                                    phjBestAlternativesVarName = 'bestAlternatives',
                                    phjPostcode7VarName = 'postcode7',
                                    phjPostcodeAreaVarName = 'postcodeArea',
                                    phjSalvageOutwardPostcodeComponent = True,
                                    phjCheckByOption = 'format',
                                    phjDropExisting = False,
                                    phjPrintResults = True)

Description

In many situations, postcodes are added to a database field to record people's addresses. However, when entering postcodes by hand or transcribing from written notes, it is often the case that postcodes are entered incorrectly due to typing errors or because the postcode in question is not fully known. Consequently, a variable containing postcode information will contain many correct postcodes but also many incorrect or partial data points. This function seeks to extract correctly formatted postcodes and to correct some commonly occurring transcription errors in order to produce a correctly-formatted postcode. In addition, in situations where just the outward component (first half) of the postcode is recorded, the function will attempt to salvage just the outward component. Finally, the function extracts the postcode area (first 1 or 2 letters) of the postcode. The cleaned postcode (with no spaces and in 7-character format), the outward and inward components of the postcode and the postcode areas are all stored in new variables that are added to the original dataframe.

This function uses one of two methods to extract postcode information:

checking the postcode is correctly 'formatted' using a regex;
comparing the postcode to a database of all known postcodes and, if the postcode does not exist, determining the most likely alternatives based on Damerau-Levenshtein distance and on the physical position of inserted or transposed characters on the keyboard. This method makes use of the fast damerau-levenshtein library (pyxdameraulevenshtein) that needs to be installed in the Python environment.

The regex used to determine whether postcodes are correctly formatted is a modified version of a regex published at https://en.wikipedia.org/wiki/Talk:Postcodes_in_the_United_Kingdom (accessed 22 Mar 2016). (This page is also stored locally as a PDF entitled, "Talk/Postcodes in the United Kingdom - Wikipedia, the free encyclopedia".)

The function takes, as two of its arguments, a Pandas dataframe containing a column of postcode data, and the name of that postcode column. It returns the same dataframe with some additional, postcode-related columns. The additional columns returned are:

'postcodeClean' (column name is user-defined through phjNewPostcodeVarName argument)

This variable will contain the correctly formatted components of the postcode, either the whole postcode or the outward component (first half of postcode). Postcodes that are incorrectly formatted or have been entered as missing values will contain the missing value code (e.g. 'missing').
'postcodeFormatCheck' (column name is user-defined through phjPostcodeFormatCheckVarName argument)

This is a binary variable that contains True if a correctly formatted postcode component can be extracted, either the whole postcode or the outward component only. Otherwise, it contains False.
'postcode7' (column name is user-defined through the phjPostcode7VarName argument)

This variable contains correctly formatted complete postcodes in 7-character format. For postcodes that contain 5 letters, the outward and inward components will be separated by 2 spaces; for postcodes that contain 6 letters, the outward and inward components will be separated by 1 space; and postcodes that contain 7 letters will contain no spaces. This format of postcodes is often used in postcode lookup tables.
'postcodeOutward' (defined as a group name in the regular expression and, therefore, not user-definable)

This variable contains the outward component of the postcode (first half of postcode). It is possible that this variable may contain a correctly-formatted postcode string (2 to 4 characters) whilst the variable containing the inward postcode string contains the missing vaue code.
'postcodeInward' (defined as a group name in the regular expression and, therefore, not user-definable)

This variable contains the inward component of the postcode (second half of postcode). It is possible that this variable may contain a missing value whilst the postcodeOutward variable contains a correctly-formatted postcode string (2 to 4 characters).
'phjPostcodeArea' (column name is user-defined through the phjPostcodeAreaVarName argument)

This variable contains the postcode area (first one or two letters) taken from correctly formatted outward postcode components.

If postcodes are checked using a regex, the functions proceeds as follows:

Postcodes data is cleaned by removing all spaces and punctuation marks and converting all letters to uppercase. Missing values and strings that cannot possibly be a postcode (e.g. all numeric data) are converted to the missing value code. The cleaned strings are stored temporarily in the postcodeClean variable.
Correctly formatted postcodes (in postcodeClean column) are identified using the regular expression and the postcodeFormatCheck is set to True. Outward and inward components are extracted and stored in the relevant columns.
Postcodes that are incorrectly formatted undergo an error-correction step where common typos and mis-transcriptions are corrected. After this process, the format of the corrected postcode is checked again using the regex and the postcodeFormatCheck variable set to True if necessary. Outward and inward components are extracted and stored in the relevant columns.
If the phjSalvageOutwardPostcodeComponent arugment is set to True (default), the function attempts to salvage just the outward postcode component. The postcode string in the postcodeClean variable are tested using the outward component of the regex to determine if the first 2 to 4 characters represent a correctly formatted outward component of a postcode. If so, postcodeFormatCheck is set to True and the partial string is extracted and stored in the postcodeOutward column.
Common typos and mis-transcriptions are corrected once again and the string tested against the regex to determine if the first 2 to 4 characters represent a correctly formatted outward component of a postcode. If so, postcodeFormatCheck is set to True and the partial string is extracted and stored in the postcodeOutward column.
For any postcode strings that have not been identified as a complete or partial match to the postcode regex, the postcodeClean variable is set to the missing value code.
The postcode area is extracted from the outwardPostcode variable and stored in the postcodeArea variable.
The function returns the dataframe containing the additional columns.

If postcodes are checked against a list of correct postcodes, the functions proceeds in a similar way except incorrect postcodes are compared with correct postcodes using the Damarau-Levenshtein distance, weighted bfor the physical distance of inserted or transponsed character on a standard QWERTY keyboard.

Function parameters

The function takes the following parameters:

phjDF

Pandas dataframe containing a variable that contains postcode information.
phjRealPostcodeSer (default = None)

If the postcodes are to be compared to real postcodes, this variable should refer to a Pandas Series of genuine postcodes.
phjOrigPostcodeVarName (default = 'postcode')

The name of the variable that contains postcode information.
phjNewPostcodeVarName (default = 'postcodeClean')

The name of the variable that the function creates that will contain 'cleaned' postcode data. The postcodes stored in this column will contain no whitespace. Therefore, A1 2BC will be entered as A12BC. Also, the 'cleaned' postcode may only be the outward component if that is the only corrected formatted data. If the use wants to view only complete postcodes, use phjPostcode7VarName. Strings where no valid postcode data has been extracted will be stored as missing value string.
phjNewPostcodeStrLenVarName (default = 'postcodeCleanStrLen')

Name of the variable that will be created to contain the length of the postcode.
phjPostcodeCheckVarName (default = 'postcodeCheck')

A binary variable that the function will create that indicates whether the whole postcode (or, if only 2 to 4 characters are entered, the outward component of the postcode) is either correctly formatted or matches the list of real postcodes supplied, depending on what what requested.
phjMissingValueCode (default = 'missing')

String used to indicate a missing value. This can not be np.nan because DataFrame.update() function does not undate NaN values.
phjMinDamerauLevenshteinDistanceVarName (default = 'minDamLevDist')

Name of variable that will be created to contain the DL distance.
phjBestAlternativesVarName (default = 'bestAlternatives')

Name of variable that will be created to contain best (or closest matching) postcodes from the list of real postcodes.
phjPostcode7VarName (default = 'postcode7')

The name of the variable that the function creates that will contain 'cleaned' postcode data in 7-character format. Postcodes can contain 5 to 7 characters. In those postcodes that consist of 5 characters, the outward and inward components will be separated by 2 spaces, in those postcodes that consist of 6 characters, the outward and inward components will be separated by 1 spaces, and in those postcodes that consist of 7 characters there will be no spaces. This format is commonly used in lookup tables that link postcodes to other geographical information.

phjPostcodeAreaVarName (default = 'postcodeArea')

The name of the variable that the function creates that will contain the postcode area (the first 1, 2 or, in very rare cases, 3 letters).

phjSalvageOutwardPostcodeComponent (default = True)

Indicates whether user wants to attempt to salvage some outward postcode components from postcode strings.

phjCheckByOption (default = 'format')

Select method to use to check postcodes. The default is 'format' and checks the format of the postcode using a regular expression. The alternative is 'dictionary' which calculates the Damarau-Levenshtein distance from each postcode in the list of supplied postcodes and chooses the closest matches based on the DL distance and the disctance of inserted or trasposed characters based on physical distance on a standard QWERTY keyboard. The 'dictionary' option makes use of the fast damerau-levenshtein library (pyxdameraulevenshtein) which, therefore, needs to be installed in the Python environment.

phjDropExisting (default = False)

If set to True, the function will automatically drop any pre-existing columns that have the same name as those columns that need to be created. If set to False, the function will halt.

phjPrintResults (default = False)

If set to True, the function will print information to screen as it proceeds.

Exceptions raised

None.

Returns

By default, function returns the original dataframe with added columns containing postcode data.

Other notes

The regex used to check the format of postcodes is given below and is a modification of the regex found at https://en.wikipedia.org/wiki/Talk:Postcodes_in_the_United_Kingdom (accessed 22 Mar 2016). The regex was modified slightly to allow for optional space between first and second parts of postcode (even though, in this library, all the whitespace is removed before comparing with the regex). Also, the original did not find old Norwich postcodes of the form NOR number-number-letter nor old Newport postcodes of form NPT number-letter-letter. The regex was changed so it consisted of two named components recognising the outward (first half) and inward (second half) of the postcode which could be compiled into a single regex, separated by whitespace (if required).

postcodeOutwardRegex = '''(?P<postcodeOutward>(?:^GIR(?=\s*0AA$)) |                 # Identifies special postcode GIR 0AA
                                              (?:^NOR(?=\s*[0-9][0-9][A-Z]$)) |     # Identifies old Norwich postcodes of format NOR number-number-letter
                                              (?:^NPT(?=\s*[0-9][A-Z][A-Z]$)) |     # Identifies old Newport (South Wales) postcodes of format NPT number-letter-letter
                                              (?:^(?:(?:A[BL]|B[ABDFHLNRSTX]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[HNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTY]?|T[ADFNQRSW]|UB|W[ADFNRSV]|YO|ZE)[1-9]?[0-9] |  # Identifies stardard outward code e.g. L4, L12, CH5, CH64
                                                  (?:(?:E|N|NW|SE|SW|W)1|EC[1-4]|WC[12])[A-HJKMNPR-Y]|(?:SW|W)(?:[1-9]|[1-9][0-9])|EC[1-9][0-9]|WC99))    # Identifies the odd London-based postcodes
                           )'''

postcodeInwardRegex = '''(?P<postcodeInward>(?<=NOR)(?:\s*[0-9][0-9][A-Z]$) |      # Picks out the unusual format of old Norwich postcodes (including leading space)
                                            (?:[0-9][ABD-HJLNP-UVW-Z]{2}$)         # Picks out standard number-letter-letter end of postcode
                         )'''

Example

Clean postcode based on format alone

# Create a test dataframe that contains a postcode variable and some other empty variables
# that have the same names as the new variables that will be created. Setting the 'phjDropExisting'
# variable to true will automatically drop pre-existing variables before running the function.
# Some of the variables in the test dataframe are not duplicated and are present to show that the
# function preserves those variables in tact.

import numpy as np
import pandas as pd
import re

# Create test dataframe
myTestPostcodeDF = pd.DataFrame({'postcode': ['NP45DG',
                                              'CH647TE',
                                              'CH5 4HE',
                                              'GIR 0AA',
                                              'NOT NOWN',
                                              'GIR0AB',
                                              'NOR12A',
                                              'no idea',
                                              'W1A 1AA',
                                              'missin',
                                              'NP4  OGH',
                                              'P012 OLL',
                                              'p01s',
                                              'ABCD',
                                              '',
                                              'ab123cd',
                                              'un-known',
                                              'B1    INJ',
                                              'AB123CD',
                                              'No idea what the postcode is',
                                              '    ???NP4-5DG_*#   '],
                                 'pcdClean': np.nan,
                                 'pcd7': np.nan,
                                 'postcodeOutward': np.nan,
                                 'someOtherCol': np.nan})

# Run function to extract postcode data
print('\nStart dataframe\n===============\n')
print(myTestPostcodeDF)
print('\n')

myTestPostcodeDF = epy.phjCleanUKPostcodeVariable(phjDF = myTestPostcodeDF,
                                                  phjRealPostcodeSer = None,
                                                  phjOrigPostcodeVarName = 'postcode',
                                                  phjNewPostcodeVarName = 'pcdClean',
                                                  phjNewPostcodeStrLenVarName = 'pcdCleanStrLen',
                                                  phjPostcodeCheckVarName = 'pcdFormatCheck',
                                                  phjMissingValueCode = 'missing',
                                                  phjMinDamerauLevenshteinDistanceVarName = 'minDamLevDist',
                                                  phjBestAlternativesVarName = 'bestAlternatives',
                                                  phjPostcode7VarName = 'pcd7',
                                                  phjPostcodeAreaVarName = 'pcdArea',
                                                  phjSalvageOutwardPostcodeComponent = True,
                                                  phjCheckByOption = 'format',
                                                  phjDropExisting = True,
                                                  phjPrintResults = True)

print('\nReturned dataframe\n==================\n')
print(myTestPostcodeDF)

This produces the following output:

Start dataframe
===============

                        postcode  pcdClean  pcd7  postcodeOutward  \
0                         NP45DG       NaN   NaN              NaN   
1                        CH647TE       NaN   NaN              NaN   
2                        CH5 4HE       NaN   NaN              NaN   
3                        GIR 0AA       NaN   NaN              NaN   
4                       NOT NOWN       NaN   NaN              NaN   
5                         GIR0AB       NaN   NaN              NaN   
6                         NOR12A       NaN   NaN              NaN   
7                        no idea       NaN   NaN              NaN   
8                        W1A 1AA       NaN   NaN              NaN   
9                         missin       NaN   NaN              NaN   
10                      NP4  OGH       NaN   NaN              NaN   
11                      P012 OLL       NaN   NaN              NaN   
12                          p01s       NaN   NaN              NaN   
13                          ABCD       NaN   NaN              NaN   
14                                     NaN   NaN              NaN   
15                       ab123cd       NaN   NaN              NaN   
16                      un-known       NaN   NaN              NaN   
17                     B1    INJ       NaN   NaN              NaN   
18                       AB123CD       NaN   NaN              NaN   
19  No idea what the postcode is       NaN   NaN              NaN   
20              ???NP4-5DG_*#          NaN   NaN              NaN   

    someOtherCol  
0            NaN  
1            NaN  
2            NaN  
3            NaN  
4            NaN  
5            NaN  
6            NaN  
7            NaN  
8            NaN  
9            NaN  
10           NaN  
11           NaN  
12           NaN  
13           NaN  
14           NaN  
15           NaN  
16           NaN  
17           NaN  
18           NaN  
19           NaN  
20           NaN  



Correctly and incorrectly formatted postcodes (BEFORE ERROR CORRECTION):
False    10
True      7
Name: pcdFormatCheck, dtype: int64
                        postcode                 pcdClean pcdFormatCheck  \
0                         NP45DG                   NP45DG           True   
1                        CH647TE                  CH647TE           True   
2                        CH5 4HE                   CH54HE           True   
3                        GIR 0AA                   GIR0AA           True   
4                       NOT NOWN                  missing            NaN   
5                         GIR0AB                   GIR0AB          False   
6                         NOR12A                   NOR12A           True   
7                        no idea                   NOIDEA          False   
8                        W1A 1AA                   W1A1AA           True   
9                         missin                  missing            NaN   
10                      NP4  OGH                   NP4OGH          False   
11                      P012 OLL                  P012OLL          False   
12                          p01s                     P01S          False   
13                          ABCD                     ABCD          False   
14                                                missing            NaN   
15                       ab123cd                  AB123CD          False   
16                      un-known                  missing            NaN   
17                     B1    INJ                    B1INJ          False   
18                       AB123CD                  AB123CD          False   
19  No idea what the postcode is  NOIDEAWHATTHEPOSTCODEIS          False   
20              ???NP4-5DG_*#                      NP45DG           True   

    pcdCleanStrLen  
0                6  
1                7  
2                6  
3                6  
4                7  
5                6  
6                6  
7                6  
8                6  
9                7  
10               6  
11               7  
12               4  
13               4  
14               7  
15               7  
16               7  
17               5  
18               7  
19              23  
20               6  



Correctly and incorrectly formatted postcodes (AFTER ERROR CORRECTION):
True     10
False     7
Name: pcdFormatCheck, dtype: int64
                        postcode pcdClean pcdFormatCheck  pcdCleanStrLen
0                         NP45DG   NP45DG           True               6
1                        CH647TE  CH647TE           True               7
2                        CH5 4HE   CH54HE           True               6
3                        GIR 0AA   GIR0AA           True               6
4                       NOT NOWN  missing            NaN               7
5                         GIR0AB   GIR0AB          False               6
6                         NOR12A   NOR12A           True               6
7                        no idea   NO1DEA          False               6
8                        W1A 1AA   W1A1AA           True               6
9                         missin  missing            NaN               7
10                      NP4  OGH   NP40GH           True               6
11                      P012 OLL  PO120LL           True               7
12                          p01s     PO15          False               4
13                          ABCD     ABCD          False               4
14                                missing            NaN               7
15                       ab123cd  AB123CD          False               7
16                      un-known  missing            NaN               7
17                     B1    INJ    B11NJ           True               5
18                       AB123CD  AB123CD          False               7
19  No idea what the postcode is  missing          False              23
20              ???NP4-5DG_*#      NP45DG           True               6



Final working postcode dataframe
================================

                        postcode pcdClean pcdFormatCheck  pcdCleanStrLen  \
0                         NP45DG   NP45DG           True               6   
1                        CH647TE  CH647TE           True               7   
2                        CH5 4HE   CH54HE           True               6   
3                        GIR 0AA   GIR0AA           True               6   
4                       NOT NOWN  missing            NaN               7   
5                         GIR0AB  missing          False               6   
6                         NOR12A   NOR12A           True               6   
7                        no idea  missing          False               6   
8                        W1A 1AA   W1A1AA           True               6   
9                         missin  missing            NaN               7   
10                      NP4  OGH   NP40GH           True               6   
11                      P012 OLL  PO120LL           True               7   
12                          p01s     PO15           True               4   
13                          ABCD  missing          False               4   
14                                missing            NaN               7   
15                       ab123cd     AB12           True               7   
16                      un-known  missing            NaN               7   
17                     B1    INJ    B11NJ           True               5   
18                       AB123CD     AB12           True               7   
19  No idea what the postcode is  missing          False              23   
20              ???NP4-5DG_*#      NP45DG           True               6   

       pcd7 postcodeOutward postcodeInward pcdArea  
0   NP4 5DG             NP4            5DG      NP  
1   CH647TE            CH64            7TE      CH  
2   CH5 4HE             CH5            4HE      CH  
3   GIR 0AA             GIR            0AA     GIR  
4       NaN             NaN            NaN     NaN  
5       NaN             NaN            NaN     NaN  
6   NOR 12A             NOR            12A     NOR  
7       NaN             NaN            NaN     NaN  
8   W1A 1AA             W1A            1AA       W  
9       NaN             NaN            NaN     NaN  
10  NP4 0GH             NP4            0GH      NP  
11  PO120LL            PO12            0LL      PO  
12      NaN            PO15            NaN      PO  
13      NaN             NaN            NaN     NaN  
14      NaN             NaN            NaN     NaN  
15      NaN            AB12            NaN      AB  
16      NaN             NaN            NaN     NaN  
17  B1  1NJ              B1            1NJ       B  
18      NaN            AB12            NaN      AB  
19      NaN             NaN            NaN     NaN  
20  NP4 5DG             NP4            5DG      NP  



Returned dataframe
==================

                        postcode  someOtherCol pcdClean pcdFormatCheck  \
0                         NP45DG           NaN   NP45DG           True   
1                        CH647TE           NaN  CH647TE           True   
2                        CH5 4HE           NaN   CH54HE           True   
3                        GIR 0AA           NaN   GIR0AA           True   
4                       NOT NOWN           NaN  missing            NaN   
5                         GIR0AB           NaN  missing          False   
6                         NOR12A           NaN   NOR12A           True   
7                        no idea           NaN  missing          False   
8                        W1A 1AA           NaN   W1A1AA           True   
9                         missin           NaN  missing            NaN   
10                      NP4  OGH           NaN   NP40GH           True   
11                      P012 OLL           NaN  PO120LL           True   
12                          p01s           NaN     PO15           True   
13                          ABCD           NaN  missing          False   
14                                         NaN  missing            NaN   
15                       ab123cd           NaN     AB12           True   
16                      un-known           NaN  missing            NaN   
17                     B1    INJ           NaN    B11NJ           True   
18                       AB123CD           NaN     AB12           True   
19  No idea what the postcode is           NaN  missing          False   
20              ???NP4-5DG_*#              NaN   NP45DG           True   

    pcdCleanStrLen     pcd7 postcodeOutward postcodeInward pcdArea  
0                6  NP4 5DG             NP4            5DG      NP  
1                7  CH647TE            CH64            7TE      CH  
2                6  CH5 4HE             CH5            4HE      CH  
3                6  GIR 0AA             GIR            0AA     GIR  
4                7      NaN             NaN            NaN     NaN  
5                6      NaN             NaN            NaN     NaN  
6                6  NOR 12A             NOR            12A     NOR  
7                6      NaN             NaN            NaN     NaN  
8                6  W1A 1AA             W1A            1AA       W  
9                7      NaN             NaN            NaN     NaN  
10               6  NP4 0GH             NP4            0GH      NP  
11               7  PO120LL            PO12            0LL      PO  
12               4      NaN            PO15            NaN      PO  
13               4      NaN             NaN            NaN     NaN  
14               7      NaN             NaN            NaN     NaN  
15               7      NaN            AB12            NaN      AB  
16               7      NaN             NaN            NaN     NaN  
17               5  B1  1NJ              B1            1NJ       B  
18               7      NaN            AB12            NaN      AB  
19              23      NaN             NaN            NaN     NaN  
20               6  NP4 5DG             NP4            5DG      NP

Clean postcodes based on real postcode and identify closest matches

import re

# N.B. When calculating best alternative postcodes, only postcodes that are within
# 1 DL distance are considered.

# Create a Pandas series that could contain all the postcodes in the UK
realPostcodesSer = pd.Series(['NP4 5DG','CH647TE','CH5 4HE','W1A 1AA','NP4 0GH','PO120LL','AB123CF','AB124DF','AB123CV'])

# Create test dataframe
myTestPostcodeDF = pd.DataFrame({'postcode': ['NP45DG',
                                              'CH647TE',
                                              'CH5 4HE',
                                              'GIR 0AA',
                                              'NOT NOWN',
                                              'GIR0AB',
                                              'NOR12A',
                                              'no idea',
                                              'W1A 1AA',
                                              'missin',
                                              'NP4  OGH',
                                              'P012 OLL',
                                              'p01s',
                                              'ABCD',
                                              '',
                                              'ab123cd',
                                              'un-known',
                                              'B1    INJ',
                                              'AB123CD',
                                              'No idea what the postcode is',
                                              '    ???NP4-5DG_*#   '],
                                 'pcdClean': np.nan,
                                 'pcd7': np.nan,
                                 'postcodeOutward': np.nan,
                                 'someOtherCol': np.nan})

# Run function to extract postcode data
print('\nStart dataframe\n===============\n')
print(myTestPostcodeDF)
print('\n')

myTestPostcodeDF = epy.phjCleanUKPostcodeVariable(phjDF = myTestPostcodeDF,
                                                  phjRealPostcodeSer = realPostcodesSer,
                                                  phjOrigPostcodeVarName = 'postcode',
                                                  phjNewPostcodeVarName = 'pcdClean',
                                                  phjNewPostcodeStrLenVarName = 'pcdCleanStrLen',
                                                  phjPostcodeCheckVarName = 'pcdFormatCheck',
                                                  phjMissingValueCode = 'missing',
                                                  phjMinDamerauLevenshteinDistanceVarName = 'minDamLevDist',
                                                  phjBestAlternativesVarName = 'bestAlternatives',
                                                  phjPostcode7VarName = 'pcd7',
                                                  phjPostcodeAreaVarName = 'pcdArea',
                                                  phjSalvageOutwardPostcodeComponent = True,
                                                  phjCheckByOption = 'dictionary',
                                                  phjDropExisting = True,
                                                  phjPrintResults = True)

print('\nReturned dataframe\n==================\n')
print(myTestPostcodeDF)

This produces the following output:

Start dataframe
===============

                        postcode  pcdClean  pcd7  postcodeOutward  \
0                         NP45DG       NaN   NaN              NaN   
1                        CH647TE       NaN   NaN              NaN   
2                        CH5 4HE       NaN   NaN              NaN   
3                        GIR 0AA       NaN   NaN              NaN   
4                       NOT NOWN       NaN   NaN              NaN   
5                         GIR0AB       NaN   NaN              NaN   
6                         NOR12A       NaN   NaN              NaN   
7                        no idea       NaN   NaN              NaN   
8                        W1A 1AA       NaN   NaN              NaN   
9                         missin       NaN   NaN              NaN   
10                      NP4  OGH       NaN   NaN              NaN   
11                      P012 OLL       NaN   NaN              NaN   
12                          p01s       NaN   NaN              NaN   
13                          ABCD       NaN   NaN              NaN   
14                                     NaN   NaN              NaN   
15                       ab123cd       NaN   NaN              NaN   
16                      un-known       NaN   NaN              NaN   
17                     B1    INJ       NaN   NaN              NaN   
18                       AB123CD       NaN   NaN              NaN   
19  No idea what the postcode is       NaN   NaN              NaN   
20              ???NP4-5DG_*#          NaN   NaN              NaN   

    someOtherCol  
0            NaN  
1            NaN  
2            NaN  
3            NaN  
4            NaN  
5            NaN  
6            NaN  
7            NaN  
8            NaN  
9            NaN  
10           NaN  
11           NaN  
12           NaN  
13           NaN  
14           NaN  
15           NaN  
16           NaN  
17           NaN  
18           NaN  
19           NaN  
20           NaN  



Correctly and incorrectly formatted postcodes (BEFORE ERROR CORRECTION):
False    12
True      5
Name: pcdFormatCheck, dtype: int64
                        postcode                 pcdClean pcdFormatCheck  \
0                         NP45DG                   NP45DG           True   
1                        CH647TE                  CH647TE           True   
2                        CH5 4HE                   CH54HE           True   
3                        GIR 0AA                   GIR0AA          False   
4                       NOT NOWN                  missing            NaN   
5                         GIR0AB                   GIR0AB          False   
6                         NOR12A                   NOR12A          False   
7                        no idea                   NOIDEA          False   
8                        W1A 1AA                   W1A1AA           True   
9                         missin                  missing            NaN   
10                      NP4  OGH                   NP4OGH          False   
11                      P012 OLL                  P012OLL          False   
12                          p01s                     P01S          False   
13                          ABCD                     ABCD          False   
14                                                missing            NaN   
15                       ab123cd                  AB123CD          False   
16                      un-known                  missing            NaN   
17                     B1    INJ                    B1INJ          False   
18                       AB123CD                  AB123CD          False   
19  No idea what the postcode is  NOIDEAWHATTHEPOSTCODEIS          False   
20              ???NP4-5DG_*#                      NP45DG           True   

    pcdCleanStrLen  
0                6  
1                7  
2                6  
3                6  
4                7  
5                6  
6                6  
7                6  
8                6  
9                7  
10               6  
11               7  
12               4  
13               4  
14               7  
15               7  
16               7  
17               5  
18               7  
19              23  
20               6  



Correctly and incorrectly formatted postcodes (AFTER ERROR CORRECTION):
False    10
True      7
Name: pcdFormatCheck, dtype: int64
                        postcode pcdClean pcdFormatCheck  pcdCleanStrLen
0                         NP45DG   NP45DG           True               6
1                        CH647TE  CH647TE           True               7
2                        CH5 4HE   CH54HE           True               6
3                        GIR 0AA   GIR0AA          False               6
4                       NOT NOWN  missing            NaN               7
5                         GIR0AB   GIR0AB          False               6
6                         NOR12A   NOR1ZA          False               6
7                        no idea   NO1DEA          False               6
8                        W1A 1AA   W1A1AA           True               6
9                         missin  missing            NaN               7
10                      NP4  OGH   NP40GH           True               6
11                      P012 OLL  PO120LL           True               7
12                          p01s     PO15          False               4
13                          ABCD     ABCD          False               4
14                                missing            NaN               7
15                       ab123cd  AB123CD          False               7
16                      un-known  missing            NaN               7
17                     B1    INJ    B11NJ          False               5
18                       AB123CD  AB123CD          False               7
19  No idea what the postcode is  missing          False              23
20              ???NP4-5DG_*#      NP45DG           True               6


Consider first postcode entry: GIR0AA
   Returned list of edits: [4, None]

Consider first postcode entry: GIR0AA
   Returned list of edits: [4, None]

Consider first postcode entry: GIR0AB
   Returned list of edits: [5, None]

Consider first postcode entry: NOR1ZA
   Returned list of edits: [4, None]

Consider first postcode entry: NO1DEA
   Returned list of edits: [5, None]

Consider first postcode entry: PO15
   Returned list of edits: [4, None]

Consider first postcode entry: ABCD
   Returned list of edits: [4, None]

Consider first postcode entry: AB123CD
   Returned list of edits: [1, ['AB123CF', 'AB123CV']]

Consider first postcode entry: B11NJ
   Returned list of edits: [4, None]

Consider first postcode entry: AB123CD
   Returned list of edits: [1, ['AB123CF', 'AB123CV']]


Final working postcode dataframe
================================

                        postcode pcdClean pcdFormatCheck  pcdCleanStrLen  \
0                         NP45DG   NP45DG           True             6.0   
1                        CH647TE  CH647TE           True             7.0   
2                        CH5 4HE   CH54HE           True             6.0   
3                        GIR 0AA  missing          False             6.0   
4                       NOT NOWN  missing            NaN             7.0   
5                         GIR0AB  missing          False             6.0   
6                         NOR12A  missing          False             6.0   
7                        no idea  missing          False             6.0   
8                        W1A 1AA   W1A1AA           True             6.0   
9                         missin  missing            NaN             7.0   
10                      NP4  OGH   NP40GH           True             6.0   
11                      P012 OLL  PO120LL           True             7.0   
12                          p01s  missing          False             4.0   
13                          ABCD  missing          False             4.0   
14                                missing            NaN             7.0   
15                       ab123cd  missing          False             7.0   
16                      un-known  missing            NaN             7.0   
17                     B1    INJ  missing          False             5.0   
18                       AB123CD  missing          False             7.0   
19  No idea what the postcode is  missing          False            23.0   
20              ???NP4-5DG_*#      NP45DG           True             6.0   

       pcd7 postcodeOutward postcodeInward  minDamLevDist    bestAlternatives  \
0   NP4 5DG             NP4            5DG            NaN                 NaN   
1   CH647TE            CH64            7TE            NaN                 NaN   
2   CH5 4HE             CH5            4HE            NaN                 NaN   
3       NaN             NaN            NaN            4.0                 NaN   
4       NaN             NaN            NaN            NaN                 NaN   
5       NaN             NaN            NaN            5.0                 NaN   
6       NaN             NaN            NaN            4.0                 NaN   
7       NaN             NaN            NaN            5.0                 NaN   
8   W1A 1AA             W1A            1AA            NaN                 NaN   
9       NaN             NaN            NaN            NaN                 NaN   
10  NP4 0GH             NP4            0GH            NaN                 NaN   
11  PO120LL            PO12            0LL            NaN                 NaN   
12      NaN             NaN            NaN            4.0                 NaN   
13      NaN             NaN            NaN            4.0                 NaN   
14      NaN             NaN            NaN            NaN                 NaN   
15      NaN             NaN            NaN            1.0  [AB123CF, AB123CV]   
16      NaN             NaN            NaN            NaN                 NaN   
17      NaN             NaN            NaN            4.0                 NaN   
18      NaN             NaN            NaN            1.0  [AB123CF, AB123CV]   
19      NaN             NaN            NaN            NaN                 NaN   
20  NP4 5DG             NP4            5DG            NaN                 NaN   

   pcdArea  
0       NP  
1       CH  
2       CH  
3      NaN  
4      NaN  
5      NaN  
6      NaN  
7      NaN  
8        W  
9      NaN  
10      NP  
11      PO  
12     NaN  
13     NaN  
14     NaN  
15     NaN  
16     NaN  
17     NaN  
18     NaN  
19     NaN  
20      NP  



Returned dataframe
==================

                        postcode  someOtherCol pcdClean pcdFormatCheck  \
0                         NP45DG           NaN   NP45DG           True   
1                        CH647TE           NaN  CH647TE           True   
2                        CH5 4HE           NaN   CH54HE           True   
3                        GIR 0AA           NaN  missing          False   
4                       NOT NOWN           NaN  missing            NaN   
5                         GIR0AB           NaN  missing          False   
6                         NOR12A           NaN  missing          False   
7                        no idea           NaN  missing          False   
8                        W1A 1AA           NaN   W1A1AA           True   
9                         missin           NaN  missing            NaN   
10                      NP4  OGH           NaN   NP40GH           True   
11                      P012 OLL           NaN  PO120LL           True   
12                          p01s           NaN  missing          False   
13                          ABCD           NaN  missing          False   
14                                         NaN  missing            NaN   
15                       ab123cd           NaN  missing          False   
16                      un-known           NaN  missing            NaN   
17                     B1    INJ           NaN  missing          False   
18                       AB123CD           NaN  missing          False   
19  No idea what the postcode is           NaN  missing          False   
20              ???NP4-5DG_*#              NaN   NP45DG           True   

    pcdCleanStrLen     pcd7 postcodeOutward postcodeInward  minDamLevDist  \
0              6.0  NP4 5DG             NP4            5DG            NaN   
1              7.0  CH647TE            CH64            7TE            NaN   
2              6.0  CH5 4HE             CH5            4HE            NaN   
3              6.0      NaN             NaN            NaN            4.0   
4              7.0      NaN             NaN            NaN            NaN   
5              6.0      NaN             NaN            NaN            5.0   
6              6.0      NaN             NaN            NaN            4.0   
7              6.0      NaN             NaN            NaN            5.0   
8              6.0  W1A 1AA             W1A            1AA            NaN   
9              7.0      NaN             NaN            NaN            NaN   
10             6.0  NP4 0GH             NP4            0GH            NaN   
11             7.0  PO120LL            PO12            0LL            NaN   
12             4.0      NaN             NaN            NaN            4.0   
13             4.0      NaN             NaN            NaN            4.0   
14             7.0      NaN             NaN            NaN            NaN   
15             7.0      NaN             NaN            NaN            1.0   
16             7.0      NaN             NaN            NaN            NaN   
17             5.0      NaN             NaN            NaN            4.0   
18             7.0      NaN             NaN            NaN            1.0   
19            23.0      NaN             NaN            NaN            NaN   
20             6.0  NP4 5DG             NP4            5DG            NaN   

      bestAlternatives pcdArea  
0                  NaN      NP  
1                  NaN      CH  
2                  NaN      CH  
3                  NaN     NaN  
4                  NaN     NaN  
5                  NaN     NaN  
6                  NaN     NaN  
7                  NaN     NaN  
8                  NaN       W  
9                  NaN     NaN  
10                 NaN      NP  
11                 NaN      PO  
12                 NaN     NaN  
13                 NaN     NaN  
14                 NaN     NaN  
15  [AB123CF, AB123CV]     NaN  
16                 NaN     NaN  
17                 NaN     NaN  
18  [AB123CF, AB123CV]     NaN  
19                 NaN     NaN  
20                 NaN      NP