Clean UK postcodes - lvphj/epydemiology GitHub Wiki
Python function to clean and extract correctly formatted UK postcode data in a Pandas dataframe.
phjCleanUKPostcodeVariable()
df = epy.phjCleanUKPostcodeVariable(phjDF,
phjRealPostcodeSer = None,
phjOrigPostcodeVarName = 'postcode',
phjNewPostcodeVarName = 'postcodeClean',
phjNewPostcodeStrLenVarName = 'postcodeCleanStrLen',
phjPostcodeCheckVarName = 'postcodeCheck',
phjMissingValueCode = 'missing',
phjMinDamerauLevenshteinDistanceVarName = 'minDamLevDist',
phjBestAlternativesVarName = 'bestAlternatives',
phjPostcode7VarName = 'postcode7',
phjPostcodeAreaVarName = 'postcodeArea',
phjSalvageOutwardPostcodeComponent = True,
phjCheckByOption = 'format',
phjDropExisting = False,
phjPrintResults = True)
Description
In many situations, postcodes are added to a database field to record people's addresses. However, when entering postcodes by hand or transcribing from written notes, it is often the case that postcodes are entered incorrectly due to typing errors or because the postcode in question is not fully known. Consequently, a variable containing postcode information will contain many correct postcodes but also many incorrect or partial data points. This function seeks to extract correctly formatted postcodes and to correct some commonly occurring transcription errors in order to produce a correctly-formatted postcode. In addition, in situations where just the outward component (first half) of the postcode is recorded, the function will attempt to salvage just the outward component. Finally, the function extracts the postcode area (first 1 or 2 letters) of the postcode. The cleaned postcode (with no spaces and in 7-character format), the outward and inward components of the postcode and the postcode areas are all stored in new variables that are added to the original dataframe.
This function uses one of two methods to extract postcode information:
-
checking the postcode is correctly 'formatted' using a regex;
-
comparing the postcode to a database of all known postcodes and, if the postcode does not exist, determining the most likely alternatives based on Damerau-Levenshtein distance and on the physical position of inserted or transposed characters on the keyboard. This method makes use of the fast damerau-levenshtein library (pyxdameraulevenshtein) that needs to be installed in the Python environment.
The regex used to determine whether postcodes are correctly formatted is a modified version of a regex published at https://en.wikipedia.org/wiki/Talk:Postcodes_in_the_United_Kingdom (accessed 22 Mar 2016). (This page is also stored locally as a PDF entitled, "Talk/Postcodes in the United Kingdom - Wikipedia, the free encyclopedia".)
The function takes, as two of its arguments, a Pandas dataframe containing a column of postcode data, and the name of that postcode column. It returns the same dataframe with some additional, postcode-related columns. The additional columns returned are:
-
'postcodeClean' (column name is user-defined through phjNewPostcodeVarName argument)
This variable will contain the correctly formatted components of the postcode, either the whole postcode or the outward component (first half of postcode). Postcodes that are incorrectly formatted or have been entered as missing values will contain the missing value code (e.g. 'missing').
-
'postcodeFormatCheck' (column name is user-defined through phjPostcodeFormatCheckVarName argument)
This is a binary variable that contains True if a correctly formatted postcode component can be extracted, either the whole postcode or the outward component only. Otherwise, it contains False.
-
'postcode7' (column name is user-defined through the phjPostcode7VarName argument)
This variable contains correctly formatted complete postcodes in 7-character format. For postcodes that contain 5 letters, the outward and inward components will be separated by 2 spaces; for postcodes that contain 6 letters, the outward and inward components will be separated by 1 space; and postcodes that contain 7 letters will contain no spaces. This format of postcodes is often used in postcode lookup tables.
-
'postcodeOutward' (defined as a group name in the regular expression and, therefore, not user-definable)
This variable contains the outward component of the postcode (first half of postcode). It is possible that this variable may contain a correctly-formatted postcode string (2 to 4 characters) whilst the variable containing the inward postcode string contains the missing vaue code.
-
'postcodeInward' (defined as a group name in the regular expression and, therefore, not user-definable)
This variable contains the inward component of the postcode (second half of postcode). It is possible that this variable may contain a missing value whilst the postcodeOutward variable contains a correctly-formatted postcode string (2 to 4 characters).
-
'phjPostcodeArea' (column name is user-defined through the phjPostcodeAreaVarName argument)
This variable contains the postcode area (first one or two letters) taken from correctly formatted outward postcode components.
If postcodes are checked using a regex, the functions proceeds as follows:
-
Postcodes data is cleaned by removing all spaces and punctuation marks and converting all letters to uppercase. Missing values and strings that cannot possibly be a postcode (e.g. all numeric data) are converted to the missing value code. The cleaned strings are stored temporarily in the postcodeClean variable.
-
Correctly formatted postcodes (in postcodeClean column) are identified using the regular expression and the postcodeFormatCheck is set to True. Outward and inward components are extracted and stored in the relevant columns.
-
Postcodes that are incorrectly formatted undergo an error-correction step where common typos and mis-transcriptions are corrected. After this process, the format of the corrected postcode is checked again using the regex and the postcodeFormatCheck variable set to True if necessary. Outward and inward components are extracted and stored in the relevant columns.
-
If the phjSalvageOutwardPostcodeComponent arugment is set to True (default), the function attempts to salvage just the outward postcode component. The postcode string in the postcodeClean variable are tested using the outward component of the regex to determine if the first 2 to 4 characters represent a correctly formatted outward component of a postcode. If so, postcodeFormatCheck is set to True and the partial string is extracted and stored in the postcodeOutward column.
-
Common typos and mis-transcriptions are corrected once again and the string tested against the regex to determine if the first 2 to 4 characters represent a correctly formatted outward component of a postcode. If so, postcodeFormatCheck is set to True and the partial string is extracted and stored in the postcodeOutward column.
-
For any postcode strings that have not been identified as a complete or partial match to the postcode regex, the postcodeClean variable is set to the missing value code.
-
The postcode area is extracted from the outwardPostcode variable and stored in the postcodeArea variable.
-
The function returns the dataframe containing the additional columns.
If postcodes are checked against a list of correct postcodes, the functions proceeds in a similar way except incorrect postcodes are compared with correct postcodes using the Damarau-Levenshtein distance, weighted bfor the physical distance of inserted or transponsed character on a standard QWERTY keyboard.
Function parameters
The function takes the following parameters:
-
phjDF
Pandas dataframe containing a variable that contains postcode information.
-
phjRealPostcodeSer (default = None)
If the postcodes are to be compared to real postcodes, this variable should refer to a Pandas Series of genuine postcodes.
-
phjOrigPostcodeVarName (default = 'postcode')
The name of the variable that contains postcode information.
-
phjNewPostcodeVarName (default = 'postcodeClean')
The name of the variable that the function creates that will contain 'cleaned' postcode data. The postcodes stored in this column will contain no whitespace. Therefore, A1 2BC will be entered as A12BC. Also, the 'cleaned' postcode may only be the outward component if that is the only corrected formatted data. If the use wants to view only complete postcodes, use phjPostcode7VarName. Strings where no valid postcode data has been extracted will be stored as missing value string.
-
phjNewPostcodeStrLenVarName (default = 'postcodeCleanStrLen')
Name of the variable that will be created to contain the length of the postcode.
-
phjPostcodeCheckVarName (default = 'postcodeCheck')
A binary variable that the function will create that indicates whether the whole postcode (or, if only 2 to 4 characters are entered, the outward component of the postcode) is either correctly formatted or matches the list of real postcodes supplied, depending on what what requested.
-
phjMissingValueCode (default = 'missing')
String used to indicate a missing value. This can not be np.nan because DataFrame.update() function does not undate NaN values.
-
phjMinDamerauLevenshteinDistanceVarName (default = 'minDamLevDist')
Name of variable that will be created to contain the DL distance.
-
phjBestAlternativesVarName (default = 'bestAlternatives')
Name of variable that will be created to contain best (or closest matching) postcodes from the list of real postcodes.
-
phjPostcode7VarName (default = 'postcode7')
The name of the variable that the function creates that will contain 'cleaned' postcode data in 7-character format. Postcodes can contain 5 to 7 characters. In those postcodes that consist of 5 characters, the outward and inward components will be separated by 2 spaces, in those postcodes that consist of 6 characters, the outward and inward components will be separated by 1 spaces, and in those postcodes that consist of 7 characters there will be no spaces. This format is commonly used in lookup tables that link postcodes to other geographical information.
- phjPostcodeAreaVarName (default = 'postcodeArea')
The name of the variable that the function creates that will contain the postcode area (the first 1, 2 or, in very rare cases, 3 letters).
- phjSalvageOutwardPostcodeComponent (default = True)
Indicates whether user wants to attempt to salvage some outward postcode components from postcode strings.
- phjCheckByOption (default = 'format')
Select method to use to check postcodes. The default is 'format' and checks the format of the postcode using a regular expression. The alternative is 'dictionary' which calculates the Damarau-Levenshtein distance from each postcode in the list of supplied postcodes and chooses the closest matches based on the DL distance and the disctance of inserted or trasposed characters based on physical distance on a standard QWERTY keyboard. The 'dictionary' option makes use of the fast damerau-levenshtein library (pyxdameraulevenshtein) which, therefore, needs to be installed in the Python environment.
- phjDropExisting (default = False)
If set to True, the function will automatically drop any pre-existing columns that have the same name as those columns that need to be created. If set to False, the function will halt.
- phjPrintResults (default = False)
If set to True, the function will print information to screen as it proceeds.
Exceptions raised
None.
Returns
By default, function returns the original dataframe with added columns containing postcode data.
Other notes
The regex used to check the format of postcodes is given below and is a modification of the regex found at https://en.wikipedia.org/wiki/Talk:Postcodes_in_the_United_Kingdom (accessed 22 Mar 2016). The regex was modified slightly to allow for optional space between first and second parts of postcode (even though, in this library, all the whitespace is removed before comparing with the regex). Also, the original did not find old Norwich postcodes of the form NOR number-number-letter nor old Newport postcodes of form NPT number-letter-letter. The regex was changed so it consisted of two named components recognising the outward (first half) and inward (second half) of the postcode which could be compiled into a single regex, separated by whitespace (if required).
postcodeOutwardRegex = '''(?P<postcodeOutward>(?:^GIR(?=\s*0AA$)) | # Identifies special postcode GIR 0AA
(?:^NOR(?=\s*[0-9][0-9][A-Z]$)) | # Identifies old Norwich postcodes of format NOR number-number-letter
(?:^NPT(?=\s*[0-9][A-Z][A-Z]$)) | # Identifies old Newport (South Wales) postcodes of format NPT number-letter-letter
(?:^(?:(?:A[BL]|B[ABDFHLNRSTX]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[HNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTY]?|T[ADFNQRSW]|UB|W[ADFNRSV]|YO|ZE)[1-9]?[0-9] | # Identifies stardard outward code e.g. L4, L12, CH5, CH64
(?:(?:E|N|NW|SE|SW|W)1|EC[1-4]|WC[12])[A-HJKMNPR-Y]|(?:SW|W)(?:[1-9]|[1-9][0-9])|EC[1-9][0-9]|WC99)) # Identifies the odd London-based postcodes
)'''
postcodeInwardRegex = '''(?P<postcodeInward>(?<=NOR)(?:\s*[0-9][0-9][A-Z]$) | # Picks out the unusual format of old Norwich postcodes (including leading space)
(?:[0-9][ABD-HJLNP-UVW-Z]{2}$) # Picks out standard number-letter-letter end of postcode
)'''
Example
Clean postcode based on format alone
# Create a test dataframe that contains a postcode variable and some other empty variables
# that have the same names as the new variables that will be created. Setting the 'phjDropExisting'
# variable to true will automatically drop pre-existing variables before running the function.
# Some of the variables in the test dataframe are not duplicated and are present to show that the
# function preserves those variables in tact.
import numpy as np
import pandas as pd
import re
# Create test dataframe
myTestPostcodeDF = pd.DataFrame({'postcode': ['NP45DG',
'CH647TE',
'CH5 4HE',
'GIR 0AA',
'NOT NOWN',
'GIR0AB',
'NOR12A',
'no idea',
'W1A 1AA',
'missin',
'NP4 OGH',
'P012 OLL',
'p01s',
'ABCD',
'',
'ab123cd',
'un-known',
'B1 INJ',
'AB123CD',
'No idea what the postcode is',
' ???NP4-5DG_*# '],
'pcdClean': np.nan,
'pcd7': np.nan,
'postcodeOutward': np.nan,
'someOtherCol': np.nan})
# Run function to extract postcode data
print('\nStart dataframe\n===============\n')
print(myTestPostcodeDF)
print('\n')
myTestPostcodeDF = epy.phjCleanUKPostcodeVariable(phjDF = myTestPostcodeDF,
phjRealPostcodeSer = None,
phjOrigPostcodeVarName = 'postcode',
phjNewPostcodeVarName = 'pcdClean',
phjNewPostcodeStrLenVarName = 'pcdCleanStrLen',
phjPostcodeCheckVarName = 'pcdFormatCheck',
phjMissingValueCode = 'missing',
phjMinDamerauLevenshteinDistanceVarName = 'minDamLevDist',
phjBestAlternativesVarName = 'bestAlternatives',
phjPostcode7VarName = 'pcd7',
phjPostcodeAreaVarName = 'pcdArea',
phjSalvageOutwardPostcodeComponent = True,
phjCheckByOption = 'format',
phjDropExisting = True,
phjPrintResults = True)
print('\nReturned dataframe\n==================\n')
print(myTestPostcodeDF)
This produces the following output:
Start dataframe
===============
postcode pcdClean pcd7 postcodeOutward \
0 NP45DG NaN NaN NaN
1 CH647TE NaN NaN NaN
2 CH5 4HE NaN NaN NaN
3 GIR 0AA NaN NaN NaN
4 NOT NOWN NaN NaN NaN
5 GIR0AB NaN NaN NaN
6 NOR12A NaN NaN NaN
7 no idea NaN NaN NaN
8 W1A 1AA NaN NaN NaN
9 missin NaN NaN NaN
10 NP4 OGH NaN NaN NaN
11 P012 OLL NaN NaN NaN
12 p01s NaN NaN NaN
13 ABCD NaN NaN NaN
14 NaN NaN NaN
15 ab123cd NaN NaN NaN
16 un-known NaN NaN NaN
17 B1 INJ NaN NaN NaN
18 AB123CD NaN NaN NaN
19 No idea what the postcode is NaN NaN NaN
20 ???NP4-5DG_*# NaN NaN NaN
someOtherCol
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN
Correctly and incorrectly formatted postcodes (BEFORE ERROR CORRECTION):
False 10
True 7
Name: pcdFormatCheck, dtype: int64
postcode pcdClean pcdFormatCheck \
0 NP45DG NP45DG True
1 CH647TE CH647TE True
2 CH5 4HE CH54HE True
3 GIR 0AA GIR0AA True
4 NOT NOWN missing NaN
5 GIR0AB GIR0AB False
6 NOR12A NOR12A True
7 no idea NOIDEA False
8 W1A 1AA W1A1AA True
9 missin missing NaN
10 NP4 OGH NP4OGH False
11 P012 OLL P012OLL False
12 p01s P01S False
13 ABCD ABCD False
14 missing NaN
15 ab123cd AB123CD False
16 un-known missing NaN
17 B1 INJ B1INJ False
18 AB123CD AB123CD False
19 No idea what the postcode is NOIDEAWHATTHEPOSTCODEIS False
20 ???NP4-5DG_*# NP45DG True
pcdCleanStrLen
0 6
1 7
2 6
3 6
4 7
5 6
6 6
7 6
8 6
9 7
10 6
11 7
12 4
13 4
14 7
15 7
16 7
17 5
18 7
19 23
20 6
Correctly and incorrectly formatted postcodes (AFTER ERROR CORRECTION):
True 10
False 7
Name: pcdFormatCheck, dtype: int64
postcode pcdClean pcdFormatCheck pcdCleanStrLen
0 NP45DG NP45DG True 6
1 CH647TE CH647TE True 7
2 CH5 4HE CH54HE True 6
3 GIR 0AA GIR0AA True 6
4 NOT NOWN missing NaN 7
5 GIR0AB GIR0AB False 6
6 NOR12A NOR12A True 6
7 no idea NO1DEA False 6
8 W1A 1AA W1A1AA True 6
9 missin missing NaN 7
10 NP4 OGH NP40GH True 6
11 P012 OLL PO120LL True 7
12 p01s PO15 False 4
13 ABCD ABCD False 4
14 missing NaN 7
15 ab123cd AB123CD False 7
16 un-known missing NaN 7
17 B1 INJ B11NJ True 5
18 AB123CD AB123CD False 7
19 No idea what the postcode is missing False 23
20 ???NP4-5DG_*# NP45DG True 6
Final working postcode dataframe
================================
postcode pcdClean pcdFormatCheck pcdCleanStrLen \
0 NP45DG NP45DG True 6
1 CH647TE CH647TE True 7
2 CH5 4HE CH54HE True 6
3 GIR 0AA GIR0AA True 6
4 NOT NOWN missing NaN 7
5 GIR0AB missing False 6
6 NOR12A NOR12A True 6
7 no idea missing False 6
8 W1A 1AA W1A1AA True 6
9 missin missing NaN 7
10 NP4 OGH NP40GH True 6
11 P012 OLL PO120LL True 7
12 p01s PO15 True 4
13 ABCD missing False 4
14 missing NaN 7
15 ab123cd AB12 True 7
16 un-known missing NaN 7
17 B1 INJ B11NJ True 5
18 AB123CD AB12 True 7
19 No idea what the postcode is missing False 23
20 ???NP4-5DG_*# NP45DG True 6
pcd7 postcodeOutward postcodeInward pcdArea
0 NP4 5DG NP4 5DG NP
1 CH647TE CH64 7TE CH
2 CH5 4HE CH5 4HE CH
3 GIR 0AA GIR 0AA GIR
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NOR 12A NOR 12A NOR
7 NaN NaN NaN NaN
8 W1A 1AA W1A 1AA W
9 NaN NaN NaN NaN
10 NP4 0GH NP4 0GH NP
11 PO120LL PO12 0LL PO
12 NaN PO15 NaN PO
13 NaN NaN NaN NaN
14 NaN NaN NaN NaN
15 NaN AB12 NaN AB
16 NaN NaN NaN NaN
17 B1 1NJ B1 1NJ B
18 NaN AB12 NaN AB
19 NaN NaN NaN NaN
20 NP4 5DG NP4 5DG NP
Returned dataframe
==================
postcode someOtherCol pcdClean pcdFormatCheck \
0 NP45DG NaN NP45DG True
1 CH647TE NaN CH647TE True
2 CH5 4HE NaN CH54HE True
3 GIR 0AA NaN GIR0AA True
4 NOT NOWN NaN missing NaN
5 GIR0AB NaN missing False
6 NOR12A NaN NOR12A True
7 no idea NaN missing False
8 W1A 1AA NaN W1A1AA True
9 missin NaN missing NaN
10 NP4 OGH NaN NP40GH True
11 P012 OLL NaN PO120LL True
12 p01s NaN PO15 True
13 ABCD NaN missing False
14 NaN missing NaN
15 ab123cd NaN AB12 True
16 un-known NaN missing NaN
17 B1 INJ NaN B11NJ True
18 AB123CD NaN AB12 True
19 No idea what the postcode is NaN missing False
20 ???NP4-5DG_*# NaN NP45DG True
pcdCleanStrLen pcd7 postcodeOutward postcodeInward pcdArea
0 6 NP4 5DG NP4 5DG NP
1 7 CH647TE CH64 7TE CH
2 6 CH5 4HE CH5 4HE CH
3 6 GIR 0AA GIR 0AA GIR
4 7 NaN NaN NaN NaN
5 6 NaN NaN NaN NaN
6 6 NOR 12A NOR 12A NOR
7 6 NaN NaN NaN NaN
8 6 W1A 1AA W1A 1AA W
9 7 NaN NaN NaN NaN
10 6 NP4 0GH NP4 0GH NP
11 7 PO120LL PO12 0LL PO
12 4 NaN PO15 NaN PO
13 4 NaN NaN NaN NaN
14 7 NaN NaN NaN NaN
15 7 NaN AB12 NaN AB
16 7 NaN NaN NaN NaN
17 5 B1 1NJ B1 1NJ B
18 7 NaN AB12 NaN AB
19 23 NaN NaN NaN NaN
20 6 NP4 5DG NP4 5DG NP
Clean postcodes based on real postcode and identify closest matches
import re
# N.B. When calculating best alternative postcodes, only postcodes that are within
# 1 DL distance are considered.
# Create a Pandas series that could contain all the postcodes in the UK
realPostcodesSer = pd.Series(['NP4 5DG','CH647TE','CH5 4HE','W1A 1AA','NP4 0GH','PO120LL','AB123CF','AB124DF','AB123CV'])
# Create test dataframe
myTestPostcodeDF = pd.DataFrame({'postcode': ['NP45DG',
'CH647TE',
'CH5 4HE',
'GIR 0AA',
'NOT NOWN',
'GIR0AB',
'NOR12A',
'no idea',
'W1A 1AA',
'missin',
'NP4 OGH',
'P012 OLL',
'p01s',
'ABCD',
'',
'ab123cd',
'un-known',
'B1 INJ',
'AB123CD',
'No idea what the postcode is',
' ???NP4-5DG_*# '],
'pcdClean': np.nan,
'pcd7': np.nan,
'postcodeOutward': np.nan,
'someOtherCol': np.nan})
# Run function to extract postcode data
print('\nStart dataframe\n===============\n')
print(myTestPostcodeDF)
print('\n')
myTestPostcodeDF = epy.phjCleanUKPostcodeVariable(phjDF = myTestPostcodeDF,
phjRealPostcodeSer = realPostcodesSer,
phjOrigPostcodeVarName = 'postcode',
phjNewPostcodeVarName = 'pcdClean',
phjNewPostcodeStrLenVarName = 'pcdCleanStrLen',
phjPostcodeCheckVarName = 'pcdFormatCheck',
phjMissingValueCode = 'missing',
phjMinDamerauLevenshteinDistanceVarName = 'minDamLevDist',
phjBestAlternativesVarName = 'bestAlternatives',
phjPostcode7VarName = 'pcd7',
phjPostcodeAreaVarName = 'pcdArea',
phjSalvageOutwardPostcodeComponent = True,
phjCheckByOption = 'dictionary',
phjDropExisting = True,
phjPrintResults = True)
print('\nReturned dataframe\n==================\n')
print(myTestPostcodeDF)
This produces the following output:
Start dataframe
===============
postcode pcdClean pcd7 postcodeOutward \
0 NP45DG NaN NaN NaN
1 CH647TE NaN NaN NaN
2 CH5 4HE NaN NaN NaN
3 GIR 0AA NaN NaN NaN
4 NOT NOWN NaN NaN NaN
5 GIR0AB NaN NaN NaN
6 NOR12A NaN NaN NaN
7 no idea NaN NaN NaN
8 W1A 1AA NaN NaN NaN
9 missin NaN NaN NaN
10 NP4 OGH NaN NaN NaN
11 P012 OLL NaN NaN NaN
12 p01s NaN NaN NaN
13 ABCD NaN NaN NaN
14 NaN NaN NaN
15 ab123cd NaN NaN NaN
16 un-known NaN NaN NaN
17 B1 INJ NaN NaN NaN
18 AB123CD NaN NaN NaN
19 No idea what the postcode is NaN NaN NaN
20 ???NP4-5DG_*# NaN NaN NaN
someOtherCol
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN
Correctly and incorrectly formatted postcodes (BEFORE ERROR CORRECTION):
False 12
True 5
Name: pcdFormatCheck, dtype: int64
postcode pcdClean pcdFormatCheck \
0 NP45DG NP45DG True
1 CH647TE CH647TE True
2 CH5 4HE CH54HE True
3 GIR 0AA GIR0AA False
4 NOT NOWN missing NaN
5 GIR0AB GIR0AB False
6 NOR12A NOR12A False
7 no idea NOIDEA False
8 W1A 1AA W1A1AA True
9 missin missing NaN
10 NP4 OGH NP4OGH False
11 P012 OLL P012OLL False
12 p01s P01S False
13 ABCD ABCD False
14 missing NaN
15 ab123cd AB123CD False
16 un-known missing NaN
17 B1 INJ B1INJ False
18 AB123CD AB123CD False
19 No idea what the postcode is NOIDEAWHATTHEPOSTCODEIS False
20 ???NP4-5DG_*# NP45DG True
pcdCleanStrLen
0 6
1 7
2 6
3 6
4 7
5 6
6 6
7 6
8 6
9 7
10 6
11 7
12 4
13 4
14 7
15 7
16 7
17 5
18 7
19 23
20 6
Correctly and incorrectly formatted postcodes (AFTER ERROR CORRECTION):
False 10
True 7
Name: pcdFormatCheck, dtype: int64
postcode pcdClean pcdFormatCheck pcdCleanStrLen
0 NP45DG NP45DG True 6
1 CH647TE CH647TE True 7
2 CH5 4HE CH54HE True 6
3 GIR 0AA GIR0AA False 6
4 NOT NOWN missing NaN 7
5 GIR0AB GIR0AB False 6
6 NOR12A NOR1ZA False 6
7 no idea NO1DEA False 6
8 W1A 1AA W1A1AA True 6
9 missin missing NaN 7
10 NP4 OGH NP40GH True 6
11 P012 OLL PO120LL True 7
12 p01s PO15 False 4
13 ABCD ABCD False 4
14 missing NaN 7
15 ab123cd AB123CD False 7
16 un-known missing NaN 7
17 B1 INJ B11NJ False 5
18 AB123CD AB123CD False 7
19 No idea what the postcode is missing False 23
20 ???NP4-5DG_*# NP45DG True 6
Consider first postcode entry: GIR0AA
Returned list of edits: [4, None]
Consider first postcode entry: GIR0AA
Returned list of edits: [4, None]
Consider first postcode entry: GIR0AB
Returned list of edits: [5, None]
Consider first postcode entry: NOR1ZA
Returned list of edits: [4, None]
Consider first postcode entry: NO1DEA
Returned list of edits: [5, None]
Consider first postcode entry: PO15
Returned list of edits: [4, None]
Consider first postcode entry: ABCD
Returned list of edits: [4, None]
Consider first postcode entry: AB123CD
Returned list of edits: [1, ['AB123CF', 'AB123CV']]
Consider first postcode entry: B11NJ
Returned list of edits: [4, None]
Consider first postcode entry: AB123CD
Returned list of edits: [1, ['AB123CF', 'AB123CV']]
Final working postcode dataframe
================================
postcode pcdClean pcdFormatCheck pcdCleanStrLen \
0 NP45DG NP45DG True 6.0
1 CH647TE CH647TE True 7.0
2 CH5 4HE CH54HE True 6.0
3 GIR 0AA missing False 6.0
4 NOT NOWN missing NaN 7.0
5 GIR0AB missing False 6.0
6 NOR12A missing False 6.0
7 no idea missing False 6.0
8 W1A 1AA W1A1AA True 6.0
9 missin missing NaN 7.0
10 NP4 OGH NP40GH True 6.0
11 P012 OLL PO120LL True 7.0
12 p01s missing False 4.0
13 ABCD missing False 4.0
14 missing NaN 7.0
15 ab123cd missing False 7.0
16 un-known missing NaN 7.0
17 B1 INJ missing False 5.0
18 AB123CD missing False 7.0
19 No idea what the postcode is missing False 23.0
20 ???NP4-5DG_*# NP45DG True 6.0
pcd7 postcodeOutward postcodeInward minDamLevDist bestAlternatives \
0 NP4 5DG NP4 5DG NaN NaN
1 CH647TE CH64 7TE NaN NaN
2 CH5 4HE CH5 4HE NaN NaN
3 NaN NaN NaN 4.0 NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN 5.0 NaN
6 NaN NaN NaN 4.0 NaN
7 NaN NaN NaN 5.0 NaN
8 W1A 1AA W1A 1AA NaN NaN
9 NaN NaN NaN NaN NaN
10 NP4 0GH NP4 0GH NaN NaN
11 PO120LL PO12 0LL NaN NaN
12 NaN NaN NaN 4.0 NaN
13 NaN NaN NaN 4.0 NaN
14 NaN NaN NaN NaN NaN
15 NaN NaN NaN 1.0 [AB123CF, AB123CV]
16 NaN NaN NaN NaN NaN
17 NaN NaN NaN 4.0 NaN
18 NaN NaN NaN 1.0 [AB123CF, AB123CV]
19 NaN NaN NaN NaN NaN
20 NP4 5DG NP4 5DG NaN NaN
pcdArea
0 NP
1 CH
2 CH
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 W
9 NaN
10 NP
11 PO
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NP
Returned dataframe
==================
postcode someOtherCol pcdClean pcdFormatCheck \
0 NP45DG NaN NP45DG True
1 CH647TE NaN CH647TE True
2 CH5 4HE NaN CH54HE True
3 GIR 0AA NaN missing False
4 NOT NOWN NaN missing NaN
5 GIR0AB NaN missing False
6 NOR12A NaN missing False
7 no idea NaN missing False
8 W1A 1AA NaN W1A1AA True
9 missin NaN missing NaN
10 NP4 OGH NaN NP40GH True
11 P012 OLL NaN PO120LL True
12 p01s NaN missing False
13 ABCD NaN missing False
14 NaN missing NaN
15 ab123cd NaN missing False
16 un-known NaN missing NaN
17 B1 INJ NaN missing False
18 AB123CD NaN missing False
19 No idea what the postcode is NaN missing False
20 ???NP4-5DG_*# NaN NP45DG True
pcdCleanStrLen pcd7 postcodeOutward postcodeInward minDamLevDist \
0 6.0 NP4 5DG NP4 5DG NaN
1 7.0 CH647TE CH64 7TE NaN
2 6.0 CH5 4HE CH5 4HE NaN
3 6.0 NaN NaN NaN 4.0
4 7.0 NaN NaN NaN NaN
5 6.0 NaN NaN NaN 5.0
6 6.0 NaN NaN NaN 4.0
7 6.0 NaN NaN NaN 5.0
8 6.0 W1A 1AA W1A 1AA NaN
9 7.0 NaN NaN NaN NaN
10 6.0 NP4 0GH NP4 0GH NaN
11 7.0 PO120LL PO12 0LL NaN
12 4.0 NaN NaN NaN 4.0
13 4.0 NaN NaN NaN 4.0
14 7.0 NaN NaN NaN NaN
15 7.0 NaN NaN NaN 1.0
16 7.0 NaN NaN NaN NaN
17 5.0 NaN NaN NaN 4.0
18 7.0 NaN NaN NaN 1.0
19 23.0 NaN NaN NaN NaN
20 6.0 NP4 5DG NP4 5DG NaN
bestAlternatives pcdArea
0 NaN NP
1 NaN CH
2 NaN CH
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN W
9 NaN NaN
10 NaN NP
11 NaN PO
12 NaN NaN
13 NaN NaN
14 NaN NaN
15 [AB123CF, AB123CV] NaN
16 NaN NaN
17 NaN NaN
18 [AB123CF, AB123CV] NaN
19 NaN NaN
20 NaN NP