Extract Ordnance Survey grid reference and convert to longitude and latitude - lvphj/epydemiology GitHub Wiki
Python function to extract Ordnance Survey grid references from column of free-text and convert to longitude and latitude
phjConvertOSGridRefToLatLong()
myDF = epy.phjConvertOSGridRefToLatLong(phjDF,
phjOrigGridRefVarName = 'origGridRef',
phjExtrGridRefStrVarName = 'extrGridRefStr',
phjFmtGridRefStrVarName = 'fmtGridRefStr',
phjAccuracyVarName = 'accuracy',
phjTruncateAccuracy = None,
phjErrorMsgVarName = 'errorMsg',
phjLatLongVarNameList = ['lat','long'],
phjPrintResults = False)
Description
The function extracts Ordnance Survey grid references (of format AB1212 to AB123451245) from a column of text. The extracted grid reference is formatted and truncated as required by user. The grid reference can also be converted to latitude and longitude (WGS84) using the OSGB library by Toby Thurston.
Function parameters
-
phjDF
Dataframe containing a column of Ordnance Survey grid references, either 4, 6, 8 or 10 digit.
-
phjOrigGridRefVarName (default = 'origGridRef')
Name of the variable containing grid references.
-
phjExtrGridRefStrVarName (default= 'extrGridRefStr')
Name of new column that will be used to store the extracted grid reference string.
-
phjFmtGridRefStrVarName (default = 'fmtGridRefStr')
Name of the column that will be used to store the grid reference string after it has been formatted to a consistent format.
-
phjAccuracyVarName (default = 'accuracy')
Name of the column that will be used to store the accuracy of the grid reference. For example, a grid reference of the format AB1212 defines the bottom, left corner of a 1km square, i.e. accuracy is 1000m. In contrast, a grid reference of format AB1234512345 defines the bottom, left corner of a 1m square, i.e. accuracy is 1m.
If the grid reference is truncated (using the phjTruncateAccuracy argument) then the phjAccuracyVarName column will contain the truncated accuracy or lower.
-
phjTruncateAccuracy (default = None)
Desired accuracy to truncate extracted grid references. Allowed values are 1000, 100, 10, 1 or None (default). With accuracy set to '100', a grid reference of AB123451234 will be truncated to AB123123. However, a grid reference of AB1212 will not be affected.
-
phjErrorMsgVarName (default = 'errorMsg')
If grid reference cannot be extracted from text, a message will be generated and stored in this column.
-
phjLatLongVarNameList (default = ['lat','long'])
List of length 2 containing names of columns to be used for storing latitude and longitude values (WGS84).
If phjLatLogVarNameList = None, then latitude and longitude values are not generated.
-
phjPrintResults (default = False)
Indicates whether intermediate results (including the returned dataframe) should be printed to screen as the function progresses.
Exceptions raised
-
AssertionError
An
AssertionError
is raised if function arguments are entered incorrectly.
Returns
Pandas dataframe containing all the columns of the original dataframe together with a series of additional columns that contain the information relating to the OS grid reference.
Other notes
None.
Example
A dataframe contains a column of data that contains Ordnance Survey grid references. The grid references can be in a variety of formats and the map label can be uppercase or lowercase. There may also be other text in the column in addition to the grid reference.
df = pd.DataFrame({'job_name':list(range(1,11)),
'grid_reference': [' NT 337 708 ',
'ns099262',
'SO 414687 (this is in England)',
'NX 79812 6883',
'Not known',
'Found at approx. NT 349657',
np.nan,
'NO 50878',
'NT1639165764',
'NS 4362']})
The dataframe looks like:
job_name grid_reference
0 1 NT 337 708
1 2 ns099262
2 3 SO 414687 (this is in England)
3 4 NX 79812 6883
4 5 Not known
5 6 Found at approx. NT 349657
6 7 NaN
7 8 NO 50878
8 9 NT1639165764
9 10 NS 4362
OS grid references can be extracted and converted to latitude and longitude using:
df = epy.phjConvertOSGridRefToLatLong(phjDF = df,
phjOrigGridRefVarName = 'grid_reference',
phjExtrGridRefStrVarName = 'extrGridRefStr',
phjFmtGridRefStrVarName = 'fmtGridRefStr',
phjAccuracyVarName = 'accuracy',
phjTruncateAccuracy = 100,
phjErrorMsgVarName = 'error_message',
phjLatLongVarNameList = ['latitude','longitude'],
phjPrintResults = False)
print('Updated dataframe')
print('=================')
print(df)
The returned dataframe is:
Updated dataframe
=================
job_name grid_reference extrGridRefStr fmtGridRefStr \
0 1 NT 337 708 NT 337 708 NT337708
1 2 ns099262 NS099262 NS099262
2 3 SO 414687 (this is in England) SO 414687 SO414687
3 4 NX 79812 6883 NX 79812 6883 NX798688
4 5 Not known NaN NaN
5 6 Found at approx. NT 349657 NT 349657 NT349657
6 7 NaN NaN NaN
7 8 NO 50878 NO 50878 NaN
8 9 NT1639165764 NT1639165764 NT163657
9 10 NS 4362 NS 4362 NS4362
accuracy error_message latitude \
0 100.0 55.925680
1 100.0 55.492513
2 100.0 52.313278
3 100.0 54.999220
4 NaN Unable to extract grid reference NaN
5 100.0 55.880029
6 NaN Unable to extract grid reference NaN
7 NaN Discrepancy in accuracy of easting and northing NaN
8 100.0 55.877154
9 1000.0 55.825629
longitude
0 -3.062585
1 -5.010675
2 -2.861017
3 -3.880599
4 NaN
5 -3.042154
6 NaN
7 NaN
8 -3.339375
9 -4.507828