Extract Ordnance Survey grid reference and convert to longitude and latitude - lvphj/epydemiology GitHub Wiki

Python function to extract Ordnance Survey grid references from column of free-text and convert to longitude and latitude

phjConvertOSGridRefToLatLong()

myDF = epy.phjConvertOSGridRefToLatLong(phjDF,
                                        phjOrigGridRefVarName = 'origGridRef',
                                        phjExtrGridRefStrVarName = 'extrGridRefStr',
                                        phjFmtGridRefStrVarName = 'fmtGridRefStr',
                                        phjAccuracyVarName = 'accuracy',
                                        phjTruncateAccuracy = None,
                                        phjErrorMsgVarName = 'errorMsg',
                                        phjLatLongVarNameList = ['lat','long'],
                                        phjPrintResults = False)

Description

The function extracts Ordnance Survey grid references (of format AB1212 to AB123451245) from a column of text. The extracted grid reference is formatted and truncated as required by user. The grid reference can also be converted to latitude and longitude (WGS84) using the OSGB library by Toby Thurston.

Function parameters

  1. phjDF

    Dataframe containing a column of Ordnance Survey grid references, either 4, 6, 8 or 10 digit.

  2. phjOrigGridRefVarName (default = 'origGridRef')

    Name of the variable containing grid references.

  3. phjExtrGridRefStrVarName (default= 'extrGridRefStr')

    Name of new column that will be used to store the extracted grid reference string.

  4. phjFmtGridRefStrVarName (default = 'fmtGridRefStr')

    Name of the column that will be used to store the grid reference string after it has been formatted to a consistent format.

  5. phjAccuracyVarName (default = 'accuracy')

    Name of the column that will be used to store the accuracy of the grid reference. For example, a grid reference of the format AB1212 defines the bottom, left corner of a 1km square, i.e. accuracy is 1000m. In contrast, a grid reference of format AB1234512345 defines the bottom, left corner of a 1m square, i.e. accuracy is 1m.

    If the grid reference is truncated (using the phjTruncateAccuracy argument) then the phjAccuracyVarName column will contain the truncated accuracy or lower.

  6. phjTruncateAccuracy (default = None)

    Desired accuracy to truncate extracted grid references. Allowed values are 1000, 100, 10, 1 or None (default). With accuracy set to '100', a grid reference of AB123451234 will be truncated to AB123123. However, a grid reference of AB1212 will not be affected.

  7. phjErrorMsgVarName (default = 'errorMsg')

    If grid reference cannot be extracted from text, a message will be generated and stored in this column.

  8. phjLatLongVarNameList (default = ['lat','long'])

    List of length 2 containing names of columns to be used for storing latitude and longitude values (WGS84).

    If phjLatLogVarNameList = None, then latitude and longitude values are not generated.

  9. phjPrintResults (default = False)

    Indicates whether intermediate results (including the returned dataframe) should be printed to screen as the function progresses.

Exceptions raised

  1. AssertionError

    An AssertionError is raised if function arguments are entered incorrectly.

Returns

Pandas dataframe containing all the columns of the original dataframe together with a series of additional columns that contain the information relating to the OS grid reference.

Other notes

None.

Example

A dataframe contains a column of data that contains Ordnance Survey grid references. The grid references can be in a variety of formats and the map label can be uppercase or lowercase. There may also be other text in the column in addition to the grid reference.

df = pd.DataFrame({'job_name':list(range(1,11)),
                   'grid_reference':  ['  NT    337  708    ',
                                       'ns099262',
                                       'SO 414687  (this is in England)',
                                       'NX 79812 6883',
                                       'Not known',
                                       'Found at approx. NT 349657',
                                       np.nan,
                                       'NO 50878',
                                       'NT1639165764',
                                       'NS 4362']})

The dataframe looks like:

   job_name                   grid_reference
0         1               NT    337  708    
1         2                         ns099262
2         3  SO 414687  (this is in England)
3         4                    NX 79812 6883
4         5                        Not known
5         6       Found at approx. NT 349657
6         7                              NaN
7         8                         NO 50878
8         9                     NT1639165764
9        10                          NS 4362

OS grid references can be extracted and converted to latitude and longitude using:

df = epy.phjConvertOSGridRefToLatLong(phjDF = df,
                                      phjOrigGridRefVarName = 'grid_reference',
                                      phjExtrGridRefStrVarName = 'extrGridRefStr',
                                      phjFmtGridRefStrVarName = 'fmtGridRefStr',
                                      phjAccuracyVarName = 'accuracy',
                                      phjTruncateAccuracy = 100,
                                      phjErrorMsgVarName = 'error_message',
                                      phjLatLongVarNameList = ['latitude','longitude'],
                                      phjPrintResults = False)

print('Updated dataframe')
print('=================')
print(df)

The returned dataframe is:

Updated dataframe
=================
   job_name                   grid_reference extrGridRefStr fmtGridRefStr  \
0         1               NT    337  708         NT 337 708      NT337708   
1         2                         ns099262       NS099262      NS099262   
2         3  SO 414687  (this is in England)      SO 414687      SO414687   
3         4                    NX 79812 6883  NX 79812 6883      NX798688   
4         5                        Not known            NaN           NaN   
5         6       Found at approx. NT 349657      NT 349657      NT349657   
6         7                              NaN            NaN           NaN   
7         8                         NO 50878       NO 50878           NaN   
8         9                     NT1639165764   NT1639165764      NT163657   
9        10                          NS 4362        NS 4362        NS4362   

   accuracy                                    error_message   latitude  \
0     100.0                                                   55.925680   
1     100.0                                                   55.492513   
2     100.0                                                   52.313278   
3     100.0                                                   54.999220   
4       NaN                 Unable to extract grid reference        NaN   
5     100.0                                                   55.880029   
6       NaN                 Unable to extract grid reference        NaN   
7       NaN  Discrepancy in accuracy of easting and northing        NaN   
8     100.0                                                   55.877154   
9    1000.0                                                   55.825629   

   longitude  
0  -3.062585  
1  -5.010675  
2  -2.861017  
3  -3.880599  
4        NaN  
5  -3.042154  
6        NaN  
7        NaN  
8  -3.339375  
9  -4.507828