update_listing_scrapper_data - zebrainvest/postgres GitHub Wiki
update_listing_scrapper_data
Stored Procedure
Requirement
A stored procedure that finds records in listing_scrapper_data
table (LSD) with a properly formatted postcode, and a UPRN that is zero or empty and/or its address is empty, and attempts to match records in either price_paid
table (PP) or the energy performance table (certificates_depc
table, DEPC), whichever appropriate, via the following rules (in order top to bottom), copying at each step the UPRN and address from matched PP/DEPC records to LSD:
- LSD properties'
postcode
, sold price (transactionhistoryprice
), and sold year (transactionhistoryyear
), against PP properties'postcode
,price_paid
, and sold year (date_part('year', sold_date::timestamp)
) - LSD properties'
postcode
, EPC date (epc_date
), an EPC rating (epc_rating
as a letter), against DEPC properties'postcode
,lodgement_date
, andcurrent_energy_rating
- LSD properties'
postcode
, current energy performance (epc_current_value
), and potential energy performance (epc_potential_value
), against DEPC properties'postcode
,current_energy_efficiency
, andpotential_energy_efficiency
- Same as step 3, but comparing both postcodes without the Unit component (i.e. the last two letters in the postcode)
Implementation
This is a stored procedure in the production propertytoolkit database.
Updates uprn
and address
columns from listing_scrapper_data
table (where address is empty/null or uprn is zero or null, and postcode is nonempty) taking these values from matching records from either price_paid
or certificates_depc
table, matched via the following sequence of strategies (in order):
- from postcode + price_paid + year, i.e. (
transactionhistoryprice
,transactionhistoryyear
) =price_paid
.(price_paid
,date_part('year', sold_date::timestamp)
), provided these values are nonempty and nonzero (commits all changes at the end of the update) - loops through a list of (1000 by default) postcode ranges (committing at the end of each loop):
- postcode + epc_date + epc_rating, i.e. (
epc_date::date
,epc_rating
) =certificates_depc
.(lodgement_date
,current_energy_rating
) -uprn
here is obtained viauprn_addresses
match - postcode + epc_current_value + epc_potential_value, i.e. (
epc_date::date
,epc_rating
) =certificates_depc
.(current_energy_efficiency
,potential_energy_efficiency
) -uprn
here is obtained viauprn_addresses
match - same as 2.ii. but matching only partial postcodes (without the Unit component, i.e. the last two letters)
- postcode + epc_date + epc_rating, i.e. (
Notes:
- Records are updated only when the the matches produce in a single match, i.e. if a LSD matches two PP different records for a given postcode + price + year, then no update is made; same goes for the other matching strategies.
- DEPC's UPRN value is often NULL or NaN, but can have multiple records for the same postcode+address, some of them with a UPRN
- this process obtains the UPRN value from
uprn_addresses
table by joining it with DEPC by postcode and address - the
uprn_addresses
table is generated/maintained by theupdate_uprn_addresses
stored procedure, which is not run regularly, but should be run every time DEPC tables get updated
- this process obtains the UPRN value from
- A system cron job (root user's crontab) runs this stored procedure every two hours.