update_listing_scrapper_data - zebrainvest/postgres GitHub Wiki

update_listing_scrapper_data Stored Procedure

Requirement

A stored procedure that finds records in listing_scrapper_data table (LSD) with a properly formatted postcode, and a UPRN that is zero or empty and/or its address is empty, and attempts to match records in either price_paid table (PP) or the energy performance table (certificates_depc table, DEPC), whichever appropriate, via the following rules (in order top to bottom), copying at each step the UPRN and address from matched PP/DEPC records to LSD:

  1. LSD properties' postcode, sold price (transactionhistoryprice), and sold year (transactionhistoryyear), against PP properties' postcode, price_paid, and sold year (date_part('year', sold_date::timestamp))
  2. LSD properties' postcode, EPC date (epc_date), an EPC rating (epc_rating as a letter), against DEPC properties' postcode, lodgement_date, and current_energy_rating
  3. LSD properties' postcode, current energy performance (epc_current_value), and potential energy performance (epc_potential_value), against DEPC properties' postcode, current_energy_efficiency, and potential_energy_efficiency
  4. Same as step 3, but comparing both postcodes without the Unit component (i.e. the last two letters in the postcode)

Implementation

This is a stored procedure in the production propertytoolkit database.

Updates uprn and address columns from listing_scrapper_data table (where address is empty/null or uprn is zero or null, and postcode is nonempty) taking these values from matching records from either price_paid or certificates_depc table, matched via the following sequence of strategies (in order):

  1. from postcode + price_paid + year, i.e. (transactionhistoryprice, transactionhistoryyear) = price_paid.(price_paid, date_part('year', sold_date::timestamp)), provided these values are nonempty and nonzero (commits all changes at the end of the update)
  2. loops through a list of (1000 by default) postcode ranges (committing at the end of each loop):
    1. postcode + epc_date + epc_rating, i.e. (epc_date::date, epc_rating) = certificates_depc.(lodgement_date, current_energy_rating) - uprn here is obtained via uprn_addresses match
    2. postcode + epc_current_value + epc_potential_value, i.e. (epc_date::date, epc_rating) = certificates_depc.(current_energy_efficiency, potential_energy_efficiency) - uprn here is obtained via uprn_addresses match
    3. same as 2.ii. but matching only partial postcodes (without the Unit component, i.e. the last two letters)

Notes:

  • Records are updated only when the the matches produce in a single match, i.e. if a LSD matches two PP different records for a given postcode + price + year, then no update is made; same goes for the other matching strategies.
  • DEPC's UPRN value is often NULL or NaN, but can have multiple records for the same postcode+address, some of them with a UPRN
    • this process obtains the UPRN value from uprn_addresses table by joining it with DEPC by postcode and address
    • the uprn_addresses table is generated/maintained by the update_uprn_addresses stored procedure, which is not run regularly, but should be run every time DEPC tables get updated
  • A system cron job (root user's crontab) runs this stored procedure every two hours.