Config - NYCPlanning/db-data-library GitHub Wiki
Config
might seems intimidating with first look but it essentially does one things. It constructed a list of variables from the template yml files and then passed those variables as arguments to the data library ingesting process.
For some datasets, however, there is additional step for specific datasets when processing is required on the dataset so a python script will get processed to get source type dependent configurations back to ingestor. This is acheived using @property
decorator function. To read more about @property check out the blog post here.
Source Types
the the first items in the source from template determines what the type of source we are getting. It could script, url, or socrata link.
def source_type(self) -> str:
"""determine the type of the source, either url, socrata or script"""
template = self.parsed_unrendered_template
source = template["dataset"]["source"]
return list(source.keys())[0]
compute
Once the source type is defined, then the crucial next step for config is to call the compute
function which will organize the parsed results from the templates and reformat them in proper ways so that they would be accessible to the ingest functions. For socrata, this process looks like the following:
if self.source_type == "socrata":
# For socrata we are computing the url and add the url object to the config file
_uid = self.parsed_unrendered_template["dataset"]["source"]["socrata"][
"uid"
]
_format = self.parsed_unrendered_template["dataset"]["source"]["socrata"][
"format"
]
config = self.parsed_rendered_template(version=self.version_socrata(_uid))
if _format == "csv":
url = f"https://data.cityofnewyork.us/api/views/{_uid}/rows.csv"
if _format == "geojson":
url = f"https://nycopendata.socrata.com/api/geospatial/{_uid}?method=export&format=GeoJSON"
options = config["dataset"]["source"]["options"]
geometry = config["dataset"]["source"]["geometry"]
config["dataset"]["source"] = {
"url": {"path": url, "subpath": ""},
"options": options,
"geometry": geometry,
}
Python Script
And for certain datasets, the source requires additional processing with an associated Python script. Then script is called in the lines below inside the compute
function.
script_name = _config["dataset"]["source"]["script"]
module = importlib.import_module(f"library.script.{script_name}")
scriptor = module.Scriptor(config=config)
url = scriptor.runner()
Note the Scriptor
class object is imported into the process, and then the final output url (where the temporary path is set for the object) would be returned by the runner()
function from the specific dataset.
Improvement
it should be noted that this line does not do what it says currently
# Validate unparsed, unrendered file
Validator(self.parsed_unrendered_template)()