Config - NYCPlanning/db-data-library GitHub Wiki

Config might seems intimidating with first look but it essentially does one things. It constructed a list of variables from the template yml files and then passed those variables as arguments to the data library ingesting process.

For some datasets, however, there is additional step for specific datasets when processing is required on the dataset so a python script will get processed to get source type dependent configurations back to ingestor. This is acheived using @property decorator function. To read more about @property check out the blog post here.

Source Types

the the first items in the source from template determines what the type of source we are getting. It could script, url, or socrata link.

def source_type(self) -> str:
    """determine the type of the source, either url, socrata or script"""
    template = self.parsed_unrendered_template
    source = template["dataset"]["source"]
    return list(source.keys())[0]

compute

Once the source type is defined, then the crucial next step for config is to call the compute function which will organize the parsed results from the templates and reformat them in proper ways so that they would be accessible to the ingest functions. For socrata, this process looks like the following:

        if self.source_type == "socrata":
            # For socrata we are computing the url and add the url object to the config file
            _uid = self.parsed_unrendered_template["dataset"]["source"]["socrata"][
                "uid"
            ]
            _format = self.parsed_unrendered_template["dataset"]["source"]["socrata"][
                "format"
            ]
            config = self.parsed_rendered_template(version=self.version_socrata(_uid))

            if _format == "csv":
                url = f"https://data.cityofnewyork.us/api/views/{_uid}/rows.csv"
            if _format == "geojson":
                url = f"https://nycopendata.socrata.com/api/geospatial/{_uid}?method=export&format=GeoJSON"

            options = config["dataset"]["source"]["options"]
            geometry = config["dataset"]["source"]["geometry"]
            config["dataset"]["source"] = {
                "url": {"path": url, "subpath": ""},
                "options": options,
                "geometry": geometry,
            }

Python Script

And for certain datasets, the source requires additional processing with an associated Python script. Then script is called in the lines below inside the compute function.

script_name = _config["dataset"]["source"]["script"]
module = importlib.import_module(f"library.script.{script_name}")
scriptor = module.Scriptor(config=config)
url = scriptor.runner()

Note the Scriptor class object is imported into the process, and then the final output url (where the temporary path is set for the object) would be returned by the runner() function from the specific dataset.

Improvement

it should be noted that this line does not do what it says currently

        # Validate unparsed, unrendered file
        Validator(self.parsed_unrendered_template)()