Archive - NYCPlanning/db-data-library GitHub Wiki

Archive function's purpose is moving the ingested dataset into the s3 bucket.

In looking at the design of the archive action, we will get a sense of the function-driven design pattern that the data library is implemented with.

call to Ingestor

self.ingestor = Ingestor()

However, the initializing a Ingestor class object won't actually be completed at the step above. The ingestor is taking an input, the output_format, from the archive action to determine which wrapper ingestion functions to run.

ingestor_of_format = getattr(self.ingestor, output_format)

After the output format determines which ingestor function is called into actions. Then ingestor call is made in the line below to start ingesting and also return the required parameters needed for archiving.

        # Initiate ingestion
        output_files, version, acl = ingestor_of_format(path, *args, **kwargs)

latest

one important task that this script is handling is the versioning on the s3 space. This involves knowing all the available versions on the space and also setting the latest which many other DCP workflows depending on to correctly access the datasets. Using a boolean flag to place the latest if asked, and then also update the metadata to give the latest the version number. Then also remove from the latest folder any other previous version are accomplished by the code below.

It is worth to highlight there are a few different ways the program can know the versioning of a file. It can either be manually set by user from the main accessor function e.g. library archive -v 20220420. It can also get the version from the Config yml template or using date of the archive action being run as the default version date. When a dataset is queried from socrata API, the snippet below is run to determine the latest available update for that dataset.

    def version_socrata(self, uid: str) -> str:
        """using the socrata API, collect the 'data last update' date"""
        metadata = requests.get(
            f"https://data.cityofnewyork.us/api/views/{uid}.json"
        ).json()
        version = datetime.fromtimestamp(metadata["rowsUpdatedAt"]).strftime("%Y%m%d")
        return version