Good practices for managing dependent objects and features (import, typing) - GlacioHack/xdem GitHub Wiki

There are several good practices to remember when creating inter-dependent objects and features, to create a logical package structure and avoids circularity issues, which we use throughout GeoUtils and xDEM. (Original posts here and here)

Problem

Let's take the example of an Object that implements feature() (in our case Object could be Raster and feature() could be reproject()).

We separate object and features into separate python modules, and Object depends on feature by needing to import a _feature() in separate feature.py module (that is the natural way of doing it, and good practice).

However, we might want to also have Object appear within feature at different occasions:

To declare the type of Object as a possible input of feature(),
To check user-input at runtime doing isinstance(user_input, Object),
Or to create a new Object by calling a class method of Object.

This problem can scale badly with multiple objects that depend on each other through multiple features...

Additionally, within a _feature method, the layering can also be complex in case of multiple backends (chunked operations, Dask/Multiprocessing support), for instance:

Our objects methods Object.feature() calls a main function feature(),
The parent function feature() dispatches towards either a _feature() (core function, unchunked), or a _feature_dask (Dask backend), or _feature_multiproc (Multiproc backend),
The _feature_dask and _feature_multiproc rely on a similar block/chunk logic, so both call a _feature_chunk, which itself calls _feature() underneath.

So how do we scale this reliably?

Solutions for structure

In terms of layers, the mental model we have is this (example for GeoUtils):

Domain objects (RasterBase / VectorBase / PointCloudBase; inherited by accessors and main objects)
↓
Feature API (parent functions for reproject, proximity, interpolate, grid, for any backend)
↓
Execution backends (Dask / Multiprocessing / Direct core)
↓
Chunked function logic (only for Dask/MP)
↓
Core functions

So we need a structure like this, either for each feature or for a group of features:

geoutils/
├── feature/                 # feature subsystem
│   ├── __init__.py
│   ├── api.py
│   ├── core.py
│   ├── chunked.py
│   ├── backends/
│   │   ├── __init__.py
│   │   ├── dask.py
│   │   └── multiprocessing.py
├── object/

Solution for imports/typing

There are several aspects to consider to avoid issues with imports and facilitate typing:

Non-runtime type checking (in function declaration)

To solve static, non-runtime typing, a good practice is to use typing's TYPE_CHECKING at top of file to isolate typing imports

For example, instead of doing:

# In feature.py

from geoutils import Raster  # Creates a circular dependency

myfeature(input: Raster, ...):
   ...

One should do:

# In feature.py

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from geoutils import Raster  # Only runs during type checking, no circularity issue

myfeature(input: Raster, ...):
   ...

Duck typing (for runtime user-input checks)

To solve object type check during runtime, a good practice is to use duck typing. (or use typing.Protocol in certain cases, but those fit less well with our objects)

This means doing hasattr(obj, "obj_attr") instead of isinstance(obj, Object), thereby removing the need to import Object during runtime.

For example, instead of doing:

# In feature.py

from geoutils import Raster  # Creates a circular dependency

myfeature(input, ...):
   # Check user input is correct
   if not isinstance(input, Raster):
       raise ValueError("Wrong input")
   # Feature uses some specific attributes...
   return func(input.crs)

One should do:

# In feature.py

myfeature(input, ...):
   if not hasattr(input, "crs"):
       raise ValueError("Wrong input, did not implement 'crs'")
   # Feature uses specific attributes safely
   return func(input.crs)

Lazy/In-method imports (for runtime object instantiation)

Sometimes a cross-import is absolutely necessary, for instance to create a new instance of Object in feature() (this can usually be circumvented for the same Object by calling a class method from self (input) such as our from_array, but the issue can be unavoidable when working with different objects, like creating Object1 from Object2.feature()). In this case, we can simply import "lazily" = from within the method (only triggers during runtime) to avoid circularity issues.

For example, instead of doing:

# In feature.py called by Raster module

# Creates a circular dependency if a feature of PointCloud needs to do the same for Raster
from geoutils import PointCloud  

myfeature(input, ...):

   pc = func(input)
   return PointCloud(pc)

One should do:

# In feature.py called by Raster module

myfeature(input, ...):

   from geoutils import PointCloud  # Only happens at runtime
   pc = func(input)
   return PointCloud(pc)