Instructions for developers - hzovaro/spaxelsleuth GitHub Wiki

Tests

In the tests/ directory you will find a series of tests which check that the code is working as expected. Most (but not all) of these run as GitHub Actions automatically on push or pull requests to/from the dev or main branches on a subset of the full Hector and SAMI data sets.

Unit tests test low-level functions used throughout the code, e.g. functions in utils.density for calculating the electron density.
Integration tests test the end-to-end functionality of spaxelsleuth by running make_df on test data sets in different configurations and using assertion tests to ensure that the output is as expected (e.g., checking that spaxels where the H$\alpha$ flux is NaN do not have finite SFRs, as these are calculated from the H$\alpha$ flux)
Regression tests check for changes against "reference" DataFrames that are assumed to represent the "ground truth".

Unit tests are located in tests/unit_tests and the integration & regression tests are located in tests/integration_tests (as of May 2024):

integration_tests/
├── hector_integration_test.py
├── hector_regression_test.py
├── sami_integration_test.py
├── sami_regression_test.py
└── test_config.json
unit_tests/
├── test_config.json
├── test_continuum.py
├── test_electron_density.py
├── test_get_label.py
├── test_io.py
├── test_load_user_config.py
├── test_temperature_and_metallicity.py
└── test_velocity.py

Running tests

The simplest way to run them is to use pytest. I strongly recommend running these tests any time you make even minor changes to the code.

cd tests/unit_tests
pytest   # Will run all unit tests
cd ../integration_tests
pytest   # Will run both integration & regression tests for SAMI and Hector

Note that you must be in the same directory as the python test scripts for them to run.

Important note: prior to creating new DataFrames during the integration and regression tests, any existing DataFrames with the same input parameters in the tests/output/ directory, e.g. from previous runs of these tests, are permanently deleted in order to save disk space.

Updating the reference DataFrames

Sometimes, you will make changes to the code that render the reference DataFrames obsolete - e.g., perhaps you've changed the way that SFRs are calculated, you've spotted a bug in how a column is calculated, you've added or removed columns, or you've changed some column names. In these cases, the regression tests will fail, and you will need to update them by doing the following:

Before you do anything, write new unit tests to test any new functionality that you've added - e.g. if you've changed the SFR calculation, write a new unit test to make sure it works as expected.
Then, run the integration tests.
If the integration tests pass, then you can modify the regression test file to check for changes in any columns that should not have been modified by your changes.

If you've modified the values in one or more columns, e.g., the "SFR (total)" column, then change the following line

for c in [c for c in df_old.columns if c not in ["timestamp", "df_fname_tag"]]:

for c in [c for c in df_old.columns if c not in ["timestamp", "df_fname_tag", "SFR (total)"]]:

This will stop the code from checking against changes to the "SFR (total)" column.

If you've added extra columns, then the assertion test

assert len(added_cols) == 0, f"The following columns in df_new do not exist in df_old: {', '.join(added_cols)}"

will fail. Comment it out, and use a debugger to check that added_cols contains the new columns you've added.

If you've removed some columns, then the assertion test

assert len(removed_cols) == 0, f"The following columns in df_old are missing in df_new: {', '.join(removed_cols)}"

will fail. Again, comment this line out and use a debugger to check that removed_cols contains the columns you've removed.

If you've renamed some columns, then simply rename them in the reference DataFrame df_reference before passing it to compare_dataframes():

df_reference = df_reference.rename(columns={
    "old_column_name": "new_column_name"
})
compare_dataframes(df_new, df_reference)

Once you've made the appropriate changes and the regression tests pass, it's time to update the reference DataFrames. Unless you are running the tests on local data (see Running integration & regression tests locally), these are tracked by git, so you will first have to delete the old ones using

cd tests/reference/hector/
git rm *.hd5

and copy over the new ones

cd tests/output/hector/
cp *.hd5 ../../reference/hector/
cd ../../reference/hector/
git add *.hd5
git commit -m "Updated reference DataFrames"

If you are running these tests locally then you can simply rm and cp the files without using git.

Remove any changes you've made to the regression tests using e.g. git checkout hector_regression_test.py.
Finally, run the regression tests one more time to double-check that the files have been copied over correctly.

Running integration & regression tests locally

Due to the hard GitHub repository size limit, it is not feasible to run the integration and regression via GitHub Actions for large numbers of galaxies. Additionally, it is not possible to upload individual files >100 MB in size to GitHub unless you use Git Large File Storage (LFS) (I experimented with using this and decided against it since it was unintuitive and cumbersome to use, and can result in the repository becoming bricked such that it must be permanently deleted and re-created). As a result, the current automatic tests are not very comprehensive since they only run on a small number of galaxies (one galaxy in the case of Hector due to the large file sizes!).

I strongly recommend that future developers run their own tests on a larger number of galaxies using files stored locally:

First, run pytest on the Hector integration & regression tests to make sure they do not return any errors on the single galaxy:

cd tests/integration_tests
pytest hector*

Make a copy of tests/integration_tests/test_config.json called tests/integration_tests/local_test_config.json.
Create a new directory to store the test data set - e.g., tests/local_test_data/hector/. You can put these files anywhere you like. Note that the directory structure in data/ must match the expected locations of the different data products in the survey submodule, e.g. io.hector. For Hector, as of May 2024, the directory structure look like this:

tests/local_test_data/hector/
├── data
│   ├── cont_subtracted
│   ├── cubes
│   ├── emission_cubes
│   ├── gauss_fits
│   └── initial_stel_kin
├── output
└── reference

Modify local_test_config.json to point to the directories containing your test data set. Note that you can use relative or absolute paths.

    "hector": {
        "output_path": "../local_test_data/hector/output/",
        "reference_output_path": "../local_test_data/hector/reference/",
        "input_path": "../local_test_data/hector/data/",
        "data_cube_path": "../local_test_data/hector/data/cubes"
    }

Important note: I added tests/local_test_data to the .gitignore to make sure your local test files will NOT be added to the git repository. If you decide to place them elsewhere within the repository then you must be careful not to accidentally add them to the repository!

Then, copy the input files for a larger number of galaxies into the appropriate subfolders within tests/local_test_data/hector/data/.
Run tests/integration_tests/hector_integration_test.py to create new "reference" DataFrames against which future DataFrames can be compared:

cd tests/integration_tests
python hector_integration_test.py local_test_config.json

This will run hector_integration_test.py but will read & write files to the directories specified in local_test_config.json instead of the default test_config.json.

IFF you are satisfied that they do not contain any bugs or errors, copy the newly made DataFrames from tests/local_test_data/hector/output/ to tests/local_test_data/hector/reference/:

cp tests/local_test_data/hector/output *.hd5 tests/local_test_data/hector/reference

Run python hector_regression_test.py local_test_config.json one more time to check that you've copied the DataFrame over correctly.
Repeat step 7 whenever any changes are made to the codebase.

To update the reference DataFrames for the local tests, simply follow the steps in Updating the reference DataFrames, but make sure to pass local_test_config.json as the input argument to the integration/regression tests, and be sure not to accidentally add any of the output files to the git repository.

Updating the Hector input catalogue

io.hector.make_metadata_df() reads an input catalogue (stored in data/0_2cubed_galaxies_v0_01.txt) to get "metadata" e.g. stellar masses, redshifts, etc. for each galaxy. The precise format and contents of this file will undoubtedly change with future data releases. It's therefore important to update make_metadata_df() appropriately so that it correctly reads in the catalogue.

Important note: If you update the file format of the input catalogue (e.g. change it from a .txt file to .fits) and find that the integration/regression tests pass locally but fail when run remotely as part of a GitHub Action, it may be because GitHub can't "see" the file. Make sure you have added the appropriate extension to the package_data entry in setup.py, e.g.:

package_data={
    "spaxelsleuth": ["**/*.csv", "**/*.json", "**/*.txt",  "**/*.fits"]
}