Using spaxelsleuth with other data - hzovaro/spaxelsleuth GitHub Wiki

spaxelsleuth is designed to be modular such that it is straightforward to implement support for processing data from new sources.

How does it work?

spaxelsleuth.io.io

spaxelsleuth.io.io is the module containing generic functions for creating and loading both the "metadata" and the "full" spaxelsleuth DataFrames. It contains the following functions:

make_df(mysurvey)

make_df(mysurvey) creates and writes the full spaxelsleuth DataFrame for data from source mysurvey.

  1. The "metadata" DataFrame corresponding to mysurvey is loaded if such a function exists (see make_metadata_df() below).
  2. Under the hood, it calls mysurvey.process_galaxies() for each galaxy in mysurvey (and runs these across multiple cores to reduce computation time), and consolidates the output into a single DataFrame in which each row represents a pixel or spatial bin and each column represents an associated quantity, e.g. v_gas. See the process_galaxies() section below for the required inputs and outputs of this function.
  3. If a metadata DataFrame is present, it is merged with the full DataFrame on galaxy ID.
  4. utils.add_columns() is then called, which calculates additional data products, e.g. SFRs, extinctions, metallicities, etc. based on the columns returned by mysurvey.process_galaxies() and adds them to the DataFrame. Data quality and S/N cuts are optionally applied in this step via flags that can be set by the user.
  5. The DataFrame is then saved to file in the output directory (specified in the configuration file as settings["mysurvey"]["output_path"]). The filename can be specified by the user as an input argument; otherwise, it follows the formula
mysurvey_<bin_type>_<ncomponents>-comp_<extcorr>_minSNR=<eline_SNR_min>_minANR=<eline_ANR_min>_<debug>_<df_fname_tag>.hd5.

load_df()

load_df(mysurvey, ...) loads DataFrames created using make_df(mysurvey, ...). See the docstring for detailed information about options and parameters that can be passed.

make_metadata_df()

make_metadata_df(mysurvey, **kwargs) calls mysurvey.make_metadata_df(**kwargs).

load_metadata_df()

load_metadata_df(mysurvey) calls mysurvey.load_metadata_df().

and so on.

Requirements

In order to implement support for data from a new source (which we will call mysurvey in this example), carry out the following steps:

1. Edit your configuration file

Add a new entry to your local configuration file called mysurvey with key/value pairs as follows:

    "mysurvey": {
        "data_cube_path": "mysurvey/input/",  // Path to the raw data cubes. 
        "input_path": "mysurvey/input",       // Path to other data products.
        "output_path": "mysurvey/output"      // Path where spaxelsleuth DataFrames will be saved.
        "flux_units": 1e-16,  // log10 flux/flux density units for the input data; e.g., 1e-16 corresponds to units of 10^-16 erg s^-1 cm^-2 (Å^-1).
        "ncomponents": [      // Emission line fitting options, e.g. if your data set has available emission line products resulting from 1, 2 and 3-component fits. 
            1,
            2,
            3
        ],
        "bin_types": [        // Spatial binning scheme options. "default" refers to unbinned data.
            "default"
        ],
        "eline_list": [       // List of emission lines that have been fitted to your data set.
            "OII3726",
            "OII3729",
            "HBETA",
            "OIII5007",
            "OI6300",
            "HALPHA",
            "NII6583",
            "SII6716",
            "SII6731"
        ],
        "sigma_inst_kms": 40, // Instrumental resolution (measured as the sigma of a Gaussian LSF).
    },

2. Create a new module called mysurvey

mysurvey should contain the following functions:

  1. process_galaxies() (required)
  2. make_metadata_df() (optional)

process_galaxies(args)

Inputs

args, which is a list that is expanded as follows:

gg, gal, ncomponents, bin_type, df_metadata, kwargs = args 

where

  • gg is the index of galaxy with ID gal in the list of galaxies processed by make_df()
  • gal is the galaxy ID, which must be of type int rather than str
  • bin_type is the binning scheme used (must be a valid entry in settings["mysurvey"]["bin_types"])
  • ncomponents is the number of kinematic components fitted to the emission lines (must be a valid entry in settings["mysurvey"]["ncomponents"])
  • df_metadata is the metadata DataFrame (if no such DataFrame is available, this parameter will be None)
  • kwargs is a dict of keyword arguments that are passed as extra arguments to make_df(). For example, say your survey contains data products corresponding to two different ways to fit stellar kinematics. Then you could pass a keyword argument to make_df(), for instance "stekin_fit_moments=2", and access this from within process_galaxies as kwargs["stekin_fit_moments"].

Note that you don't have to actually use any of these arguments if you don't want to. For example, if mysurvey only has data in a single binning scheme then the bin_type argument may be unused; similarly, there may be no need for you to access the information in df_metadata.

Outputs

process_gals() must return a tuple of the form

(rows_arr, colnames)

where

  • rows_arr is a 2D array such that each row represents measurements from a single spaxel (or spatial bin) and each column represents a different quantity.
  • colnames is a list of strings corresponding to the columns in rows_arr.
    • These must be in the same order - i.e., colnames[0] must correspond to the quantity in column rows_arr[:, 0] and so on.
    • The galaxy ID must be stored as a column as "ID".
    • Column names must follow the naming conventions in column descriptions page - e.g., total H$alpha$ fluxes must be stored as "HALPHA (total)" and not "Halpha (total)" or "HALPHA (tot)", otherwise other spaxelsleuth functions will not work.

make_metadata_df()

Inputs

No inputs are required, although you can provide input arguments if you wish. These must be passed to io.make_metadata_df().

Outputs

The function itself must return nothing, but this function must save a metadata DataFrame to settings["mysurvey"]["output_path"]. Any filename can be used, with the caveat that you must

load_metadata_df()

Inputs

None.

Outputs

The metadata DataFrame. It is recommended that you raise a FileNotFoundError exception if the DataFrame cannot be found.

Template

Below is a template you can use to get started. For now, this file must be called mysurvey.py and be saved in the spaxelsleuth/io/ folder.

from pathlib import Path
import pandas as pd

from spaxelsleuth.utils.misc import _2d_map_to_1d_list
# etc.

# Paths
input_path = Path(settings["mysurvey"]["input_path"])
output_path = Path(settings["mysurvey"]["output_path"])
data_cube_path = Path(settings["mysurvey"]["data_cube_path"])


def process_galaxies(args):
    # Extract input arguments
    gg, gal, ncomponents, bin_type, df_metadata, kwargs = args 
    
    # Get the x and y coordinates corresponding to measurements in your data. 
    # For example, if you want to extract data points only from pixels in a boolean mask: 
    y_c_list, x_c_list = np.where(mask)

    # Calculate stuff and store in a dict, such that the keys are the column names and the values are 2D maps of each quantity.
    _2dmap_dict = {}
    _2dmap_dict["HALPHA (total)"] = halpha_total_flux_map
    _2dmap_dict["HALPHA error (total)"] = halpha_total_flux_map_err
    # etc.

    # Access keyword arguments passed to io.make_df() as follows:
    some_kwarg = kwargs["some_kwarg"]
    some_other_kwarg = kwargs["some_other_kwarg"]
    # etc. 

    """
    Tips:
     - Check out io.sami, io.hector, io.s7 and io.lzifu for examples of how to process data in different formats. 
     - You can make use of functions in utils, misc, etc. to automate some calculations. For instance, the D4000Å break strength can be calculated using continuum.compute_d4000(). 
     - Most of the time, the input data format will be in the form of 2D images. Use utils.misc._2d_map_to_1d_list() to convert 2D arrays into 1D arrays.
    """

    # Convert 2D maps to 1D rows 
    rows_list = []
    colnames = list(_2dmap_dict.keys())
    for colname in colnames:
        rows = _2d_map_to_1d_list(_2dmap_dict[colname], x_c_list, y_c_list, nx, ny)
        rows_list.append(rows)

    # Add galaxy ID 
    rows_list.append([gal] * len(x_c_list))
    colnames.append("ID")

    # Transpose so that each row represents a single pixel & each column a measured quantity.
    rows_arr = np.array(rows_list).T

    logger.info(f"Finished processing galaxy {gal} ({gg})")

    return rows_arr, colnames


def make_metadata_df(arg1, arg2):
    # Load a .csv file containing galaxy metadata 
    df_metadata = pd.read_csv("some_metadata_file")

    # Do some calculations, e.g. calculate the angular scale
    df_metadata["kpc per arcsec"] = df_metadata["D_A (Mpc)"] * 1e3 * np.pi / 180.0 / 3600.0

    # Use input arguments 
    if arg1 == "method 1":
        # then compute some quantity using method 1
    elif arg1 == "method 2":
        # then compute it using method 2, etc. 

    # Rename some columns to ensure consistency with spaxelsleuth column conventions, e.g. ensure that the galaxy names are stored in the "ID" column
    df_metadata = df_metadata.rename(columns={
            "galaxy_name": "ID",
        })

    """
    Tips:
     - Check out io.sami, io.hector, io.s7 and io.lzifu for examples of how to calculate various galaxy metadata properties.
    """

    # Save to file 
    df_metadata.to_hdf(os.path.join(output_path / "mysurvey_metadata.hd5")

    return 


def load_metadata_df():
    if not os.path.exists(output_path / "mysurvey_metadata.hd5"):
        raise FileNotFoundError(
            f"File {output_path / 'mysurvey_metadata.hd5'} not found. Did you remember to run make_metadata_df() first?"
        )
    df_metadata = pd.read_hdf(output_path / "mysurvey_metadata.hd5")
    return df_metadata

Usage

To use your newly created submodule for handling mysurvey data, do the following:

from spaxelsleuth import load_user_config
load_user_config("/path/to/custom/config/file/.spaxelsleuthconfig.json")
from spaxelsleuth.io.io import make_df, load_df, make_metadata_df, load_metadata_df

nthreads = 10
ncomponents = 1
eline_SNR_min = 3
eline_ANR_min = 3
correct_extinction = True


# Create the DataFrames
make_metadata_df(survey="mysurvey", arg1="method 1", arg2=10)  # Note that you can pass in arguments to mysurvey.make_metadata_df() here
df_metadata = load_metadata_df(survey="mysurvey")


# Create the DataFrame
make_df(survey="mysurvey",
        bin_type="default",
        ncomponents=ncomponents,
        eline_SNR_min=eline_SNR_min,
        eline_ANR_min=eline_ANR_min,
        correct_extinction=correct_extinction,
        metallicity_diagnostics=[
            "N2Ha_PP04",
        ],
        nthreads=nthreads,
        some_kwarg=999,
        some_other_kwarg=False,
)

# Load the DataFrame
df = load_df(
        survey="mysurvey",
        bin_type="default",
        ncomponents=ncomponents,
        eline_SNR_min=eline_SNR_min,
        eline_ANR_min=eline_ANR_min,
        correct_extinction=correct_extinction,
)

Adding your survey submodules to spaxelsleuth

If you have created a submodule for a particular survey and think it might be useful for others, feel free to create a branch and make a pull request!

⚠️ **GitHub.com Fallback** ⚠️