Directory Structures and yaml Files in MoSeq Pipeline - dattalab/moseq2-app GitHub Wiki
The currently accepted depth data extensions are:
-
.dat(raw depth files from our kinect2 data acquisition software) -
.tar.gz(compressed depth files from our kinect2 data acquisition software) -
.avi(compressed depth files from themoseq2-extractCLI) -
.mkv(generated from Microsoft's recording software for the Azure Kinect)
The kinect2nidaq acquisition software produces 3-5 files after a recording:
- depth.dat
- depth_ts.txt
- metadata.json
- (optionally, depending on if the Nidaq data stream box is checked) nidaq.dat
- (optionally, depending on if the RGB stream box is checked) RGB.mp4
The depth.dat file is a 3D depth video stored in raw byte form. Each pixel of each movie frame is a little-endian unsigned 16-bit integer (uint16) representing the distance from the camera, in millimeters.
The depth_ts.txt file records the timestamps of each video frame in plain text format. The file has 2 columns separated by a single whitespace. The first column contains the hardware timestamps of the camera in ms while the second column contains timestamps from the NIDAQ, if you enabled data capture from it. Otherwise, the second column will be populated with zeros. The MoSeq analysis pipeline only uses the first column.
The metadata.json file contains the following information in JSON format:
- mouse name
- session name
- time of the recording
- NIDAQ-specific parameters (not important for typical behavioral recordings)
- video-specific parameters (i.e., resolution, data type)
We recommend recording more than 10 hours of depth video (~1 million frames at 30 frames per second) to ensure quality MoSeq models
Each MoSeq project is contained within a base directory.
To better organize the extraction, modeling, and analysis results, you can copy the MoSeq notebooks to the base directory and navigate to the base directory using cd when you are using the notebooks.
You should see the base directory using ls in your current working directory and cd to change the directory when you use the CLI. To better organize the output, you may want to specify <base_dir> as your input directory and output directory in the CLI commands, if your working directory is not <base_dir>.
At the beginning of the MoSeq pipeline, the base directory should contain separate subfolders for each depth recording session. The directory structure is as shown below:
. ** current working directory
βββ <base_dir>/ ** base directory with all depth recordings
βββ session_1/ ** - the folder containing all of a single session's data
β βββ depth.dat ** depth data - the recording itself
β βββ depth_ts.txt ** timestamps - csv/txt file of the frame timestamps (2 columns, recording timestamps in ms and nidaq timestamps)
β βββ metadata.json ** metadata - json file that contains the rodent's info (group, subjectName, etc.)
...
βββ session_n/
β βββ depth.dat
β βββ depth_ts.txt
βββ βββ metadata.json
Running the MoSeq2 Extract Modeling Notebook on your data the first time, the Notebook will generate necessary yaml files in the analysis pipeline.
You can find more information in the section describing yaml files in MoSeq pipeline.
After running the generate progress.yaml cell, a progress.yaml file will be added to the base directory if the file doesn't already exist.
MoSeq uses a config.yaml file to hold all the configurable parameters in the pipeline. After running the generate config.yaml cell, a config.yaml file will be added to the base directory.
MoSeq uses a flip classifier to orient the mouse's head to always point right during the extraction. After running the download flip classifier cell, a file with .pkl extension will be added to the base directory.
After running generate progress.yaml, config.yaml, and download the Flip Classifier File cells, the directory structure is as shown below:
. ** current working directory
βββ <base_dir>/
βββ config.yaml ** - NEW FILE -
βββ progress.yaml ** - NEW FILE -
βββ flip_classifier_k2_c57_10to13weeks.pkl ** - NEW FILE -
βββ session_1/
β βββ depth.dat
β βββ depth_ts.txt
β βββ metadata.json
...
βββ session_n/
β βββ depth.dat
β βββ depth_ts.txt
βββ βββ metadata.json
Running Interactive ROI Detection Tool in MoSeq2 Extract Modeling Notebook is an optional step. After running the Interactive ROI Detection Tool, the notebook generates a session_config.yaml that is later used in the extraction step. After running the interactive arena detection tool, a session_config.yaml will be added to the base directory. The directory structure is as shown below:
. ** current working directory
βββ <base_dir>/
βββ config.yaml
βββ session_config.yaml ** - NEW FILE -
βββ progress.yaml
βββ flip_classifier_k2_c57_10to13weeks.pkl
βββ session_1/
β βββ depth.dat
β βββ depth_ts.txt
β βββ metadata.json
...
βββ session_n/
β βββ depth.dat
β βββ depth_ts.txt
βββ βββ metadata.json
A folder called proc that contains all the extraction results is generated within each session sub-folder. The proc folder contains roi.tiff, first_frame.tiff, bground.tiff, results_00.yaml, results_00.h5 and results_00.mp4.
. ** current working directory
βββ <base_dir>/.
βββ config.yaml
βββ session_config.yaml
βββ progress.yaml
βββ flip_classifier_k2_c57_10to13weeks.pkl
βββ session_1/
β ...
β βββ proc/ ** - NEW FOLDER -
β β βββ roi.tiff ** the detected arena
β β βββ first_frame.tiff ** the first frame of the recording
β β βββ bground.tiff ** the background of the recording
β β βββ results_00.yaml ** .yaml file storing extraction parameters
β β βββ results_00.h5 ** .h5 file storing extraction
β β βββ results_00.mp4 ** extracted video
βββ session_n/
β ...
β βββ proc/ ** - NEW FOLDER -
β β βββ roi.tiff
β β ...
β β βββ results_00.yaml
β β βββ results_00.h5
β β βββ results_00.mp4
The following cell will search for the proc/ subfolders containing the extraction output, and copy them to a single aggregate_results/ folder. An index file called moseq2-index.yaml will also be generated with metadata for all extracted sessions.
After running the aggregate results cell, a folder called aggregate_results will be generated in the base directory. The aggregate_results/ folder contains all the data you need to run the rest of the pipeline. The PCA and modeling step will use data in this folder.
. ** current working directory
βββ <base_dir>/.
βββ aggregate_results/ ** - NEW FOLDER -
β βββ session_1_results_00.h5 ** session 1 compressed extraction + metadata
β βββ session_1_results_00.yaml ** session 1 extraction parameters
β βββ session_1_results_00.mp4 ** session 1 extracted video
...
β βββ session_n_results_00.h5 ** session n compressed extraction + metadata
β βββ session_n_results_00.yaml ** session n extraction parameters
β βββ session_n_results_00.mp4 ** session n extracted video
βββ config.yaml
βββ moseq2-index.yaml ** - NEW FILE -
βββ session_1/
...
βββ session_n/
After running the train PCA step, a new folder called _pca will be generated and the PCA results, pca.h5, pca.yaml, pca_components.png and pca_scree.png will be stored in the newly generated directory _pca.
. ** current working directory
βββ <base_dir>/.
βββ _pca/ ** - NEW FOLDER -
β βββ pca.h5 ** - NEW FILE - pca model compressed file
β βββ pca.yaml ** - NEW FILE - pca model YAML metadata file
β βββ pca_components.png ** - NEW FILE -
β βββ pca_scree.png ** - NEW FILE -
βββ aggregate_results/
βββ config.yaml
βββ moseq2-index.yaml
βββ session_1/
...
βββ session_n/
After running the apply PCA step, a pca_scores.h5 will be added to the _pca folder.
. ** current working directory
βββ <base_dir>/.
βββ _pca/
β βββ pca.h5
β βββ pca.yaml
β βββ pca_components.png
β βββ pca_scree.png
β βββ pca_scores.h5 ** - NEW FILE - depth video PC scores
βββ aggregate_results/
βββ config.yaml
βββ moseq2-index.yaml
βββ session_1/
...
βββ session_n/
After computing model-free changepoints, changepoints.h5 and changepoints_dist.png will be added to _pca.
. ** current working directory
βββ <base_dir>/.
βββ _pca/
β βββ pca.h5
β βββ pca_scores.h5
β ...
β βββ changepoints.h5 ** - NEW FILE - HDF5 file that contains the computed changepoints for each session used to produce the block duration distribution plot.
β βββ changepoints_dist.pdf/png ** - NEW FILES - Images that contain the distribution of behavior block durations captured by the PCs.
βββ aggregate_results/
βββ config.yaml
βββ moseq2-index.yaml
βββ session_1/
...
βββ session_n/
Running train AR-HMM step will generate a folder specified in the base_model_path if it doesn't already exist. Trained model(s) will be stored in the newly generated directory, specified in base_model_path.
After training 1 model, model.p, the directory structure will be as shown below.
. ** current working directory
βββ <base_dir>/.
βββ _pca/
βββ aggregate_results/
βββ <base_model_path>/
β βββ model.p ** - NEW FILE -
βββ moseq2-index.yaml/
βββ config.yaml
βββ session_1/
...
βββ session_n/
After training multiple models, (eg. 3 models, model1.p, model2.p and model3.p), the directory structure will be as shown below.
. ** current working directory
βββ <base_dir>/.
βββ _pca/
βββ aggregate_results/
βββ <base_model_path>/
β βββ model1.p ** - NEW FILE -
β βββ model2.p ** - NEW FILE -
β βββ model3.p ** - NEW FILE -
βββ moseq2-index.yaml/
βββ config.yaml
βββ session_1/
...
βββ session_n/
The Setup Directory Structure for Analyzing Model(s) cell in MoSeq2-Analysis-Visualization-Notebook detects all the models and creates a model-specific folder for each model and copies the model to its model-specific folder. For example, if there are three models in base_model_path (eg. model1.p, model2.p and model3.p), the directory structure will be as shown below.
. ** current working directory
βββ <base_dir>/.
βββ _pca/
βββ aggregate_results/
βββ <base_model_path>/
β βββ model1.p
β βββ model2.p
β βββ model3.p
β βββ model1/ ** - NEW FOLDER -
β β βββ model1.p
β βββ model2/ ** - NEW FOLDER -
β β βββ model2.p
β βββ model3/ ** - NEW FOLDER -
β β βββ model3.p
βββ moseq2-index.yaml/
βββ config.yaml
βββ session_1/
...
βββ session_n/
After running the syllable labeler tool, syllable crowd movies based on the model of interest will be generated in its model-specific folder. In this example, there are three models in base_model_path (model1.p, model2.p and model3.p), and model1.p is the model of interest.
. ** current working directory
βββ <base_dir>/.
βββ _pca/
βββ aggregate_results/
βββ <base_model_path>/
β βββ model1.p
β βββ model2.p
β βββ model3.p
β βββ model1/
β β βββ model1.p
β β βββ syll_info.yaml ** - NEW FILE - yaml file the stores the syallable names and descriptions
β β βcrowd_movies ** - NEW FOLDER - syllable crowd movies
β β β βsyll1.mp4
β β β ...
β β β βsylln.mp4
β βββ model2/
β β βββ model2.p
β βββ model3/
β β βββ model3.p
...
Generally, these files store metadata, configuration parameters, or file paths.
This notebook generates a progress.yaml file that stores the file paths to data generated from this notebook including extraction data files, PC scores from the extractions, and model results.
This file is project-specific. Each time you start a new project with MoSeq, a new progress.yaml file is generated.
Every Below we show the content in the progress.yaml file and what each field is for.
| name | description |
|---|---|
| base_dir | path to the data directory with all depth recordings |
| config_file | path to the config.yaml file |
| index_file | path to the moseq2-index.yaml file |
| train_data_dir | path to the aggregate_results folder |
| pca_dirname | path to the folder containing PCA results |
| scores_filename | file name containing PCA scores |
| scores_path | path to the PCA scores |
| changepoints_path | file name or path to the file containing model-free changepoints |
| base_model_path | folder where all models are saved |
| model_session_path | path to the one model you have selected for analysis, and used in the analysis pipeline |
| crowd_dir | path to the folder that saves crowd movies. This folder should be a subdirectory of the model_path |
| plot_path | folder where most plots are saved, excluding the plots generated during the PCA step |
| session_config | path to the session_config.yaml file (see below for description) |
| syll_info | path to the syll_info.yaml file (see below for description) |
| df_info_path | path to syllable statistics dataframe. Contains the same information as the mean_df but is saved in a different location and format. |
If your notebook kernel is shut down, you can load the progress file to 'restore' your progress. The progress file may not correctly track MoSeq pipeline operations that were executed outside this notebook (for example, if you were to run PCA using the command line interface). If necessary, you can manually modify the paths in the progress file or the corresponding progress_paths dictionary to access the output of these external operations.
We recommend running the notebooks from the folder where your data is located so the results are better organized. In that case, you can specify the base_dir like ./ (or the current folder).
If you run the MoSeq2 Extract Modeling Notebook to initialize or restore a progress.yaml file, progress.yaml file will be generated if there is not such a file in the base directory. When generating the progress.yaml file, the program will scan through the folder that stores all the depth recordings, the base directory, and determine the progress of the analysis pipeline. Otherwise, the program will try to find the progress.yaml file or the last saved checkpoint to determine the progress of the analysis pipeline.
- If there is a
progress.yamlfile in the directory, the information will be loaded into the progress_paths dictionary. Thecheck_progressfunction will print progress bars for each pipeline step in the notebook. The extraction progress bar indicates the total number of extracted sessions detected in the providedbase_dirpath. - It prints the session names that haven't been extracted. Note: the progress does not reflect the contents of the aggregate_results/ folder.
- The remainder of the progress bars are derived from reading the paths in the
progress_pathsdictionary, and the bars will fill up if the included paths are found.
The notebook generates a config.yaml that holds all configurable parameters for all steps in the MoSeq pipeline, such as extraction parameters and PCA parameters.
The file is initialized with default values we found to work best for the common C57BL/6J mouse strain.
Here is an example config.yaml file, with settings for each MoSeq package clearly demarcated.
Parameters will be added to this file as you progress through the notebook. The config file can be used to run an identical pipeline in future analyses.
The notebook generates a session-config.yaml that holds all configurable extraction parameters for each session.
Each session contains the same parameters as in the moseq2-extract section of the example config file.
During initialization, the depth of the bucket in each session gets detected and the values are stored in session-config.yaml.
If you use the Interactive ROI Detection Tool to configure parameters for specific sessions, the new parameters are stored insession-config.yaml.
The file structure looks like the following:
session1_name:
moseq2_extract_parameter1: value1
moseq2_extract_parameter2: value2
session2_name:
moseq2_extract_parameter1: value1
moseq2_extract_parameter2: value2
During the aggregating results step, the proc/ subfolders generated by extraction are copied to a single aggregate_results/ folder. The notebook generates a moseq2-index.yaml from the metadata for all extracted sessions. The aggregate_results/ folder contains all the data you need to run the rest of the pipeline. The PCA and modeling step will use data in this folder.
Important Note: The index file contains UUIDs to map each session to a specific extraction. If you want to re-extract session(s), delete the existing moseq2-index.yaml file and re-aggregate the extracted results to keep the moseq2-index.yaml updated. Not doing so may cause KeyErrors in the PCA and modeling steps.
The syllable labeler widget in the MoSeq2 Analysis Visualization Notebook generates a syll_info.yaml that saves the syllable names and descriptions.
The contents of the file look like the following:
0:
label: walk
desc: ''
crowd_movie_path: /data/saline-amphetamine/model/crowd_movies/syllable_sorted-id-00_(usage)_original-id-64.mp4
sorted_id: 0
sort_type: usage
original_id: 64
1:
label: ''
desc: ''
crowd_movie_path: /data/saline-amphetamine/model/crowd_movies/syllable_sorted-id-01_(usage)_original-id-75.mp4
sorted_id: 1
sort_type: usage
original_id: 75Where the first number denotes the syllable label and the indented data contain the syllable description and crowd movie links.
If you ever need to access the extracted raw data, you can look into the results_00.h5 file directly.
Each hdf5 file produced by the MoSeq extractions contains the following data structure:
/
- /frames
- /frames_mask
- /timestamps
/metadata
/metadata/acquisition
- /metadata/acquisition/ColorDataType
- /metadata/acquisition/ColorResolution
- /metadata/acquisition/DepthDataType
- /metadata/acquisition/DepthResolution
- /metadata/acquisition/IsLittleEndian
- /metadata/acquisition/NidaqChannels
- /metadata/acquisition/NidaqSamplingRate
- /metadata/acquisition/SessionName
- /metadata/acquisition/StartTime
- /metadata/acquisition/SubjectName
/metadata/extraction
- /metadata/extraction/background
- /metadata/extraction/extract_version
- /metadata/extraction/first_frame
- /metadata/extraction/first_frame_idx
- /metadata/extraction/flips
- /metadata/extraction/last_frame_idx
/metadata/extraction/parameters
- /metadata/extraction/parameters/angle_hampel_sig
- /metadata/extraction/parameters/angle_hampel_span
- /metadata/extraction/parameters/bg_roi_depth_range
- /metadata/extraction/parameters/bg_roi_dilate
- /metadata/extraction/parameters/bg_roi_erode
- /metadata/extraction/parameters/bg_roi_fill_holes
- /metadata/extraction/parameters/bg_roi_gradient_filter
- /metadata/extraction/parameters/bg_roi_gradient_kernel
- /metadata/extraction/parameters/bg_roi_gradient_threshold
- /metadata/extraction/parameters/bg_roi_index
- /metadata/extraction/parameters/bg_roi_shape
- /metadata/extraction/parameters/bg_roi_weights
- /metadata/extraction/parameters/bg_sort_roi_by_position
- /metadata/extraction/parameters/bg_sort_roi_by_position_max_rois
- /metadata/extraction/parameters/cable_filter_iters
- /metadata/extraction/parameters/cable_filter_shape
- /metadata/extraction/parameters/cable_filter_size
- /metadata/extraction/parameters/camera_type
- /metadata/extraction/parameters/centroid_hampel_sig
- /metadata/extraction/parameters/centroid_hampel_span
- /metadata/extraction/parameters/chunk_overlap
- /metadata/extraction/parameters/chunk_size
- /metadata/extraction/parameters/cluster_type
- /metadata/extraction/parameters/compress
- /metadata/extraction/parameters/compress_chunk_size
- /metadata/extraction/parameters/compress_threads
- /metadata/extraction/parameters/compute_raw_scalars
- /metadata/extraction/parameters/config_file
- /metadata/extraction/parameters/crop_size
- /metadata/extraction/parameters/delete
- /metadata/extraction/parameters/detected_true_depth
- /metadata/extraction/parameters/dilate_iterations
- /metadata/extraction/parameters/erode_iterations
- /metadata/extraction/parameters/flip_classifier
- /metadata/extraction/parameters/flip_classifier_smoothing
- /metadata/extraction/parameters/fps
- /metadata/extraction/parameters/frame_dtype
- /metadata/extraction/parameters/frame_trim
- /metadata/extraction/parameters/graduate_walls
- /metadata/extraction/parameters/manual_set_depth_range
- /metadata/extraction/parameters/mapping
- /metadata/extraction/parameters/max_height
- /metadata/extraction/parameters/min_height
- /metadata/extraction/parameters/model_smoothing_clips
- /metadata/extraction/parameters/movie_dtype
- /metadata/extraction/parameters/noise_tolerance
- /metadata/extraction/parameters/num_frames
- /metadata/extraction/parameters/output_dir
- /metadata/extraction/parameters/output_file
- /metadata/extraction/parameters/pixel_format
- /metadata/extraction/parameters/progress_bar
- /metadata/extraction/parameters/recompute_bg
- /metadata/extraction/parameters/skip_completed
- /metadata/extraction/parameters/spatial_filter_size
- /metadata/extraction/parameters/tail_filter_iters
- /metadata/extraction/parameters/tail_filter_shape
- /metadata/extraction/parameters/tail_filter_size
- /metadata/extraction/parameters/temporal_filter_size
- /metadata/extraction/parameters/threads
- /metadata/extraction/parameters/tracking_model_init
- /metadata/extraction/parameters/tracking_model_ll_clip
- /metadata/extraction/parameters/tracking_model_ll_threshold
- /metadata/extraction/parameters/tracking_model_mask_threshold
- /metadata/extraction/parameters/tracking_model_segment
- /metadata/extraction/parameters/use_cc
- /metadata/extraction/parameters/use_plane_bground
- /metadata/extraction/parameters/use_tracking_model
- /metadata/extraction/parameters/widen_radius
- /metadata/extraction/parameters/write_movie
- /metadata/extraction/roi
- /metadata/extraction/true_depth
- /metadata/uuid
/scalars
- /scalars/angle
- /scalars/area_mm
- /scalars/area_px
- /scalars/centroid_x_mm
- /scalars/centroid_x_px
- /scalars/centroid_y_mm
- /scalars/centroid_y_px
- /scalars/height_ave_mm
- /scalars/length_mm
- /scalars/length_px
- /scalars/velocity_2d_mm
- /scalars/velocity_2d_px
- /scalars/velocity_3d_mm
- /scalars/velocity_3d_px
- /scalars/velocity_theta
- /scalars/width_mm
- /scalars/width_px