FRB Pipeline Requirements - federatedcloud/FRB_pipeline GitHub Wiki

Definitions
Background
Purpose
System Overview
Specific Requirements

Definitions

Fast Radio Bursts (FRBs) - rare astrophysical phenomena that occur as transient high energy pulses or bursts in radio astronomy data
Radio Frequency Interference (RFI) - radio frequency energy generated by non-astrophysical sources
Modulation Index - normalized standard deviation of intensity across frequency
Dynamic Spectrum - intensity as a function of time and frequency
psrfits - a standard data storage format for pulsar data files based on the FITS standard. Both mean pulse profile and streamed multi-channel full-polarisation data are supported.
PRESTO (PulsaR Exploration and Search TOolkit) - a software suite developed by Scott Ransom for pulsar search and analysis
PALFA2 pipeline - The PALFA project searches for radio pulsars using the Arecibo Observatory and a pipeline maintained by astronomers including Robert Wharton and Laura Spitler
rsw_pipeline - Robert Wharton's pipeline as adapted from the PALFA2 pipeline streamlined to search for single pulses.
spitler_pipeline - A complete re-production plus automation of the PALFA2 pipeline as adapted by Laura Spitler, but based on the condensed rsw_pipeline.
Production - The code in full operational mode. Requirements specified for production run need not be met during prototyping.

Background

The detection of Fast Radio Bursts (FRBs) first began in 2007 when Duncan Lorimer and his student David Narkevic discovered a transient high energy burst in the archive data from a pulsar survey. As a result, FRBs are often called Lorimer Bursts or radio transients. It has since been discovered that these phenomena sometimes repeat, though their origin has not been explicitly discovered. FRBs are a current area of interest to radio astronomers for continued discovery and study.

Laura Spitler discovered another FRB in Arecibo pulsar survey data during her time as a PhD student at Cornell University working with Jim Cordes and Shami Chatterjee. She developed a modification to the PALFA2 pipeline to search for candidate single pulses within the data and calculate the modulation index to filter out candidates from RFI sources. Her pipeline is now an established method to search for FRBs, and more methods continue to be explored.

Purpose

A new software pipeline will be written to assist researchers in searching for single pulse candidates amongst pulsar survey data and filtering those candidates based on their desired criteria. Previous methods used in the PALFA2 pipeline and the methods that Laura Spitler developed will be made available as part of the pipeline. The entire process of the pipeline with be configurable by the user for each search. Existing algorithms and user-designed algorithms will be interchangeable, and the order of process steps will be defined by the user. Furthermore, algorithms will be able to be tuned and adjusted to ideal search criteria by the user. These features will enable continued searches for FRBs, as well as expand the possible types of searches that can be done.

System Overview

Aristotle Federated Cloud

This project is one scientific use case out of eight that are being utilized to demonstrate the power of cloud computing for scientific research as part of the Aristotle Federated Cloud project. This is a joint effort between Cornell University CAC, the University at Buffalo CCR, and the University of California Santa Barbara Department of Computer Science. An instance of the pipeline will be hosted on Aristotle for scientists to use on existing data to search for FRBs.

Container

The software for this pipeline, including any necessary dependencies, will be available as a single Docker container via Docker Hub.

Pipeline Software

Every run: Set-up

Pipeline reads a configuration file (config) provided by user. The config file specifies which operations to carry out on the data, in the form of Python module names and, optionally, specified input parameter values
Set up directories based on config for candidate search
Pipeline code executes the specified processing modules, maintaining state and passing output from modules to other modules if necessary

Tasks which may be performed

Data pre-preprocessing
- Might include smoothing, decimation, etc
Find and Remove RFI (optional)
- Pipeline detects interference and contaminated frequency channels
- Creates a mask or other list that can be used to discard candidates that are RFI
Find Candidates
- Pipeline uses user-provided rules and specs to search for signals potentially caused by astrophysical sources
Score Candidates
- Pipeline inspects and classifies each of the found signal candidates
  - Classification may include, eg, calculation of the modulation index for each signal candidate
Display Candidates
- Pipeline produces, eg, dynamic spectra or other plots, for user inspection

Specific Requirements

External Interface

Users shall run the pipeline from the command line by ssh into the VM or Docker container hosting the code
Production runs shall take as input a list of raw data files to process

Availability

An instance of the pipeline will be hosted on Aristotle for scientists to use on new and existing data to search for FRBs
All software written for the pipeline will be free and open source available on github (BSD-3 preferred)
The container will be available on Docker Hub

Portability

Pipeline software will be containerized to increase portability for use by members of PALFA and the astronomical community
This project is extensible to operation on data from other current telescopes, such as the Green Bank Telescope, the Jansky Very Large Array, the Parkes Observatory and others, in addition to the next generation of radio telescopes such as Meerkat and the Square Kilometer Array

Security

Software has no security, and relies on the container/VM host security
Data access is based on user permissions

Database

The tracking database shall record all production runs, including the parameters, configuration and the input file list
The tracking database shall record either the location of the payload of all output data products

Performance

In production, ideal processing of a typical single 268s beam of the PALFA dataset will take 1-2 core-hours, subject to the performance of the user-defined methods
Pipeline overhead must be no more than 5% of total run time
User-defined method performance is up to the user

Reliability

Subsequent runs with the same configuration shall produce the same results as the prior run on the same data.

Maintainability

Python code must be written to follow PEP8 standards

Functional Requirements

Set-up

Must be written in Python 3.6
Must be compatible with a recent version of PRESTO
Must operate on various types of data:
- Standard astrophysical data (including psrfits)
- FITS files that need to be combined (such as Arecibo data)
- FITS files that have been combined
Must be modular
- Each method must be able to be optionally turned on or off
- Each method must have self-contained functionality
- Users must be able to create and add new methods/modules
Must be configurable via a configuration file
- Configuration file must detail which methods are run and in which order
- Configuration file must allow parameters used in methods/algorithms to be adjusted by user
- Pipeline must read configuration file
- Pipeline must load necessary modules based on configuration file
- Pipeline must run methods in order specified in configuration file
- Pipeline must accommodate a variety of parameter inputs in configuration file
Must pass inputs and outputs of methods between the pipeline and other methods
- Each method must take in a dictionary as a parameter
- Each method must return a dictionary as an output
- Each method must write any data that will be necessary to other methods into the dictionary
- Each method must be able to obtain necessary input data from the dictionary
Error checking
- Must check configuration file inputs for basic errors
- Must generate error logs for all methods/modules that were run
Time summary
- Must measure the time profile of each method run
- Must optionally output time summary for each run

Find and Remove RFI

Must optionally use the rfifind method from PRESTO
Must optionally generate a mask and/or masked dynamic spectra
Must optionally employ user-defined RFI search and removal

Find Candidates

Must optionally use the prepsubband and single_pulse_search methods from PRESTO
Must optionally use the flood fill algorithm
Must optionally use the friend-of-friends algorithm
Must be extensible to other candidate search algorithms
Must optionally employ user-defined candidate search(es)
Must produce and store search results (including candidate metrics) in a candidate list

Classify/Score Candidates

Must include one or more classifications which identify candidates as caused by RFI
Must optionally perform the modulation index calculation
- Must be able to write a masked dynamic spectra
- Must be able to combine .singlepulse files
Must be extensible to other candidate scoring algorithms as specified in the config file
Must reduce selected candidates list based on method-defined or user-defined criteria

Display Candidates

Must optionally generate output for user inspection of candidates:
- Candidate metrics
- Candidate score (modulation index, etc.)
- Dynamic Spectra
- Plots

Logging

An activity log for a given run of the code shall be recorded
The activity log shall minimally include all the parameters needed to re-run the analysis, and a full list of all data products produced by that run
The activity log shall be uploaded to a tracking database

Clean-Up

All intermediary data products, defined as all data products not specified as output, shall be removed from disk at the end of a production run of the code