FRB Pipeline Requirements - federatedcloud/FRB_pipeline GitHub Wiki
Table of Contents
Definitions
- Fast Radio Bursts (FRBs) - rare astrophysical phenomena that occur as transient high energy pulses or bursts in radio astronomy data
- Radio Frequency Interference (RFI) - radio frequency energy generated by non-astrophysical sources
- Modulation Index - normalized standard deviation of intensity across frequency
- Dynamic Spectrum - intensity as a function of time and frequency
- psrfits - a standard data storage format for pulsar data files based on the FITS standard. Both mean pulse profile and streamed multi-channel full-polarisation data are supported.
- PRESTO (PulsaR Exploration and Search TOolkit) - a software suite developed by Scott Ransom for pulsar search and analysis
- PALFA2 pipeline - The PALFA project searches for radio pulsars using the Arecibo Observatory and a pipeline maintained by astronomers including Robert Wharton and Laura Spitler
rsw_pipeline
- Robert Wharton's pipeline as adapted from the PALFA2 pipeline streamlined to search for single pulses.spitler_pipeline
- A complete re-production plus automation of the PALFA2 pipeline as adapted by Laura Spitler, but based on the condensed rsw_pipeline.- Production - The code in full operational mode. Requirements specified for production run need not be met during prototyping.
Background
The detection of Fast Radio Bursts (FRBs) first began in 2007 when Duncan Lorimer and his student David Narkevic discovered a transient high energy burst in the archive data from a pulsar survey. As a result, FRBs are often called Lorimer Bursts or radio transients. It has since been discovered that these phenomena sometimes repeat, though their origin has not been explicitly discovered. FRBs are a current area of interest to radio astronomers for continued discovery and study.
Laura Spitler discovered another FRB in Arecibo pulsar survey data during her time as a PhD student at Cornell University working with Jim Cordes and Shami Chatterjee. She developed a modification to the PALFA2 pipeline to search for candidate single pulses within the data and calculate the modulation index to filter out candidates from RFI sources. Her pipeline is now an established method to search for FRBs, and more methods continue to be explored.
Purpose
A new software pipeline will be written to assist researchers in searching for single pulse candidates amongst pulsar survey data and filtering those candidates based on their desired criteria. Previous methods used in the PALFA2 pipeline and the methods that Laura Spitler developed will be made available as part of the pipeline. The entire process of the pipeline with be configurable by the user for each search. Existing algorithms and user-designed algorithms will be interchangeable, and the order of process steps will be defined by the user. Furthermore, algorithms will be able to be tuned and adjusted to ideal search criteria by the user. These features will enable continued searches for FRBs, as well as expand the possible types of searches that can be done.
System Overview
Aristotle Federated Cloud
This project is one scientific use case out of eight that are being utilized to demonstrate the power of cloud computing for scientific research as part of the Aristotle Federated Cloud project. This is a joint effort between Cornell University CAC, the University at Buffalo CCR, and the University of California Santa Barbara Department of Computer Science. An instance of the pipeline will be hosted on Aristotle for scientists to use on existing data to search for FRBs.
Container
The software for this pipeline, including any necessary dependencies, will be available as a single Docker container via Docker Hub.
Pipeline Software
Every run: Set-up
- Pipeline reads a configuration file (config) provided by user. The config file specifies which operations to carry out on the data, in the form of Python module names and, optionally, specified input parameter values
- Set up directories based on config for candidate search
- Pipeline code executes the specified processing modules, maintaining state and passing output from modules to other modules if necessary
Tasks which may be performed
- Data pre-preprocessing
- Might include smoothing, decimation, etc
- Find and Remove RFI (optional)
- Pipeline detects interference and contaminated frequency channels
- Creates a mask or other list that can be used to discard candidates that are RFI
- Find Candidates
- Pipeline uses user-provided rules and specs to search for signals potentially caused by astrophysical sources
- Score Candidates
- Pipeline inspects and classifies each of the found signal candidates
- Classification may include, eg, calculation of the modulation index for each signal candidate
- Pipeline inspects and classifies each of the found signal candidates
- Display Candidates
- Pipeline produces, eg, dynamic spectra or other plots, for user inspection
Specific Requirements
External Interface
- Users shall run the pipeline from the command line by ssh into the VM or Docker container hosting the code
- Production runs shall take as input a list of raw data files to process
Availability
- An instance of the pipeline will be hosted on Aristotle for scientists to use on new and existing data to search for FRBs
- All software written for the pipeline will be free and open source available on github (BSD-3 preferred)
- The container will be available on Docker Hub
Portability
- Pipeline software will be containerized to increase portability for use by members of PALFA and the astronomical community
- This project is extensible to operation on data from other current telescopes, such as the Green Bank Telescope, the Jansky Very Large Array, the Parkes Observatory and others, in addition to the next generation of radio telescopes such as Meerkat and the Square Kilometer Array
Security
- Software has no security, and relies on the container/VM host security
- Data access is based on user permissions
Database
- The tracking database shall record all production runs, including the parameters, configuration and the input file list
- The tracking database shall record either the location of the payload of all output data products
Performance
- In production, ideal processing of a typical single 268s beam of the PALFA dataset will take 1-2 core-hours, subject to the performance of the user-defined methods
- Pipeline overhead must be no more than 5% of total run time
- User-defined method performance is up to the user
Reliability
- Subsequent runs with the same configuration shall produce the same results as the prior run on the same data.
Maintainability
- Python code must be written to follow PEP8 standards
Functional Requirements
Set-up
- Must be written in Python 3.6
- Must be compatible with a recent version of PRESTO
- Must operate on various types of data:
- Standard astrophysical data (including psrfits)
- FITS files that need to be combined (such as Arecibo data)
- FITS files that have been combined
- Must be modular
- Each method must be able to be optionally turned on or off
- Each method must have self-contained functionality
- Users must be able to create and add new methods/modules
- Must be configurable via a configuration file
- Configuration file must detail which methods are run and in which order
- Configuration file must allow parameters used in methods/algorithms to be adjusted by user
- Pipeline must read configuration file
- Pipeline must load necessary modules based on configuration file
- Pipeline must run methods in order specified in configuration file
- Pipeline must accommodate a variety of parameter inputs in configuration file
- Must pass inputs and outputs of methods between the pipeline and other methods
- Each method must take in a dictionary as a parameter
- Each method must return a dictionary as an output
- Each method must write any data that will be necessary to other methods into the dictionary
- Each method must be able to obtain necessary input data from the dictionary
- Error checking
- Must check configuration file inputs for basic errors
- Must generate error logs for all methods/modules that were run
- Time summary
- Must measure the time profile of each method run
- Must optionally output time summary for each run
Find and Remove RFI
- Must optionally use the
rfifind
method from PRESTO - Must optionally generate a mask and/or masked dynamic spectra
- Must optionally employ user-defined RFI search and removal
Find Candidates
- Must optionally use the
prepsubband
andsingle_pulse_search
methods from PRESTO - Must optionally use the flood fill algorithm
- Must optionally use the friend-of-friends algorithm
- Must be extensible to other candidate search algorithms
- Must optionally employ user-defined candidate search(es)
- Must produce and store search results (including candidate metrics) in a candidate list
Classify/Score Candidates
- Must include one or more classifications which identify candidates as caused by RFI
- Must optionally perform the modulation index calculation
- Must be able to write a masked dynamic spectra
- Must be able to combine
.singlepulse
files
- Must be extensible to other candidate scoring algorithms as specified in the config file
- Must reduce selected candidates list based on method-defined or user-defined criteria
Display Candidates
- Must optionally generate output for user inspection of candidates:
- Candidate metrics
- Candidate score (modulation index, etc.)
- Dynamic Spectra
- Plots
Logging
- An activity log for a given run of the code shall be recorded
- The activity log shall minimally include all the parameters needed to re-run the analysis, and a full list of all data products produced by that run
- The activity log shall be uploaded to a tracking database
Clean-Up
- All intermediary data products, defined as all data products not specified as output, shall be removed from disk at the end of a production run of the code