grievous realign: Usage and Arguments Explained - jvtalwar/GRIEVOUS GitHub Wiki

grievous realign --file GENOMIC_FILE --assembly GENOME_ASSEMBLY [--database DATABASE_ALIAS --file_type {pvar, ssf} 
--write_path DIRECTORY_TO_WRITE_FILES --mapping MAPPING_FILE --comment_characters "##" ! --verbose --return_all
--shutoff_db_update]
  • The --file, -f flag is the first required parameter when running grievous realign. This flag should be used to specify the chromosomal-level file for your dataset of interest.
  • --assembly, -a is the second required parameter for grievous realign. This flag should be used to specify the current dataset's genome assembly (e.g., hg19).
    • If you are using an existing grievous database for realign-ment, this flag ensures that the specified dataset assembly matches the genome assembly of the grievous database.
    • If you are creating a new grievous database, this flag builds a new empty database with the specified genome assembly, ensuring consistency with your dataset.
  • --database, -d is the first optional flag, but is essential to the correctness of your grievous realign run for your dataset. --database takes in a user-specified string, and first assesses whether any GRIEVOUS databases exist with the same name (or alias). If so, then that that database will be utilized. If not, GRIEVOUS will create a new empty database (recall databases are initialized in an empty manner) and subsequently proceed with realignment. For our purposes, we want to ensure all our datasets are oriented and aligned with one another (i.e., you can conceptualize this as one "project") and thus all datasets should utilize the same --database.
    • In the event only one GRIEVOUS database exists and you fail to provide a --database DATABASE_ALIAS, GRIEVOUS will default to utilizing your single database. In all other instances, the user is required to provide a --database DATABASE_ALIAS.
    • NOTE: GRIEVOUS allows you to create multiple databases with grievous realign. This provides you with the organizational and parallelization advantage (if desired) of having a separate database for each project (or genome assembly) for which you are trying to homogenize datasets.
  • --file_type, -t corresponds to the type of genomic file for which you are attempting to align. If passed in valid inputs are either pvar or ssf. In the absence of this flag, grievous realign will attempt to automagically resolve your file type for you.
  • --write_path, -w corresponds to the directory to which to write the results of grievous realign. In its absence, GRIEVOUS will make a new folder in your --file input directory named GrievousAlignedFiles. It is critical that all chromosome-level files be written to the same --write_path, as grievous merge will look to merge chromosome-level reports and GRIEVOUS_Formatted_Files to dataset-level files from a singular path.
    • If your chromosome-level files are all in their own unique directory and --write_path is not given, observe that this will create a new GrievousAlignedFiles directory for each file. In this case the user will be required to aggregate all grievous realign results to one location before calling grievous merge.
  • --mapping, -m allows you to run grievous realign without converting all files' column names to the GRIEVOUS standard. We will highlight its usage during tutorial realignment.
  • The --comment_characters, -c flag allows the user to define characters used as headers in your genomic files. For example many pvar files start with information about the sequencing and/or imputation methodology and descriptions of the INFO field (if contained). This information is separate from the desired information contained in the file and is often indicated with a set of leading comment characters. The user can pass in as many space delimited comment characters after --comment_characters and GRIEVOUS will ignore all leading lines with these.
  • The --verbose, -v flag provides a granular-level log of each step employed by grievous realign.
  • The --return_all, -r flag can be passed in with grievous realign to return all columns of the original input file in the realign output files, specifically the GRIEVOUS_Formatted files. By default the GRIEVOUS dataset standard subset is returned, to prevent confusion when columns such as INFO are provided which contain variant orientation specific information, such as MAF. If you utilize --return_all with grievous realign please be aware that information contained in these columns is uncorrected by GRIEVOUS and will require user correction ex-post-facto.
  • The final flag is --shutoff_db_update, -s. If you have any ambiguity over whether to use this flag, then DON'T USE IT! --shutoff_db_update prevents grievous realign from updating the database with all dataset identified novel biallelic SNPs. It provides a slight speed boost to grievous realign, but should only be run for the final dataset desired to be realigned (after all other datasets have been realigned that is) . If there is any ambiguity about whether a dataset is the final dataset, or about acquiring and realigning future datasets, then DO NOT use this flag. Overall, we recommend not using this flag to guarantee correctness in the case of potential future dataset grievous realignments.

A comprehensive description of each flag can be found on your command-line as well with grievous realign --help.