grievous realign: Usage and Arguments Explained - jvtalwar/GRIEVOUS GitHub Wiki
grievous realign --file GENOMIC_FILE --assembly GENOME_ASSEMBLY [--database DATABASE_ALIAS --file_type {pvar, ssf}
--write_path DIRECTORY_TO_WRITE_FILES --mapping MAPPING_FILE --comment_characters "##" ! --verbose --return_all
--shutoff_db_update]
- The
--file, -fflag is the first required parameter when runninggrievous realign. This flag should be used to specify the chromosomal-level file for your dataset of interest. --assembly, -ais the second required parameter forgrievous realign. This flag should be used to specify the current dataset's genome assembly (e.g., hg19).- If you are using an existing
grievousdatabase forrealign-ment, this flag ensures that the specified dataset assembly matches the genome assembly of thegrievousdatabase. - If you are creating a new
grievousdatabase, this flag builds a new empty database with the specified genome assembly, ensuring consistency with your dataset.
- If you are using an existing
--database, -dis the first optional flag, but is essential to the correctness of yourgrievous realignrun for your dataset.--databasetakes in a user-specified string, and first assesses whether any GRIEVOUS databases exist with the same name (or alias). If so, then that that database will be utilized. If not, GRIEVOUS will create a new empty database (recall databases are initialized in an empty manner) and subsequently proceed with realignment. For our purposes, we want to ensure all our datasets are oriented and aligned with one another (i.e., you can conceptualize this as one "project") and thus all datasets should utilize the same--database.- In the event only one GRIEVOUS database exists and you fail to provide a
--database DATABASE_ALIAS, GRIEVOUS will default to utilizing your single database. In all other instances, the user is required to provide a--database DATABASE_ALIAS.- If you ever forget what you defined your GRIEVOUS databases as,
grievous list_dbswill be inordinately helpful. - To identify which datasets have been aligned to which GRIEVOUS database,
grievous records [--database DATABASE_ALIAS --chr CHR_SUBSET --out OUT_PATH]is immensely invaluable (specific flag descriptions can be found withgrievous records --help).
- If you ever forget what you defined your GRIEVOUS databases as,
- NOTE: GRIEVOUS allows you to create multiple databases with
grievous realign. This provides you with the organizational and parallelization advantage (if desired) of having a separate database for each project (or genome assembly) for which you are trying to homogenize datasets.
- In the event only one GRIEVOUS database exists and you fail to provide a
--file_type, -tcorresponds to the type of genomic file for which you are attempting to align. If passed in valid inputs are either pvar or ssf. In the absence of this flag,grievous realignwill attempt to automagically resolve your file type for you.--write_path, -wcorresponds to the directory to which to write the results ofgrievous realign. In its absence, GRIEVOUS will make a new folder in your--fileinput directory named GrievousAlignedFiles. It is critical that all chromosome-level files be written to the same--write_path, asgrievous mergewill look to merge chromosome-level reports and GRIEVOUS_Formatted_Files to dataset-level files from a singular path.- If your chromosome-level files are all in their own unique directory and
--write_pathis not given, observe that this will create a new GrievousAlignedFiles directory for each file. In this case the user will be required to aggregate allgrievous realignresults to one location before callinggrievous merge.
- If your chromosome-level files are all in their own unique directory and
--mapping, -mallows you to rungrievous realignwithout converting all files' column names to the GRIEVOUS standard. We will highlight its usage during tutorial realignment.- The
--comment_characters, -cflag allows the user to define characters used as headers in your genomic files. For example many pvar files start with information about the sequencing and/or imputation methodology and descriptions of the INFO field (if contained). This information is separate from the desired information contained in the file and is often indicated with a set of leading comment characters. The user can pass in as many space delimited comment characters after--comment_charactersand GRIEVOUS will ignore all leading lines with these. - The
--verbose, -vflag provides a granular-level log of each step employed bygrievous realign. - The
--return_all, -rflag can be passed in withgrievous realignto return all columns of the original input file in the realign output files, specifically the GRIEVOUS_Formatted files. By default the GRIEVOUS dataset standard subset is returned, to prevent confusion when columns such as INFO are provided which contain variant orientation specific information, such as MAF. If you utilize--return_allwithgrievous realignplease be aware that information contained in these columns is uncorrected by GRIEVOUS and will require user correction ex-post-facto. - The final flag is
--shutoff_db_update, -s. If you have any ambiguity over whether to use this flag, then DON'T USE IT!--shutoff_db_updatepreventsgrievous realignfrom updating the database with all dataset identified novel biallelic SNPs. It provides a slight speed boost togrievous realign, but should only be run for the final dataset desired to be realigned (after all other datasets have been realigned that is) . If there is any ambiguity about whether a dataset is the final dataset, or about acquiring and realigning future datasets, then DO NOT use this flag. Overall, we recommend not using this flag to guarantee correctness in the case of potential future datasetgrievous realignments.
A comprehensive description of each flag can be found on your command-line as well with grievous realign --help.