0.2 Filter SRRs - labbces/SpliceScape GitHub Wiki

The Python script filter_metadata.py provides a user-friendly command-line interface to filter the database based on various technical and biological criteria. While it is always possible to query the SQLite database directly using SQL, this script simplifies the most common filtering operations. It dynamically constructs and executes a SQL query based on your command-line arguments, making it easy to narrow down thousands of samples to a relevant subset.

Inputs and Outputs:

Input:

SQLite Database: The .db file generated by get_metadata.py containing the sra_metadata table.

See 0. Pre‐processing: SRR metadata DB construction page for more information.

Outputs:

The script generates two types of output:

  • SRR List (Required): A plain text (.txt) file specified by --output_file. This file contains the final list of filtered SRA accessions, with one identifier per line, ready to be used as input for the main SpliceScape pipeline.
  • Filtered Database Table (Optional): If the --create_table flag is used, the script will create a new table named filtered_sra_metadata inside the input database. This table will contain all the metadata for the samples that passed the filters.

Requirements:

Category Requirements
Python Python 3
Standard Libraries argparse, sqlite3, sys
External Libraries None
  • Arguments:
Argument Function Required
-db, --database Path to the input SQLite database file. Yes
--output_file Path for the output .txt file that will store the filtered SRR IDs.
-l, --read_length Filters for samples where the mean read length is greater than or equal to the specified value. Default is 100. No
-f, --filters Filters for samples where the specified columns are not empty. You can list multiple columns. Allowed columns: pmid, species_cultivar, species_genotype, treatment, dev_stage, tissue, age, source_name.
-s, --strand Filters by library layout. You can specify one or more of PAIRED, SINGLE, or NULL.
-e, --exact_filter Applies a filter for an exact value in a specific column. Provide the argument as <column_name> . It can be used multiple times for different filters. To match multiple values in the same column, separate them with a comma (e.g.,
--create_table If used, creates a new table filtered_sra_metadata in the database with the full data for the filtered results.
--verbose Enables detailed console messages about the filtering process. No

Run example:

The following command filters the arabidopsis.db database for samples that meet several criteria:

  • Have a read length of at least 150 bp.
  • Have non-empty values in the pmid and tissue columns.
  • Are from a PAIRED library.
  • Are from the Arabidopsis thaliana species and the 'leaf' tissue.

It will create the filtered_sra_metadata table in the database and save the resulting SRR IDs to filtered_ath_leaf_samples.txt.

python3 filter_metadata.py --database "arabidopsis.db" \
                           --output_file "filtered_ath_leaf_samples.txt" \
                           -l 150 \
                           -f pmid tissue \
                           -s PAIRED \
                           -e species_name "Arabidopsis thaliana" \
                           -e tissue "leaf" \
                           --create_table \
                           --verbose
⚠️ **GitHub.com Fallback** ⚠️