0.2 Filter SRRs - labbces/SpliceScape GitHub Wiki
The Python script filter_metadata.py provides a user-friendly command-line interface to filter the database based on various technical and biological criteria. While it is always possible to query the SQLite database directly using SQL, this script simplifies the most common filtering operations. It dynamically constructs and executes a SQL query based on your command-line arguments, making it easy to narrow down thousands of samples to a relevant subset.
Input:
SQLite Database: The .db file generated by get_metadata.py containing the sra_metadata table.
See 0. Pre‐processing: SRR metadata DB construction page for more information.
Outputs:
The script generates two types of output:
- SRR List (Required): A plain text (.txt) file specified by
--output_file. This file contains the final list of filtered SRA accessions, with one identifier per line, ready to be used as input for the main SpliceScape pipeline. - Filtered Database Table (Optional): If the
--create_tableflag is used, the script will create a new table named filtered_sra_metadata inside the input database. This table will contain all the metadata for the samples that passed the filters.
| Category | Requirements |
|---|---|
| Python | Python 3 |
| Standard Libraries |
argparse, sqlite3, sys
|
| External Libraries | None |
- Arguments:
| Argument | Function | Required |
|---|---|---|
| -db, --database | Path to the input SQLite database file. | Yes |
| --output_file | Path for the output | .txt file that will store the filtered SRR IDs. |
| -l, --read_length | Filters for samples where the mean read length is greater than or equal to the specified value. Default is 100. | No |
| -f, --filters | Filters for samples where the specified columns are not empty. You can list multiple columns. | Allowed columns: pmid, species_cultivar, species_genotype, treatment, dev_stage, tissue, age, source_name. |
| -s, --strand | Filters by library layout. You can specify one or more of | PAIRED, SINGLE, or NULL. |
| -e, --exact_filter | Applies a filter for an exact value in a specific column. Provide the argument as | <column_name> . It can be used multiple times for different filters. To match multiple values in the same column, separate them with a comma (e.g., |
| --create_table | If used, creates a new table | filtered_sra_metadata in the database with the full data for the filtered results. |
| --verbose | Enables detailed console messages about the filtering process. | No |
The following command filters the arabidopsis.db database for samples that meet several criteria:
- Have a read length of at least 150 bp.
- Have non-empty values in the pmid and tissue columns.
- Are from a PAIRED library.
- Are from the Arabidopsis thaliana species and the 'leaf' tissue.
It will create the filtered_sra_metadata table in the database and save the resulting SRR IDs to filtered_ath_leaf_samples.txt.
python3 filter_metadata.py --database "arabidopsis.db" \
--output_file "filtered_ath_leaf_samples.txt" \
-l 150 \
-f pmid tissue \
-s PAIRED \
-e species_name "Arabidopsis thaliana" \
-e tissue "leaf" \
--create_table \
--verbose