07 data formats - the-omics-os/lobster-local GitHub Wiki
Lobster AI supports a wide range of biological data formats for different omics types. This guide provides detailed specifications for supported input and output formats, including format conversion capabilities and best practices.
Description: Standard format for single-cell data, used by scanpy and other Python tools.
File Extension: .h5ad
Structure:
AnnData object with:
- X: Expression matrix (cells × genes)
- obs: Cell metadata (cell barcodes, QC metrics, clusters)
- var: Gene metadata (gene symbols, chromosome, biotype)
- obsm: Multi-dimensional cell annotations (PCA, UMAP coordinates)
- varm: Multi-dimensional gene annotations
- layers: Additional expression matrices (raw counts, normalized)
- uns: Unstructured annotations (parameters, plots)
Example Loading:
/read single_cell_data.h5adAdvantages:
- Efficient storage with compression
- Preserves all analysis metadata
- Native format for scanpy workflows
- Supports both sparse and dense matrices
10X HDF5 Format
-
File Extension:
.h5 - Structure: HDF5 file with matrix, features, and barcodes
-
Loading:
/read filtered_feature_bc_matrix.h5
10X MTX Format
-
Files:
matrix.mtx.gz,features.tsv.gz,barcodes.tsv.gz - Structure: Market Matrix format with separate metadata files
-
Loading:
/read /path/to/filtered_feature_bc_matrix/
10X CSV Format
- Files: CSV/TSV files with gene expression matrix
- Structure: Genes as rows, cells as columns (or transposed)
Structure Options:
-
Genes as rows, cells as columns:
gene_id,cell_1,cell_2,cell_3,... ENSG00000001,10,5,0,... ENSG00000002,0,15,3,... -
Cells as rows, genes as columns:
cell_id,ENSG00000001,ENSG00000002,... cell_1,10,0,... cell_2,5,15,...
Loading:
/read expression_matrix.csv
/read expression_matrix.tsvAuto-detection: Lobster AI automatically detects orientation and format.
File Extensions: .xlsx, .xls
Structure: Expression matrix with optional metadata sheets
Example Loading:
/read single_cell_data.xlsxKallisto abundance.tsv Format
target_id length eff_length est_counts tpm
ENST00000456328 1657 1497 0 0
ENST00000450305 632 472 10.5 3.2
ENST00000488147 1351 1191 125.8 15.4
Directory Structure:
quantification_directory/
├── sample1/
│ └── abundance.tsv
├── sample2/
│ └── abundance.tsv
└── sample3/
└── abundance.tsv
Loading Kallisto Data:
/read command directly for quantification files.
/read /path/to/kallisto_outputSalmon quant.sf Format
Name Length EffectiveLength TPM NumReads
ENST00000456328 1657 1497.000 0.000 0.000
ENST00000450305 632 472.000 3.215 10.500
ENST00000488147 1351 1191.000 15.432 125.800
Loading Salmon Data:
/read command directly for quantification files.
/read /path/to/salmon_outputKey Features:
- Automatic Tool Detection: System detects Kallisto vs Salmon from file patterns
- Per-Sample Merging: Automatically merges quantification from multiple samples
- Correct Orientation: Transposes to samples × genes (bulk RNA-seq standard)
- Metadata Preservation: Extracts sample names from directory structure
- Quality Validation: Verifies quantification file integrity and consistency
Supported File Names:
-
Kallisto:
abundance.tsv,abundance.h5,abundance.txt -
Salmon:
quant.sf,quant.genes.sf
CSV/TSV Count Matrix
gene_id,sample_1,sample_2,sample_3,sample_4
ENSG00000001,150,200,175,220
ENSG00000002,0,5,2,8
ENSG00000003,1200,1500,1300,1800
Requirements:
- Raw or normalized counts
- Gene identifiers (Ensembl, Symbol, etc.)
- Sample identifiers as column headers
Structure: Compatible with DESeq2 input requirements
- Integer count values (for raw counts)
- Gene metadata optional
- Sample metadata in separate file
Sample Metadata:
sample_id,condition,batch,replicate
sample_1,control,batch1,1
sample_2,control,batch1,2
sample_3,treatment,batch2,1
sample_4,treatment,batch2,2
Gene Metadata:
gene_id,gene_symbol,biotype,chromosome
ENSG00000001,DDX11L1,processed_transcript,chr1
ENSG00000002,WASH7P,unprocessed_pseudogene,chr1
proteinGroups.txt
- Description: Main MaxQuant output file with protein quantification
-
Key Columns:
-
Protein IDs: UniProt identifiers -
Gene names: Gene symbols -
Intensity <sample>: Raw protein intensities -
LFQ intensity <sample>: Label-free quantified intensities -
Razor + unique peptides: Peptide counts
-
Loading:
/read proteinGroups.txtpeptides.txt
- Description: Peptide-level quantification
- Usage: For peptide-level analysis or filtering
CSV/Excel Format
- Structure: Protein or peptide quantification matrix
-
Key Columns:
- Protein/peptide identifiers
- Sample quantifications
- Quality metrics (CV, detection frequency)
Loading:
/read spectronaut_results.csvIntensity Matrix:
protein_id,sample_1,sample_2,sample_3,sample_4
P12345,1200.5,1500.2,1300.8,1800.1
Q67890,800.3,950.7,750.2,1100.4
Requirements:
- Protein identifiers (UniProt, gene symbols)
- Quantitative values (intensities, ratios)
- Missing values as NA, NaN, or empty
CSV Format:
SampleID,UniProt,Assay,NPX,Panel
Sample_1,P12345,IL6,5.2,Inflammation
Sample_1,Q67890,TNF,4.8,Inflammation
Sample_2,P12345,IL6,5.5,Inflammation
Structure:
- NPX Values: Normalized protein expression
- Panel Information: Olink panel designation
- UniProt IDs: Protein identifiers
- Assay Names: Protein assay identifiers
Intensity Matrix:
sample_id,protein_1,protein_2,protein_3,...
control_1,1500,2200,800,...
control_2,1600,2100,750,...
treatment_1,2200,3500,1200,...
Metadata Requirements:
- Antibody validation information
- Protein identifiers
- Sample annotations
File Extension: .h5mu
Description: Stores multiple omics modalities in single file
Structure:
MuData object with:
- mod['rna']: Transcriptomics AnnData
- mod['protein']: Proteomics AnnData
- mod['atac']: Chromatin accessibility AnnData
- obs: Shared sample metadata
- var: Combined feature metadata
Loading:
/read multiomics_data.h5muSeparate Files for Each Modality:
transcriptomics.csvproteomics.csvmetadata.csv
Sample Matching: Common sample identifiers across files
Standard Format:
sample_id,condition,batch,age,gender,replicate
sample_1,control,batch1,25,female,1
sample_2,control,batch1,27,male,2
sample_3,treatment,batch2,24,female,1
Required Columns:
-
sample_id: Unique sample identifier - Additional columns as needed for experimental design
Supported Data Types:
- Categorical: condition, batch, gender
- Numerical: age, dose, time
- Date/time: collection_date, processing_time
Gene Metadata:
gene_id,gene_symbol,biotype,chromosome,start,end
ENSG00000001,DDX11L1,processed_transcript,chr1,11869,14409
Protein Metadata:
protein_id,gene_symbol,protein_name,molecular_weight
P12345,IL6,Interleukin-6,23.7
Usage:
"Download GSE12345 from GEO database"Supported GEO Formats:
- Series Matrix Files (
GSE*_series_matrix.txt.gz) - Supplementary Files (various formats)
- Platform Annotations (
GPL*)
Processing:
- Automatic format detection
- Metadata extraction
- Sample annotation processing
- Expression matrix reconstruction
Loading Downloaded Files:
/read GSE12345_series_matrix.txt.gz
/archive GSE12345_RAW.tar # Extract and process archived samplesTAR Archives
-
File Extensions:
.tar,.tar.gz,.tar.bz2 - Compression: Supports gzip and bzip2 compression
- Usage: Common for GEO RAW files and multi-sample datasets
ZIP Archives
-
File Extensions:
.zip - Compression: Standard ZIP compression
- Usage: Alternative archive format for data distribution
Lobster AI automatically detects and processes multiple bioinformatics formats within archives:
10X Genomics Archives
-
V3 Chemistry:
matrix.mtx,features.tsv,barcodes.tsv -
V2 Chemistry:
matrix.mtx,genes.tsv,barcodes.tsv -
Compression: Handles both
.gzcompressed and uncompressed files - Structure: Automatically detects nested sample directories
Example Structure:
GSE155698_RAW.tar
├── GSM4701116_PDAC_PBMC_01/
│ ├── matrix.mtx.gz
│ ├── features.tsv.gz # V3 chemistry
│ └── barcodes.tsv.gz
├── GSM4701131_PDAC_PBMC_16/
│ ├── matrix.mtx.gz
│ ├── genes.tsv.gz # V2 chemistry
│ └── barcodes.tsv.gz
└── ... (additional samples)
Kallisto/Salmon Quantification Archives
-
Kallisto Files:
abundance.tsv,abundance.h5,abundance.txt -
Salmon Files:
quant.sf,quant.genes.sf - Requirements: Multiple sample subdirectories (≥2 samples)
- Auto-Merge: Automatically combines all samples into unified dataset
Example Structure:
kallisto_results.tar.gz
├── sample_1/
│ └── abundance.tsv
├── sample_2/
│ └── abundance.tsv
└── sample_3/
└── abundance.tsv
GEO RAW Expression Files
-
Pattern:
GSM<digits>_*.txt,GSM<digits>_*.txt.gz - Format: Expression matrices or quantification files
- Metadata: Automatically extracts sample information from filenames
Basic Usage:
/archive /path/to/archive.tar
/archive /path/to/archive.tar.gz
/archive /path/to/archive.zipFeatures:
- Smart Content Detection: Identifies data format without full extraction
- Memory Efficiency: Streaming extraction for large archives
- Multi-Sample Processing: Automatic sample concatenation
- Format Mixing: Handles archives with V2 and V3 chemistry mixed
- Progress Tracking: Real-time status updates during processing
Example Workflow:
# Load GEO archive with multiple 10X samples
/archive GSE155698_RAW.tar
# System automatically:
# 1. Inspects archive contents (no extraction yet)
# 2. Detects 17 10X Genomics samples (V2 and V3 mixed)
# 3. Extracts each sample efficiently
# 4. Loads and concatenates all samples
# 5. Result: 94,371 cells × 32,738 genes
# Load Kallisto quantification archive
/archive kallisto_batch_results.tar.gz
# System automatically:
# 1. Detects multiple abundance.tsv files
# 2. Identifies Kallisto format
# 3. Merges samples with proper orientation
# 4. Result: samples × genes count matrixStep 1: Manifest Inspection
- Fast archive contents scan
- File pattern matching
- Format identification
Step 2: Content Type Detection
- 10X Genomics (V2/V3 detection)
- Kallisto/Salmon quantification
- GEO RAW expression files
- Generic expression matrices
Step 3: Extraction Strategy
- Memory-efficient streaming
- Selective extraction (only needed files)
- Nested archive handling
Step 4: Data Loading
- Format-specific loaders
- Sample concatenation
- Metadata preservation
- Quality validation
Use /archive for:
- TAR/ZIP compressed archives
- Multi-sample datasets (10X, Kallisto, Salmon)
- GEO RAW downloads
- Nested directory structures
Use /read for:
- Individual H5AD files
- Single CSV/Excel files
- Uncompressed directories
- Pre-extracted data
Quality Checks:
- File completeness (all required files present)
- Format consistency (matching structures across samples)
- Compression integrity
- Data type validation
Error Handling:
- Missing required files (e.g., missing
barcodes.tsv) - Corrupted archives
- Unsupported nested structures
- Mixed incompatible formats
Generated Data:
- Processed expression matrices
- Quality control metrics
- Clustering results
- Dimensionality reduction coordinates
- Differential expression results
Professional Naming Convention:
geo_gse12345_quality_assessed.h5ad
geo_gse12345_filtered_normalized.h5ad
geo_gse12345_clustered.h5ad
geo_gse12345_annotated.h5ad
Differential Expression Results:
gene_id,gene_symbol,log2FoldChange,pvalue,padj,baseMean
ENSG00000001,DDX11L1,2.5,0.001,0.05,150.2
ENSG00000002,WASH7P,-1.8,0.002,0.06,89.7
Cluster Annotations:
cell_id,cluster,cell_type,confidence
cell_1,0,Hepatocyte,0.95
cell_2,1,Stellate_Cell,0.87
Format: Plotly HTML files Features:
- Zoom, pan, hover information
- Publication-quality rendering
- Embedded metadata
Example Files:
plot_1_UMAP_clusters.htmlplot_2_volcano_plot.htmlplot_3_quality_metrics.html
Formats: PNG, PDF, SVG Usage: Publications and presentations Resolution: High-resolution (300+ DPI)
Format: ZIP archive Contents:
- All processed data files (H5AD format)
- Generated plots (HTML and PNG)
- Analysis metadata and parameters
- Technical summary report
- Provenance information
Structure:
lobster_analysis_package_20240115_143022.zip
├── modalities/
│ ├── dataset_processed.h5ad
│ ├── dataset_processed.csv
│ └── dataset_metadata.json
├── plots/
│ ├── plot_1_clusters.html
│ ├── plot_1_clusters.png
│ └── index.json
├── technical_summary.md
├── workspace_status.json
└── provenance.json
Format: JSON metadata Content: Analysis parameters, tool usage history, session information
Lobster AI automatically handles format conversion during loading:
- CSV/Excel → AnnData: Matrix orientation detection and conversion
- 10X → AnnData: MTX format to AnnData with metadata
- H5 → AnnData: 10X HDF5 to AnnData format
- CSV → Count Matrix: Proper gene/sample orientation
- Excel → Multiple Sheets: Extract expression and metadata
- MaxQuant → Standard Matrix: Extract relevant columns
- Wide → Long Format: Reshape for analysis tools
- Missing Value Handling: Consistent NA representation
# Convert Excel to CSV
"Convert this Excel file to CSV format for analysis"
# Reshape data matrix
"Transpose this matrix so genes are rows and samples are columns"
# Extract specific columns
"Extract only the LFQ intensity columns from this MaxQuant file"
# Merge files
"Combine the expression data with the sample metadata file"- Matrix Dimensions: Consistent row/column counts
- Data Types: Numeric values in expression matrices
- Identifiers: Valid gene/protein/sample IDs
- Missing Values: Appropriate handling of NA values
- Expression Ranges: Biologically reasonable values
- Count Data: Non-negative values for count matrices
- Metadata Consistency: Matching sample identifiers
- Format Compliance: Standard field requirements
- Gene Detection: Minimum genes per cell
- Cell Quality: Mitochondrial content, doublet detection
- Library Complexity: UMI and gene count distributions
- Library Sizes: Total count distributions
- Gene Detection: Expressed genes per sample
- Batch Effects: PCA-based assessment
- Missing Value Patterns: MNAR vs MCAR assessment
- Coefficient of Variation: Technical reproducibility
- Dynamic Range: Protein intensity distributions
project/
├── raw_data/
│ ├── expression_matrix.csv
│ ├── sample_metadata.csv
│ └── gene_annotations.csv
├── processed_data/
└── results/
- Descriptive Names: Include data type, condition, date
- No Spaces: Use underscores instead of spaces
- Version Control: Include version numbers for iterations
- Complete Annotations: All relevant experimental factors
- Consistent Identifiers: Use standard gene/protein IDs
- Missing Data: Explicit NA values, never empty strings
- H5AD: Single-cell analysis workflows
- CSV: Simple bulk RNA-seq experiments
- Excel: Small datasets with multiple annotation sheets
- HDF5: Large datasets requiring compression
- scanpy: H5AD format preferred
- DESeq2: Count matrices with integer values
- Custom Analysis: CSV for maximum compatibility
- Compression: Use compressed formats (H5AD, HDF5)
- Sparse Matrices: Appropriate for single-cell data
- Chunked Loading: For very large files
- Memory Management: Monitor memory usage during loading
- Compressed Files: Reduce transfer time
- Batch Loading: Multiple small files vs. single large file
- Cloud Storage: Consider cloud-native formats
Cause: Unsupported or malformed file format Solution:
# Check file structure
"What format is this file and how can I load it?"
# Manual format specification
"Load this file treating it as a CSV with genes as rows"Cause: Matrix dimensions don't match metadata Solution:
# Validate data structure
"Check if my expression matrix matches the sample metadata"
# Fix dimension mismatch
"Transpose this matrix to match the metadata"Cause: Poor data quality or incorrect format interpretation Solution:
# Assess missing value patterns
"Analyze the missing value patterns in this proteomics data"
# Apply appropriate handling
"Handle missing values using MNAR imputation for this MS data"Cause: Data may be pre-normalized or log-transformed Solution:
# Check data distribution
"Examine the distribution of expression values"
# Apply appropriate preprocessing
"Skip normalization since this data appears pre-normalized"Cause: Incompatible data structures or missing information Solution:
# Identify conversion requirements
"What information do I need to convert this data to AnnData format?"
# Provide missing metadata
"Use default gene symbols for missing gene annotations"This comprehensive data formats guide covers all major biological data formats supported by Lobster AI, providing detailed specifications and best practices for effective data analysis.