Utilities - skelviper/CHARMtools GitHub Wiki
Utilities
The utils
module provides essential utility functions and helper tools that support the core functionality of CHARMtools. These utilities handle file I/O, data manipulation, and common operations used throughout the package.
Overview
The utilities module includes:
- File I/O operations for various genomic data formats
- Data format conversions and manipulations
- Helper functions for common tasks
- Color generation and visualization support
Core Modules
1. CHARMio.py - Input/Output Operations
Purpose: Comprehensive I/O functions for reading and writing genomic data formats used in 3D chromatin analysis.
Key Functions
File Parsing Functions
parse_pairs(filename: str) -> pd.DataFrame
- Input: Path to .pairs file (4DN standard format, gzipped or plain text)
- Output: pandas DataFrame with contact data
- Features:
- Automatic comment line handling
- Flexible column assignment
- Memory-efficient reading
- Preserves metadata in DataFrame attributes
from CHARMtools.utils.CHARMio import parse_pairs
# Load pairs file
pairs_data = parse_pairs("contacts.pairs.gz")
print(f"Loaded {len(pairs_data)} contacts")
print(f"Sample name: {pairs_data.attrs['name']}")
parse_3dg(filename: str, s2m=False, sorting=False) -> pd.DataFrame
- Input: Path to .3dg file containing 3D coordinates
- Output: pandas DataFrame with 3D structure data
- Parameters:
s2m
: Convert single-cell to multi-cell formatsorting
: Sort coordinates by genomic position
from CHARMtools.utils.CHARMio import parse_3dg
# Load 3D structure
structure = parse_3dg("cell_structure.3dg", sorting=True)
print(f"Structure contains {len(structure)} beads")
parse_gtf(filename: str) -> pd.DataFrame
- Input: Path to GTF annotation file
- Output: pandas DataFrame with gene annotations
- Features: Handles GENCODE/ENSEMBL GTF formats
File Writing Functions
write_pairs(pairs: pd.DataFrame, out_name: str)
- Input: pandas DataFrame with pairs data, output filename
- Output: Writes .pairs file in 4DN standard format
- Features: Preserves comments and metadata
write_3dg(_3dg: pd.DataFrame, outname: str, m2s=False)
- Input: pandas DataFrame with 3D coordinates, output filename
- Output: Writes .3dg file
- Parameters:
m2s
for multi-cell to single-cell conversion
Matrix Operations
getMatrixFromMCOOLs(filepath: str, genome_coord1: str, genome_coord2=None, resolution=40000, balance=False) -> np.ndarray
- Input: Path to .mcool file, genomic coordinates, resolution
- Output: Contact matrix as numpy array
- Features: Supports balanced and raw matrices
getMatrixFromCooler(filepath: str, genome_coord1: str, genome_coord2=None, resolution=40000, balance=False) -> np.ndarray
- Input: Path to .cool file, genomic coordinates, resolution
- Output: Contact matrix as numpy array
Utility Functions
check_index_binsize(df: pd.DataFrame) -> int
- Input: DataFrame with MultiIndex (chr, pos)
- Output: Detected resolution/bin size
- Purpose: Automatically determine data resolution
divide_name(filename: str) -> tuple
- Input: Filename with extensions
- Output: Tuple of (basename, extensions)
- Purpose: Robust filename parsing
2. helper.py - General Helper Functions
Purpose: Collection of utility functions for common data processing tasks.
Key Functions
- Data validation and type checking
- Coordinate system conversions
- Statistical helper functions
- Memory management utilities
3. pairs_manipulations.py - Pairs Data Processing
Purpose: Specialized functions for manipulating and processing contact pairs data.
Key Functions
- Pairs filtering and subsetting
- Distance calculations
- Contact frequency analysis
- Data quality metrics
Example Usage:
from CHARMtools.utils.pairs_manipulations import filter_pairs_by_distance
# Filter pairs by genomic distance
filtered_pairs = filter_pairs_by_distance(pairs_data, min_dist=1000, max_dist=1000000)
4. generateColor2.py - Color Generation
Purpose: Generate color schemes and palettes for visualization.
Key Functions
Color Palette Generation
- Categorical color schemes
- Continuous color maps
- Custom color interpolation
- Accessibility-friendly palettes
Example Usage:
from CHARMtools.utils.generateColor2 import generate_palette
# Generate colors for chromosomes
chrom_colors = generate_palette(n_colors=23, palette_type='categorical')
Data Format Support
Supported Input Formats
- .pairs - 4DN standard contact pairs format
- .3dg - 3D genome structure coordinates
- .gtf/.gff - Gene annotation files
- .cool/.mcool - Cooler contact matrices
- .h5 - HDF5 data files
- .parquet - Parquet data files
Supported Output Formats
- .pairs - Contact pairs
- .3dg - 3D coordinates
- .csv/.tsv - Tabular data
- .h5 - HDF5 format
- .cif - Crystallographic Information File
Common Usage Patterns
1. Loading and Converting Data
from CHARMtools.utils.CHARMio import parse_pairs, parse_3dg, write_pairs
# Load different data types
pairs = parse_pairs("contacts.pairs.gz")
structure = parse_3dg("structure.3dg")
# Process data
filtered_pairs = pairs[pairs['chr1'] == pairs['chr2']] # cis contacts only
# Save processed data
write_pairs(filtered_pairs, "cis_contacts.pairs")
2. Matrix Operations
from CHARMtools.utils.CHARMio import getMatrixFromCooler
# Load contact matrix
matrix = getMatrixFromCooler(
"contacts.cool",
"chr1:1000000-2000000",
resolution=10000,
balance=True
)
print(f"Matrix shape: {matrix.shape}")
3. File Format Detection
from CHARMtools.utils.CHARMio import divide_name, check_index_binsize
# Parse filename
basename, extension = divide_name("sample.pairs.gz")
print(f"Base: {basename}, Extension: {extension}")
# Detect resolution
resolution = check_index_binsize(data_with_coords)
print(f"Detected resolution: {resolution} bp")
Performance Considerations
Memory Management
- Use chunked reading for large files
- Leverage pandas' memory-efficient data types
- Consider using Parquet format for intermediate storage
I/O Optimization
- Prefer compressed formats (gzip) for storage
- Use appropriate data types to reduce memory usage
- Cache frequently accessed data
Best Practices
-
File Handling:
- Always check file existence before processing
- Use context managers for file operations
- Handle compressed and uncompressed files uniformly
-
Data Validation:
- Validate input formats before processing
- Check for required columns in DataFrames
- Handle missing or malformed data gracefully
-
Error Handling:
- Provide informative error messages
- Log processing steps for debugging
- Implement fallback strategies for common issues
Integration with CHARMtools
The utilities module is designed to work seamlessly with other CHARMtools components:
- Cell3D Objects: Utilities provide the I/O backend for Cell3D data loading
- Preprocessing: File parsing functions support preprocessing workflows
- Analysis: Matrix operations enable contact analysis
- Visualization: Color generation supports plotting functions
Troubleshooting
Common Issues
-
File Format Errors
- Verify file format compatibility
- Check for proper compression
- Validate column structures
-
Memory Issues
- Use chunked processing for large files
- Consider data type optimization
- Monitor memory usage during operations
-
Performance Problems
- Profile I/O operations
- Use appropriate file formats
- Consider parallel processing where applicable
Getting Help
For utility-specific issues:
- Check function docstrings for detailed parameter information
- Review example usage in other CHARMtools modules
- Consult the Troubleshooting Guide
Next Steps
After familiarizing yourself with the utilities:
- Explore Data Preprocessing workflows
- Learn about Cell3D Objects for data analysis
- Check out Visualization tools for data exploration
For more information about CHARMtools, visit the Home page or check the Installation Guide.