additional_io_routines_for_error_catching - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Additional I/O Routines for Error Catching Implementation

Related to PR #673: Add capability to catch bort errors

This document provides a comprehensive list of additional I/O routines in NCEPLIBS-bufr that should have error catching capabilities added, following the pattern established in PR #673. The routines are organized by complexity level to facilitate phased implementation.

Level 1: Simple I/O Query Routines (Low Complexity)

These routines primarily query state or perform simple operations with minimal error paths:

ufbcnt (src/openclosebf.F90)
- Query message/subset counts for a logical unit
- Returns current message number and subset number
ufbqcd (src/cftbvs.F90)
- Query code/flag information for a mnemonic
- Returns descriptor and code/flag table information
ufbqcp (src/cftbvs.F90)
- Query code/flag pair information
- Returns mnemonic for given code/flag descriptor
status (src/openclosebf.F90)
- Query file connection status
- Returns logical unit status information
ufbpos (src/readwritesb.F90)
- Position to specific message/subset within file
- Allows random access to specific locations in BUFR file

Level 2: Message-Level I/O Routines (Low-Medium Complexity)

These routines handle complete BUFR messages but have straightforward error handling:

readerme (src/readwritemg.F90)
- Read BUFR message from memory array
- Processes message from byte array instead of file
rdmsgw (src/readwritemg.F90)
- Read BUFR message wrapper function
- Low-level message reading utility
openmg (src/readwritemg.F90)
- Open BUFR message for output without subset initialization
- Alternative to openmb for specific use cases
closmg (src/readwritemg.F90)
- Close current BUFR message for output
- Finalizes message and prepares for next
msgwrt (src/readwritemg.F90)
- Write BUFR message to file
- Low-level message writing utility
copymg (src/copydata.F90)
- Copy complete BUFR message from one file to another
- Transfers entire message including all subsets

Level 3: Subset-Level I/O Routines (Medium Complexity)

These routines handle individual data subsets within messages:

writsb (src/readwritesb.F90)
- Write data subset to output message
- Complements readsb (already protected)
copysb (src/copydata.F90)
- Copy data subset from input to output file
- Transfers single subset between files
ufbmms (src/memmsgs.F90)
- Access specific message/subset from memory
- Direct access to memory-resident data
ufbmns (src/memmsgs.F90)
- Read next subset from memory
- Sequential access through memory-resident messages
readlc (src/readwriteval.F90)
- Read long character strings from data subset
- Handles strings longer than standard BUFR descriptors
writlc (src/readwriteval.F90)
- Write long character strings to data subset
- Stores extended character data in subsets

Level 4: Value-Reading/Writing Routines (Medium-High Complexity)

These are the core UFB* routines that read/write data values - similar to ufbint which already has error catching:

ufbrep (src/readwriteval.F90)
- Read/write replicated sequences
- Handles repeated data structures within subset
ufbstp (src/readwriteval.F90)
- Read/write stacked replicated sequences
- Processes nested/stacked replication patterns
ufbseq (src/readwriteval.F90)
- Read/write entire sequences
- Operates on complete sequence descriptors
ufbovr (src/readwriteval.F90)
- Overwrite existing values in subset
- Updates values without full subset reconstruction
ufbevn (src/readwriteval.F90)
- Read/write event stacks
- Handles specialized event sequence structures
ufbinx (src/readwriteval.F90)
- Read values from specific message/subset without positioning
- Random access to data values
ufbget (src/readwriteval.F90)
- Retrieve complete table of values
- Bulk extraction of data elements

Level 5: Advanced Memory Operations (High Complexity)

These routines manage BUFR messages in internal memory arrays:

ufbmem (src/memmsgs.F90)
- Read entire BUFR file into internal memory
- Enables random access to all messages in file
ufbmex (src/memmsgs.F90)
- Read and expand BUFR file into memory
- Similar to ufbmem with additional processing
readmm (src/memmsgs.F90)
- Read message from internal memory arrays
- Memory-based alternative to readmg
ufbrms (src/memmsgs.F90)
- Read values from specific message/subset in memory
- Combines memory access with value extraction
ufbtam (src/memmsgs.F90)
- Build table of all messages in memory
- Creates index/inventory of memory-resident data

Level 6: Bulk Copy/Transfer Operations (High Complexity)

These routines perform complex multi-step operations:

copybf (src/copydata.F90)
- Copy entire BUFR file
- Complete file-to-file transfer operation
cpymem (src/copydata.F90)
- Copy messages from internal memory to output file
- Transfers data from memory arrays to file
ufbcpy (src/copydata.F90)
- Copy all data from input to output subset
- Complete subset-level data transfer
ufbcup (src/copydata.F90)
- Copy and update subset data
- Combines copy with value modifications

Level 7: Specialized I/O Operations (Highest Complexity)

These routines have special requirements or complex internal state management:

ufbtab (src/openclosebf.F90)
- Build table of data from multiple messages
- Aggregates data across message boundaries
- Note: Already has some error handling
ufbdmp (src/dumpdata.F90)
- Dump formatted contents of data subset
- Human-readable output of subset structure
ufdump (src/dumpdata.F90)
- Dump formatted contents of entire message
- Complete message structure visualization
dxdump (src/dxtable.F90)
- Dump dictionary tables to output file
- Outputs internal table definitions
openbt (src/openbt.F90)
- Open DX BUFR table file (user override point)
- Special case: stub routine meant for user customization

Implementation Priority Recommendations

Phase 1 (Quick Wins): Level 1-2 Routines (Items 1-11)

Timeline: 1-2 months
Rationale: Low complexity, high user visibility
Benefit: Immediate improvement in error handling for common operations
Testing: Straightforward unit tests for each routine

Phase 2 (Core Protection): Level 3-4 Routines (Items 12-24)

Timeline: 2-4 months
Rationale: Most frequently used by applications
Benefit: Greatest impact on operational reliability
Pattern: Similar to already-implemented ufbint in PR #673
Testing: Extensive testing with real-world data scenarios

Phase 3 (Memory Safety): Level 5 Routines (Items 25-29)

Timeline: 2-3 months
Rationale: Critical for applications using memory-mode operations
Benefit: Prevents memory corruption from cascading errors
Testing: Focus on memory leak and corruption detection

Phase 4 (Comprehensive Coverage): Level 6-7 Routines (Items 30-38)

Timeline: 2-3 months
Rationale: Complete the safety net
Benefit: Handle edge cases and specialized workflows
Testing: Integration tests with complex operational scenarios

Key Implementation Considerations

1. Pattern Consistency

Follow the same catch_bort_*_c() wrapper pattern established in PR #673
Maintain consistent error message format and return codes
Use bort_target_is_unset flag to prevent nested error catching

2. Backward Compatibility

Maintain existing function signatures and behavior
Ensure applications without error catching still work unchanged
Return codes should match existing conventions (0 = success, -1 = error)

3. Test Coverage

Each addition should include test cases similar to intest14.F90
Test both successful operations and intentional error conditions
Verify error messages are captured correctly
Test nested call scenarios

4. Documentation Updates

Update user guide (docs/user_guide.md) with list of protected routines
Add examples of error catching usage for each routine type
Document error message formats and return codes
Update API documentation in source code headers

5. Performance Impact

Minimal overhead from setjmp/longjmp when no errors occur
No performance degradation for applications not using error catching
Consider profiling critical paths in operational systems

6. C Interface Layer

Each protected Fortran routine needs a corresponding C wrapper function
C wrapper uses setjmp to establish error return point
Fortran routine checks bort_target_is_unset flag before proceeding
Error string captured in moda_borts module arrays

Technical Background

Error Catching Mechanism (from PR #673)

The error catching system implemented in PR #673 uses the following approach:

C Language Layer (borts.c):
- Uses setjmp/longjmp for non-local jumps
- Provides catch_bort_*_c() wrapper functions for each protected routine
- Catches errors from bort() calls via longjmp
Fortran Interface (borts.F90, bufr_c2f_interface.F90):
- Each protected routine checks bort_target_is_unset flag
- If catching is active, calls corresponding C wrapper
- C wrapper establishes return point and calls back to Fortran
- Prevents nested error catching with flag check
Error Storage (modules_arrs.F90):
- moda_borts module stores error state
- caught_str array holds error message text
- caught_str_len indicates error presence (0 = none, >0 = error occurred)
- bort_target_is_unset flag prevents recursive catching

Currently Protected Routines (PR #673)

The following routines already have error catching implemented:

openbf - Open BUFR file
closbf - Close BUFR file
readmg - Read BUFR message
readns - Read next subset
readsb - Read data subset
ufbint - Read/write data values

Testing Strategy

Unit Tests

Individual routine error conditions
Boundary cases and invalid inputs
Error message content verification

Integration Tests

Multi-routine workflows with error recovery
Real-world data processing scenarios
Performance benchmarking

Regression Tests

Ensure existing applications continue to work
Verify backward compatibility
Test with operational data streams

Platform Coverage

Linux (GNU, Intel compilers)
macOS
Windows (if applicable)
Various Fortran compiler versions

Success Metrics

Coverage: Percentage of public API routines with error catching
Reliability: Reduction in unexpected program terminations
Usability: Developer feedback on error handling improvements
Performance: Negligible overhead in production systems
Adoption: Number of applications utilizing error catching

References

PR #673: https://github.com/NOAA-EMC/NCEPLIBS-bufr/pull/673
Issue #671: Original proposal for setjmp/longjmp implementation
Issue #675: Python interface abort problem that motivated this work
Test Program: test/intest14.F90 - Example usage of error catching
Documentation: docs/user_guide.md - NCEPLIBS-bufr user guide

Contact and Collaboration

For questions or contributions related to error catching implementation:

Submit issues on GitHub: https://github.com/NOAA-EMC/NCEPLIBS-bufr/issues
Discuss on pull requests related to error catching
Contact NCEPLIBS-bufr maintainers for coordination

Document Version: 1.0
Date: October 14, 2025
Author: Technical Analysis of PR #673
Status: Planning Document