Merge Methods - Enferlain/sd-optim GitHub Wiki
draft
Merge method docs
multi_domain_alignment:
Merges tensors A and B using multi-domain alignment with anchor guidance.
This method combines spatial and frequency domain information to create more robust merges, using the anchor tensor C as a reference point.
Key Steps:
- Frequency-Selective Alignment: Aligns the frequency distributions of A and B, guided by the anchor C, to preserve global features.
- Cross-Attention (Optional): Calculates feature importance weights using cross-attention between A, B, and C to emphasize consistently important features.
- Dissimilarity Calculation: Measures the spatial dissimilarity between A and B, guided by the anchor C.
- Adaptive Interpolation: Merges A and B using slerp interpolation, with alpha values adaptively adjusted based on feature importance and dissimilarity.
- Anchor Adjustment: Fine-tunes the merged tensor towards the anchor C to enhance consistency and preserve anchor characteristics.
Args:
a (Tensor): The first tensor to merge.
b (Tensor): The second tensor to merge.
c (Tensor): The anchor tensor, used as a reference for alignment and adjustment.
alpha (float): The base interpolation factor for slerp (0 <= alpha <= 1).
beta (float): The strength of the anchor adjustment (0 <= beta <= 1).
kernel_size (int): Size of the Gaussian kernel for smoothing the dissimilarity map (must be odd).
centroid_margin_factor (float): Controls the width of the transition zone between
frequency bands during alignment.
frequency_weight (float): Weight given to the frequency-domain contribution when
combining aligned and original tensors (0 <= weight <= 1).
use_cross_attention (bool): Whether to use cross-attention to calculate feature
importance weights.
Returns:
Tensor: The merged tensor, combining the characteristics of A and B while guided
by the anchor C.
Recommendations:
For stable, conservative merges:
kernel_size=3
centroid_margin_factor=0.1
frequency_weight=0.3
For detail preservation:
kernel_size=3
centroid_margin_factor=0.08
frequency_weight=0.4
For smoother blending:
kernel_size=5
centroid_margin_factor=0.15
frequency_weight=0.25
basically project A onto B, optionally using C as the foundational base to create deltas from.
- Build low rank version of model, then amplify the original model by using that low rank version A_diff = A - A_lowrank
- Project the A diff onto the orthogonal base of model B
- Add onto model B
ties_sum_with_dropout
For Similar Models (e.g., models from same family/fine-tuning):
probability: [0.85, 0.95] # High keep rate since weights are likely meaningful
della_eps: [0.05, 0.15] # Smaller adjustments needed since models are similar
rescale: [0.9, 1.1] # Gentle rescaling since we expect coherent results
k: 0.2 # Standard TIES threshold
vote_sgn: 0.0 # Use actual values for voting
apply_stock: 0.0 # Stock averaging not critical for similar models
apply_median: 1.0 # Geometric median helps with outliers
eps: 1e-6 # Standard numerical stability
For Different Models (e.g., different architectures/training):
probability: [0.7, 0.85] # Lower keep rate to be more selective
della_eps: [0.15, 0.25] # Larger magnitude-based adjustments to prefer strong weights
rescale: [0.8, 1.2] # Wider rescaling range to handle varying distributions
k: 0.25 # Slightly higher threshold to be more selective
vote_sgn: 0.0 # Still use actual values
apply_stock: 1.0 # Enable stock averaging to handle different distributions
apply_median: 1.0 # Definitely use geometric median for different models
eps: 1e-5 # Slightly more permissive for numerical stability
della_eps
:
Regarding -
Positive values (recommended most cases):
- Increases keep probability for high-magnitude weights
- Decreases keep probability for low-magnitude weights
- Good for preserving model structure and important features
- Example: della_eps = 0.1 means up to ±10% adjustment to keep probabilities
-
Negative values (specialized cases):
- Reverses the magnitude-based selection
- Could be useful when you want to:
- Encourage exploration of alternative pathways in the network
- Reduce dominance of very strong weights that might be overfitted
- Balance models where one has systematically larger magnitudes
- I'd suggest small negative values if used: [-0.1, -0.05]
- Use with higher base probability to maintain stability
For practical use:
- Start with similar model settings
- If results are too noisy: increase probability, decrease della_eps
- If results are too conservative: decrease probability, increase della_eps
ties_sum_extended
Memory-efficient TIES (Top-K Importance-based Ensemble Selection) implementation with optimized chunking.
Parameters:
models : Tensor | LiftFlag[MergeSpace.DELTA]
Input model tensors or delta flags to be merged. Must provide at least one model.
k : Hyper, default=0.218
The proportion of parameters to keep (1-k is the actual filter threshold).
Range: [0, 1]. Higher values retain more parameters.
vote_sgn : Hyper, default=0.0
Controls voting mechanism:
- If <= 0.0: Uses actual parameter values for voting
- If > 0.0: Uses only parameter signs for voting
Best used when parameter magnitudes vary significantly between models.
apply_stock : Hyper, default=0.0
Controls model stock computation:
- If <= 0.0: Disabled
- If > 0.0: Enables model stock calculation using cosine similarity
Useful when dealing with potentially contradictory updates.
cos_eps : Hyper, default=1e-6
Epsilon value for cosine similarity calculation stability.
Only relevant when apply_stock > 0.0.
apply_median : Hyper, default=1.0
Controls merge strategy:
- If <= 0.0: Uses weighted average
- If > 0.0: Uses geometric median
Geometric median is more robust to outliers but computationally intensive.
eps : Hyper, default=1e-6
Small constant for numerical stability in various calculations.
maxiter : Hyper, default=150
Maximum number of iterations for geometric median computation.
Only relevant when apply_median > 0.0.
ftol : Hyper, default=1e-22
Convergence tolerance for geometric median iteration.
Only relevant when apply_median > 0.0.
weight_decay : Hyper, default=0.0218
Parameter decay factor applied to filtered values.
Range: [0, 1]. Higher values cause stronger decay.
min_agreement : Hyper, default=0.3
Minimum proportion of models that must agree on parameter sign.
Range: [0, 1]. Higher values require stronger consensus.
chunk_size : int, default=16
Number of models to process simultaneously for memory efficiency.
Adjust based on available GPU memory.
Regarding the key parameters usage:
vote_sgn
:
- Use > 0.0 when:
- Models come from different training conditions/architectures
- Parameter magnitudes vary significantly
- You want to focus on directional agreement rather than magnitude
- Keep at 0.0 when:
- Models are from similar conditions
- Parameter magnitudes are comparable
- You want to preserve magnitude information
apply_stock
:
- Use > 0.0 when:
- Dealing with potentially contradictory updates
- Models might have conflicting parameter changes
- You want to detect and handle model disagreement
- Keep at 0.0 when:
- Models are expected to be mostly aligned
- You want faster computation
- Memory is constrained
apply_median
:
- Use 1.0 (default) when:
- There might be outlier models
- Models could be corrupted or adversarial
- Robustness is more important than speed
- Set to 0.0 when:
- All models are trusted
- Speed is crucial
- Models are known to be well-behaved