Known projects in an ML for midi and audio - shepherdvovkes/idmlatentspace GitHub Wiki
This project, detailed in the 2025 paper by Ovcharov Vladimir, introduces a generative framework for creating complex electronic music, with a focus on genres like IDM and Dubstep. The core of the research is the hypothesis that high-dimensional latent spaces are essential for capturing the detailed timbral performances that define these styles.
The work provides a comprehensive methodology for building and evaluating models capable of generating not just notes, but the intricate synthesizer parameter automation (MIDI CC messages) that constitutes a primary compositional element in modern electronic music.
A central argument of this research is that conventional generative models, which often use low-dimensional latent spaces (e.g., ≤256D), are fundamentally insufficient for genres rich in timbral detail. This "information bottleneck" discards the high-frequency timbral details necessary for creating authentic-sounding IDM or Dubstep basslines.
The project proposes and outlines a plan to test the efficacy of higher-dimensional latent spaces (384-512D). The rationale is that a larger dimensional space has the capacity to encode the thousands of unique parameter values that create complex textures like "wobbles" and "glitches," allowing the model to generate performances with significant timbral complexity and rhythmic nuance.
- 384 Dimensions: Serves as a robust baseline to test the hypothesis, offering a significant capacity increase over standard models.
- 512 Dimensions: Represents a high-fidelity option intended to capture the most intricate and layered textures, with the research aiming to determine if the performance gains justify the increased complexity.
The framework is built upon a β-Variational Autoencoder (β-VAE) with a Transformer backbone.
To model the tight coupling between notes and timbre, the system uses a unified, time-ordered sequence of events. This allows the Transformer's self-attention mechanism to learn the crucial relationships between a note and its subsequent timbral evolution. The vocabulary includes:
- Note Events: NOTE_ON and NOTE_OFF for all 128 pitches, with velocity quantized into 32 bins and represented by a preceding token.
- Time Events: TIME_SHIFT tokens representing the passage of time, quantized to a 1/32nd note grid.
- CC Events: Specific event tokens for 10 monitored MIDI CC controllers, with each controller's value quantized into 128 bins.
- Core Model: A 6-layer Transformer with 8 attention heads and a model dimension of 512 is used for both the encoder and decoder. The encoder maps the input sequence to a context vector, and the decoder autoregressively generates a new sequence conditioned on a latent vector z.
-
Regularization Strategy: To manage the high-dimensional latent space and prevent poor generative quality, a two-pronged approach is used.
- β-VAE Framework: Enforces a smooth, continuous latent space by penalizing the KL divergence between the learned posterior and a standard normal prior.
- Cyclical Annealing of β: To prevent the KL-divergence penalty from stifling learning too early, the β coefficient is cyclically increased during training. This encourages the model to first prioritize accurate reconstruction before imposing the latent structure.
A key innovation of this project is its genre-specific evaluation metrics, designed to measure what matters in timbral-focused music.
-
Objective Metrics:
- CC Modulation Error (CC-ME): Mean Squared Error between original and reconstructed CC value sequences, directly measuring timbral fidelity.
- Multi-Resolution STFT Loss (MR-STFT): An audio-based metric that renders the MIDI outputs and compares the perceptual similarity of the original and reconstructed audio waveforms.
-
Subjective Metrics:
- Expert Listening Tests: A panel of producers familiar with IDM and Dubstep will rate generated samples on stylistic authenticity, complexity, and overall quality.
- Visual Analysis: Plotting the generated CC automation curves to provide visual evidence of the model's ability to create novel and complex patterns.
This research strategically builds upon a foundation of established work in machine learning and music generation. The project categorizes its relationship with these prior works as Adopted, Improved, or Sufficient.
Reference | Category | Original Contribution | Project Adaptation / Usage (Pros & Cons) |
---|---|---|---|
Kingma & Welling (2013) | Adopted | The core mathematical framework for Variational Autoencoders (VAEs). |
Pro: The framework is directly implemented as it is mathematically sound and well-suited for learning continuous latent representations for creative generation. Con: In its original form, it doesn't specify how to handle the high-information demands of complex audio synthesis. |
Vaswani et al. (2017) | Adopted | The self-attention mechanism and the Transformer architecture. |
Pro: The architecture is adopted directly due to its proven strength in modeling long-range dependencies, ideal for linking notes to their timbral evolution. Con: The base architecture is not inherently designed for the specific event types of musical performance. |
Higgins et al. (2017) | Improved | The β-VAE, which introduces the β parameter to control the trade-off between reconstruction and latent space regularity. |
Pro: Provides a crucial tool for regularization. Con/Enhancement: The original work focused on low-dimensional visual concepts. This project adapts it for high-dimensional (384-512D) musical data, requiring specialized β values and integration with cyclical annealing to preserve critical timbral details. |
Fu et al. (2019) | Improved | A cyclical annealing schedule for the KL term in VAEs to prevent posterior collapse. |
Pro: Provides a sophisticated method to balance reconstruction and regularization during training. Con/Enhancement: This project adapts the schedule specifically for musical data, considering the unique temporal structures of note and CC automation streams to maintain coherence. |
Huang et al. (2018) - Music Transformer | Improved | A MIDI event representation for music generation. |
Pro: Established a strong baseline for representing symbolic music. Con/Enhancement: The original representation focused on melody and rhythm. This project significantly extends it by creating an interleaved event stream that integrates high-resolution CC events as first-class citizens, which is a key innovation for modeling timbre. |
Roberts et al. (2018) - MusicVAE | Improved | A hierarchical VAE for learning long-term structure and enabling musical interpolation. |
Pro: Demonstrated the value of hierarchical structures in music modeling. Con/Enhancement: MusicVAE's focus was primarily on melodic content. This project adapts the hierarchical concept to a new domain: the relationship between notes and their timbral envelopes, demonstrating coherent interpolation in high-dimensional, CC-aware latent spaces. |
Engel et al. (2017) - NSynth | Improved | Neural audio synthesis of individual musical notes. |
Pro: Pioneered high-quality neural audio synthesis. Con/Enhancement: NSynth's note-level focus cannot capture the continuous, evolving performances that define electronic music. This project extends the goal from single-note synthesis to complete, performance-level synthesis driven by CC automation. |
Yamamoto et al. (2020) | Improved | A Multi-Resolution STFT (MR-STFT) loss for perceptually-aware audio generation. |
Pro: A powerful, perceptually relevant audio loss function. Con/Enhancement: The original loss was for raw audio generation. This project adapts it as an evaluation metric for synthesizer performance, where the loss is calculated on audio rendered from the generated MIDI+CC data, focusing on timbral fidelity. |
Loshchilov & Hutter (2017) | Sufficient | The AdamW optimizer, which decouples weight decay from the gradient update. | Pro: AdamW is a robust, state-of-the-art optimizer for Transformer and VAE models. Its effectiveness makes direct adoption appropriate without modification. |
Dhariwal et al. (2020) - Jukebox | Sufficient | A large-scale, hierarchical model for raw audio music generation. |
Pro: Its conceptual paradigm of multi-level generation serves as a solid baseline for comparison. Con: Its use of discrete latent spaces (VQ-VAE) is architecturally different from this project's focus on continuous latent spaces required for fine-grained timbral control. |
Agostinelli et al. (2023) - MusicLM | Sufficient | Text-conditional music generation and associated evaluation methodologies. | Pro: Provides a comprehensive and well-validated framework for the subjective assessment of generated music quality. Its protocols for expert listening tests are directly applicable and adopted for this project's qualitative analysis. |
1. High-Dimensional Latent Spaces for Music: The project directly challenges the paradigm of using low-dimensional latent spaces for music, arguing and aiming to demonstrate that 384-512 dimensions are required to capture the timbral richness of complex electronic genres. 2. Unified Note-CC Representation: It introduces a novel interleaved event stream that treats MIDI notes and high-resolution CC automation as a single, coherent sequence. This is a significant departure from previous works that either ignored CC data or treated it as a separate modality. 3. Genre-Specific Evaluation Metrics: It proposes new metrics like CC Modulation Error (CC-ME) and an adapted MR-STFT loss to directly measure the fidelity of timbral generation, addressing a critical gap where evaluation previously focused on melodic and harmonic accuracy.