Generative Machine Learning (ML) models can create text, images, videos, audio, and music. These models are valuable in tools for music co-composition and co-production, as well as for enabling non-musicians to engage in the music-making process. My research focuses on understanding the ML approach to music generation, centering on two key questions: What are the limitations of current generative music models? Can incremental improvements, without introducing a new class of generative models, address these limitations?
My dissertation examines various techniques, challenges, and applications of ML algorithms for music generation. The research includes projects on musical interfaces for generative models, symbolic music generation, and music generation with a focus on form. Additionally, three chapters report on my research into dimensionality reduction, performative controllers, and human voice activity detection, offering further insight into key technological advancements relevant to these areas. Given the rapid pace of algorithmic development, many of the specific methods proposed may soon become outdated. Therefore, I believe the most significant outcome of my research is the identification of the challenges faced by autonomous ML-based music generation models and the development of general solutions to these challenges.
In particular, my dissertation presents a novel denoising diffusion probabilistic model for symbolic music generation, offering improved computational efficiency compared to other diffusion methods in the literature. While several ML-based music generation techniques, including my diffusion-based approach, have been explored, they all face limitations due to the small context window, typically capped at one minute. The primary challenge in this field is generating long-form music that rivals human-composed music in terms of form and structure. After hypothesizing why extending the context window alone cannot solve the problem of structural diversity in generated long-form music, due to combinatorial variability, I introduce my generalized approach. Subjective and objective metrics demonstrate a meaningful improvement in musical form compared to the current generation of generative music models. While this work shows promise, achieving models that match the quality of great composers remains an exciting and open challenge.