Artificial Intelligence (AI) in Audio Production

Report on a panel discussion of the 
Berlin district group of the VDT on June 18, 2024 


Text: Jens Blauert Images: Georg Fett

Block 1

In November 2022, the company OpenAI (now part of Microsoft) published an online chat system (chatGPT) that generates texts based on instructions (so-called prompts). There are now a wealth of such systems for which the not entirely accurate term “artificial intelligence” (AI) has been popularized. With regard to language and music, AI systems have the potential to automate routine processes in the production and editing of audio material, thus making them very cost-efficient. This development has a radical impact on the fields of activity of sound engineers and other sound creators (see 1 ). The Berlin district group therefore decided to organize a panel discussion to introduce those affected by the topic of AI.

Block 2

The aim was to promote the exchange of experiences with audio professionals who already use AI in their professional work. The event was opened with the following statements by Jens Blauert1:

  • Artificial intelligence (AI) is a useful tool for audio production.
  • It poses risks, but it can no longer be stopped.
  • In order to use it effectively, you have to master it.
  • If you do not learn how to do that, you will fall by the wayside.


The panel discussion took place on June 18, 2024 at the University of the Arts (UdK) with four invited panelists [3 to 6] and more than 50 attentive participants.

Block 3

The discussions began by considering what the essence of a tool actually is. The prototype used for this was the axe, a tool that has been used by humans for thousands of years. Its uses varies, e.g., chopping wood, cutting up bones, gaining access, attack and defense. However, it can also be used to injure oneself, such as chopping off fingers. So, as with any tool: You have to master it to use it productively!

In the field of audio production, the following tools are particularly worth mentioning:

  1. Digital audio signal processing. Activities started in the early 1970s, but only became relevant to the market with the availability of microelectronic chips. This led to the development of the CD.
  2. Perceptual coding (e.g. mp3, AAC). It enabled streaming (e.g. YouTube, Spotify) and thus changed the music market substantially. 

  3. And now: Generative AI systems. They have fundamentally revolutionized audio production within two years, and not just this, but the entire audio world world (see e. g. 15 ).

The Technology of AI

As a basis for further discussion, an introduction to the technical principles of AI systems was provided by Fabian Seipel6. The current rapid breakthrough is due to the availability of massive data corpora in the digital space − in the case of audio-related AI systems, speech and music examples − as well as the computer power available for their rapid processing. AI systems are based on machine learning, which benefits in particular from the use of artificial neural networks.



In machine learning, a learning algorithm projects an existing data collection onto a neural network. The process of transferring data to the model is called training. To do this, the elements of the data collection are first labelled, i.e. given descriptions that symbolize their respective contents. Depending on the size of the data corpus, the training process can be very lengthy. Note, however, that for humans learning from experience may take much longer!

Block 5

The function of the trained model can be analytically described by a complex mathematical function, the Machine Learning Function (MLF). This MLF can then be used to take decisions or trigger actions, such as classification of music genres, errors in monitored systems, or quality levels of products.



However, generative AI systems require additional processing steps (see e.g. 17 ). The generation of the output of such a system begins with an input text or speech-command that verbally describes the request to the system, the so-called prompt. Prompts can be very detailed. Writing them is a skill in itself. Prompts can even give rise to copyright. The exact processing of prompts is system-specific. In general, the following applies: The system processes the prompt step-by-step. For each step, the system generates an output section based on its learned data. The chain of all these sections results in the overall output.


Block 6

This output is then optimized by making the original prompt more targeted – either by an internal system module (unsupervised learning) or by the users themselves (supervised learning). The output is then regenerated. This loop can be repeated several times.



The presentation by Fabian Seipel6 was followed by a lively and at times skeptical discussion. A contribution by Nirto Karsten Fischer7 should be highlighted because of the following comment: Human/AI interactions carry the risk of adapting to the way machines think, i.e. limiting oneself to a strictly positivist view. However, human-to-human communication is more complex. It involves a much more extensive repertoire of meanings (keyword: semiotics).


The panelists

Fabian Seipel
Fabian Seipel
Johannes Imort
Johannes Imort
Martin Rieger
Martin Rieger
Peter Weinsheimer
Peter Weinsheimer
Peter Hirscher
Peter Hirscher

Mixing and Mastering

The primary goal of the panel discussion was to have colleagues who already use AI for their sound engineering tasks report on their experiences. Johannes Imort3, Peter Hirscher4, Martin Rieger5, Marian Boldt8, and several other attendees took part in the discussion. A show of hands revealed that a majority of the audience had already tried out such tools. In this context, references to relevant AI products were given 18. It was noted that these systems require data on the original audio tracks and on mixing results achieved by experts. However, the scope and origin of such data is usually not communicated.



The consensus is that AI systems for mixing with their know-how are useful when you are not acquainted with compressors, limiters and EQs. Yet, their main advantage is to save time when mixing. Yet, this becomes problematic when a nontransparent algorithm is used. Then it is difficult to assess the quality of the output − let alone make changes to it.



Block 9

Systems that do not perform the mixing themselves but instead make suggestions for the parameters of the mixing process, are also attractive. The sound engineers then carry out the actual mixing themselves. This expands their creative scope. It is also possible, for example, to reduce background noise, such as, breathing, reverberation, traffic noise, ventilation noise, crackling.

AI systems for source separation have recently been developed to a commercial level. Separated sources are a prerequisite for object-oriented mixing, e.g., when creating immersive audio and/or constructing virtual auditory spaces.



Finally, the issue was raised by Mr Decker9 and Georg Fett10 as to whether AI-assisted mixing and mastering systems will have an impact on the training of sound professionals. Actually, a trained ear may turn out to be less essential than the ability to operate the AI systems efficiently. Question: Why at all should I then study to become a tonmeister?

Inserts, handed in after the panel session

Martin Rieger5 points out the following advantages of AI-supported upmixing:

  • Increased efficiency: AI can analyze large quantities of stereo recordings in a very short time and convert them into 3D mixes. This not only saves time, but also resources that would otherwise be spent on lengthy manual mixing processes.

  • Accessibility: Thanks to AI-supported upmixing, the production of immersive audio becomes accessible to a wider target group. Even smaller studios or, e.g., indie artists can now create high-quality 3D mixes without expensive equipment.

  • Future-proof possibilities: People like to say that Dolby-Atmos mixes are future-proof. But no technology is future-proof. It would rather be preferable to have an AI that conjures up a state-of-the-art mix that will continue to be improved in the future – once we know what is well received – instead of having a mix that was made at some point, but after a few years it turns out that things should have been mixed differently.

  • Personalization: In the end, a lot is a matter of taste. Apple Music currently allows you to turn down the vocals on stereo tracks as a karaoke feature, so that you can sing along yourself. A guitarist, would perhaps prefer louder guitars, but everyone is different. With AI stem separation, something like this can already be realized today – it is the approach of object-based audio, only thought backwards to already produced stereo tracks. If an AI then knows what to do with the individual objects in 3D space, one can already imagine more useful results than what we hear today.

But there are also disadvantages to consider:



  • Loss of authenticity: One of the biggest concerns is the risk that AI-driven mixes could lose the emotional depth and individuality that manually created 3D mixes offer. The precision of AI could lead to a homogenized sound that dilutes human intuition and creativity.

  • Quality risks: While AI tools for upmixing are getting better, there is a risk of low-quality content flooding the market. This could dilute the experience of immersive audio and diminish the value of carefully crafted, authentic mixes.

  • Technical limitations: Despite advances in AI, the technology is not without its limitations. Complex, multi-layered stereo recordings can pose challenges to algorithms that an experienced sound engineer might be better equipped to handle.


John Mourjopoulos14 writes that automatic mixing is indeed a very critical application for several reasons:



  • In multichannel mixing, there are potentially hundreds of parameters that tonmeisters can freely adjust.

  • There is no linear, predictable interaction between these parameters and the audible result, not even within a single audio channel. Rather, it is highly unpredictable how changes to the parameters of a single channel will affect the overall mix and whether they would require adjustments to the parameters in the other channels.

  • As a rule, tonmeisters continuously evaluate the audio quality of the mix and adaptively adjust the relevant parameters. 

  • There is no objective target for the mixed output. Different mixes can be technically equivalent. The choice of the optimal mix is often based on aesthetic considerations, possibly by the artist or producer.

  • In the mixing process. each step is auditively monitored and empirically evaluated under calibrated playback settings.

  • An ideal, fully automated AI mixing system would require a perception/cognition model of an experienced virtual tonmeister.

Generative Composition Systems

Such systems (e.g., Suno, Udio) are very powerful. They set lyrics (your own ones or those generated by ChatGPT) to music and create polyphonic soundtracks in radio quality (192 kBit mp3).


Our colleague Forster Thomas16 commented succinctly: “Now I'm unemployed!”

The training material is usually not disclosed. It is suspected that commercial repertoires of the major music publishers have been tapped for this purpose.

In the prompt, the user gives a few instructions on the genre and emotional expression of the desired composition. The text is then performed by a male or female singer, a male or female choir, or a musical instrument of your choice, and accompanied by a band if desired. With special cloning software (e.g. 19 ) you can use any singing voice – even your own one.



Block 12

Compared to the challenges of AI-based mixing, AI-based composition and music generation for machine-based applications is relatively straightforward14. Namely, there is a large amount of training material, classified in a well-defined hierarchical structure (genre, tempo, orchestration, harmonic and rhythmic properties, mood, style, etc.). There is also a large body of work in MIR (Music Information Retrieval) and SA (Semantic Audio) that provides relevant methods, tools, and datasets.

If the authors of the training material are not disclosed, the sound material produced is currently Royalty-free5. However, this could change due to EU legislation that will require disclosure of sources. The owners of the sources could then contractually restrict their use.


Block 13

By the way, it may happen that the AI creates a composition that is so close to a natural original that it can be classified as plagiarism. The burden of proof lies with the victims. They can complain to Suno. The system then does not output the AI version, but rather a link to the original on YouTube. This occurs, e.g., with songs by well-known pop stars.

The topic of generative composition systems sparked a lively debate. Peter Hirscher4 started with the frequently asked question of whether AI will ultimately replace humans and suggests considering the following: AI creates many use cases that people would never have thought of. For example, it allows for the creation of purpose-built songs (Scalable Audio) that humans cannot create quickly and inexpensively. This opens the opportunity for creating songs that no one else would produce, but which serve an entertaining purpose.

Block 14

Understandably, the biggest concern for audio professionals is the question of how they will be able to earn their money in the future. Peter Hirscher4 emphatically points out that the audio industry is undergoing rapid change. Soon, for example, there will no longer be studios in the traditional sense. Further, the entire exploitation chain is affected, from composers, musicians, sound engineers, sound engineers for mixing and mastering, through to distribution and marketing experts. Thus, many sound professionals will have to rethink their business model. By the way, it must not necessarily be Tonmeister!

On the role of creativity in the production process

A thesis which is often put forward is the following:

"Only people are capable of creativity, machines are not."

From this statement, one could deduce that man-made products are more valuable than machine-made ones, which would be a market advantage for their producers. However, is individual art really a market-powerful factor for sound production?

 Simon Hestermann17 writes: “The economic advantage of good AI (KI) cannot be denied. Why should a label 
pay artists to produce music for workout playlists on Spotify when AI delivers 
the right music en masse? Why should a music audience look for artists for a candlelight dinner when AI composes personalized music for the evening? 
Thus, the question is no longer whether, but when technology will present the music industry with a fait accompli. We should not rely on the morality and appreciation of the masses, as the streaming age has already taught us. 
Ultimately, consumer convenience wins!”

This quote was followed by the question of whether only humans are capable of creativity, or whether machines can also be creative? The following thoughts were put forward by Jens Blauert1: AI is known to be able to hallucinate, i.e. imagine something that does not stand up to current fact checking. AI can therefore imagine. Cannot such imagined ideas contain useful suggestions for the future? Isn’t it true that humans do the same thing when getting creative, namely, they start using their imagination?


Block 16

AI systems can even be programmed with the intention to produce mistakes, i.e. to come up something that did not exist before, and thus become creativity generators 13. Actually, AI can generate such creative ideas very quickly and in great variety.

Obviously, young people in particular are increasingly attracted to live events, presumably to escape from the omnipresent artificial creativity that noticeably lacks empathy 7. However, live events are often not primarily about the content, but about the community experience, i.e. the social event of togetherness. AI music, i.e. consumer art, is often sufficient for this purpose. After all, it is cheaper than the exorbitant ticket prices for pop-star concerts.

Block 17

However, the change triggered by AI has far-reaching social consequences. In short: the massive use of capital by big investors leads to the “expropriation” of individual sound creators7. It is likely that a new social contract is required to tackle this problem in the future4.

The final contributions to this set of questions were the following: Theodor Przybilla11 does not rule this out the advantage of an updated social contract but takes a more optimistic view of the problem. He trusts in the human ability to use new technologies for our benefit and to create adequate conditions.
 Pilos Kostas12 reports that the reception of AI audio products depends on the audience's level of experience. For example, children are often better able to identify AI songs as such than adults (see 20). Might it be that in our society we are permanently and involuntarily drilled to get accustomed to consumer art?

Super-Intelligent Autonomous AI systems

Current AI development is leading to AI systems becoming increasingly autonomous (e.g., GTB-4o, from OpenAI, a Microsoft company). This is referred to as Artificial General Intelligence (AGI).

 Jens Blauert1 explained the following as an introduction: “

These highly autonomous systems can perceive and interpret their environment via cameras and microphones, communicate with other AI systems (e.g., talk to them), and carry out actions independently. Current areas of application include autonomous driving and drones that recognize patterns in infrared images, identify potential enemies and then fight them autonomously. In the audio realm, AKI systems can observe Spotify or YouTube channels or simply a park landscape, and create music to match.

”

While Microsoft is promoting its new product, some of the previous developers have distanced themselves from it. Security expert Less Wrong said something like this:

“I left the development team because I lost confidence that Microsoft will behave responsibly when the time comes for super-intelligent autonomous systems”.

Although these comments aroused the interest of the panel participants, they also created a feeling of thoughtfulness and concern. A discussion therefore did not take place.


Conclusion

Audio production is already in a transformation phase in which the role of AI is rapidly increasing. As already emphasized in the introductory statement, AI is undoubtedly a useful tool for audio production. However, only those who have mastered its use will be able to use this tool productively. The participants of this panel therefore parted in the certainty that it was necessary to discuss this topic in detail, and this will remain so. The closing statement thus was:

Good that we talked about it!


Block 21

Jens Blauert

Jens Blauert is professor emeritus at Ruhr-University Bochum, Germany, where he founded the Institute for Communication Acoustics (IKA) and headed it for 29 years. He was a visiting professor in various countries worldwide. For example, he lectured on architectural acoustics for 10 years at the Rensselaer Polytechnic Institute (RPI) in Troy, N.Y. He was a self-employed acoustical consultant for more than 40 years. Jens is an honorary member of the VDT.