Artificial Intelligence (AI) in Audio Production
Report on a panel discussion of the Berlin district group of the VDT on June 18, 2024
Text: Jens Blauert Images: Georg Fett
Text: Jens Blauert Images: Georg Fett
In November 2022, the company OpenAI (now part of Microsoft) published an online chat system (chatGPT) that generates texts based on instructions (so-called prompts). There are now a wealth of such systems for which the not entirely accurate term “artificial intelligence” (AI) has been popularized. With regard to language and music, AI systems have the potential to automate routine processes in the production and editing of audio material, thus making them very cost-efficient. This development has a radical impact on the fields of activity of sound engineers and other sound creators (see 1 ). The Berlin district group therefore decided to organize a panel discussion to introduce those affected by the topic of AI.
The aim was to promote the exchange of experiences with audio professionals who already use AI in their professional work. The event was opened with the following statements by Jens Blauert1:
The panel discussion took place on June 18, 2024 at the University of the Arts (UdK) with four invited panelists [3 to 6] and more than 50 attentive participants.
The discussions began by considering what the essence of a tool actually is. The prototype used for this was the axe, a tool that has been used by humans for thousands of years. Its uses varies, e.g., chopping wood, cutting up bones, gaining access, attack and defense. However, it can also be used to injure oneself, such as chopping off fingers. So, as with any tool: You have to master it to use it productively!
In the field of audio production, the following tools are particularly worth mentioning:
As a basis for further discussion, an introduction to the technical principles of AI systems was provided by Fabian Seipel6. The current rapid breakthrough is due to the availability of massive data corpora in the digital space − in the case of audio-related AI systems, speech and music examples − as well as the computer power available for their rapid processing. AI systems are based on machine learning, which benefits in particular from the use of artificial neural networks.
In machine learning, a learning algorithm projects an existing data collection onto a neural network. The process of transferring data to the model is called training. To do this, the elements of the data collection are first labelled, i.e. given descriptions that symbolize their respective contents. Depending on the size of the data corpus, the training process can be very lengthy. Note, however, that for humans learning from experience may take much longer!
The function of the trained model can be analytically described by a complex mathematical function, the Machine Learning Function (MLF). This MLF can then be used to take decisions or trigger actions, such as classification of music genres, errors in monitored systems, or quality levels of products.
However, generative AI systems require additional processing steps (see e.g. 17 ). The generation of the output of such a system begins with an input text or speech-command that verbally describes the request to the system, the so-called prompt. Prompts can be very detailed. Writing them is a skill in itself. Prompts can even give rise to copyright. The exact processing of prompts is system-specific. In general, the following applies: The system processes the prompt step-by-step. For each step, the system generates an output section based on its learned data. The chain of all these sections results in the overall output.
This output is then optimized by making the original prompt more targeted – either by an internal system module (unsupervised learning) or by the users themselves (supervised learning). The output is then regenerated. This loop can be repeated several times.
The presentation by Fabian Seipel6 was followed by a lively and at times skeptical discussion. A contribution by Nirto Karsten Fischer7 should be highlighted because of the following comment: Human/AI interactions carry the risk of adapting to the way machines think, i.e. limiting oneself to a strictly positivist view. However, human-to-human communication is more complex. It involves a much more extensive repertoire of meanings (keyword: semiotics).
The primary goal of the panel discussion was to have colleagues who already use AI for their sound engineering tasks report on their experiences. Johannes Imort3, Peter Hirscher4, Martin Rieger5, Marian Boldt8, and several other attendees took part in the discussion. A show of hands revealed that a majority of the audience had already tried out such tools. In this context, references to relevant AI products were given 18. It was noted that these systems require data on the original audio tracks and on mixing results achieved by experts. However, the scope and origin of such data is usually not communicated.
The consensus is that AI systems for mixing with their know-how are useful when you are not acquainted with compressors, limiters and EQs. Yet, their main advantage is to save time when mixing. Yet, this becomes problematic when a nontransparent algorithm is used. Then it is difficult to assess the quality of the output − let alone make changes to it.
Systems that do not perform the mixing themselves but instead make suggestions for the parameters of the mixing process, are also attractive. The sound engineers then carry out the actual mixing themselves. This expands their creative scope. It is also possible, for example, to reduce background noise, such as, breathing, reverberation, traffic noise, ventilation noise, crackling.
AI systems for source separation have recently been developed to a commercial level. Separated sources are a prerequisite for object-oriented mixing, e.g., when creating immersive audio and/or constructing virtual auditory spaces.
Finally, the issue was raised by Mr Decker9 and Georg Fett10 as to whether AI-assisted mixing and mastering systems will have an impact on the training of sound professionals. Actually, a trained ear may turn out to be less essential than the ability to operate the AI systems efficiently. Question: Why at all should I then study to become a tonmeister?
Martin Rieger5 points out the following advantages of AI-supported upmixing:
But there are also disadvantages to consider:
John Mourjopoulos14 writes that automatic mixing is indeed a very critical application for several reasons:
Such systems (e.g., Suno, Udio) are very powerful. They set lyrics (your own ones or those generated by ChatGPT) to music and create polyphonic soundtracks in radio quality (192 kBit mp3).
Our colleague Forster Thomas16 commented succinctly: “Now I'm unemployed!”
The training material is usually not disclosed. It is suspected that commercial repertoires of the major music publishers have been tapped for this purpose. In the prompt, the user gives a few instructions on the genre and emotional expression of the desired composition. The text is then performed by a male or female singer, a male or female choir, or a musical instrument of your choice, and accompanied by a band if desired. With special cloning software (e.g. 19 ) you can use any singing voice – even your own one.
Compared to the challenges of AI-based mixing, AI-based composition and music generation for machine-based applications is relatively straightforward14. Namely, there is a large amount of training material, classified in a well-defined hierarchical structure (genre, tempo, orchestration, harmonic and rhythmic properties, mood, style, etc.). There is also a large body of work in MIR (Music Information Retrieval) and SA (Semantic Audio) that provides relevant methods, tools, and datasets.
If the authors of the training material are not disclosed, the sound material produced is currently Royalty-free5. However, this could change due to EU legislation that will require disclosure of sources. The owners of the sources could then contractually restrict their use.
By the way, it may happen that the AI creates a composition that is so close to a natural original that it can be classified as plagiarism. The burden of proof lies with the victims. They can complain to Suno. The system then does not output the AI version, but rather a link to the original on YouTube. This occurs, e.g., with songs by well-known pop stars.
The topic of generative composition systems sparked a lively debate. Peter Hirscher4 started with the frequently asked question of whether AI will ultimately replace humans and suggests considering the following: AI creates many use cases that people would never have thought of. For example, it allows for the creation of purpose-built songs (Scalable Audio) that humans cannot create quickly and inexpensively. This opens the opportunity for creating songs that no one else would produce, but which serve an entertaining purpose.
Understandably, the biggest concern for audio professionals is the question of how they will be able to earn their money in the future. Peter Hirscher4 emphatically points out that the audio industry is undergoing rapid change. Soon, for example, there will no longer be studios in the traditional sense. Further, the entire exploitation chain is affected, from composers, musicians, sound engineers, sound engineers for mixing and mastering, through to distribution and marketing experts. Thus, many sound professionals will have to rethink their business model. By the way, it must not necessarily be Tonmeister!
A thesis which is often put forward is the following:
"Only people are capable of creativity, machines are not."
From this statement, one could deduce that man-made products are more valuable than machine-made ones, which would be a market advantage for their producers. However, is individual art really a market-powerful factor for sound production? Simon Hestermann17 writes: “The economic advantage of good AI (KI) cannot be denied. Why should a label pay artists to produce music for workout playlists on Spotify when AI delivers the right music en masse? Why should a music audience look for artists for a candlelight dinner when AI composes personalized music for the evening? Thus, the question is no longer whether, but when technology will present the music industry with a fait accompli. We should not rely on the morality and appreciation of the masses, as the streaming age has already taught us. Ultimately, consumer convenience wins!”
This quote was followed by the question of whether only humans are capable of creativity, or whether machines can also be creative? The following thoughts were put forward by Jens Blauert1: AI is known to be able to hallucinate, i.e. imagine something that does not stand up to current fact checking. AI can therefore imagine. Cannot such imagined ideas contain useful suggestions for the future? Isn’t it true that humans do the same thing when getting creative, namely, they start using their imagination?
AI systems can even be programmed with the intention to produce mistakes, i.e. to come up something that did not exist before, and thus become creativity generators 13. Actually, AI can generate such creative ideas very quickly and in great variety.
Obviously, young people in particular are increasingly attracted to live events, presumably to escape from the omnipresent artificial creativity that noticeably lacks empathy 7. However, live events are often not primarily about the content, but about the community experience, i.e. the social event of togetherness. AI music, i.e. consumer art, is often sufficient for this purpose. After all, it is cheaper than the exorbitant ticket prices for pop-star concerts.
However, the change triggered by AI has far-reaching social consequences. In short: the massive use of capital by big investors leads to the “expropriation” of individual sound creators7. It is likely that a new social contract is required to tackle this problem in the future4.
The final contributions to this set of questions were the following: Theodor Przybilla11 does not rule this out the advantage of an updated social contract but takes a more optimistic view of the problem. He trusts in the human ability to use new technologies for our benefit and to create adequate conditions. Pilos Kostas12 reports that the reception of AI audio products depends on the audience's level of experience. For example, children are often better able to identify AI songs as such than adults (see 20). Might it be that in our society we are permanently and involuntarily drilled to get accustomed to consumer art?
Current AI development is leading to AI systems becoming increasingly autonomous (e.g., GTB-4o, from OpenAI, a Microsoft company). This is referred to as Artificial General Intelligence (AGI). Jens Blauert1 explained the following as an introduction: “
These highly autonomous systems can perceive and interpret their environment via cameras and microphones, communicate with other AI systems (e.g., talk to them), and carry out actions independently. Current areas of application include autonomous driving and drones that recognize patterns in infrared images, identify potential enemies and then fight them autonomously. In the audio realm, AKI systems can observe Spotify or YouTube channels or simply a park landscape, and create music to match. ”
While Microsoft is promoting its new product, some of the previous developers have distanced themselves from it. Security expert Less Wrong said something like this:
“I left the development team because I lost confidence that Microsoft will behave responsibly when the time comes for super-intelligent autonomous systems”.
Although these comments aroused the interest of the panel participants, they also created a feeling of thoughtfulness and concern. A discussion therefore did not take place.
Audio production is already in a transformation phase in which the role of AI is rapidly increasing. As already emphasized in the introductory statement, AI is undoubtedly a useful tool for audio production. However, only those who have mastered its use will be able to use this tool productively. The participants of this panel therefore parted in the certainty that it was necessary to discuss this topic in detail, and this will remain so. The closing statement thus was:
Good that we talked about it!
Click on Footnote to get back to the first mentioning of the link
Jens Blauert is professor emeritus at Ruhr-University Bochum, Germany, where he founded the Institute for Communication Acoustics (IKA) and headed it for 29 years. He was a visiting professor in various countries worldwide. For example, he lectured on architectural acoustics for 10 years at the Rensselaer Polytechnic Institute (RPI) in Troy, N.Y. He was a self-employed acoustical consultant for more than 40 years. Jens is an honorary member of the VDT.