While trying to organize the sound files that I produce in my newest hobby, field recording, I came across the problem of combining the outputs from several multimodal models.
Basically the problem was this: just asking the model to describe the sound usually made it miss the forest for the trees - it would describe the little bits and pieces but not the whole. A recording from a construction site, for example, would result in a description of an engine revving
and miss the fact that it was a recording of a construction site.
Since I take photos of the recording sites and use them as thumbnails, the obvious next question was if the photo could be used to guide the generation of the sound description. I didn't find a model that could accept both images and sounds - the only "multimodal" models I could find were text-and-something-else, with "something else" being either images or sound, but never both. Somehow I had to plug the result of the image description model into the input of the sound description model. Since any embedding isn't readily available and most likely incompatible, I ended up trying the naive approach: the image description model produces English text, so English text is what I'll feed the sound description model.
This actually worked. As an example, I have this sound, recorded at the construction site of Gasklocka 1 at the old Stockholm gas works:

For reference, just feeding the sound to voxtral-mini-latest[a], with the prompt Describe the sound content in two to three sentences.
results in the following, which I think is an example of missing the forest for the trees:
The audio begins with a series of engine revs, followed by a continuous humming or whirring sound, which could be from a fan or a motor. There's also a faint, distant buzzing sound in the background. The engine revs are then repeated, and the humming sound continues, with occasional interruptions.
Asking Pixtral 12B[b] to describe the image resulted in: Image of a construction site with a building covered in scaffolding. The building is made of brick and is being renovated. There are trees, a fence, and a gravel area visible. A red sign and barriers are present.
Then the following prompt template was used:
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": sound_data,
},
{
"type": "text",
"text": f"A photo taken while recording the sound was described as: {image_description}. \n-----\n Using that as a context, describe the sound content in two to three sentences."
}
]
}
This resulted in:
The sound content is a mix of construction noises, including the hum of machinery, the clanking of metal, and the occasional thud of falling debris. There's also the distant sound of a vehicle driving by, possibly a truck or a car, and the occasional distant siren. The overall atmosphere is one of a bustling construction site, with a variety of sounds contributing to the background noise.
Which I think is a much better description of the sound.
I've only just started playing around with this; so the above isn't anything to write home about - but one interesting result of these models being trained on huge corpuses of English is that the universal model-to-model interchange format appears to be, not JSON, not some obscure AI-only binary format, but actually plain English.