English as an Interchange Format

2025-08-18

While trying to organize the sound files that I produce in my newest hobby, field recording, I came across the problem of combining the outputs from several multimodal models.

Basically the problem was this: just asking the model to describe the sound usually made it miss the forest for the trees - it would describe the little bits and pieces but not the whole. A recording from a construction site, for example, would result in a description of an engine revving and miss the fact that it was a recording of a construction site.

Since I take photos of the recording sites and use them as thumbnails, the obvious next question was if the photo could be used to guide the generation of the sound description. I didn't find a model that could accept both images and sounds - the only "multimodal" models I could find were text-and-something-else, with "something else" being either images or sound, but never both. Somehow I had to plug the result of the image description model into the input of the sound description model. Since any embedding isn't readily available and most likely incompatible, I ended up trying the naive approach: the image description model produces English text, so English text is what I'll feed the sound description model.

This actually worked. As an example, I have this sound, recorded at the construction site of Gasklocka 1 at the old Stockholm gas works:

2025-08-06 07:38

AI, sound

Audio, Gasklocka 1

For reference, just feeding the sound to voxtral-mini-latest [a], with the prompt Describe the sound content in two to three sentences. results in the following, which I think is an example of missing the forest for the trees:

The audio begins with a series of engine revs, followed by a continuous humming or whirring sound, which could be from a fan or a motor. There's also a faint, distant buzzing sound in the background. The engine revs are then repeated, and the humming sound continues, with occasional interruptions.

Asking Pixtral 12B [b] to describe the image resulted in: Image of a construction site with a building covered in scaffolding. The building is made of brick and is being renovated. There are trees, a fence, and a gravel area visible. A red sign and barriers are present. Then the following prompt template was used:

{
    "role": "user",
    "content": [
        {
            "type": "input_audio",
            "input_audio": sound_data,
        },
        {
            "type": "text",
            "text": f"A photo taken while recording the sound was described as: {image_description}. \n-----\n Using that as a context, describe the sound content in two to three sentences."
        }
    ]
}

This resulted in:

The sound content is a mix of construction noises, including the hum of machinery, the clanking of metal, and the occasional thud of falling debris. There's also the distant sound of a vehicle driving by, possibly a truck or a car, and the occasional distant siren. The overall atmosphere is one of a bustling construction site, with a variety of sounds contributing to the background noise.

Which I think is a much better description of the sound.

I've only just started playing around with this; so the above isn't anything to write home about - but one interesting result of these models being trained on huge corpuses of English is that the universal model-to-model interchange format appears to be, not JSON, not some obscure AI-only binary format, but actually plain English.

2025-08-18

4K For Less With Panasonic Lumix LX-15

Organizing Sound Files

#AI, #sound

Gasklocka 1, Stockholm, Sweden