Skip to content

Inconsistent instrument separation in long tracks with local inference #94

@Cactooz

Description

@Cactooz

I have two problems when running the sam-audio-large model locally for instrument segmentation in music tracks, using just text prompts and audio input. I'm running SAM Audio with songs between 2 and 5 minutes, splitting each song into 10-second crops.

  1. As the splitting of the song into crops, an instrument might be absent from some of the crops. When the instrument in the prompt is absent, the model can output random audio from the track instead of silence. Compared to when segmenting music on the website (https://aidemos.meta.com/segment-anything/editor/segment-audio), if the prompted instrument isn't present in the inputted audio, the target is silent or quiet noise.

  2. When the instrument exists in the crop, I also have issues where ~10-20% of the crop segmentations fail in one of the following ways
    a. fails to segment (fully misses the instrument)
    b. segments the wrong instrument
    c. switching between segmenting a single and sometimes multiple instruments
    No number of rerankings or crop durations has been able to fully fix these issues. Having rerank 2-3 makes it better than 1, but I don't see much improvement after that.

  • Are there any parameters or different prompts I can use to mitigate these issues?
  • Is there a specific recommended strategy for maintaining consistency for longer songs? (I'm now using multi-diffusion on the chunks as well, without much difference)
  • How is the website demo version configured, does it have some specific thresholds or post-processing? (as it seems not to have the same issues)
  • Is the model able to understand plural inputs like guitars? Or is it more sensitive to some instruments? (e.g. in my current tries, for the prompt "guitars" some crop segments only the acoustic guitar, and some both the acoustic and electric guitar.)

Model and settings I've tried:

  • sam-audio-large and sam-audio-large-tv
  • Text prompt only for audio files (visual model disabled)
  • Music durations are split into crops between 10 and 80 seconds
  • Batches of crops between 1 and 8
  • predict_spans=False
  • reranking_candidates between 1-12
    • reranking with only CLAP
    • reranking with both CLAP and SAJ
    • changed the CLAP vs SAJ ratios

Prompts I've tried:

  • guitar
  • guitars
  • electric guitar
  • piano
  • synth
  • synthesizer
  • saxophone
  • drums

Hardware & OS:

  • NVIDIA RTX 6000 Ada (48GB)
  • Ubuntu 24.04

Inference code:

inputs = list(torch.split(source, num_samples, dim=-1))
if inputs[-1].shape[-1] < num_samples:
    inputs[-1] = torch.nn.functional.pad(inputs[-1], (0, num_samples - inputs[-1].shape[-1]))

batches = [inputs[i:i + BATCH_SIZE] for i in range(0, len(inputs), BATCH_SIZE)]

targets = []
residuals = []
with torch.inference_mode():
    for batch_inputs in tqdm(batches, desc="Processing batches", unit="batch"):
        batch = processor(
            audios=batch_inputs,
            descriptions=[DESCRIPTION] * len(batch_inputs),
        ).to(device)

        result = model.separate(batch, predict_spans=False, reranking_candidates=5)
        
        targets.extend(result.target)
        residuals.extend(result.residual)

full_length = source.shape[-1]
target = torch.cat(targets, dim=-1).clamp(-1, 1).cpu()[..., :full_length]
residual = torch.cat(residuals, dim=-1).clamp(-1, 1).cpu()[..., :full_length]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions