Inconsistent instrument separation in long tracks with local inference

I have two problems when running the `sam-audio-large` model locally for instrument segmentation in music tracks, using just text prompts and audio input. I'm running SAM Audio with songs between 2 and 5 minutes, splitting each song into 10-second crops.

1. As the splitting of the song into crops, an instrument might be absent from some of the crops. When the instrument in the prompt is absent, the model can output random audio from the track instead of silence. Compared to when segmenting music on the website (https://aidemos.meta.com/segment-anything/editor/segment-audio), if the prompted instrument isn't present in the inputted audio, the target is silent or quiet noise.

2. When the instrument exists in the crop, I also have issues where ~10-20% of the crop segmentations fail in one of the following ways
  a. fails to segment (fully misses the instrument)
  b. segments the wrong instrument
  c. switching between segmenting a single and sometimes multiple instruments
No number of rerankings or crop durations has been able to fully fix these issues. Having rerank 2-3 makes it better than 1, but I don't see much improvement after that.

- Are there any parameters or different prompts I can use to mitigate these issues?
- Is there a specific recommended strategy for maintaining consistency for longer songs? (I'm now using multi-diffusion on the chunks as well, without much difference)
- How is the website demo version configured, does it have some specific thresholds or post-processing? (as it seems not to have the same issues)
- Is the model able to understand plural inputs like guitars? Or is it more sensitive to some instruments? (e.g. in my current tries, for the prompt "guitars" some crop segments only the acoustic guitar, and some both the acoustic and electric guitar.)

Model and settings I've tried:
- `sam-audio-large` and `sam-audio-large-tv`
- Text prompt only for audio files (visual model disabled)
- Music durations are split into crops between 10 and 80 seconds
- Batches of crops between 1 and 8
- `predict_spans=False`
- `reranking_candidates` between 1-12
  - reranking with only CLAP
  - reranking with both CLAP and SAJ
  - changed the CLAP vs SAJ ratios

Prompts I've tried:
- guitar
- guitars
- electric guitar
- piano
- synth
- synthesizer
- saxophone
- drums

Hardware & OS:
- NVIDIA RTX 6000 Ada (48GB)
- Ubuntu 24.04

Inference code:
```python
inputs = list(torch.split(source, num_samples, dim=-1))
if inputs[-1].shape[-1] < num_samples:
    inputs[-1] = torch.nn.functional.pad(inputs[-1], (0, num_samples - inputs[-1].shape[-1]))

batches = [inputs[i:i + BATCH_SIZE] for i in range(0, len(inputs), BATCH_SIZE)]

targets = []
residuals = []
with torch.inference_mode():
    for batch_inputs in tqdm(batches, desc="Processing batches", unit="batch"):
        batch = processor(
            audios=batch_inputs,
            descriptions=[DESCRIPTION] * len(batch_inputs),
        ).to(device)

        result = model.separate(batch, predict_spans=False, reranking_candidates=5)
        
        targets.extend(result.target)
        residuals.extend(result.residual)

full_length = source.shape[-1]
target = torch.cat(targets, dim=-1).clamp(-1, 1).cpu()[..., :full_length]
residual = torch.cat(residuals, dim=-1).clamp(-1, 1).cpu()[..., :full_length]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent instrument separation in long tracks with local inference #94

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Inconsistent instrument separation in long tracks with local inference #94

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions