Canonical pitch-shifting algorithms in functional JavaScript.
Frequency-domain: vocoder, phaseLock, transient, formant, sms, hpss.
Time-domain: ola, wsola, psola, granular.
Consistent unified API: batch, stream, multi-channel.
Part of the audiojs ecosystem.
npm install pitch-shiftimport transient from 'pitch-shift/transient.js'
// Batch
let pitched = transient(audio, { semitones: 5 })
// Stream
let write = transient({ ratio: 1.5 })
let output = write(inputBlock)
let tail = write() // flush
// Stereo
let [L, R] = transient([left, right], { ratio: 1.5 })| Domain | Best for | shift | |
|---|---|---|---|
| pitchShift | auto | content-aware default | 1.781 |
| transient | STFT | music with percussion ★ | 1.781 |
| phaseLock | STFT | general music | 1.775 |
| vocoder | STFT | simple tonal | 1.491 |
| formant | STFT | voice (no chipmunk) | 1.593 |
| hpss | STFT | mixed music (drums+tonal) | 1.464 |
| sms | sinusoidal | harmonic/tonal | 1.761 |
| paulstretch | STFT | ambient, extreme shifts | 2.339 |
| wsola | time | speech, low-latency | 1.672 |
| psola | time | speech, mono voice | 1.767 |
| ola | time | baseline | 2.050 |
| granular | time | creative textures | 1.905 |
| sample | time | sampler/tracker playback | 1.655 |
| hybrid | hybrid | mixed dynamic material | 1.925 |
Frequency-domain algorithms shift bins natively; time-domain algorithms use their namesake stretcher from time-stretch + sinc resample. shift = log-magnitude distance to canonical reference (lower is better). Run npm run quality for all metrics.
All algorithms accept ratio (1.5 = +7 semitones, 2 = octave), semitones, frameSize (2048), hopSize (frameSize/4).
Content-aware auto-selector. Picks: voice/speech → psola, tonal → sms, else → transient.
import pitchShift from 'pitch-shift'
pitchShift(audio, { semitones: 5 })
pitchShift(audio, { ratio: 1.5, content: 'voice' })
pitchShift(audio, { ratio: 2, method: 'formant' })| Param | Default | |
|---|---|---|
content |
music |
music, voice/speech, tonal |
method |
auto | Force a specific algorithm by name |
formant |
false |
Wrap in formant preservation |
Peak-locked phase vocoder with spectral-flux transient detection. On transient frames, synthesis phase resets to analysis phase, preserving attacks. Between transients, behaves like phaseLock.
import transient from 'pitch-shift/transient.js'
transient(audio, { ratio: 1.5 })
transient(audio, { semitones: 5, transientThreshold: 2.0 })| Param | Default | |
|---|---|---|
transientThreshold |
1.5 |
z-score over log-flux EMA (higher = fewer resets) |
Preserves phase coherence, partial structure, attack localization on detected transients.
Destroys formants; misses quiet transients at too-high threshold.
| f0 err | THD% | alias | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|
| 0.00 | 0.0 | 0.000 | 0.988 | 1.619 | 0.991 | 1.781 |
Formant dist 1.619 because bin-shift moves the spectral envelope with the partials — use formant to preserve it.
Use when: Music with drums — the default choice.
Not for: Voice where formant preservation matters.
Laroche-Dolson peak-locked phase vocoder. Peaks scatter to shifted bins; non-peak bins lock their phase relative to the nearest peak, keeping the vertical phase relationship inside each sinusoidal lobe intact.
import phaseLock from 'pitch-shift/phase-lock.js'
phaseLock(audio, { ratio: 1.5 })Preserves phase coherence around peaks, partial structure.
Destroys transients (still smeared, less than vocoder), formants.
| f0 err | THD% | alias | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|
| 0.00 | 0.0 | 0.000 | 0.988 | 1.623 | 0.991 | 1.775 |
Nearly identical to transient on non-percussive material. The 0.006 shift gap is the transient reset cost on synthetic fixtures that have no transients.
Use when: General music — the "try this first" phase vocoder.
Not for: Music with drums (use transient), voice (use formant).
SMB/Bernsee bin-shift. Computes true instantaneous frequency per bin from consecutive-frame phase advance, scatters peaks to shifted bins, accumulates synthesis phase at the shifted frequency.
import vocoder from 'pitch-shift/vocoder.js'
vocoder(audio, { ratio: 1.5 })Preserves dominant-partial pitch, long-horizon phase per bin.
Destroys transients, vertical phase coherence ("phasiness"), formants.
| f0 err | THD% | alias | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|
| 0.00 | 0.0 | 0.000 | 0.983 | 1.343 | 0.922 | 1.491 |
Phase coh 0.922 from independent per-bin phase accumulation — no inter-bin locking. Lower shift score than phaseLock because the simpler scatter avoids peak-assignment artifacts on pure tones.
Use when: Simple tonal material, educational baseline.
Not for: Music with percussion, voice.
Cepstral envelope preservation wrapping a peak-locked vocoder. Extracts spectral envelope via cepstral liftering from temporally-smoothed magnitude, flattens the spectrum, applies peak-locked pitch shift on the flat residual, re-imposes the original envelope.
import formant from 'pitch-shift/formant.js'
formant(audio, { semitones: 5 })
formant(audio, { ratio: 0.75, envelopeWidth: 16 })| Param | Default | |
|---|---|---|
envelopeWidth |
max(8, N/64) |
Cepstrum lifter cutoff (quefrency bins) |
Preserves formant envelope (absolute Hz), vocal-tract character.
Destroys transients (same as vocoder); risks cepstral ringing on sparse spectra.
| f0 err | THD% | alias | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|
| 0.00 | 0.0 | 0.000 | 0.988 | 0.921 | 0.980 | 1.593 |
Best formant dist (0.921) by construction — the envelope is explicitly separated and re-applied. Slightly worse shift score than vocoder because the lifter→flatten→re-impose chain introduces spectral rounding.
Use when: Voice shifting without chipmunk / giant artifact.
Not for: Percussion-heavy material (transients smear).
Fitzgerald median-filter harmonic/percussive separation. Time-axis and frequency-axis medians produce soft Wiener masks splitting the spectrogram. Harmonic component is vocoder-shifted; percussive component passes through with original phase.
import hpss from 'pitch-shift/hpss.js'
hpss(audio, { ratio: 1.5 })
hpss(audio, { ratio: 1.5, hpssTimeWidth: 31, hpssFreqWidth: 31 })| Param | Default | |
|---|---|---|
hpssTimeWidth |
17 |
Median window width (frames) |
hpssFreqWidth |
17 |
Median window width (bins) |
hpssPower |
2 |
Soft-mask exponent |
Preserves percussive onset locations (unshifted) and harmonic pitch (shifted).
Destroys signal quality at ambiguous mask boundaries (leakage in both directions).
| f0 err | THD% | alias | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|
| 0.00 | 0.0 | 0.052 | 0.996 | 1.267 | 0.922 | 1.464 |
Best overall shift score — keeping percussion unshifted sidesteps most artifacts. Alias 0.052 from residual harmonic energy leaking through the percussive mask.
Use when: Mixed music where drums should stay stationary while melody shifts.
Not for: Solo tonal material (unnecessary separation overhead).
Spectral Modeling Synthesis. Parabolic-interpolated peak picking builds sinusoidal tracks (freq, mag, phase); each peak's lobe is copied intact to round(f·ratio). Stochastic residual shifts to ratio-scaled bins with analysis phase.
import sms from 'pitch-shift/sms.js'
sms(audio, { ratio: 2 })
sms(audio, { ratio: 1.5, maxTracks: 40 })| Param | Default | |
|---|---|---|
maxTracks |
Infinity |
Max simultaneous sinusoidal tracks |
minMag |
1e-4 |
Peak detection threshold (linear) |
Preserves formant envelope (lobes scale freely with peaks), harmonic structure, tonal clarity.
Destroys transients, noise-like textures (absorbed into residual), polyphony beyond maxTracks.
| f0 err | THD% | alias | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|
| 0.00 | 0.0 | 0.002 | 0.953 | 2.028 | 0.922 | 1.761 |
Lower attack corr (0.953) because sinusoidal modeling smooths onset transients into the residual. Formant dist 2.028 despite natural lobe scaling — the residual component carries unshifted energy.
Use when: Sustained tonal / harmonic instruments, vowels.
Not for: Percussion, noise-heavy material.
Large-frame (16k) phase randomization. Magnitudes pulled from source bins at k/ratio; phases drawn uniformly from [0, 2π) every frame. Destroys temporal structure by design.
import paulstretch from 'pitch-shift/paulstretch.js'
paulstretch(audio, { ratio: 1.5 })Preserves long-term magnitude-spectrum statistics.
Destroys phase, transients, rhythm — by design.
| f0 err | THD% | alias | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|
| 0.00 | 0.3 | 0.232 | 0.954 | 7.113 | — | 2.339 |
Worst shift score (2.339) and formant dist (7.113) because random phases smear spectral energy across the frame — the smear is the aesthetic. Stream-vs-batch decorrelates (—) because random phase is non-deterministic.
Use when: Ambient/drone textures, extreme shift ratios.
Not for: Anything requiring temporal precision.
WSOLA time-stretch + sinc resample. Searches each grain position ±tolerance samples for maximum cross-correlation with the previous grain's tail, eliminating phase cancellation before resampling to the target pitch.
import wsola from 'pitch-shift/wsola.js'
wsola(audio, { ratio: 0.85 })
wsola(audio, { ratio: 1.5, tolerance: 512 })| Param | Default | |
|---|---|---|
tolerance |
frameSize/4 |
Similarity search radius (±samples) |
Preserves local waveform shape, attack envelopes.
Destroys formants (shifted by resample), phase coherence across long spans.
| f0 err | THD% | alias | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|
| 1.00 | 0.2 | 0.005 | 0.995 | 2.345 | 0.866 | 1.672 |
f0 err 1.00 Hz from sinc resample quantization (time-domain algorithms round the stretch ratio to grain boundaries). Best attack corr (0.995) — the similarity search preserves waveform continuity.
Use when: Speech, low-latency, anywhere the phase vocoder's frame latency is unacceptable.
Not for: Polyphonic music with sustained tones.
PSOLA time-stretch + sinc resample. Autocorrelation detects pitch periods; two-period Hann grains are placed at pitch-synchronous intervals, preserving formants in the stretch stage.
import psola from 'pitch-shift/psola.js'
psola(audio, { ratio: 0.75, sampleRate: 48000 })
psola(audio, { ratio: 1.5, minFreq: 100, maxFreq: 400 })| Param | Default | |
|---|---|---|
sampleRate |
44100 |
For pitch detection range |
minFreq |
70 |
Lowest expected pitch (Hz) |
maxFreq |
600 |
Highest expected pitch (Hz) |
Preserves waveform-per-period shape, formants, voiced-speech naturalness.
Destroys polyphony (assumes single pitch contour), unvoiced regions (pitch-mark jitter).
| f0 err | THD% | alias | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|
| 0.66 | 0.2 | 0.005 | 0.941 | 2.340 | 0.998 | 1.767 |
Best phase coherence (0.998) — pitch-synchronous grains align perfectly with the waveform period. Lower attack corr (0.941) from pitch-mark jitter on non-periodic onsets.
Use when: Monophonic speech, solo voice, single melodic instrument.
Not for: Polyphonic material, chords.
Plain OLA time-stretch + sinc resample. Overlap-add without similarity search — the baseline the others improve on.
import ola from 'pitch-shift/ola.js'
ola(audio, { ratio: 1.5 })Preserves amplitude envelope.
Destroys pitch accuracy, formants, transients, phase coherence.
| f0 err | THD% | alias | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|
| 39.59 | 0.1 | 0.005 | 0.977 | 2.360 | 0.992 | 2.050 |
f0 err 39.59 Hz — worst by far. Without similarity search, grains land at arbitrary phase offsets causing destructive interference that shifts the perceived pitch. Onset err 0.388 for the same reason.
Use when: Reference baseline, or the simplest possible shift for comparison.
Not for: Anything quality-sensitive.
Small-grain (1024) WSOLA time-stretch + sinc resample. Grain-rate artifacts are intentionally prominent — the texture is the point.
import granular from 'pitch-shift/granular.js'
granular(audio, { ratio: 1.3 })Preserves grain-local timbre, characteristic textural quality.
Destroys pitch accuracy on complex tones, smooth envelopes.
| f0 err | THD% | alias | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|
| 0.95 | 0.2 | 0.005 | 0.995 | 2.796 | 0.945 | 1.905 |
Worst formant dist among time-domain algorithms (2.796) because the small grains create audible spectral ripples.
Use when: Creative/textural effects where grain character is desired.
Not for: Transparent pitch shifting.
Playback-rate pitch shift. Hann-windowed sinc interpolation at a fractional read-head stepped by ratio per output sample. No time preservation — higher pitch = shorter clip.
import sample from 'pitch-shift/sample.js'
sample(instrumentBuffer, { semitones: 7 })
sample(audio, { ratio: 2, sincRadius: 16 })| Param | Default | |
|---|---|---|
sincRadius |
8 |
Windowed-sinc half-width (samples) |
Preserves waveform identity (literally the same audio, faster/slower), formants — everything scales together.
Destroys time: output duration = input_length / ratio, zero-padded to match API.
| f0 err | THD% | alias | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|
| 2.50 | 0.1 | 0.007 | 0.951 | 2.245 | 0.170 | 1.655 |
Phase coh 0.170 because the modulation rate itself shifts with the pitch (a 5 Hz tremolo becomes 7.5 Hz at ratio 1.5). This is correct behavior for a sampler — not an artifact.
Use when: Instrument one-shots, ROM-sample playback, tracker-style.
Not for: Time-preserving pitch shift.
Runs phaseLock and wsola in parallel, crossfades sample-by-sample by spectral-flux transient confidence. Tonal regions resolve via the phase vocoder; attacks resolve via WSOLA similarity search.
import hybrid from 'pitch-shift/hybrid.js'
hybrid(audio, { ratio: 1.5 })
hybrid(audio, { ratio: 1.5, hybridThreshold: 0.6 })| Param | Default | |
|---|---|---|
hybridThreshold |
0.8 |
Spectral-flux z-score for full WSOLA blend |
Preserves tonal phase coherence + attack shape — simultaneously.
Destroys CPU budget (≈2×), formants.
| f0 err | THD% | alias | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|
| 0.00 | 0.0 | 0.000 | 0.988 | 2.538 | 0.879 | 1.925 |
Phase coh 0.879 from crossfade blending — the detector's confidence curve creates micro-transitions between two engines with different phase trajectories. Worst on synthetic fixtures that have no transients to trigger the WSOLA path.
Use when: Mixed dynamic material where a single domain compromises the other.
Not for: Pure tonal (just use phaseLock) or pure percussive (just use transient).
Full quality table
| Algorithm | f0 err | THD% | alias | stream corr | cent err | onset err | attack corr | formant dist | phase coh | shift |
|---|---|---|---|---|---|---|---|---|---|---|
hpss |
0.00 | 0.0 | 0.052 | 1.000 | 0.007 | 0.000 | 0.996 | 1.267 | 0.922 | 1.464 |
vocoder |
0.00 | 0.0 | 0.000 | 1.000 | 0.006 | 0.000 | 0.983 | 1.343 | 0.922 | 1.491 |
formant |
0.00 | 0.0 | 0.000 | 1.000 | 0.061 | 0.000 | 0.988 | 0.921 | 0.980 | 1.593 |
sample |
2.50 | 0.1 | 0.007 | 1.000 | 0.003 | 0.000 | 0.951 | 2.245 | 0.170 | 1.655 |
wsola |
1.00 | 0.2 | 0.005 | 1.000 | 0.003 | 0.000 | 0.995 | 2.345 | 0.866 | 1.672 |
sms |
0.00 | 0.0 | 0.002 | 1.000 | 0.001 | 0.000 | 0.953 | 2.028 | 0.922 | 1.761 |
psola |
0.66 | 0.2 | 0.005 | 1.000 | 0.003 | 0.000 | 0.941 | 2.340 | 0.998 | 1.767 |
phaseLock |
0.00 | 0.0 | 0.000 | 1.000 | 0.012 | 0.000 | 0.988 | 1.623 | 0.991 | 1.775 |
pitchShift |
0.00 | 0.0 | 0.000 | 1.000 | 0.012 | 0.000 | 0.988 | 1.619 | 0.991 | 1.781 |
transient |
0.00 | 0.0 | 0.000 | 1.000 | 0.012 | 0.000 | 0.988 | 1.619 | 0.991 | 1.781 |
granular |
0.95 | 0.2 | 0.005 | 1.000 | 0.019 | 0.000 | 0.995 | 2.796 | 0.945 | 1.905 |
hybrid |
0.00 | 0.0 | 0.000 | 1.000 | 0.004 | 0.000 | 0.988 | 2.538 | 0.879 | 1.925 |
ola |
39.59 | 0.1 | 0.005 | 1.000 | 0.042 | 0.388 | 0.977 | 2.360 | 0.992 | 2.050 |
paulstretch |
0.00 | 0.3 | 0.232 | — | 0.005 | 0.000 | 0.954 | 7.113 | — | 2.339 |
Column definitions
- f0 err (Hz) — pitch accuracy shifting 440→660 Hz sine.
- THD% — harmonic distortion on shifted pure sine.
- alias — energy above Nyquist when shifting 14 kHz ×2.
- stream corr — streaming vs batch correlation.
—= decorrelates by design. - cent err — spectral centroid ratio error on a 3-partial chord.
- onset err — impulse-train period error after shift.
- attack corr — plucked-string attack envelope correlation.
- formant dist — cepstral envelope distance on synthetic vowel. Lower = formants preserved.
- phase coh — AM-envelope coherence on 5 Hz tremolo.
—forpaulstretch(non-deterministic). - shift — log-magnitude distance to canonical shifted reference, averaged over four fixtures. Bold = leader.
Frequency-domain algorithms + sample accept time-varying ratio — a function (t) => ratio or Float32Array. Time-domain algorithms (ola, wsola, psola, granular, hybrid) apply a single global ratio.
// Vibrato: ±10% at 5 Hz
let vibrato = phaseLock(audio, {
ratio: (t) => 1 + 0.1 * Math.sin(2 * Math.PI * 5 * t),
sampleRate: 44100,
})Combine with a pitch detector: detect per-frame f0, snap to target scale, pass as ratio function. Use formant for natural voice, phaseLock for hard-tune effect, sms for harmonic instruments.
import { yin } from 'pitch-detection'
import { formant } from 'pitch-shift'
let hop = 512, sr = 44100
let pitchFrames = []
for (let i = 0; i + 2048 <= audio.length; i += hop) {
let r = yin(audio.subarray(i, i + 2048), { fs: sr })
pitchFrames.push(r ? { freq: r.freq, clarity: r.clarity } : null)
}
let scale = [261.63, 293.66, 329.63, 349.23, 392.00, 440.00, 493.88]
let snap = (f) => scale.reduce((a, b) =>
Math.abs(Math.log2(b / f)) < Math.abs(Math.log2(a / f)) ? b : a
)
let corrected = formant(audio, {
ratio: (t) => {
let p = pitchFrames[Math.min(Math.round(t * sr / hop), pitchFrames.length - 1)]
return (!p || p.clarity < 0.5) ? 1 : snap(p.freq) / p.freq
},
sampleRate: sr,
})npm test # correctness
npm run quality # measured metrics
npm run bench # performance- time-stretch — Time-domain stretchers (WSOLA, PSOLA)
- fourier-transform — FFT
- window-function — Hann windowing
Previously held by mikolalysenko/pitch-shift (2013, v0.0.0) — a single WSOLA/TD-PSOLA implementation. Available here as wsola or psola with batch, streaming, and multi-channel support.
// v0.0.0 (old)
var shifter = require('pitch-shift')(onData, t => ratio, { frameSize: 2048 })
shifter.feed(float32Array)
// v1 (this package)
import { wsola } from 'pitch-shift'
let write = wsola({ ratio })
let out = write(float32Array)
let tail = write() // flush- time-stretch — Time stretching
- audio-filter — Audio filters