How it works
diarization-js ports pyannote.audio's
community-1 pipeline to TypeScript + ONNX so the
entire diarization graph runs client-side. Three building blocks,
same architecture as the Python reference:
-
Speaker segmentation — pyannote/segmentation-3.0
(5.6 MB ONNX). A 10 s window slides over the audio; the model
emits per-frame powerset activations covering all
subsets of up to 3 simultaneous speakers (max 2 concurrent).
Argmax + a lookup table flatten that to a (frame × speaker)
multilabel grid.
-
Speaker embedding — WeSpeaker
ResNet34-LM
(26.5 MB ONNX). For each (chunk, local-speaker) pair, the
ResNet ingests Kaldi-compliant 80-mel fbank features
(computed in pure TS, validated to 1e-4 vs
torchaudio.compliance.kaldi.fbank) and emits a
256-d speaker fingerprint.
-
VBx clustering — direct port of pyannote.audio
4.0.4's
utils/vbx.py (no HMM variant). Centroid-linkage
AHC seeds Variational Bayes-x clustering with the
pre-computed PLDA matrices shipped by the
community-1
checkpoint. Convergence on ELBO; speakers with vanishing prior
are pruned.
Reconstruction (sliding-window aggregation + top-K by
instantaneous speaker count) merges per-chunk results into a
single annotation. End-to-end DER vs the Python reference on a
7 min mono recording: 1.73 % (Hungarian
optimal-mapping).
Use it as a developer
The library is browser- and Node-compatible. The same code runs in either runtime, you just provide the appropriate onnxruntime-* module:
npm install diarization-js onnxruntime-web
import * as ort from "onnxruntime-web/webgpu";
import { DiarizationPipeline } from "diarization-js";
const pipeline = await DiarizationPipeline.create({
ort,
segmentationModel: await fetch("/models/segmentation-3.0.onnx").then(r => r.arrayBuffer()),
embeddingModel: await fetch("/models/embedding-resnet34.onnx").then(r => r.arrayBuffer()),
pldaParamsJson: await fetch("/models/plda-params-vbx.json").then(r => r.json()),
executionProviders: ["webgpu", "wasm"],
});
const { result, metrics } = await pipeline.run(samples, sampleRate);
// result.segments: Array<{ start, end, speaker }>
// metrics.rtf, metrics.totalProcessingMs, etc.
See the GitHub repo for the full source,
the bench (apps/bench/) for batch evaluation, and the
export scripts (scripts/export-models/) for rebuilding the ONNX artifacts from the upstream
PyTorch checkpoints.
Licenses & credit