Skip to content

feat: add LocalVQE backend and audio transformations UI#9640

Merged
mudler merged 1 commit into
mudler:masterfrom
richiejp:feat/localvqe
May 4, 2026
Merged

feat: add LocalVQE backend and audio transformations UI#9640
mudler merged 1 commit into
mudler:masterfrom
richiejp:feat/localvqe

Conversation

@richiejp

@richiejp richiejp commented May 2, 2026

Copy link
Copy Markdown
Collaborator

Adds LocalVQE and a new REST API plus UI.

image

Introduce a generic "audio transform" capability for any audio-in / audio-out
operation (echo cancellation, noise suppression, dereverberation, voice
conversion, etc.) and ship LocalVQE as the first backend implementation.

Backend protocol:

  • Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and
    bidirectional AudioTransformStream for low-latency frame-by-frame use.
    This is the first bidi stream in the proto; per-frame unary at LocalVQE's
    16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server,
    embed,interface,base} with paired-channel ergonomics.

LocalVQE backend (backend/go/localvqe/):

  • Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream
    shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE
    wrapper needed because LocalVQE handles CPU feature selection internally
    via GGML_BACKEND_DL.
  • Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it
    LocalVQE runs single-threaded at ~1× realtime instead of the documented
    ~9.6×.
  • Reference-length policy: zero-pad short refs, truncate long ones (the
    trailing portion can't have leaked into a mic that wasn't recording).
  • Ginkgo test suite (9 always-on specs + 2 model-gated).

HTTP layer:

  • POST /v1/audio/transformations (alias /audio/transform): multipart batch
    endpoint, accepts audio + optional reference + params[*]=v form fields.
    Persists inputs alongside the output in GeneratedContentDir/audio so the
    React UI history can replay past (audio, reference, output) triples.
  • GET /v1/audio/transformations/stream: WebSocket bidi, 16 ms PCM frames
    (interleaved stereo mic+ref in, mono out). JSON session.update envelope
    for config; constants hoisted in core/schema/audio_transform.go.
  • ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing
    utils.AudioToWav (with passthrough fast-path), so the user can upload any
    format / rate without seeing the model's strict 16 kHz constraint.
  • BackendTraceAudioTransform integration so /api/backend-traces and the
    Traces UI light up with audio_snippet base64 and timing.

Auth + capability + importer:

  • FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on,
    in APIFeatures), three RouteFeatureRegistry rows.
  • localvqe added to knownPrefOnlyBackends with modality "audio-transform".
  • Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on
    huggingface.co/LocalAI-io/LocalVQE).

React UI:

  • New Studio "Transform" tab and /app/transform page with two AudioInput
    components (Upload + Record tabs, drag-drop, mic capture).
  • Echo-test button: records mic while playing the loaded reference through
    the speakers — the mic naturally picks up speaker bleed, giving a real
    (mic, ref) pair for AEC testing without leaving the UI.
  • Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls)
    and useAudioPeaks hook (shared module-scoped AudioContext to avoid
    hitting browser context limits with three players on one page); migrated
    TTS, Sound, Traces audio blocks to use it.
  • Past runs saved in localStorage via useMediaHistory('audio-transform') —
    the history entry stores all three URLs so clicking re-renders the full
    triple, not just the output.

Build + e2e:

  • 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm,
    SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those
    two and let GPU-class hardware route through Vulkan in the gallery
    capabilities map.
  • tests-localvqe-grpc-transform job in test-extra.yml (gated on
    detect-changes.outputs.localvqe).
  • New audio_transform capability + 4 specs in tests/e2e-backends.
  • Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js
    (8 specs covering tabs, file upload, multipart shape, history, errors).

Docs:

  • New docs/content/features/audio-transform.md covering the (audio,
    reference) mental model, batch + WebSocket wire formats, LocalVQE param
    keys, and a YAML config example. Cross-links from text-to-audio and
    audio-to-text feature pages.
@richiejp richiejp force-pushed the feat/localvqe branch 2 times, most recently from ff7ff39 to 31a2460 Compare May 3, 2026 12:30
@richiejp richiejp marked this pull request as ready for review May 3, 2026 13:06
@richiejp richiejp force-pushed the feat/localvqe branch 5 times, most recently from 6620618 to 6930b22 Compare May 4, 2026 11:21
Introduce a generic "audio transform" capability for any audio-in / audio-out
operation (echo cancellation, noise suppression, dereverberation, voice
conversion, etc.) and ship LocalVQE as the first backend implementation.

Backend protocol:
- Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and
  bidirectional AudioTransformStream for low-latency frame-by-frame use.
  This is the first bidi stream in the proto; per-frame unary at LocalVQE's
  16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server,
  embed,interface,base} with paired-channel ergonomics.

LocalVQE backend (backend/go/localvqe/):
- Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream
  shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE
  wrapper needed because LocalVQE handles CPU feature selection internally
  via GGML_BACKEND_DL.
- Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it
  LocalVQE runs single-threaded at ~1× realtime instead of the documented
  ~9.6×.
- Reference-length policy: zero-pad short refs, truncate long ones (the
  trailing portion can't have leaked into a mic that wasn't recording).
- Ginkgo test suite (9 always-on specs + 2 model-gated).

HTTP layer:
- POST /audio/transformations (alias /audio/transform): multipart batch
  endpoint, accepts audio + optional reference + params[*]=v form fields.
  Persists inputs alongside the output in GeneratedContentDir/audio so the
  React UI history can replay past (audio, reference, output) triples.
- GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames
  (interleaved stereo mic+ref in, mono out). JSON session.update envelope
  for config; constants hoisted in core/schema/audio_transform.go.
- ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing
  utils.AudioToWav (with passthrough fast-path), so the user can upload any
  format / rate without seeing the model's strict 16 kHz constraint.
- BackendTraceAudioTransform integration so /api/backend-traces and the
  Traces UI light up with audio_snippet base64 and timing.
- Routes registered under routes/localai.go (LocalAI extension; OpenAI has
  no /audio/transformations endpoint), traced via TraceMiddleware.

Auth + capability + importer:
- FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on,
  in APIFeatures), three RouteFeatureRegistry rows.
- localvqe added to knownPrefOnlyBackends with modality "audio-transform".
- Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on
  huggingface.co/LocalAI-io/LocalVQE).

React UI:
- New /app/transform page surfaced via a dedicated "Enhance" sidebar
  section (sibling of Tools / Biometrics) — the page is enhancement, not
  generation, so it lives outside Studio. Two AudioInput components
  (Upload + Record tabs, drag-drop, mic capture).
- Echo-test button: records mic while playing the loaded reference through
  the speakers — the mic naturally picks up speaker bleed, giving a real
  (mic, ref) pair for AEC testing without leaving the UI.
- Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls)
  and useAudioPeaks hook (shared module-scoped AudioContext to avoid
  hitting browser context limits with three players on one page); migrated
  TTS, Sound, Traces audio blocks to use it.
- Past runs saved in localStorage via useMediaHistory('audio-transform') —
  the history entry stores all three URLs so clicking re-renders the full
  triple, not just the output.

Build + e2e:
- 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm,
  SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those
  two and let GPU-class hardware route through Vulkan in the gallery
  capabilities map.
- tests-localvqe-grpc-transform job in test-extra.yml (gated on
  detect-changes.outputs.localvqe).
- New audio_transform capability + 4 specs in tests/e2e-backends.
- Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js
  (8 specs covering tabs, file upload, multipart shape, history, errors).

Docs:
- New docs/content/features/audio-transform.md covering the (audio,
  reference) mental model, batch + WebSocket wire formats, LocalVQE param
  keys, and a YAML config example. Cross-links from text-to-audio and
  audio-to-text feature pages.

Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
@mudler mudler merged commit bb033b1 into mudler:master May 4, 2026
50 checks passed
@localai-bot localai-bot added the enhancement New feature or request label May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

3 participants