wandlerGitHubgithub
    wandler

    transformers.js inference server

    OpenAI-compatible API · Mac, Linux & Windows

    run the server
    let your agent run the server

    setup

    wandler is an OpenAI-compatible inference server powered by transformers.js

    install it globally and run it directly:

    or use npx to skip the install:

    run the server

    pick a setup, run the command, and point any OpenAI client at the server

    wandler --llm LiquidAI/LFM2.5-1.2B-Instruct-ONNX

    the server listens on http://127.0.0.1:8000 and speaks the OpenAI API, so any OpenAI client works out of the box.

    here is every flag wandler accepts:

    --llm <id>
    LLM model.
    format: org/repo[:precision]
    --embedding <id>
    Embedding model.
    --stt <id>
    Speech-to-text model.
    --device <type>
    Inference device.
    default: auto · options: auto, webgpu, cpu, wasm
    --port <n>
    Server port.
    default: 8000
    --host <addr>
    Bind address.
    default: 127.0.0.1
    --api-key <key>
    Bearer auth token.
    reads env WANDLER_API_KEY
    --hf-token <token>
    HuggingFace token for gated models.
    --cors-origin <origin>
    Allowed CORS origin.
    default: *
    --max-tokens <n>
    Max tokens per request.
    default: 2048
    --max-concurrent <n>
    Concurrent requests.
    default: 1
    --timeout <ms>
    Request timeout in milliseconds.
    default: 120000
    --log-level <level>
    Log verbosity.
    default: info · options: debug, info, warn, error
    --cache-dir <path>
    Model cache directory.
    default: .cache/ inside the @huggingface/transformers package (i.e. node_modules/@huggingface/transformers/.cache/)

    precision suffixes: q4 (default), q8, fp16, fp32.

    discover models

    list every model in the wandler registry with type, size, precision and capabilities

    filter by type with --type llm, --type embedding, or --type stt.

    benchmarks

    WebGPU · q4 quantization · 10 runs per scenario

    ModelParamsWeightsContexttok/sTTFTLoadCapabilities
    LiquidAI/LFM2.5-350M-ONNX
    350M~200 MB128K24816ms0.5stext
    LiquidAI/LFM2.5-1.2B-Instruct-ONNX
    1.2B~700 MB128K11834ms1.7stext, tools
    onnx-community/Qwen3.5-0.8B-Text-ONNX
    0.8B~500 MB256K37276ms1.8stext, tools
    onnx-community/gemma-4-E4B-it-ONNX
    4B~2.5 GB128K20636ms13.4stext, tools, vision
    onnx-community/gemma-4-E2B-it-ONNX
    2B~1.2 GB128K12890ms7.0stext, tools, vision

    these are the ones we tested. any transformers.js-compatible model on Hugging Face works.

    find more on Hugging Face

    use it in your app

    drop-in replacement for any OpenAI-compatible SDK

    import OpenAI from "openai";
    
    const client = new OpenAI({
      baseURL: "http://localhost:8000/v1",
      apiKey: "-",
    });
    
    const res = await client.chat.completions.create({
      model: "LiquidAI/LFM2.5-1.2B-Instruct-ONNX",
      messages: [{ role: "user", content: "Hello!" }],
      stream: true,
    });
    
    for await (const chunk of res) {
      process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
    }

    use it with your agent

    point your agent to wandler. works with any agent that supports custom OpenAI endpoints

    set the base URL in ~/.hermes/config.yaml

    model:
      default: "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
      provider: "custom"
      base_url: "http://localhost:8000/v1"
      api_key: "-"

    or configure it via the CLI

    API reference

    POST/v1/chat/completionsChat completion with streaming and tool calling

    Chat completion with streaming and tool calling

    Body
    messagesarrayRequired

    Input messages with role and content

    temperaturefloat

    Sampling temperature, 0-2. Default 0.7

    top_pfloat

    Nucleus sampling threshold. Default 0.95

    max_tokensint

    Maximum tokens to generate

    streamboolean

    Enable SSE streaming. Default false

    stopstring | string[]

    Stop sequences

    Only the final token of each stop string triggers stopping. Multi-token sequences are not matched exactly.

    toolsarray

    Function calling tool definitions

    When set, streaming is emulated. The full response is generated first, then re-chunked as SSE.

    response_formatobject

    {"type": "json_object"} for JSON mode

    top_kint

    Top-k sampling

    min_pfloat

    Minimum probability threshold

    repetition_penaltyfloat

    Repetition penalty, > 1.0 to penalize

    stream_optionsobject

    {"include_usage": true} for usage stats

    POST/v1/completionsText completion (legacy) with echo and suffix

    Text completion (legacy) with echo and suffix

    Body
    promptstringRequired

    Input text prompt

    temperaturefloat

    Sampling temperature, 0-2. Default 0.7

    max_tokensint

    Maximum tokens to generate

    streamboolean

    Enable SSE streaming. Default false

    stopstring | string[]

    Stop sequences

    Only the final token of each stop string triggers stopping. Multi-token sequences are not matched exactly.

    echoboolean

    Echo the prompt in the response

    suffixstring

    Text to append after completion

    POST/v1/embeddingsText embeddings for RAG and semantic search

    Text embeddings for RAG and semantic search

    Body
    inputstring | string[]Required

    Text to embed

    encoding_formatstring

    "float" or "base64". Default "float"

    GET/v1/modelsList and inspect loaded models

    List and inspect loaded models

    POST/v1/audio/transcriptionsSpeech-to-text via Whisper

    Speech-to-text via Whisper

    Body
    filebinaryRequired

    Audio file to transcribe

    languagestring

    Language code (e.g. en, de)

    POST/tokenizeConvert between text and token IDs

    Convert between text and token IDs

    Body
    textstringRequired

    Text to tokenize

    provided byRunPodLabs