For the complete documentation index, see llms.txt. This page is also available as Markdown.

Connect Python SDK to Unsloth

Guide to calling Unsloth's local API from Python using the official OpenAI or Anthropic SDKs including streaming, vision, function calling, and Unsloth's built-in server-side tools.

Unsloth serves three OpenAI-compatible dialects at the same base URL. Chat Completions, Responses, and Anthropic Messages, so every mainstream Python SDK works against it. You change only the base_url and api_key on the client; everything else (streaming, tool calling, vision, structured output) behaves the way the SDK documents. This page covers the two SDKs most developers reach for first: the official OpenAI Python SDK and the official Anthropic Python SDK.

If you're not sure what URL / key / model name to use, read the API overview first. It walks you through starting, loading a model, and creating an sk-unsloth-… key.

🔑 Prerequisites

Before you run any of the snippets below you'll need:

  • Unsloth running locally with a model loaded (note the port: typically 8000 or 8888).

  • An sk-unsloth-… API key created from Settings → API.

  • A model name. The name of the GGUF model inside Unsloth (e.g. qwen-local, unsloth/Qwen3.6-27B-GGUF). If you forget it, run:

    curl http://localhost:8888/v1/models -H "Authorization: Bearer sk-unsloth-…"

    and copy the id field.

Set the key as an env var so you never paste it into code:

export UNSLOTH_STUDIO_AUTH_TOKEN=sk-unsloth-xxxxxxxxxxxx

🤖 OpenAI SDK

Unsloth's /v1/chat/completions endpoint is a drop-in for the OpenAI Python SDK. The client treats Unsloth like any other OpenAI-compatible provider.

1. Install the SDK:

pip install openai

2. Create a client pointed at Unsloth:

import os
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8888/v1",              # your unsloth port + /v1
    api_key=os.environ["UNSLOTH_STUDIO_AUTH_TOKEN"],     # your sk-unsloth-… key
)

Basic chat completion

Streaming

Set stream=True and iterate over the returned generator:

Images (vision)

Attach an image as an image_url content part. Unsloth accepts either an HTTP(S) URL or a data: base64 URI:

The loaded model must be multimodal. If you load a text-only model, the vision request will succeed structurally but the model won't be able to "see" the image.

Function calling (OpenAI tools)

Pass OpenAI-style tools and (optionally) tool_choice and Unsloth forwards them to the backend. Your client is responsible for executing each tool call and returning the result on the next turn:

Unsloth server-side tools (shorthand)

In addition to OpenAI-style client-side tools, Unsloth can execute Python, bash, and web search server-side and stream the results back automatically. Opt in via the extra_body parameter so the fields pass straight through to Unsloth:

The session_id is optional. Use it to persist tool state (e.g. a Python kernel) across calls.

enabled_tools currently supports "python", "bash", and "web_search". Tool results are streamed back as tool_result events so the model can see them on its next turn.

Listing models

🧠 Anthropic SDK

Unsloth's /v1/messages endpoint is a drop-in for the Anthropic Python SDK.

1. Install the SDK:

2. Create a client pointed at Unsloth:

Basic message

Streaming

The SDK exposes a context manager that yields text deltas:

Images (vision)

Anthropic-style image content uses a source block with base64 data:

Tool calling (Anthropic tools)

Pass Anthropic-style tools with an input_schema and Unsloth forwards them natively:

Unsloth server-side tools (shorthand)

The same enable_tools / enabled_tools / session_id shorthand works against /v1/messages pass it through extra_body:

Unsloth emits custom tool_result SSE events for the model's view of each tool call's output. The Anthropic SDK passes these through its event stream unchanged.

JSON decoding (response_format)

Unsloth supports OpenAI-style structured outputs via response_format. Pass a JSON Schema and the model is constrained to produce JSON matching it.

The strict: True flag tells Unsloth to enforce the schema during decoding rather than relying on the model to comply on its own. additionalProperties: False and required work as in standard JSON Schema.

The terminal output should look roughly like this:

🧪 Choosing an SDK

Both SDKs work against Unsloth. The right choice depends on the rest of your stack:

  • Use the OpenAI SDK if your code already depends on the OpenAI Python package, you want OpenAI-style tools / tool_choice, or you plan to call the Responses API.

  • Use the Anthropic SDK if your code already depends on the Anthropic package, you prefer Anthropic's input_schema tool format, or you want the Anthropic-native streaming event types.

You can use both in the same project. Unsloth serves them on the same port, so a single sk-unsloth-… key authenticates both.

❔ Troubleshooting

401 Unauthorized The UNSLOTH_STUDIO_AUTH_TOKEN env var isn't set, or the key is wrong. Re-export and confirm with echo $UNSLOTH_STUDIO_AUTH_TOKEN.

404 Not Found from the OpenAI SDK Check that base_url ends in /v1. The OpenAI SDK appends endpoint paths to the base URL as-is.

404 Not Found from the Anthropic SDK Check that base_url does not end in /v1. The Anthropic SDK adds /v1/messages itself.

extra_body fields aren't reaching Unsloth Make sure you're on a recent openai / anthropic SDK. Older versions silently drop unknown fields. Upgrade with pip install -U openai anthropic.

Streaming "hangs" then dumps everything at once Whatever is wrapping your output is buffering. In a script, print(..., flush=True); in a notebook, it's usually fine; behind a proxy, disable response buffering on the proxy.

For endpoint-level issues (wrong port, model not loading, lost connection, etc.) see the API overview page.

Optional: set server defaults

You can configure default server behavior before connecting with the Python SDK, when using the unsloth run command.

Use --reasoning off to turn thinking off, or --reasoning on to turn it on for models that support reasoning.

This starts the server on 0.0.0.0:8888, allowing other devices on your local network to connect.

These settings become the server defaults when requests do not specify their own generation parameters.

Request-level values like temperature, top_p, max_tokens, and stream can still override the defaults for that request.

Last updated

Was this helpful?