LLM Completions

Chat completion with a model catalog, per-model pricing, streaming, and structured outputs.

Send messages (role + content) and get a completion. You choose a model from the catalog (or use the default). The service returns message, model, token_usage (prompt/completion/total tokens), finish_reason, and an optional parsed object when json_schema is requested. Pricing in the model catalog is what you actually pay — gateway already applies its markup on the OpenAI raw rate, so list/comparison endpoints expose the post-markup rates directly. Billing per request follows the model's input/output rates from the catalog. Pass stream:true for OpenAI-compatible SSE streaming. Pass json_schema for OpenAI Structured Outputs — the response then includes a parsed field with the validated object (sync), or a final SSE chunk carrying parsed before the [DONE] marker.

llmtextgenerationstreamingstructured-output

Overview

Features

Per-model pricing (already includes gateway markup)

Each model in core-llm-models exposes input_per_million / output_per_million / estimated_per_message. These are the customer-facing rates — the OpenAI raw cost is multiplied by the gateway markup before exposure. There is no separate markup field; the listed price is the price.

Model catalog and tiers

Models grouped by tier (budget, standard, premium). Each entry has model_id, display_name, description, pricing, and limits (max_tokens, context_window). default_model and aliases let you refer to models by short name.

Chat completion

Send messages array. Optional model, temperature, max_tokens, top_p, frequency_penalty, presence_penalty. Response: message (assistant content), model, finish_reason, token_usage.

Streaming (SSE)

Pass stream:true to receive an OpenAI-compatible Server-Sent Events stream of chat.completion.chunk events. A final usage chunk is sent before [DONE].

Structured outputs (json_schema → parsed)

Pass json_schema {name, schema, strict} for OpenAI Structured Outputs. Sync response gains a data.parsed field (parsed JSON object). Streaming emits an extra chunk with parsed before the usage chunk.

Model list and comparison

GET core-llm-models returns models, default_model, aliases, tiers. GET core-llm-models-comparison returns a side-by-side table (model, model_id, tier, input_price, output_price, cost_per_message, is_default) plus a top-level note describing the assumed token mix.

Use Cases

Chatbots & assistants

Send conversation history as messages; get the next reply with content and token_usage. Pick a budget or premium model from the catalog.

Streaming UIs

Use stream:true to render the assistant's reply token-by-token; track final usage from the last chunk before [DONE].

Structured extraction / function-calling-style outputs

Pass json_schema with strict:true to force the model to emit JSON conforming to your schema. Use data.parsed (sync) or the parsed SSE chunk (stream) instead of re-parsing message yourself.

Cost-aware model choice

Use core-llm-models or core-llm-models-comparison to pick by tier and estimated_per_message; aliases (e.g. default, gpt-4.1-mini, gpt4) simplify requests.

Input / Output

Input

messages (role, content), optional model (id or alias), temperature, max_tokens, stream, json_schema, response_format

JSON body

Output

message, model, finish_reason, token_usage; optional parsed when json_schema is set. Stream variant: SSE chunks (chat.completion.chunk) ending with usage chunk and [DONE].

JSONtext/event-stream (when stream:true)

Specs

Latency
~1–5s for sync; first token ~200–500ms when streaming
Async
false
Rate Limit
60 req/min per API key
Max Input
Per-model context_window (see catalog); max_tokens caps completion length

Quickstart

Prerequisites

  • -A CN8 Gateway API key with core-llm-completions in allowed_services

1. List available models

core-llm-models

Get the model catalog. The pricing values shown are what you actually pay (post-markup).

GET/v1/proxy/core-llm-models
GET /v1/proxy/core-llm-models

Response

{
  "status": "success",
  "data": {
    "models": [
      {
        "model_id": "gpt-5-mini-2025-08-07",
        "display_name": "GPT-4.1 Mini",
        "description": "Best balance of cost and capability. Recommended for most use cases.",
        "tier": "standard",
        "is_default": true,
        "is_available": true,
        "pricing": {
          "input_per_million": 0.8,
          "output_per_million": 3.2,
          "input_per_1k": 0.0008,
          "output_per_1k": 0.0032,
          "blended_per_1k": 0.00176,
          "estimated_per_message": 0.0064
        },
        "limits": { "max_tokens": 16384, "context_window": 128000 }
      }
    ],
    "default_model": "gpt-5-mini-2025-08-07",
    "aliases": {
      "default": "gpt-5-mini-2025-08-07",
      "gpt-4.1-mini": "gpt-5-mini-2025-08-07",
      "gpt-4.1-nano": "gpt-4.1-nano-2025-04-14",
      "gpt-4.1": "gpt-4.1-2025-04-14",
      "gpt4": "gpt-4o",
      "gpt4-mini": "gpt-4o-mini"
    },
    "tiers": {
      "budget": ["gpt-4.1-nano-2025-04-14", "gpt-4o-mini", "gpt-3.5-turbo"],
      "standard": ["gpt-5-mini-2025-08-07"],
      "premium": ["gpt-4.1-2025-04-14", "gpt-4o"]
    }
  }
}

Use model_id or an alias (e.g. default, gpt-4.1-mini) in the completion request. Pricing values here are the rates billed to your account.

2. Send a chat completion (sync)

core-llm-completions

POST messages, optional model. Response includes message, model, finish_reason, token_usage. The cost object reflects the actual amount deducted from your balance.

POST/v1/proxy/core-llm-completions
{
  "model": "default",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is the capital of France?" }
  ],
  "temperature": 0.7,
  "max_tokens": 256
}

Response

{
  "status": "success",
  "data": {
    "message": "The capital of France is Paris.",
    "model": "gpt-5-mini-2025-08-07",
    "finish_reason": "stop",
    "token_usage": { "prompt_tokens": 28, "completion_tokens": 9, "total_tokens": 37 }
  },
  "gateway": {
    "request_id": "req_abc123",
    "service": "core-llm-completions"
  },
  "cost": {
    "units": 37.0,
    "unit_price": 0.0000011,
    "tokens": 0.0000406,
    "balance": 4579.78
  }
}

- units = total tokens consumed (prompt + completion). - tokens = the amount actually deducted from your balance for this call. - unit_price = tokens / units → the blended per-token rate for this specific call (varies by input/output ratio and model). This is informational; do not use it to predict future calls. - balance = your remaining balance after this deduction.

3. Streaming (SSE)

core-llm-completions

Pass stream:true. The response is text/event-stream with OpenAI-compatible chat.completion.chunk events. A final usage chunk is sent before the [DONE] marker.

POST/v1/proxy/core-llm-completions
{
  "model": "default",
  "messages": [{ "role": "user", "content": "Write a haiku about the sea." }],
  "stream": true,
  "max_tokens": 256
}

Response

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","model":"gpt-5-mini-2025-08-07","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","model":"gpt-5-mini-2025-08-07","choices":[{"index":0,"delta":{"content":"Salt"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","model":"gpt-5-mini-2025-08-07","choices":[{"index":0,"delta":{"content":" wind"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","model":"gpt-5-mini-2025-08-07","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","model":"gpt-5-mini-2025-08-07","choices":[],"usage":{"prompt_tokens":12,"completion_tokens":18,"total_tokens":30}}

data: [DONE]

- Each event is one line prefixed with `data: ` and terminated by a blank line. - Concatenate `choices[0].delta.content` from every chunk to reconstruct the full message. - The penultimate chunk has empty `choices` and carries the final `usage` object (OpenAI convention). - `data: [DONE]` is the stream terminator. After it, no more chunks are sent. - Billing happens once after the stream ends, using the same per-model rates as sync.

4. Structured outputs (json_schema → parsed)

core-llm-completions

Pass json_schema with name, schema, strict. The model is forced to return JSON conforming to your schema (OpenAI Structured Outputs). The gateway parses it for you and exposes data.parsed.

POST/v1/proxy/core-llm-completions
{
  "model": "default",
  "messages": [
    { "role": "user", "content": "Extract the title and a 1-sentence summary from: 'OpenAI launches Sora 2 today, focusing on cinematic video.'" }
  ],
  "json_schema": {
    "name": "extraction_result",
    "schema": {
      "type": "object",
      "properties": {
        "title": { "type": "string" },
        "summary": { "type": "string" }
      },
      "required": ["title", "summary"],
      "additionalProperties": false
    },
    "strict": true
  }
}

Response

{
  "status": "success",
  "data": {
    "message": "{\"title\":\"OpenAI launches Sora 2\",\"summary\":\"OpenAI released Sora 2 with a focus on cinematic video generation.\"}",
    "model": "gpt-5-mini-2025-08-07",
    "finish_reason": "stop",
    "token_usage": { "prompt_tokens": 56, "completion_tokens": 24, "total_tokens": 80 },
    "parsed": {
      "title": "OpenAI launches Sora 2",
      "summary": "OpenAI released Sora 2 with a focus on cinematic video generation."
    }
  },
  "cost": { "units": 80.0, "unit_price": 0.0000010, "tokens": 0.0000832, "balance": 4579.78 }
}

- `json_schema.strict: true` requires `additionalProperties: false` on every object level (OpenAI requirement). - When both `json_schema` and `response_format` are sent, `json_schema` wins. - If the model fails to produce parseable JSON, `data.parsed` is `null` (rare with strict mode). - In streaming mode, after the content chunks the gateway emits one extra chunk with empty `choices` and a top-level `parsed` object, then the usage chunk, then `data: [DONE]`.

LLM Chat Completion

POSTsync

Chat completion with messages array. Optional model (id or alias), temperature, max_tokens, top_p, frequency_penalty, presence_penalty, stream, json_schema, response_format. Returns message (assistant content), model (resolved id), finish_reason, token_usage. Optional parsed when json_schema is set. With stream:true the response is OpenAI-compatible SSE.

/v1/proxy/core-llm-completions

List LLM Models

GETsync

Get the model catalog: models (model_id, display_name, description, tier, is_default, is_available, pricing, limits), default_model, aliases, tiers (budget, standard, premium). Pricing values are post-markup — the rates the customer pays.

/v1/proxy/core-llm-models

LLM Model Comparison

GETsync

Side-by-side comparison: model, model_id, tier, input_price, output_price (per 1M, post-markup), cost_per_message (~2K input + 1.5K output), is_default. Plus a top-level note describing the assumed token mix.

/v1/proxy/core-llm-models-comparison

Pricing

Per-model pricing — see core-llm-models for each model's input_per_million / output_per_million. Listed values are the rates billed to your account (markup is already applied at exposure time). List and comparison endpoints are free.

ServiceUnitPrice
Chat CompletiontokenPer model — see core-llm-models[].pricing
List Models / ComparisonitemFree
  • -cost.tokens in the completion response is the amount actually deducted from your balance for that call.
  • -cost.unit_price is the blended per-token rate for that specific call (= cost.tokens / cost.units). It varies with input/output ratio and model — informational only.
  • -estimated_per_message in the model catalog assumes ~2K input + 1.5K output; useful for comparing models before committing.

Guides & Tips

Messages format

  • -`messages` is an array of `{ role, content }`. Roles: `system`, `user`, `assistant`.
  • -The service returns `data.message` (assistant reply text), `data.model`, `data.finish_reason`, `data.token_usage`.

Choosing a model

  • -Call `core-llm-models` for `model_id`, `tier`, and `pricing` (the rate you actually pay).
  • -Use `core-llm-models-comparison` for a side-by-side table including `cost_per_message`.
  • -In requests, pass `model_id` or any alias (e.g. `default`, `gpt-4.1-mini`, `gpt4`).

Streaming with SSE

  • -Pass `stream: true` in the request body. Response is `text/event-stream`.
  • -Each line: `data: <json-chunk>\n\n`. The chunk follows OpenAI's `chat.completion.chunk` shape.
  • -Concatenate `choices[0].delta.content` over all chunks to reconstruct the message.
  • -The chunk before `data: [DONE]` carries `usage` (token counts), with empty `choices`.
  • -When `json_schema` is set with streaming, an additional chunk with `parsed: {...}` (and empty `choices`) is emitted right before the usage chunk.
  • -Billing happens once after the stream completes; per-model rates apply just like sync.

Structured outputs (json_schema)

  • -Pass `json_schema: { name, schema, strict: true }` to force the model to emit JSON matching your schema (OpenAI Structured Outputs).
  • -With `strict: true`, every object level must declare `additionalProperties: false` and list every required field — otherwise OpenAI returns 400.
  • -The gateway adds `data.parsed` (the parsed object) on success, or `null` on parse failure.
  • -`json_schema` takes priority over `response_format` when both are sent.

Cost in the response

  • -`cost.units` = total tokens (prompt + completion).
  • -`cost.tokens` = amount deducted from your balance for this call (in the same unit as your balance).
  • -`cost.unit_price` = the blended per-token rate for this call (`cost.tokens / cost.units`). Use it for at-a-glance comparison; it changes per call.
  • -`cost.balance` = your balance after the deduction.

FAQ

Q: What is the response shape?

A: data.message (assistant text), data.model (resolved id), data.finish_reason, data.token_usage. Optional data.parsed when json_schema is sent. cost contains units / unit_price / tokens / balance.

Q: Can I pass a short model name?

A: Yes. Aliases: default, gpt-4.1-mini, gpt-4.1-nano, gpt-4.1, gpt4, gpt4-mini. Or use the full model_id (e.g. gpt-5-mini-2025-08-07).

Q: How is cost calculated?

A: Per-model. The gateway uses the model's input/output rates as exposed in core-llm-models (those rates already include the gateway markup). The deducted amount is shown as cost.tokens in the response.

Q: How does streaming differ from sync?

A: Set stream:true. Response is text/event-stream with OpenAI-compatible chat.completion.chunk events. Final usage chunk arrives before 'data: [DONE]'. Billing is identical and happens once when the stream ends.

Q: How do I get a typed/JSON response?

A: Pass json_schema:{ name, schema, strict:true }. The gateway will add data.parsed (sync) or emit a parsed SSE chunk (stream). Use that instead of re-parsing data.message.

Q: What is the maximum context length?

A: Per model. Check core-llm-models for each model's limits.context_window. max_tokens caps completion length and is capped automatically to the model's max_tokens limit.

Related Products

Changelog

1.2 (2026-04-27)

  • -BREAKING (docs): response field names corrected — data.message (was data.content), data.token_usage (was data.usage). cost is now an object at root level, not a number inside data.
  • -Pricing: core-llm-models / core-llm-models-comparison now expose post-markup rates directly in the existing pricing fields (no separate field). The gateway charges per-model based on those same rates.
  • -Added: documentation for stream:true (SSE) including chunk shape and final usage convention.
  • -Added: documentation for json_schema → data.parsed (sync) and parsed SSE chunk (stream).
  • -Updated: default_model is gpt-5-mini-2025-08-07; aliases now include gpt-4.1-nano, gpt-4.1, gpt4, gpt4-mini.

1.1 (2026-01-26)

  • -Catalog aligned with llm_service: model catalog (model_id, display_name, tier, pricing input/output per million and per 1K, limits), default_model, aliases, tiers. Completion returns content, model, usage, cost, finish_reason. core-llm-models and core-llm-models-comparison response shapes match backend. Removed streaming and provider-specific claims; only document what the service implements.

1.0 (2026-01-26)

  • -Initial catalog: core-llm-completions, core-llm-models, core-llm-models-comparison.