Skip to main content

API Access to Local LLMs

This guide explains how to call the Jefferson Lab Local LLM Inference Gateway using an API key.
All programmatic access (scripts, notebooks, services, automations) goes through the LiteLLM Gateway, which routes requests to GPU-backed vLLM workers.

â„šī¸ The API is only reachable from the internal network.
Please follow all JLab data policies when using local LLMs.


1. API Overview

All endpoints use the same base URL:


[https://delphi.jlab.org/api](https://delphi.jlab.org/api)

Available Endpoints​

ServiceDescriptionEndpoint
Model ListReturns all available models/v1/models
Chat CompletionsOpenAI-style chat API/v1/chat/completions
CompletionsText completion (legacy)/v1/completions
EmbeddingsVector embeddings/v1/embeddings
MetricsPrometheus metrics/metrics

Authentication Header​

All requests must include:


Authorization: Bearer <YOUR_API_KEY>

The API is OpenAI-compatible, meaning SDKs and existing scripts designed for OpenAI can often work by simply changing the base URL.


2. Example: Listing Available Models

This is the simplest way to confirm your API key works.

#!/usr/bin/env bash
# File: list_models.sh
# Usage: chmod +x list_models.sh && LLM_KEY=your_key ./list_models.sh

curl https://delphi.jlab.org/api/v1/models \
-H "Authorization: Bearer $LLM_KEY"

3. Basic Chat Completion (Low Reasoning)

#!/usr/bin/env bash
# File: chat_completion.sh
# Usage: chmod +x chat_completion.sh && LLM_KEY=your_key ./chat_completion.sh

curl https://delphi.jlab.org/api/v1/chat/completions \
-H "Authorization: Bearer $LLM_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-120b",
"messages": [{"role": "user", "content": "Say hello from Bash!"}]
}'

4. Streaming Chat Completion (Higher Reasoning)

Streaming provides partial tokens as the model generates them — this improves perceived speed and helps with longer reasoning steps.

#!/usr/bin/env bash
# File: stream_chat.sh
# Usage: chmod +x stream_chat.sh && LLM_KEY=your_key ./stream_chat.sh

curl https://delphi.jlab.org/api/v1/chat/completions \
-H "Authorization: Bearer $LLM_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-120b",
"stream": true,
"messages": [{"role": "user", "content": "Stream a poem about CEBAF."}]
}'

5. Response Format Documentation

The models follow the OpenAI Chat Completions Response Spec: 🔗 https://platform.openai.com/docs/api-reference/chat/object


6. Using the API in n8n (Visual Workflows)

You can create full agentic workflows without writing code.

Basic n8n Setup​

  1. Add an HTTP Request node

  2. Set:

    • Method: POST
    • URL: https://delphi.jlab.org/api/v1/chat/completions
  3. Headers:

    {
    "Authorization": "Bearer {{ $json.api_key }}",
    "Content-Type": "application/json"
    }
  4. JSON Body:

    {
    "model": "gpt-oss:120b",
    "messages": [{"role": "user", "content": "Hello from n8n!"}]
    }

🔗 Example n8n templates: https://n8n.io/workflows


7. Latency Expectations

Model ClassTypical Latency Per Token
Small models (1–7B)20–60 ms/token
20B models40–150 ms/token
120B models120–300 ms/token
StreamingFastest perceived response

Latency depends on queue depth, GPU load, and context length.


8. Troubleshooting

401 Unauthorized​

API key missing, incorrect, or expired.

404 Model Not Found​

Double-check model spelling from /v1/models.

429 Rate Limit​

You reached concurrency or token budget limits.

500 Worker Offline​

One of the vLLM workers may be unavailable — contact maintainers.

Connection Timeout​

Large context windows or large outputs may take longer; try streaming mode.