API Access to Local LLMs

This guide explains how to call the Jefferson Lab Local LLM Inference Gateway using an API key.
All programmatic access (scripts, notebooks, services, automations) goes through the LiteLLM Gateway, which routes requests to GPU-backed vLLM workers.

ℹ️ The API is only reachable from the internal network.
Please follow all JLab data policies when using local LLMs.

1. API Overview

All endpoints use the same base URL:

[https://delphi.jlab.org/api](https://delphi.jlab.org/api)

Available Endpoints

Service	Description	Endpoint
Model List	Returns all available models	`/v1/models`
Chat Completions	OpenAI-style chat API	`/v1/chat/completions`
Completions	Text completion (legacy)	`/v1/completions`
Embeddings	Vector embeddings	`/v1/embeddings`
Metrics	Prometheus metrics	`/metrics`

Authentication Header

All requests must include:

Authorization: Bearer <YOUR_API_KEY>

The API is OpenAI-compatible, meaning SDKs and existing scripts designed for OpenAI can often work by simply changing the base URL.

2. Example: Listing Available Models

This is the simplest way to confirm your API key works.

Bash (curl)
Python
Java

#!/usr/bin/env bash
# File: list_models.sh
# Usage: chmod +x list_models.sh && LLM_KEY=your_key ./list_models.sh

curl https://delphi.jlab.org/api/v1/models \
  -H "Authorization: Bearer $LLM_KEY"

# File: list_models.py
# Run: LLM_KEY=your_key python3 list_models.py

import requests
import os

API_KEY = os.getenv("LLM_KEY")
if not API_KEY:
    raise SystemExit("Error: Please set LLM_KEY environment variable.")

resp = requests.get(
    "https://delphi.jlab.org/api/v1/models",
    headers={"Authorization": f"Bearer {API_KEY}"},
    timeout=15
)

print(resp.json())

// File: ListModels.java
// Compile: javac ListModels.java
// Run: LLM_KEY=your_key java ListModels

import java.net.http.*;
import java.net.URI;

public class ListModels {
    public static void main(String[] args) throws Exception {
        String apiKey = System.getenv("LLM_KEY");

        HttpClient client = HttpClient.newHttpClient();
        HttpRequest req = HttpRequest.newBuilder()
            .uri(URI.create("https://delphi.jlab.org/api/v1/models"))
            .header("Authorization", "Bearer " + apiKey)
            .build();

        HttpResponse<String> resp =
            client.send(req, HttpResponse.BodyHandlers.ofString());

        System.out.println(resp.body());
    }
}

3. Basic Chat Completion (Low Reasoning)

Bash (curl)
Python
Java

#!/usr/bin/env bash
# File: chat_completion.sh
# Usage: chmod +x chat_completion.sh && LLM_KEY=your_key ./chat_completion.sh

curl https://delphi.jlab.org/api/v1/chat/completions \
  -H "Authorization: Bearer $LLM_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-120b",
    "messages": [{"role": "user", "content": "Say hello from Bash!"}]
  }'

# File: chat_completion.py
# Run: LLM_KEY=your_key python3 chat_completion.py

from openai import OpenAI
import os

API_KEY = os.getenv("LLM_KEY")
if not API_KEY:
    raise SystemExit("Error: Please set LLM_KEY environment variable.")

client = OpenAI(
    api_key=API_KEY,
    base_url="https://delphi.jlab.org/api/v1/",
)

resp = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Say hello from Python!"}],
)

print(resp.choices[0].message.content)

// File: ChatCompletion.java
// Compile: javac ChatCompletion.java
// Run: LLM_KEY=your_key java ChatCompletion

import java.net.http.*;
import java.net.URI;

public class ChatCompletion {
    public static void main(String[] args) throws Exception {
        String apiKey = System.getenv("LLM_KEY");

        String body =
            "{\n" +
            "  \"model\": \"openai/gpt-oss-120b\",\n" +
            "  \"messages\": [\n" +
            "    {\"role\": \"user\", \"content\": \"Say hello from Java!\"}\n" +
            "  ]\n" +
            "}";

        HttpClient client = HttpClient.newHttpClient();
        HttpRequest req = HttpRequest.newBuilder()
            .uri(URI.create("https://delphi.jlab.org/api/v1/chat/completions"))
            .header("Authorization", "Bearer " + apiKey)
            .header("Content-Type", "application/json")
            .POST(HttpRequest.BodyPublishers.ofString(body))
            .build();

        HttpResponse<String> resp =
            client.send(req, HttpResponse.BodyHandlers.ofString());

        System.out.println(resp.body());
    }
}

4. Streaming Chat Completion (Higher Reasoning)

Streaming provides partial tokens as the model generates them — this improves perceived speed and helps with longer reasoning steps.

Bash (curl)
Python
Java

#!/usr/bin/env bash
# File: stream_chat.sh
# Usage: chmod +x stream_chat.sh && LLM_KEY=your_key ./stream_chat.sh

curl https://delphi.jlab.org/api/v1/chat/completions \
  -H "Authorization: Bearer $LLM_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-120b",
    "stream": true,
    "messages": [{"role": "user", "content": "Stream a poem about CEBAF."}]
  }'

# File: stream_chat.py
# Run: LLM_KEY=your_key python3 stream_chat.py

from openai import OpenAI
import os
import sys

API_KEY = os.getenv("LLM_KEY")
if not API_KEY:
    raise SystemExit("Error: Please set LLM_KEY environment variable.")

client = OpenAI(
    api_key=API_KEY,
    base_url="https://delphi.jlab.org/api/v1/",
)

stream = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    stream=True,
    messages=[{"role": "user", "content": "Write a CEBAF poem."}],
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta and delta.content:
        sys.stdout.write(delta.content)
        sys.stdout.flush()

print()

// File: StreamCompletion.java
// Compile: javac StreamCompletion.java
// Run: LLM_KEY=your_key java StreamCompletion

import java.net.http.*;
import java.net.URI;

public class StreamCompletion {
    public static void main(String[] args) throws Exception {
        String apiKey = System.getenv("LLM_KEY");

        String body =
            "{\n" +
            "  \"model\": \"openai/gpt-oss-120b\",\n" +
            "  \"stream\": true,\n" +
            "  \"messages\": [\n" +
            "    {\"role\": \"user\", \"content\": \"Stream a poem about CEBAF.\"}\n" +
            "  ]\n" +
            "}";

        HttpClient client = HttpClient.newHttpClient();

        HttpRequest req = HttpRequest.newBuilder()
            .uri(URI.create("https://delphi.jlab.org/api/v1/chat/completions"))
            .header("Authorization", "Bearer " + apiKey)
            .header("Content-Type", "application/json")
            .POST(HttpRequest.BodyPublishers.ofString(body))
            .build();

        client.sendAsync(req, HttpResponse.BodyHandlers.ofLines())
            .thenAccept(response ->
                response.body().forEach(System.out::println)
            )
            .join();
    }
}

5. Response Format Documentation

The models follow the OpenAI Chat Completions Response Spec: 🔗 https://platform.openai.com/docs/api-reference/chat/object

6. Using the API in n8n (Visual Workflows)

You can create full agentic workflows without writing code.

Basic n8n Setup

Add an HTTP Request node
Set:
- Method: POST
- URL: https://delphi.jlab.org/api/v1/chat/completions

Headers:

{
  "Authorization": "Bearer {{ $json.api_key }}",
  "Content-Type": "application/json"
}

JSON Body:

{
  "model": "gpt-oss:120b",
  "messages": [{"role": "user", "content": "Hello from n8n!"}]
}

🔗 Example n8n templates: https://n8n.io/workflows

7. Latency Expectations

Model Class	Typical Latency Per Token
Small models (1–7B)	20–60 ms/token
20B models	40–150 ms/token
120B models	120–300 ms/token
Streaming	Fastest perceived response

Latency depends on queue depth, GPU load, and context length.

8. Troubleshooting

401 Unauthorized

API key missing, incorrect, or expired.

404 Model Not Found

Double-check model spelling from /v1/models.

429 Rate Limit

You reached concurrency or token budget limits.

500 Worker Offline

One of the vLLM workers may be unavailable — contact maintainers.

Connection Timeout

Large context windows or large outputs may take longer; try streaming mode.

1. API Overview

Available Endpoints​

Authentication Header​