API Access to Local LLMs
This guide explains how to call the Jefferson Lab Local LLM Inference Gateway using an API key.
All programmatic access (scripts, notebooks, services, automations) goes through the LiteLLM Gateway, which routes requests to GPU-backed vLLM workers.
âšī¸ The API is only reachable from the internal network.
Please follow all JLab data policies when using local LLMs.
1. API Overview
All endpoints use the same base URL:
[https://delphi.jlab.org/api](https://delphi.jlab.org/api)
Available Endpointsâ
| Service | Description | Endpoint |
|---|---|---|
| Model List | Returns all available models | /v1/models |
| Chat Completions | OpenAI-style chat API | /v1/chat/completions |
| Completions | Text completion (legacy) | /v1/completions |
| Embeddings | Vector embeddings | /v1/embeddings |
| Metrics | Prometheus metrics | /metrics |
Authentication Headerâ
All requests must include:
Authorization: Bearer <YOUR_API_KEY>
The API is OpenAI-compatible, meaning SDKs and existing scripts designed for OpenAI can often work by simply changing the base URL.
2. Example: Listing Available Models
This is the simplest way to confirm your API key works.
- Bash (curl)
- Python
- Java
#!/usr/bin/env bash
# File: list_models.sh
# Usage: chmod +x list_models.sh && LLM_KEY=your_key ./list_models.sh
curl https://delphi.jlab.org/api/v1/models \
-H "Authorization: Bearer $LLM_KEY"
# File: list_models.py
# Run: LLM_KEY=your_key python3 list_models.py
import requests
import os
API_KEY = os.getenv("LLM_KEY")
if not API_KEY:
raise SystemExit("Error: Please set LLM_KEY environment variable.")
resp = requests.get(
"https://delphi.jlab.org/api/v1/models",
headers={"Authorization": f"Bearer {API_KEY}"},
timeout=15
)
print(resp.json())
// File: ListModels.java
// Compile: javac ListModels.java
// Run: LLM_KEY=your_key java ListModels
import java.net.http.*;
import java.net.URI;
public class ListModels {
public static void main(String[] args) throws Exception {
String apiKey = System.getenv("LLM_KEY");
HttpClient client = HttpClient.newHttpClient();
HttpRequest req = HttpRequest.newBuilder()
.uri(URI.create("https://delphi.jlab.org/api/v1/models"))
.header("Authorization", "Bearer " + apiKey)
.build();
HttpResponse<String> resp =
client.send(req, HttpResponse.BodyHandlers.ofString());
System.out.println(resp.body());
}
}
3. Basic Chat Completion (Low Reasoning)
- Bash (curl)
- Python
- Java
#!/usr/bin/env bash
# File: chat_completion.sh
# Usage: chmod +x chat_completion.sh && LLM_KEY=your_key ./chat_completion.sh
curl https://delphi.jlab.org/api/v1/chat/completions \
-H "Authorization: Bearer $LLM_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-120b",
"messages": [{"role": "user", "content": "Say hello from Bash!"}]
}'
# File: chat_completion.py
# Run: LLM_KEY=your_key python3 chat_completion.py
from openai import OpenAI
import os
API_KEY = os.getenv("LLM_KEY")
if not API_KEY:
raise SystemExit("Error: Please set LLM_KEY environment variable.")
client = OpenAI(
api_key=API_KEY,
base_url="https://delphi.jlab.org/api/v1/",
)
resp = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": "Say hello from Python!"}],
)
print(resp.choices[0].message.content)
// File: ChatCompletion.java
// Compile: javac ChatCompletion.java
// Run: LLM_KEY=your_key java ChatCompletion
import java.net.http.*;
import java.net.URI;
public class ChatCompletion {
public static void main(String[] args) throws Exception {
String apiKey = System.getenv("LLM_KEY");
String body =
"{\n" +
" \"model\": \"openai/gpt-oss-120b\",\n" +
" \"messages\": [\n" +
" {\"role\": \"user\", \"content\": \"Say hello from Java!\"}\n" +
" ]\n" +
"}";
HttpClient client = HttpClient.newHttpClient();
HttpRequest req = HttpRequest.newBuilder()
.uri(URI.create("https://delphi.jlab.org/api/v1/chat/completions"))
.header("Authorization", "Bearer " + apiKey)
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(body))
.build();
HttpResponse<String> resp =
client.send(req, HttpResponse.BodyHandlers.ofString());
System.out.println(resp.body());
}
}
4. Streaming Chat Completion (Higher Reasoning)
Streaming provides partial tokens as the model generates them â this improves perceived speed and helps with longer reasoning steps.
- Bash (curl)
- Python
- Java
#!/usr/bin/env bash
# File: stream_chat.sh
# Usage: chmod +x stream_chat.sh && LLM_KEY=your_key ./stream_chat.sh
curl https://delphi.jlab.org/api/v1/chat/completions \
-H "Authorization: Bearer $LLM_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-120b",
"stream": true,
"messages": [{"role": "user", "content": "Stream a poem about CEBAF."}]
}'
# File: stream_chat.py
# Run: LLM_KEY=your_key python3 stream_chat.py
from openai import OpenAI
import os
import sys
API_KEY = os.getenv("LLM_KEY")
if not API_KEY:
raise SystemExit("Error: Please set LLM_KEY environment variable.")
client = OpenAI(
api_key=API_KEY,
base_url="https://delphi.jlab.org/api/v1/",
)
stream = client.chat.completions.create(
model="openai/gpt-oss-120b",
stream=True,
messages=[{"role": "user", "content": "Write a CEBAF poem."}],
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta and delta.content:
sys.stdout.write(delta.content)
sys.stdout.flush()
print()
// File: StreamCompletion.java
// Compile: javac StreamCompletion.java
// Run: LLM_KEY=your_key java StreamCompletion
import java.net.http.*;
import java.net.URI;
public class StreamCompletion {
public static void main(String[] args) throws Exception {
String apiKey = System.getenv("LLM_KEY");
String body =
"{\n" +
" \"model\": \"openai/gpt-oss-120b\",\n" +
" \"stream\": true,\n" +
" \"messages\": [\n" +
" {\"role\": \"user\", \"content\": \"Stream a poem about CEBAF.\"}\n" +
" ]\n" +
"}";
HttpClient client = HttpClient.newHttpClient();
HttpRequest req = HttpRequest.newBuilder()
.uri(URI.create("https://delphi.jlab.org/api/v1/chat/completions"))
.header("Authorization", "Bearer " + apiKey)
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(body))
.build();
client.sendAsync(req, HttpResponse.BodyHandlers.ofLines())
.thenAccept(response ->
response.body().forEach(System.out::println)
)
.join();
}
}
5. Response Format Documentation
The models follow the OpenAI Chat Completions Response Spec: đ https://platform.openai.com/docs/api-reference/chat/object
6. Using the API in n8n (Visual Workflows)
You can create full agentic workflows without writing code.
Basic n8n Setupâ
-
Add an HTTP Request node
-
Set:
- Method:
POST - URL:
https://delphi.jlab.org/api/v1/chat/completions
- Method:
-
Headers:
{
"Authorization": "Bearer {{ $json.api_key }}",
"Content-Type": "application/json"
} -
JSON Body:
{
"model": "gpt-oss:120b",
"messages": [{"role": "user", "content": "Hello from n8n!"}]
}
đ Example n8n templates: https://n8n.io/workflows
7. Latency Expectations
| Model Class | Typical Latency Per Token |
|---|---|
| Small models (1â7B) | 20â60 ms/token |
| 20B models | 40â150 ms/token |
| 120B models | 120â300 ms/token |
| Streaming | Fastest perceived response |
Latency depends on queue depth, GPU load, and context length.
8. Troubleshooting
401 Unauthorizedâ
API key missing, incorrect, or expired.
404 Model Not Foundâ
Double-check model spelling from /v1/models.
429 Rate Limitâ
You reached concurrency or token budget limits.
500 Worker Offlineâ
One of the vLLM workers may be unavailable â contact maintainers.
Connection Timeoutâ
Large context windows or large outputs may take longer; try streaming mode.