Overview

This document provides an introduction to open weights large language models (LLMs) we have deployed at Jefferson Lab. These services are designed to expose high-performance, GPU-accelerated large language models to internal users through authenticated and auditable interfaces.

The local LLM system consists of four major components:

LibreChat — The user-facing chat frontend authenticated through CILogon
LiteLLM API Gateway — A consolidated OpenAI-compatible inference gateway
Key Manager — A secure API-key issuance service used to issue API keys to users and bots
vLLM Workers — GPU-backed inference servers deployed on GPU nodes with dedicated Nvidia A100 hardware

Together, these components allow users to interact with foundation models while maintaining strong identity guarantees, traceability, and fine-grained usage control.

High-Level Architecture

Component Responsibilities

LibreChat

Authenticates users via CILogon OIDC
Stores chat histories and model endpoints
Provides a clean chat UI for local models
Future frontend for internal agnetic tools and retrieval augmented generation endpoints (RAG)

Key Manager

Secure issuance and management of API keys
Enforces budgets, expiration

LiteLLM Gateway

A high-performance OpenAI-compatible API façade
Handles routing to GPU-backed model workers
Provides observability and metrics (future)
Supports fallbacks, model selection, and server fan-out (future)

vLLM Workers

Each worker hosts one or more models
Runs with GPU isolation on 4×A100 GPUs
Exposes the /v1/chat/completions and /v1/completions endpoints
0Designed for low latency inference for many users simultaneously

Security & Identity

We use CILogon → LibreChat → API Key → LiteLLM as a layered identity model:

CILogon provides federated authentication
LibreChat provides per-user sessions
Key Manager issues scoped API keys
LiteLLM enforces key validation and a control surface

This avoids per-user OAuth flows for backend services and keeps the LLM infrastructure scalable, isolated, and maintainable.

Model Hosting Strategy

The production inference node consists of:

x4 A100 80GB GPUs
Running one or more models via vLLM
Dedicated reserved node
Local model storage for fast loading
Prometheus metrics exported to a monitoring VM

The system currently supports models in the GPT-OSS, Llama 3.x, Mistral, and Nemotron families. Several embedding models are also supported through the API.

User Access Modes

Users may interact with the deployed LLMs through:

LibreChat Web UI (recommended)
Programmatic API usage via the internal LiteLLM endpoint
Service-to-service calls (n8n, pipelines) using API keys

Goals of the Local LLM System

Provide fast, secure, on-prem inference
Allow controlled use of large foundation models
Support experiment workflows (metadata extraction, analysis assistants)
Enable internal research into model routing, prompt engineering, and agentic workflows
Maintain strict isolation from external cloud services
Provide a foundation for future RAG and embedding-based workflows

Find additional details on LibreChat, the Key Manager, and API usage at:

High-Level Architecture​

Component Responsibilities​

LibreChat​

Key Manager​

LiteLLM Gateway​

vLLM Workers​

Security & Identity​

Model Hosting Strategy​

User Access Modes​

Goals of the Local LLM System​