Overview
This document provides an introduction to open weights large language models (LLMs) we have deployed at Jefferson Lab. These services are designed to expose high-performance, GPU-accelerated large language models to internal users through authenticated and auditable interfaces.
The local LLM system consists of four major components:
- LibreChat β The user-facing chat frontend authenticated through CILogon
- LiteLLM API Gateway β A consolidated OpenAI-compatible inference gateway
- Key Manager β A secure API-key issuance service used to issue API keys to users and bots
- vLLM Workers β GPU-backed inference servers deployed on GPU nodes with dedicated Nvidia A100 hardware
Together, these components allow users to interact with foundation models while maintaining strong identity guarantees, traceability, and fine-grained usage control.
High-Level Architectureβ
Component Responsibilitiesβ
LibreChatβ
- Authenticates users via CILogon OIDC
- Stores chat histories and model endpoints
- Provides a clean chat UI for local models
- Future frontend for internal agnetic tools and retrieval augmented generation endpoints (RAG)
Key Managerβ
- Secure issuance and management of API keys
- Enforces budgets, expiration
LiteLLM Gatewayβ
- A high-performance OpenAI-compatible API faΓ§ade
- Handles routing to GPU-backed model workers
- Provides observability and metrics (future)
- Supports fallbacks, model selection, and server fan-out (future)
vLLM Workersβ
- Each worker hosts one or more models
- Runs with GPU isolation on 4ΓA100 GPUs
- Exposes the
/v1/chat/completionsand/v1/completionsendpoints - 0Designed for low latency inference for many users simultaneously
Security & Identityβ
We use CILogon β LibreChat β API Key β LiteLLM as a layered identity model:
- CILogon provides federated authentication
- LibreChat provides per-user sessions
- Key Manager issues scoped API keys
- LiteLLM enforces key validation and a control surface
This avoids per-user OAuth flows for backend services and keeps the LLM infrastructure scalable, isolated, and maintainable.
Model Hosting Strategyβ
The production inference node consists of:
- x4 A100 80GB GPUs
- Running one or more models via vLLM
- Dedicated reserved node
- Local model storage for fast loading
- Prometheus metrics exported to a monitoring VM
The system currently supports models in the GPT-OSS, Llama 3.x, Mistral, and Nemotron families. Several embedding models are also supported through the API.
User Access Modesβ
Users may interact with the deployed LLMs through:
- LibreChat Web UI (recommended)
- Programmatic API usage via the internal LiteLLM endpoint
- Service-to-service calls (n8n, pipelines) using API keys
Goals of the Local LLM Systemβ
- Provide fast, secure, on-prem inference
- Allow controlled use of large foundation models
- Support experiment workflows (metadata extraction, analysis assistants)
- Enable internal research into model routing, prompt engineering, and agentic workflows
- Maintain strict isolation from external cloud services
- Provide a foundation for future RAG and embedding-based workflows
Find additional details on LibreChat, the Key Manager, and API usage at: