Skip to main content

Overview

This document provides an introduction to open weights large language models (LLMs) we have deployed at Jefferson Lab. These services are designed to expose high-performance, GPU-accelerated large language models to internal users through authenticated and auditable interfaces.

The local LLM system consists of four major components:

  1. LibreChat β€” The user-facing chat frontend authenticated through CILogon
  2. LiteLLM API Gateway β€” A consolidated OpenAI-compatible inference gateway
  3. Key Manager β€” A secure API-key issuance service used to issue API keys to users and bots
  4. vLLM Workers β€” GPU-backed inference servers deployed on GPU nodes with dedicated Nvidia A100 hardware

Together, these components allow users to interact with foundation models while maintaining strong identity guarantees, traceability, and fine-grained usage control.


High-Level Architecture​


Component Responsibilities​

LibreChat​

  • Authenticates users via CILogon OIDC
  • Stores chat histories and model endpoints
  • Provides a clean chat UI for local models
  • Future frontend for internal agnetic tools and retrieval augmented generation endpoints (RAG)

Key Manager​

  • Secure issuance and management of API keys
  • Enforces budgets, expiration

LiteLLM Gateway​

  • A high-performance OpenAI-compatible API faΓ§ade
  • Handles routing to GPU-backed model workers
  • Provides observability and metrics (future)
  • Supports fallbacks, model selection, and server fan-out (future)

vLLM Workers​

  • Each worker hosts one or more models
  • Runs with GPU isolation on 4Γ—A100 GPUs
  • Exposes the /v1/chat/completions and /v1/completions endpoints
  • 0Designed for low latency inference for many users simultaneously

Security & Identity​

We use CILogon β†’ LibreChat β†’ API Key β†’ LiteLLM as a layered identity model:

  • CILogon provides federated authentication
  • LibreChat provides per-user sessions
  • Key Manager issues scoped API keys
  • LiteLLM enforces key validation and a control surface

This avoids per-user OAuth flows for backend services and keeps the LLM infrastructure scalable, isolated, and maintainable.


Model Hosting Strategy​

The production inference node consists of:

  • x4 A100 80GB GPUs
  • Running one or more models via vLLM
  • Dedicated reserved node
  • Local model storage for fast loading
  • Prometheus metrics exported to a monitoring VM

The system currently supports models in the GPT-OSS, Llama 3.x, Mistral, and Nemotron families. Several embedding models are also supported through the API.


User Access Modes​

Users may interact with the deployed LLMs through:

  1. LibreChat Web UI (recommended)
  2. Programmatic API usage via the internal LiteLLM endpoint
  3. Service-to-service calls (n8n, pipelines) using API keys

Goals of the Local LLM System​

  • Provide fast, secure, on-prem inference
  • Allow controlled use of large foundation models
  • Support experiment workflows (metadata extraction, analysis assistants)
  • Enable internal research into model routing, prompt engineering, and agentic workflows
  • Maintain strict isolation from external cloud services
  • Provide a foundation for future RAG and embedding-based workflows

Find additional details on LibreChat, the Key Manager, and API usage at: