Skip to main content

Interactive LLM Jobs using Slurm

Here is a simple introduction on using Jefferson Lab computer resources to run LLMs. I've broken it into sections and tried to keeping things step-by-step and user-friendly, while including technical notes throughout. This guide walks you through running a large language model on an A800 with 40GB vram using vLLM on a Slurm-managed GPU node from start to finish. You will be able to interact with the model using the OpenAI-compatible API. By the end of this introduction, you should be able to reverse tunnel to your machine to expose the LLM to your local tools such as n8n or other interfaces.


1. Log into the ifarm​

To run jobs in Slurm, you must be on an interactive farm ifarm node. Follow the connecting to Farm and QCD Interactive article.

The two steps are basically to connect to ifarm through the login machine using your MFA token.

ssh <username>@login.jlab.org  # Replace with actual CUE username

Once connected to the login machine, connect to the ifarm node. You may want to add the -Y specification to use X11 forwarding to your local machine.

ssh <username>@ifarm

ℹ️ It is highly recommended to use the virtual desktop interface (VDI) for connecting from off-site when using graphic user interfaces. X11 port forwarding is slower to render GUIs than the virtual desktop.


2. Allocate an Interactive Slurm Session​

Jefferson Lab utilizes Slurm to allocate compute resources to users and groups from across the lab using a fairshare model. A series of Slurm Knowledge base articles are available to guide the user on using Slurm. Before continuing in this tutorial, it is recommended you read the article on specifying GPUs in slurm jobs and the Slurm Quickstart Guide.

Common Issues​

ℹ️ Make sure your user account has been registered to one or more Slurm batch accounts by following this article.
ℹ️ Make sure you have created your jcert as well.

Example salloc command with GPU allocation​

It is best practice to allocate the resources you intend to use via salloc for interactive jobs, so if you disconnect you still maintain the allocation. You could use the srun command directly, but once you leave the Pseudo Terminal (PTY), your request will be terminated and any data on the local file system will be wiped for the next user of the node.

Below is an example of renting out a single A800 GPU using the GPU partition for 30 minutes. Make sure to update the account field with your account.

salloc --partition gpu --account <project_account> --gres=gpu:A800:1 --mem=128G --time=00:30:00
  • --partition gpu β€” Request the GPU partition
  • --account <account> β€” Your Slurm account (check with sacctmgr list user <your_username>)
  • --gres=gpu:A800:1 β€” Request 1 A800 GPU
  • --mem=128G β€” Reserve enough RAM for large models
  • --time=04:00:00 β€” Set job time limit (4 hours here)

ℹ️ You must be on an interactive farm node (ifarm) to reach the Slurm Daemon.

Additional information on the Jefferson Lab slurm cluster can be found on the scicomp website:

Connect with srun​

Once you have an allocation, you will be given a Slurm Job ID. While it is shown in your terminal, you can also see all your existing Slurm jobs using squeue.

squeue -u $USER

Now we can connect a pseudo-terminal via --pty and run bash in it.

srun --jobid=<JOB_ID> --pty bash

⚠️ If your terminal appears unresponsive, Do not close your terminal just yet. It will end your Slurm allocation and remove everything you've added to the local disk.

  • Open a second terminal and SSH back into ifarm.
  • Attach to the session using tmux or run another srun --jobid=<slurm_job_id> --overlap --pty bash.

ℹ️ If you only use srun without first salloc you will NOT be able to reconnect using --overlap.


To prevent chatastrophic loss of progress in an interactive session over pty, I recommend using salloc and using tmux to keep your pseudo-terminal open.

What is a PTY?​

  • A pseudo-terminal (PTY) is an abstraction used by the kernel to simulate a terminal.
  • srun --pty allocates a PTY slave device to mimic an interactive session (like ssh or bash).
  • If the PTY master (the command srun --pty bash) exits, the kernel closes the slave side.
  • This triggers job termination, as Slurm sees the task exit.

Why salloc is Better for Interactive Sessions​

  • salloc reserves resources without running a command.
  • You can then srun into the allocation without losing the whole session if something fails.

Best Practice: Use tmux for Interactive Work​

Instead of relying on a fragile PTY from srun --pty bash, use tmux.

  • Survives SSH disconnects.
  • You can detach/reconnect at will.

Where to Start tmux?​

Run tmux on the Slurm compute node, after salloc or srun, NOT on your local machine (ifarm) node. We are going to run multiple things on the Slurm compute node including vLLM. Please note the commands listed at each step within the figure below.

Workflow Example​

Please refer to the tmux cheat sheet for additional information on tmux commands.


4. Verify GPUs on the Slurm Node​

With proper interactive connection out of the way, we can get to the fun part - running vLLM. First we should ensure we're attached to the targeted host and that the required GPU resources are available.

hostname -f # Should be a sciml node
nvidia-smi # Lists nvidia GPUs accessible on the host

ℹ️ You should see at least one A800 GPU listed with no (or minimal) memory usage. If no GPUs show up when running nvidia-smi, make sure you're on the sciml node and that you used the gres=gpu:A800:1 additional allocation requirmement in salloc.

Recap: You are now connected to a Slurm GPU node with access to A800 GPUs.


5. Set Up the Python Environment​

For this initial setup, we will make a local python virtual environment on the sciml node. This will be removed when we cancel the allocation. The reason we are using the sciml node to install the packages is so the proper nvidia GPU and drivers is targeted in pip install vllm[triton].

Create and activate a python virtual environment​

cd /scratch/slurm/$SLURM_JOB_ID # Navigate to the slurm scratch directory
python -m venv vllm-env
source vllm-env/bin/activate

Install vLLM​

python -m pip install --upgrade pip
python -m pip install vllm[triton]

ℹ️ The /scratch is a local NVMe drive in the sciml nodes, which leads to much faster model loading times for files placed in this area.

ℹ️ Using python -m pip install has become more popular than pip install so you ensure you are targetting your virtual environment.

Verify installation​

Start Python:

python

Then check:

import torch
import vllm
print(torch.cuda.is_available()) # Should be True

Save your Virtual Environment for Later Use​

To save yourself time in the future, copy your virtual environment to a storage location such as /work. Additional information on the /work file system can be found in the Getting Started - Filesystems

Reusing Your Python Environment When you reconnect to a slurm node, you can re-use your python virtual environment by sourcing the python environment

source /work/path/to/vllm-env/bin/activate

Recap: A Python virtual environment was set up with all the packages to run vLLM using the NVIDIA Triton inference server as a backend for the targeted CUDA version. You can optionally save your virtual environment to a persistent file system such as /work for future use.


6. Run a Moderate Sized Model with vLLM Locally​

We will use a 14B reasoning model as our starting example of vLLM. The nvidia/OpeningReasoning model was chosen since it does not require acceptance of a license: nvidia/OpenReasoning-Nemotron-14B:

python -m vllm.entrypoints.openai.api_server \
--model nvidia/OpenReasoning-Nemotron-14B \
--gpu-memory-utilization 0.90 \
--port 8000 \
--max-model-len 4096 \
--download-dir /scratch/slurm/$SLURM_JOB_ID/hf_models

In this command, a vllm OpenAI API server is established on port 8000. 90% of the GPU memory is utilized and the context length for an individual prompt is 4096 tokens. The model is downloaded automatically from Hugging Face to the download-dir.

  • Downloads model directly to the node’s scratch space.
  • If using a persistent location, adjust --download-dir accordingly, but realize that the local NVMe drive will be much faster.

Let the model finish downloading from hugging face, loading into GPU vram, and compiling. This will take several minutes. You will see the following once the model is downloaded, loaded into GPU memory, and the OpanAI complient API is up and running:

INFO 07-29 15:46:25 [launcher.py:37] Route: /metrics, Methods: GET
INFO: Started server process [1481582]
INFO: Waiting for application startup.
INFO: Application startup complete.

7. Interact with the Model​

This is where running with tmux will dramatically help you. If you are in tmux, you can use Ctrl-b " to split the pane horizontally. Please refer to the tmux cheet sheet for additional commands. In a new terminal or window, you can now query vLLM to see what model is available on port 8000.

curl http://localhost:8000/v1/models | python3 -m json.tool

You should see the loaded model listed.

Test a completion​

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/OpenReasoning-Nemotron-14B",
"messages": [{"role": "user", "content": "Hello, what is vLLM?"}]
}' | python3 -m json.tool

βœ… Recap: You successfully downloaded and ran an LLM on the farm node using an A800 GPU with 40GB of memory. It was accessible via the live OpenAI complient API at http://localhost:8000/v1/chat/completions.

Notes on Larger Models​

  • Large models (13B–70B+) can take 20–60+ GB of VRAM. Longer context windows take additional vram to run. It is common for models to have large default context lengths, but your GPU may not fit everything.

Model Memory Heuristics (simplified):​

Total VRAM = model size + Ξ± Γ— (context_length Γ— batch_size) Γ— num_layers Γ— head_dim Γ— 2 bytes Where:

  • Ξ± β‰ˆ 2.0–2.5 (safety factor, memory fragmentation, scheduler slack)
  • context_length = input + output tokens
  • head_dim β‰ˆ 128 (for LLaMA-style models)
  • 2 bytes for fp16 KV cache
  • Multiply by num_attention_heads and num_layers

Download Models Locally Beforehand​

  • Save to /work/... if you’ll reuse the model.
    • Copy the model to the local /scratch/slurm/$SLURM_JOB_ID/ before loading.

8. (Optional) Setup a Reverse Proxy​

If you want to access this model externally (e.g., from a VM or another service like n8n or LM Studio):

  • Use ssh -J login -R 8000:localhost:8000 username@target.jlab.org to forward ports.

9. Run a Local Web GUI / Chat Interface​

A small webui from llama.cpp has been set up in a container to serve as a rough frontend web-based API. It is available on the internal JLab registry as llama-webui. Anyone with a Jefferson Lab account can access the container registry and pull the container to run on the sciml node alongside vLLM.

podman login  #Use your CUE username and gitlab access token
podman pull codecr.jlab.org/cmorean/llm/llama-webui:latest

Now, on the sciml node, run:

podman run --rm --network host llama-webui

The webui should now be running alongside vLLM on the sciml node on port 8080. To access the frontend API, add the port 8080 reverse tunnel this port along with port 8000.

ssh -J login -R 8000:localhost:8000 -R 8080:localhost:8080 username@target.jlab.org

ℹ️ The connection may be quite slow depending on the computer the port is forwarded to. Ideally, this would be forwarded as standard 443 traffic through a secure gateway (Planned in the Future).


Advanced Setup (To Be Continued...)​

  • Persistent model cache using /work.
  • Running batch inference on a jsonl input.
  • Serve an embedding model alongside an LLM.
  • Prometheus/Grafana metrics endpoint from vLLM.