Interactive LLM Jobs using Slurm
Here is a simple introduction on using Jefferson Lab computer resources to run LLMs. I've broken it into sections and tried to keeping things step-by-step and user-friendly, while including technical notes throughout.
This guide walks you through running a large language model on an A800 with 40GB vram using vLLM on a Slurm-managed GPU node from start to finish. You will be able to interact with the model using the OpenAI-compatible API. By the end of this introduction, you should be able to reverse tunnel to your machine to expose the LLM to your local tools such as n8n or other interfaces.
1. Log into the ifarmβ
To run jobs in Slurm, you must be on an interactive farm ifarm node. Follow the connecting to Farm and QCD Interactive article.
The two steps are basically to connect to ifarm through the login machine using your MFA token.
ssh <username>@login.jlab.org # Replace with actual CUE username
Once connected to the login machine, connect to the ifarm node. You may want to add the -Y specification to use X11 forwarding to your local machine.
ssh <username>@ifarm
βΉοΈ It is highly recommended to use the virtual desktop interface (VDI) for connecting from off-site when using graphic user interfaces. X11 port forwarding is slower to render GUIs than the virtual desktop.
2. Allocate an Interactive Slurm Sessionβ
Jefferson Lab utilizes Slurm to allocate compute resources to users and groups from across the lab using a fairshare model. A series of Slurm Knowledge base articles are available to guide the user on using Slurm. Before continuing in this tutorial, it is recommended you read the article on specifying GPUs in slurm jobs and the Slurm Quickstart Guide.
Common Issuesβ
βΉοΈ Make sure your user account has been registered to one or more Slurm batch accounts by following this article.
βΉοΈ Make sure you have created your jcert as well.
Example salloc command with GPU allocationβ
It is best practice to allocate the resources you intend to use via salloc for interactive jobs, so if you disconnect you still maintain the allocation. You could use the srun command directly, but once you leave the Pseudo Terminal (PTY), your request will be terminated and any data on the local file system will be wiped for the next user of the node.
Below is an example of renting out a single A800 GPU using the GPU partition for 30 minutes. Make sure to update the account field with your account.
salloc --partition gpu --account <project_account> --gres=gpu:A800:1 --mem=128G --time=00:30:00
--partition gpuβ Request the GPU partition--account <account>β Your Slurm account (check withsacctmgr list user <your_username>)--gres=gpu:A800:1β Request 1 A800 GPU--mem=128Gβ Reserve enough RAM for large models--time=04:00:00β Set job time limit (4 hours here)
βΉοΈ You must be on an interactive farm node (ifarm) to reach the Slurm Daemon.
Additional information on the Jefferson Lab slurm cluster can be found on the scicomp website:
Connect with srunβ
Once you have an allocation, you will be given a Slurm Job ID. While it is shown in your terminal, you can also see all your existing Slurm jobs using squeue.
squeue -u $USER
Now we can connect a pseudo-terminal via --pty and run bash in it.
srun --jobid=<JOB_ID> --pty bash
β οΈ If your terminal appears unresponsive, Do not close your terminal just yet. It will end your Slurm allocation and remove everything you've added to the local disk.
- Open a second terminal and SSH back into ifarm.
- Attach to the session using
tmuxor run anothersrun --jobid=<slurm_job_id> --overlap --pty bash.
βΉοΈ If you only use
srunwithout firstsallocyou will NOT be able to reconnect using--overlap.
3. Use tmux Within Interactive Jobs: Recommendedβ
To prevent chatastrophic loss of progress in an interactive session over pty, I recommend using salloc and using tmux to keep your pseudo-terminal open.
What is a PTY?β
- A pseudo-terminal (PTY) is an abstraction used by the kernel to simulate a terminal.
srun --ptyallocates a PTY slave device to mimic an interactive session (likesshorbash).- If the PTY master (the command
srun --pty bash) exits, the kernel closes the slave side. - This triggers job termination, as Slurm sees the task exit.
Why salloc is Better for Interactive Sessionsβ
sallocreserves resources without running a command.- You can then
sruninto the allocation without losing the whole session if something fails.
Best Practice: Use tmux for Interactive Workβ
Instead of relying on a fragile PTY from srun --pty bash, use tmux.
- Survives SSH disconnects.
- You can detach/reconnect at will.
Where to Start tmux?β
Run tmux on the Slurm compute node, after salloc or srun, NOT on your local machine (ifarm) node. We are going to run multiple things on the Slurm compute node including vLLM. Please note the commands listed at each step within the figure below.
Workflow Exampleβ
Please refer to the tmux cheat sheet for additional information on tmux commands.
4. Verify GPUs on the Slurm Nodeβ
With proper interactive connection out of the way, we can get to the fun part - running vLLM. First we should ensure we're attached to the targeted host and that the required GPU resources are available.
hostname -f # Should be a sciml node
nvidia-smi # Lists nvidia GPUs accessible on the host
βΉοΈ You should see at least one A800 GPU listed with no (or minimal) memory usage. If no GPUs show up when running nvidia-smi, make sure you're on the sciml node and that you used the gres=gpu:A800:1 additional allocation requirmement in
salloc.
Recap: You are now connected to a Slurm GPU node with access to A800 GPUs.
5. Set Up the Python Environmentβ
For this initial setup, we will make a local python virtual environment on the sciml node. This will be removed when we cancel the allocation. The reason we are using the sciml node to install the packages is so the proper nvidia GPU and drivers is targeted in pip install vllm[triton].
Create and activate a python virtual environmentβ
cd /scratch/slurm/$SLURM_JOB_ID # Navigate to the slurm scratch directory
python -m venv vllm-env
source vllm-env/bin/activate
Install vLLMβ
python -m pip install --upgrade pip
python -m pip install vllm[triton]
βΉοΈ The
/scratchis a local NVMe drive in the sciml nodes, which leads to much faster model loading times for files placed in this area.
βΉοΈ Using
python -m pip installhas become more popular than pip install so you ensure you are targetting your virtual environment.
Verify installationβ
Start Python:
python
Then check:
import torch
import vllm
print(torch.cuda.is_available()) # Should be True
Save your Virtual Environment for Later Useβ
To save yourself time in the future, copy your virtual environment to a storage location such as /work. Additional information on the /work file system can be found in the Getting Started - Filesystems
Reusing Your Python Environment When you reconnect to a slurm node, you can re-use your python virtual environment by sourcing the python environment
source /work/path/to/vllm-env/bin/activate
Recap: A Python virtual environment was set up with all the packages to run vLLM using the NVIDIA Triton inference server as a backend for the targeted CUDA version. You can optionally save your virtual environment to a persistent file system such as /work for future use.
6. Run a Moderate Sized Model with vLLM Locallyβ
We will use a 14B reasoning model as our starting example of vLLM. The nvidia/OpeningReasoning model was chosen since it does not require acceptance of a license: nvidia/OpenReasoning-Nemotron-14B:
python -m vllm.entrypoints.openai.api_server \
--model nvidia/OpenReasoning-Nemotron-14B \
--gpu-memory-utilization 0.90 \
--port 8000 \
--max-model-len 4096 \
--download-dir /scratch/slurm/$SLURM_JOB_ID/hf_models
In this command, a vllm OpenAI API server is established on port 8000. 90% of the GPU memory is utilized and the context length for an individual prompt is 4096 tokens. The model is downloaded automatically from Hugging Face to the download-dir.
- Downloads model directly to the nodeβs scratch space.
- If using a persistent location, adjust
--download-diraccordingly, but realize that the local NVMe drive will be much faster.
Let the model finish downloading from hugging face, loading into GPU vram, and compiling. This will take several minutes. You will see the following once the model is downloaded, loaded into GPU memory, and the OpanAI complient API is up and running:
INFO 07-29 15:46:25 [launcher.py:37] Route: /metrics, Methods: GET
INFO: Started server process [1481582]
INFO: Waiting for application startup.
INFO: Application startup complete.
7. Interact with the Modelβ
This is where running with tmux will dramatically help you. If you are in tmux, you can use Ctrl-b " to split the pane horizontally. Please refer to the tmux cheet sheet for additional commands. In a new terminal or window, you can now query vLLM to see what model is available on port 8000.
curl http://localhost:8000/v1/models | python3 -m json.tool
You should see the loaded model listed.
Test a completionβ
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/OpenReasoning-Nemotron-14B",
"messages": [{"role": "user", "content": "Hello, what is vLLM?"}]
}' | python3 -m json.tool
β
Recap: You successfully downloaded and ran an LLM on the farm node using an A800 GPU with 40GB of memory. It was accessible via the live OpenAI complient API at http://localhost:8000/v1/chat/completions.
Notes on Larger Modelsβ
- Large models (13Bβ70B+) can take 20β60+ GB of VRAM. Longer context windows take additional vram to run. It is common for models to have large default context lengths, but your GPU may not fit everything.
Model Memory Heuristics (simplified):β
Total VRAM = model size + Ξ± Γ (context_length Γ batch_size) Γ num_layers Γ head_dim Γ 2 bytes Where:
- Ξ± β 2.0β2.5 (safety factor, memory fragmentation, scheduler slack)
- context_length = input + output tokens
- head_dim β 128 (for LLaMA-style models)
- 2 bytes for fp16 KV cache
- Multiply by num_attention_heads and num_layers
Download Models Locally Beforehandβ
- Save to
/work/...if youβll reuse the model.- Copy the model to the local /scratch/slurm/$SLURM_JOB_ID/ before loading.
8. (Optional) Setup a Reverse Proxyβ
If you want to access this model externally (e.g., from a VM or another service like n8n or LM Studio):
- Use
ssh -J login -R 8000:localhost:8000 username@target.jlab.orgto forward ports.
9. Run a Local Web GUI / Chat Interfaceβ
A small webui from llama.cpp has been set up in a container to serve as a rough frontend web-based API. It is available on the internal JLab registry as llama-webui. Anyone with a Jefferson Lab account can access the container registry and pull the container to run on the sciml node alongside vLLM.
- To pull the container, follow along with the JLab-container-docs.
podman login #Use your CUE username and gitlab access token
podman pull codecr.jlab.org/cmorean/llm/llama-webui:latest
Now, on the sciml node, run:
podman run --rm --network host llama-webui
The webui should now be running alongside vLLM on the sciml node on port 8080. To access the frontend API, add the port 8080 reverse tunnel this port along with port 8000.
ssh -J login -R 8000:localhost:8000 -R 8080:localhost:8080 username@target.jlab.org
βΉοΈ The connection may be quite slow depending on the computer the port is forwarded to. Ideally, this would be forwarded as standard 443 traffic through a secure gateway (Planned in the Future).
Advanced Setup (To Be Continued...)β
- Persistent model cache using
/work. - Running batch inference on a
jsonlinput. - Serve an embedding model alongside an LLM.
- Prometheus/Grafana metrics endpoint from vLLM.