Local LLMs software development setupA simple way to setup dev env using reasoning+code models |
Created | 2025-04-04 | |
---|---|---|---|
Updated |
Guide for llama-swap & llama-server based development setups. Focus on QWQ, Qwen Coder, Aider, VSCode.
This is building on Debian, CUDA, Pytorch setup; everyting is being executed on a fresh install of Debian 12 stable, using CUDA and RTX 3090 (24GB VRAM).
First install llama.cpp - we will be using llama-server.
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Install missing curl lib
sudo apt update && sudo apt upgrade
sudo apt install libcurl4-openssl-dev
# Manual build using CUDA
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=1
cmake --build build --config Release
The environment variable
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
is used to enable unified
memory in Linux; This allows swapping to system RAM instead of crashing
when the GPU VRAM is exhausted.
This leaves us with OpenAI-compatible HTTP server.
Start it:
Next, we need to install llama-swap for automatic model swapping. I built mine from sources:
git clone git@github.com:mostlygeek/llama-swap.git
cd llama-swap
# Install go if you don't have it already
make clean all
llama-swap.config.yml
models:
"qwen-coder-32B":
proxy: "http://127.0.0.1:8999"
cmd: >
/path/to/llama-server
--host 127.0.0.1 --port 8999 --flash-attn --slots
--ctx-size 16000
--cache-type-k q8_0 --cache-type-v q8_0
-ngl 99
--model /path/to/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
"QwQ":
proxy: "http://127.0.0.1:9503"
cmd: >
/path/to/llama-server
--host 127.0.0.1 --port 9503 --flash-attn --metrics--slots
--cache-type-k q8_0 --cache-type-v q8_0
--ctx-size 32000
--samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"
--temp 0.6 --repeat-penalty 1.1 --dry-multiplier 0.5
--min-p 0.01 --top-k 40 --top-p 0.95
-ngl 99 --model /mnt/nvme/models/bartowski/Qwen_QwQ-32B-Q4_K_M.gguf
start it with
llama-swap --config llama-swap.config.yml
Then create aider.model.settings.yml
# important: model names must match llama-swap configuration names !!!
- name: "openai/QwQ"
edit_format: diff
extra_params:
max_tokens: 16384
top_p: 0.95
top_k: 40
presence_penalty: 0.1
repetition_penalty: 1
num_ctx: 16384
use_temperature: 0.6
reasoning_tag: think
weak_model_name: "openai/qwen-coder-32B"
editor_model_name: "openai/qwen-coder-32B"
- name: "openai/qwen-coder-32B"
edit_format: diff
extra_params:
max_tokens: 16384
top_p: 0.8
top_k: 20
repetition_penalty: 1.05
use_temperature: 0.6
reasoning_tag: think
editor_edit_format: editor-diff
editor_model_name: "openai/qwen-coder-32B"
invoke aider with:
aider --architect \
--no-show-model-warnings \
--model openai/QwQ \
--editor-model openai/qwen-coder-32B \
--model-settings-file aider.model.settings.yml \
--openai-api-key "sk-na" \
--openai-api-base "http://127.0.0.1:8080/v1" \
If you find yourself frequently switching between two models in your workflow, profiles can help you avoid the swapping overhead. It can greatly reduce the time to first token.
llama config
profiles:
aider:
- qwen-coder-32B
- QwQ
models:
"qwen-coder-32B":
# manually set the GPU to run on
env:
- "CUDA_VISIBLE_DEVICES=0"
proxy: "http://127.0.0.1:8999"
cmd: /path/to/llama-server ...
"QwQ":
# manually set the GPU to run on
env:
- "CUDA_VISIBLE_DEVICES=1"
proxy: "http://127.0.0.1:9503"
cmd: /path/to/llama-server ...
aider.model.settings.yml
- name: "openai/aider:QwQ"
weak_model_name: "openai/aider:qwen-coder-32B-aider"
editor_model_name: "openai/aider:qwen-coder-32B-aider"
- name: "openai/aider:qwen-coder-32B"
editor_model_name: "openai/aider:qwen-coder-32B-aider"
Finally run aider with:
$ aider --architect \
--no-show-model-warnings \
--model openai/aider:QwQ \
--editor-model openai/aider:qwen-coder-32B \
--config aider.conf.yml \
--model-settings-file aider.model.settings.yml
--openai-api-key "sk-na" \
--openai-api-base "http://127.0.0.1:8080/v1"
llama-swap config:
profiles:
coding:
- qwen-coder-32B
- qwen-coder-3090-FIM
models:
# ~123tok/sec
"qwen-coder-3090-FIM":
env:
- "CUDA_VISIBLE_DEVICES=GPU-6f0"
unlisted: true
proxy: "http://127.0.0.1:9510"
cmd: >
/mnt/nvme/llama-server/llama-server-latest
--host 127.0.0.1 --port 9510
-ngl 99 --ctx-size 8096
-ub 1024 -b 1024
--model /mnt/nvme/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
--cache-reuse 256
# on the 3090s
# 80tok/sec - write snake game
# ~43tok/sec normally
"qwen-coder-32B":
env:
- "CUDA_VISIBLE_DEVICES=GPU-f10,GPU-6f0"
proxy: "http://127.0.0.1:8999"
cmd: >
/mnt/nvme/llama-server/llama-server-latest
--host 127.0.0.1 --port 8999 --flash-attn --metrics --slots
--parallel 2
--ctx-size 32000
--ctx-size-draft 32000
--model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q5_K_L.gguf
--model-draft /mnt/nvme/models/Qwen2.5-Coder-1.5B-Instruct-Q8_0.gguf
--device-draft CUDA1
-ngl 99 -ngld 99
--draft-max 16 --draft-min 4 --draft-p-min 0.4
--cache-type-k q8_0 --cache-type-v q8_0
# 23.91 GB of CUDA0 ... think this is close enough --tensor-split 90,10
Home • Things we do • Articles • Newsletter • Discussion • ©etc '25 • |