Local LLMs software development setup

Guide for llama-swap & llama-server based development setups. Focus on QWQ, Qwen Coder, Aider, VSCode.

Local LLMs software development setup A simple way to setup dev env using reasoning+code models	Created	2025-04-04
Updated	2025-04-12

This is building on Debian, CUDA, Pytorch setup; everyting is being executed on a fresh install of Debian 12 stable, using CUDA and RTX 3090 (24GB VRAM).

llama.cpp setup

The environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 is used to enable unified memory in Linux; This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted.

llama-swap setup

Next, we need to install llama-swap for automatic model swapping. I built mine from sources:

git clone git@github.com:mostlygeek/llama-swap.git
cd llama-swap

# Install go if you don't have it already
make clean all

Example 0: Aider - QwQ architect, Qwen Coder 32B editor, single 3090

models:
  "qwen-coder-32B":
    proxy: "http://127.0.0.1:8999"
    cmd: >
      /path/to/llama-server
      --host 127.0.0.1 --port 8999 --flash-attn --slots
      --ctx-size 16000
      --cache-type-k q8_0 --cache-type-v q8_0
       -ngl 99
      --model /path/to/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf

  "QwQ":
    proxy: "http://127.0.0.1:9503"
    cmd: >
      /path/to/llama-server
      --host 127.0.0.1 --port 9503 --flash-attn --metrics--slots
      --cache-type-k q8_0 --cache-type-v q8_0
      --ctx-size 32000
      --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"
      --temp 0.6 --repeat-penalty 1.1 --dry-multiplier 0.5
      --min-p 0.01 --top-k 40 --top-p 0.95
      -ngl 99
      --model /mnt/nvme/models/bartowski/Qwen_QwQ-32B-Q4_K_M.gguf

# important: model names must match llama-swap configuration names !!!
- name: "openai/QwQ"
  edit_format: diff
  extra_params:
    max_tokens: 16384
    top_p: 0.95
    top_k: 40
    presence_penalty: 0.1
    repetition_penalty: 1
    num_ctx: 16384
  use_temperature: 0.6
  reasoning_tag: think
  weak_model_name: "openai/qwen-coder-32B"
  editor_model_name: "openai/qwen-coder-32B"

- name: "openai/qwen-coder-32B"
  edit_format: diff
  extra_params:
    max_tokens: 16384
    top_p: 0.8
    top_k: 20
    repetition_penalty: 1.05
  use_temperature: 0.6
  reasoning_tag: think
  editor_edit_format: editor-diff
  editor_model_name: "openai/qwen-coder-32B"

aider --architect \
    --no-show-model-warnings \
    --model openai/QwQ \
    --editor-model openai/qwen-coder-32B \
    --model-settings-file aider.model.settings.yml \
    --openai-api-key "sk-na" \
    --openai-api-base "http://127.0.0.1:8080/v1" \

llama-swap profiles

If you find yourself frequently switching between two models in your workflow, profiles can help you avoid the swapping overhead. It can greatly reduce the time to first token.

Example 1: Aider with QwQ architect and Qwen Coder 32B editor on single RTX 3090

profiles:
  aider:
    - qwen-coder-32B
    - QwQ

models:
  "qwen-coder-32B":
    # manually set the GPU to run on
    env:
      - "CUDA_VISIBLE_DEVICES=0"
    proxy: "http://127.0.0.1:8999"
    cmd: /path/to/llama-server ...

  "QwQ":
    # manually set the GPU to run on
    env:
      - "CUDA_VISIBLE_DEVICES=1"
    proxy: "http://127.0.0.1:9503"
    cmd: /path/to/llama-server ...

- name: "openai/aider:QwQ"
  weak_model_name: "openai/aider:qwen-coder-32B-aider"
  editor_model_name: "openai/aider:qwen-coder-32B-aider"

- name: "openai/aider:qwen-coder-32B"
  editor_model_name: "openai/aider:qwen-coder-32B-aider"

$ aider --architect \
    --no-show-model-warnings \
    --model openai/aider:QwQ \
    --editor-model openai/aider:qwen-coder-32B \
    --config aider.conf.yml \
    --model-settings-file aider.model.settings.yml
    --openai-api-key "sk-na" \
    --openai-api-base "http://127.0.0.1:8080/v1"

Example 2: FIM and regular Qwen Coder 32B on dual RTX 3090

profiles:
    coding:
      - qwen-coder-32B
      - qwen-coder-3090-FIM

models:

  # ~123tok/sec
  "qwen-coder-3090-FIM":
    env:
      - "CUDA_VISIBLE_DEVICES=GPU-6f0"
    unlisted: true
    proxy: "http://127.0.0.1:9510"
    cmd: >
      /mnt/nvme/llama-server/llama-server-latest
      --host 127.0.0.1 --port 9510
      -ngl 99 --ctx-size 8096
      -ub 1024 -b 1024
      --model /mnt/nvme/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
      --cache-reuse 256

  # on the 3090s
  # 80tok/sec - write snake game
  # ~43tok/sec normally
  "qwen-coder-32B":
    env:
      - "CUDA_VISIBLE_DEVICES=GPU-f10,GPU-6f0"
    proxy: "http://127.0.0.1:8999"
    cmd: >
      /mnt/nvme/llama-server/llama-server-latest
      --host 127.0.0.1 --port 8999 --flash-attn --metrics --slots
      --parallel 2
      --ctx-size 32000
      --ctx-size-draft 32000
      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q5_K_L.gguf
      --model-draft /mnt/nvme/models/Qwen2.5-Coder-1.5B-Instruct-Q8_0.gguf
      --device-draft CUDA1
      -ngl 99 -ngld 99
      --draft-max 16 --draft-min 4 --draft-p-min 0.4
      --cache-type-k q8_0 --cache-type-v q8_0
      # 23.91 GB of CUDA0 ... think this is close enough
      --tensor-split 90,10