Local LLMs software development setup

A simple way to setup dev env using reasoning+code models
Created 2025-04-04
Updated

Guide for llama-swap & llama-server based development setups. Focus on QWQ, Qwen Coder, Aider, VSCode.

This is building on Debian, CUDA, Pytorch setup; everyting is being executed on a fresh install of Debian 12 stable, using CUDA and RTX 3090 (24GB VRAM).

llama.cpp setup

First install llama.cpp - we will be using llama-server.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Install missing curl lib
sudo apt update && sudo apt upgrade
sudo apt install libcurl4-openssl-dev 

# Manual build using CUDA
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=1
cmake --build build --config Release

The environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 is used to enable unified memory in Linux; This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted.

This leaves us with OpenAI-compatible HTTP server.

Start it:

llama-swap setup

Next, we need to install llama-swap for automatic model swapping. I built mine from sources:

git clone git@github.com:mostlygeek/llama-swap.git
cd llama-swap

# Install go if you don't have it already
make clean all

Example 0: Aider - QwQ architect, Qwen Coder 32B editor, single 3090

llama-swap.config.yml

models:
  "qwen-coder-32B":
    proxy: "http://127.0.0.1:8999"
    cmd: >
      /path/to/llama-server
      --host 127.0.0.1 --port 8999 --flash-attn --slots
      --ctx-size 16000
      --cache-type-k q8_0 --cache-type-v q8_0
       -ngl 99
      --model /path/to/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf

  "QwQ":
    proxy: "http://127.0.0.1:9503"
    cmd: >
      /path/to/llama-server
      --host 127.0.0.1 --port 9503 --flash-attn --metrics--slots
      --cache-type-k q8_0 --cache-type-v q8_0
      --ctx-size 32000
      --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"
      --temp 0.6 --repeat-penalty 1.1 --dry-multiplier 0.5
      --min-p 0.01 --top-k 40 --top-p 0.95
      -ngl 99
      --model /mnt/nvme/models/bartowski/Qwen_QwQ-32B-Q4_K_M.gguf

start it with

llama-swap --config llama-swap.config.yml

Then create aider.model.settings.yml

# important: model names must match llama-swap configuration names !!!
- name: "openai/QwQ"
  edit_format: diff
  extra_params:
    max_tokens: 16384
    top_p: 0.95
    top_k: 40
    presence_penalty: 0.1
    repetition_penalty: 1
    num_ctx: 16384
  use_temperature: 0.6
  reasoning_tag: think
  weak_model_name: "openai/qwen-coder-32B"
  editor_model_name: "openai/qwen-coder-32B"

- name: "openai/qwen-coder-32B"
  edit_format: diff
  extra_params:
    max_tokens: 16384
    top_p: 0.8
    top_k: 20
    repetition_penalty: 1.05
  use_temperature: 0.6
  reasoning_tag: think
  editor_edit_format: editor-diff
  editor_model_name: "openai/qwen-coder-32B"

invoke aider with:

aider --architect \
    --no-show-model-warnings \
    --model openai/QwQ \
    --editor-model openai/qwen-coder-32B \
    --model-settings-file aider.model.settings.yml \
    --openai-api-key "sk-na" \
    --openai-api-base "http://127.0.0.1:8080/v1" \

llama-swap profiles

If you find yourself frequently switching between two models in your workflow, profiles can help you avoid the swapping overhead. It can greatly reduce the time to first token.

Example 1: Aider with QwQ architect and Qwen Coder 32B editor on single RTX 3090

llama config

profiles:
  aider:
    - qwen-coder-32B
    - QwQ

models:
  "qwen-coder-32B":
    # manually set the GPU to run on
    env:
      - "CUDA_VISIBLE_DEVICES=0"
    proxy: "http://127.0.0.1:8999"
    cmd: /path/to/llama-server ...

  "QwQ":
    # manually set the GPU to run on
    env:
      - "CUDA_VISIBLE_DEVICES=1"
    proxy: "http://127.0.0.1:9503"
    cmd: /path/to/llama-server ...

aider.model.settings.yml

- name: "openai/aider:QwQ"
  weak_model_name: "openai/aider:qwen-coder-32B-aider"
  editor_model_name: "openai/aider:qwen-coder-32B-aider"

- name: "openai/aider:qwen-coder-32B"
  editor_model_name: "openai/aider:qwen-coder-32B-aider"

Finally run aider with:

$ aider --architect \
    --no-show-model-warnings \
    --model openai/aider:QwQ \
    --editor-model openai/aider:qwen-coder-32B \
    --config aider.conf.yml \
    --model-settings-file aider.model.settings.yml
    --openai-api-key "sk-na" \
    --openai-api-base "http://127.0.0.1:8080/v1"

Example 2: FIM and regular Qwen Coder 32B on dual RTX 3090

llama-swap config:

profiles:
    coding:
      - qwen-coder-32B
      - qwen-coder-3090-FIM

models:

  # ~123tok/sec
  "qwen-coder-3090-FIM":
    env:
      - "CUDA_VISIBLE_DEVICES=GPU-6f0"
    unlisted: true
    proxy: "http://127.0.0.1:9510"
    cmd: >
      /mnt/nvme/llama-server/llama-server-latest
      --host 127.0.0.1 --port 9510
      -ngl 99 --ctx-size 8096
      -ub 1024 -b 1024
      --model /mnt/nvme/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
      --cache-reuse 256

  # on the 3090s
  # 80tok/sec - write snake game
  # ~43tok/sec normally
  "qwen-coder-32B":
    env:
      - "CUDA_VISIBLE_DEVICES=GPU-f10,GPU-6f0"
    proxy: "http://127.0.0.1:8999"
    cmd: >
      /mnt/nvme/llama-server/llama-server-latest
      --host 127.0.0.1 --port 8999 --flash-attn --metrics --slots
      --parallel 2
      --ctx-size 32000
      --ctx-size-draft 32000
      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q5_K_L.gguf
      --model-draft /mnt/nvme/models/Qwen2.5-Coder-1.5B-Instruct-Q8_0.gguf
      --device-draft CUDA1
      -ngl 99 -ngld 99
      --draft-max 16 --draft-min 4 --draft-p-min 0.4
      --cache-type-k q8_0 --cache-type-v q8_0
      # 23.91 GB of CUDA0 ... think this is close enough
      --tensor-split 90,10

llama.vscode

HomeThings we doArticlesNewsletterDiscussion©etc '25