I’m continuing to put my RTX 5090 through its paces. A few months ago I tried running Qwen3-Coder-30B on it and hooking it into VS Code and Copilot. This unfortunately produced limited results. The code it generated was alright, but it was very inconsistent with tool use even after some troubleshooting. It would just spit the code into the chat window rather than actually updating the files, and sometimes couldn’t even read files. Apparently the root cause of all this is that the Quen model was not configured to produce tool output in a way compatible with Copilot.
So I restarted from scratch. No more Ollama on Windows and clunky Powershell scripts. I installed a fresh Ubuntu 26 image in WSL, installed Python, CUDA, etc from scratch, and then compiled llama.cpp directly. I then installed Claude Code and configured it to use my llama.cpp server as the source of the LLM. I had to do some tweaking of the flags but I actually got it working and able to call tools properly.
The next real test was to have it actually generate a significant amount of code. Luckily, I had created a spec markdown for a MVP version of the SelfDrop app I made last week. So I pointed it to that file and said to get cracking on implementing it. It took about 12 minutes to read the spec and come up with a plan.
qwen3-coder-30b-a3b-instruct: 28.3k input, 5.1k output, 444.7k cache read, 0 cache write ($0.4921)
claude-haiku-4-5: 23.7k input, 4.8k output, 344.4k cache read, 0 cache write ($0.0822)
And then it was off to the races. The video below shows me kicking off the code implementation. It was pretty satisfying seeing my GPU usage crank up as it started generating a bunch of tokens. The usage still seemed a little low though and I wonder if I could squeeze more usage out of the card.
Implementing went pretty quick and I saw it spitting out several hundred tokens per second when actually generating code, which was pretty good. But there were also long stretches where it was seemingly doing nothing. Apparently this is usually it having to juggle context and re-process old tokens before generating new ones, so the true average tokens per second was a lot lower. The final runtime results were this:
Total cost: $1.20 (costs may be inaccurate due to usage of unknown models)
Total duration (API): 15m 39s
Total duration (wall): 30m 8s
Total code changes: 1820 lines added, 0 lines removed
Usage by model:
qwen3-coder-30b-a3b-instruct: 37.7k input, 13.5k output, 1.2m cache read, 0 cache write ($1.12)
claude-haiku-4-5: 23.7k input, 4.8k output, 344.4k cache read, 0 cache write ($0.0822)
But about 10 of those minutes were spent with it just sitting there supposedly reprocessing old tokens. This is definitely slower than when I used Grok Build to implement the same spec. That took maybe 10-15 minutes. It’s not that surprising that a frontier model would be better, but I’m still a little disappointed. I’ll keep tweaking it and maybe try some other models to see if I can improve things.