How to Setup a Local Coding Agent on macOS

(ikyle.me)

45 points | by kkm an hour ago ago

10 comments

reddit_clone 2 minutes ago ago
>64 GB
Thats the rub. I have an M4 with 48G. I wonder if it is worth testing this out.
My past attempts (with Ollama and various LLMs) were too slow to use.
dofm 25 minutes ago ago
Useful stuff in here that I wish I'd seen a few days ago :-)
I am not convinced that the MTP setup for the QAT model adds very much in terms of speed on my M1 Max, but it is definitely worth experimenting with.
Fiddling about with local models has done so much for my conceptual understanding of what is going on.
FWIW and YMMV but I also found the Gemma 4 MTP head was occasionally breaking markup in Opencode, causing the thinking to display untidily and ultimately in some cases missing the stop token. So I've stopped using MTP there for now.
Recent Qwen 3.6 models have developer role support so it will occasionally surprise you with a structured multiple choice questionnaire.
[-]
- mft_ 6 minutes ago ago
  I found a marginal downside to Qwen3.6-35B-A3B-MTP vs. the non-MTP equivalent on an M1 Max. I’ll maybe experiment with settings further though.
ig0r0 26 minutes ago ago
I wrote a similar post some time ago just used ollama and opencode https://blog.kulman.sk/running-local-llm-coding-server/
c-hendricks 33 minutes ago ago
Not sure you really need huggingface-cli to download anything if you're just using llama.cpp. You can pass `-hf ...` and it will download the models for you. Set `LLAMA_CACHE` to change where the downloads go:
```
  LLAMA_CACHE="models" ./llama-server \
    -hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL \
    ...
```
[-]
- dofm 29 minutes ago ago
  Yes.
  -hfd for the draft model.
  [-]
  - c-hendricks 14 minutes ago ago
    Nice, was wondering if there was a flag for the draft as well.
    Not knocking huggingface-cli, just find it's much easier for people to try out this stuff when they can just
```
  mise use --global github:ggml-org/llama.cpp
  LLAMA_CACHE="models" llama-server \
    -hf unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL \
    --host 0.0.0.0 \
    --port 11434 \
    ...
```
namnnumbr 24 minutes ago ago
oMLX (https://github.com/jundot/omlx) makes running the mlx inference server quite easy for those interested in UI-based hosting. oMLX also supports mtp or dflash drafting.
cdolan an hour ago ago
Is there a link to the video? It did not render when I went to the page. Curious about the real-time feel of this
[-]
- dewey 19 minutes ago ago
  That's the direct link: https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent...