Local LLM Models: Are They Actually Useful?
This morning I saw a headline for a new open source LLM: QwQ-32B: Embracing the Power of Reinforcement Learning. I got excited, because a 32B model on par with gpt4o could potentially be used locally on my 32gb M3 Mac - very exciting.
The promise of running large language models (LLMs) locally, on consumer hardware like a MacBook, is appealing to many developers. Benefits include potential cost savings, increased privacy, and offline access. However, the practical usability of local LLMs for coding tasks, compared to cloud-based alternatives, remains a key question. This post details my recent experience testing local LLMs, specifically the Qwen 32B model, on a 32GB M3 MacBook Pro.
Model Distillation and Local LLMs
The ability to run increasingly complex LLMs locally is largely due to model distillation. This technique involves training a smaller "student" model using the outputs of a larger "teacher" model (e.g., GPT-4 or LLaMA 70B). This transfers knowledge, reducing the smaller model's size and computational requirements without a drastic performance loss. For a more technical explanation, see this overview on LLM Distillation.
The advantages of local LLMs are:
- Cost: Eliminates recurring subscription fees for services like ChatGPT Plus and Cursor (I’m paying 20$ for chatGPT, 20$ for Claude, 20$ for Gemini, 40$ for Cursor, ugh)
- Privacy: Code and prompts remain on the local machine, avoiding data transmission to third-party servers.
- Offline Access: Functionality is independent of internet connectivity.
Similar to the shift of speech-to-text processing from servers to mobile devices, local LLMs aim to bring AI processing power directly to the user's device.
My Experience with Local LLMs
My past attempts to use local LLMs for coding have consistently been met with limited success. While capable of basic text generation, they have generally fallen short of cloud-based alternatives (ChatGPT, Claude, Cursor) in terms of speed, context window size, and overall coding utility.
First, there are two main tools that are used to run LLMs locally. The first one is Ollama, and the second one is LM Studio.
- Ollama is more geared towards developers. My first experience with it, I noticed the UI was really slick and minimalistic, which was pretty cool. But then when I actually got down to using it, it wasn't that convenient to use, since everything was done in the terminal.
- LM Studio, on the other hand, is more engineered, less slick, but quite useful. It has a chat interface similar to ChatGPT, it allows you to select which model to use, and it allows you to search and download new models.
These tools are great at testing local models. However, my real intended use case would be to use a local model via cursor. Unfortunately, cursor makes it difficult to do. You have to setup a server, use ngrok, and overwrite the openAPI endpoint in cursor with your ngrok forwarded server. After some time playing around with this, I was not able to get it to work, and gave up.
Key Challenges for Coding with Local LLMs
While many benchmarks are used for assessing LLM qualities, I think the focus needs to be different when assessing local models.
Three primary factors consistently hinder the usability of local LLMs for my coding workflow:
- Generation Speed: Slow token generation rates (often 1-4 tokens/second) significantly impact productivity. This contrasts sharply with the near-instantaneous responses from cloud-based services.
- Context Window Size: While improving, context window limitations, particularly in models like earlier LLaMA versions, restrict the ability to work effectively with larger codebases. Models like DeepSeek R1 DeepSeek R1 Model Card are promising in this regard, but practical coding performance needs verification.
- Output Quality: Beyond speed and context, the generated code must be accurate, semantically sound, and helpful for debugging. While local models are improving, cloud-based alternatives currently offer greater consistency and reliability.
The Qwen 32B Experiment
I recently tested the Qwen 32B model from Alibaba Cloud Qwen Official Documentation. Qwen models offer a range of sizes and capabilities, including a Qwen-Coder variant specifically for programming tasks. The 32B model seemed suitable for my 32GB M3 MacBook Pro.
My test involved a simple task: generating an HTML page for a tic-tac-toe game. As mentionned earlier, I was not able to use the model through cursor. Instead, I prompted it through LM Studio, by simply asking it to generate the html in the chat.
While the setup was straightforward, the generation speed was extremely slow (approximately 1-4 tokens/second). It took roughly five minutes to complete this basic task. This level of latency renders the model impractical for interactive coding within my usual workflow. I ended up leaving the chat window, browsed hacker news for 5 minutes, and came back to see the result.
The result was decent, it made working HTML with the actual tic tac toe game. However, the generation speed must be emphasized as being not useful.
Conclusion: Latency Remains a Major Bottleneck
Based on my experience, current local LLMs, including Qwen 32B, are not yet a viable replacement for cloud-based coding assistants for my specific needs. While improvements in model quality are evident, the latency of token generation remains a significant bottleneck, hindering productivity.
The focus in the local LLM space should arguably shift from increasing model size to drastically improving generation speed. As model output quality converges (becoming "good enough" for many tasks), latency will become the primary differentiator and the key factor determining practical usability.
To emphasize this point, one can imagine using the LLM to design a landing page for your website. Right now, this might take hours using a local LLM, having to wait 5 minutes to see results, and iterate on it. Soon, it might take minutes, and then seconds. If the LLM could generate the html in milliseconds, then one could imagine generating hundreds or thousands of different options, use another LLM call to evaluate the results, and present the best options to you. That would be great!
To conclude, I think further investigation and benchmarking of local LLM performance, particularly focused on generation speed and context window handling in real-world coding scenarios, are warranted. I am interested in hearing about other developers' experiences with local LLMs for coding, and if any latency focused benchmarks exist at the moment.