What are LLMs?

If you’ve missed the recent AI boom, LLMs (Large Language Models) are powerful neural networks capable of writing essays, explaining code, summarizing text, and even assisting with planning and reasoning. At their core, they are highly advanced text predictors that often seem to understand context and intent.

However, not all LLMs are the same. They range from small, fast models to massive ones that run much slower. Choosing the right one to run on your personal computer is a significant challenge.

Why Run Them Locally?

While cloud-based LLMs like GPT-4 and Gemini are powerful, I grew tired of a few key issues:

  • Sending my data to third-party servers.
  • The rapidly accumulating cost of API credits.
  • The inability to tweak or customize their behavior.

Running an LLM locally ensures data privacy, eliminates per-token fees, and gives you full control. The main drawback is the hardware challenge—it often feels like a gamble whether a model will run on a given GPU, and I faced many failures before finding success.

How to Pick a Model That Works

This is where I spent a lot of time researching. My setup is a modest 8GB RTX 3050 with 32GB of RAM. I found that advice on Reddit, Hugging Face, and GitHub was often contradictory.

One of the most important things I learned is that picking the right model really depends on the kind of task you want to perform. For example, some LLMs are better at reasoning through complex problems, others excel at writing or understanding code, and some are focused on following instructions precisely. There’s no one-size-fits-all. Choosing the best model means matching its strengths to your specific use case, whether that’s complex reasoning, coding assistance, summarization, or something else.

Here are some key lessons I learned:

  • Model size vs. VRAM: Models are often categorized by their number of parameters—like 3B, 7B, 12B, and up to 30B. More parameters usually mean a more capable model, but they also consume a lot of GPU memory.
  • Quantization: This is a term for model compression. If you want to run a 30B model on an 8GB GPU, quantization is essential. It allows larger models to fit into smaller VRAM, though it can lead to a small reduction in accuracy.
  • CPU offloading: When your GPU runs out of memory, you can offload parts of the model to your system RAM. This is useful for running larger models, but it comes at the cost of a significant performance decrease.

The Role of Unsloth

A major asset in this journey has been Unsloth, an open-source library on Hugging Face that is transforming how people fine-tune and run LLMs locally. Unsloth’s innovation lies in its engineering: it replaces standard model training components with high-performance custom GPU kernels, which makes fine-tuning faster and much less demanding on GPU memory.

Unsloth supports many popular models, including Qwen3, Llama 4, Mistral, and Gemma, and has even contributed fixes to critical bugs in these models to improve their accuracy. Its tools integrate seamlessly with Hugging Face Transformers and other popular frameworks, simplifying the process of loading, quantizing, fine-tuning, saving, and exporting models for various inference engines.

What’s particularly impressive is how Unsloth makes fine-tuning more accessible. Training times that once took over 12 hours can now be completed in just a few hours, using up to 70% less memory. The library also supports vision, speech, reinforcement learning, and multimodal models, making it a versatile tool for a wide range of AI applications.

For someone with a modest GPU, the optimized models and tools from Unsloth have been a game-changer. It’s clear why the community supports it, and if you’re serious about running or fine-tuning local LLMs, Unsloth is worth exploring.

My Model Testing Journey

I tested a variety of models until my GPU was pushed to its limits:

  • Phi-4 Mini 3B: Very fast, but too basic for complex tasks.
  • Qwen3 3B: Similar to Phi-4, good for simple text generation but lacked strong reasoning.
  • Gemma-12B: A solid all-around model, but it struggled with strict task adherence in ways that the Qwen3 30B model did not.
  • DeepSeek R1 0528 Qwen3 8B: Showed decent reasoning but was prone to hallucinations and didn’t always follow input requirements.
  • Most 7B+ models: Handled basic queries well but fell short on deep reasoning and precise instruction-following compared to my final choice.
  • Mistral-7B Instruct v0.3: A good conversationalist but not quite up to par for strict requirement adherence.
  • Qwen2.5-VL-7B: Impressive multimodal capabilities, but very resource-intensive.
  • Phi-4 Mini Reasoning: A useful tool for reasoning tasks, but limited in scope and accuracy.
  • DeepSeek R1 Distill Qwen 14B: Required heavy CPU offloading, making it painfully slow, and it struggled with consistent instruction-following.
  • Mistral-Nemo-Instruct-2407: Sharper than earlier versions but still demanding on VRAM.
  • GPT-OSS 20B: Ran without crashing but often deviated from input requirements and relied too heavily on external data.
  • Qwen3 30B A3B Instruct (2507): The clear winner. Although it was the heaviest model I tested, its speed was acceptable because I preferred waiting for a complete output rather than a continuous stream. Its superior accuracy and ability to strictly follow requirements made it worth the resource cost. I quantized it to Q3_L, which allowed it to run on my 8GB GPU. This balance of accuracy and performance was the deciding factor. I’m still keeping an eye on new models and plan to continue testing as the field evolves.

Key Takeaways

  • For consumer GPUs, quantization is practically a necessity.
  • CPU offloading is a viable option, but it significantly impacts speed.
  • Don’t just choose the largest model; test it with your specific tasks.
  • Finding reliable information requires sifting through a lot of community discussions.
  • The “best model” is subjective and depends on your needs.

The Bottom Line

Running LLMs locally is a mix of frustration and reward. At first, the jargon (GGUF, Q4_K_M, AWQ) was overwhelming, and I failed to get many models running. But once I had Qwen3 30B A3B Instruct working smoothly, it all felt worthwhile.

It’s not perfect—it’s slower than GPT-4 and memory-intensive—but it’s entirely under my control. Having a powerful AI assistant running on my own machine is incredibly satisfying.

If you’re considering trying local LLMs: start small, be prepared for some trial and error, and don’t get discouraged if your first few attempts fail. It’s part of the learning process.

What’s Next?

This experience has motivated me to continue exploring this space. I plan to build a more powerful setup to handle larger, faster models without hitting hardware limitations. I’m also eager to dive deeper into testing different LLMs and fine-tuning them for specific tasks to unlock a new level of AI customization.

Thanks to the innovations from the Unsloth team, I feel more confident about pushing the boundaries of what’s possible with local LLMs. Their work on making fine-tuning faster and more accessible is a significant contribution to the community.

I expect to be immersed in the world of local LLMs for a while—tinkering, tuning, and sharing what I learn along the way.