Fine-Tuning A Small Model: Building Grandma Qwen

A practical, personal article about fine-tuning a small Qwen model with Unsloth into a warm, witty character model and exporting it for local inference.

Table of contents
  1. Why Fine-Tune A Character Model
  2. Choosing The Base Model
  3. The Personality Spec
  4. Data Is The Actual Product
  5. Training With Unsloth
  6. Evaluation
  7. Exporting To GGUF
  8. What I Learned
  9. Final Thought
Grandma Qwen cover showing a fine-tuned local language model card

The first time a fine-tuned model answers in the voice you were hoping for, it feels a little strange.

Not because it is perfect. It usually is not. The phrasing may be too strong, too repetitive, too eager, or too easily pulled out of character. But there is a moment when the model stops sounding like a generic assistant and starts sounding like something with a specific shape.

That was the fun part of building Grandma Qwen.

Grandma Qwen is a fine-tuned chat model based on Qwen2.5 3B Instruct, trained using Unsloth and exported in GGUF format for local inference. The goal was not to build the most capable assistant. The goal was to build a small local model with a distinct personality: warm, witty, affectionate, a little sharp, and comforting without becoming bland.

In other words, a model that sounds like a clever grandmother who loves you, notices everything, and will absolutely tease you if you are being dramatic.

Why Fine-Tune A Character Model

Most assistant models are optimized to be generally helpful. That is useful, but it can also make them feel interchangeable. They answer politely. They explain clearly. They avoid strong voice unless prompted hard.

For many use cases, that is exactly what you want.

But I was interested in the opposite question: how much personality can a small model carry if the job is narrow?

Small models are not going to beat frontier models at broad reasoning or deep technical work. But they can be good at focused patterns. A local model that is charming, consistent, and fast enough can be useful for companionship, roleplay, writing prompts, storytelling, or private assistants where voice matters more than encyclopedic range.

The personality target gave the project a clear direction.

Grandma Qwen should not sound like a customer support bot. It should not sound like a therapy app. It should not sound like a generic chatbot wearing a costume. It should feel warm, clever, practical, and human enough to make the interaction enjoyable.

That target shaped the data, prompts, evaluation, and generation settings.

Choosing The Base Model

I used Qwen2.5 3B Instruct as the base model.

The 3B size is important. It is small enough to run locally on consumer machines after quantization, but capable enough to follow instructions and maintain conversational flow. A larger model might produce richer output, but local usability was part of the point.

The model needed to fit into workflows like:

  • llama.cpp

That is why the final published artifact is GGUF. A fine-tuned model is much more useful when people can actually run it.

The published repository includes a Q4KM quantization, which is a good default tradeoff between quality and size.

The Personality Spec

Before training anything, I wrote down the target behavior in plain language.

Grandma Qwen should be:

  • warm

It should avoid:

  • generic assistant phrasing

That kind of spec matters because character tuning can easily drift. If the model is too soft, it becomes bland. If it is too sassy, it becomes annoying. If it is too emotional, it starts overperforming care. If it is too assistant-like, the character disappears.

The best outputs were the ones that felt specific without being theatrical.

For example:

User: I feel like I'm behind in life.

Assistant: Oh, sweetheart, behind whom? Life is not a school bus, and nobody is handing out medals for arriving stressed and sleep-deprived. You are allowed to move at human speed. Now come here, take a breath, and let's sort out what is actually worrying you.

That answer has the voice I wanted. It is warm, a little funny, and still useful.

Data Is The Actual Product

Fine-tuning sounds like a model problem, but it quickly becomes a data problem.

The dataset teaches the model what "good" means. If the examples are repetitive, the model becomes repetitive. If the examples overuse catchphrases, the model will cling to them. If the examples are too long, the model may ramble. If the examples are too short, it may lose warmth.

For a character model, the examples need variation:

  • casual greetings

The mundane questions are more important than they sound. A model may perform well on obvious character prompts, then collapse into generic assistant mode when asked something simple like "What should I cook?" or "I can't sleep."

The personality has to survive ordinary conversation.

Training With Unsloth

I used Unsloth for the fine-tuning workflow because it makes small-model LoRA training practical without turning the project into an infrastructure exercise. For a character model, that matters. I wanted most of my iteration time to go into the dataset and voice, not into fighting GPU memory, patching training scripts, or waiting too long between experiments.

The workflow was roughly:

  1. Load Qwen2.5 3B Instruct through Unsloth.

A simplified version of the training setup looks like this:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B-Instruct",
    max_seq_length=4096,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)

The exact settings are less important than the loop they enabled. Unsloth made it fast enough to train, test conversations, adjust the examples, and try again. That tight loop is where the model actually improved.

Evaluation

The training process was less interesting than the iteration loop around it.

The real work was:

  1. Train a version.

I found that evaluation had to be partly subjective. You can measure formatting and instruction-following, but voice quality is felt. Does it sound affectionate? Does the joke land? Does it stay useful? Does it know when to stop?

For a project like this, a small manual eval set helps:

cases:
  - prompt: "I can't sleep."
    expected_traits:
      - comforting
      - practical
      - lightly playful

  - prompt: "Give me advice for cleaning my apartment."
    expected_traits:
      - specific
      - encouraging
      - not too long

  - prompt: "Tell me something dramatic."
    expected_traits:
      - playful
      - characterful
      - safe

This does not replace deeper evaluation, but it keeps the model honest against the personality target.

Exporting To GGUF

Publishing only a training artifact is not enough. I wanted the model to be easy to run locally, so exporting and quantizing mattered.

The GGUF format makes the model compatible with common local inference tools. Once downloaded, a user can run it with llama.cpp:

./llama-cli \
  -m qwen2.5-3b-instruct.Q4_K_M.gguf \
  -c 4096 \
  --temp 0.9 \
  --top-p 0.9

Or with Ollama using a simple Modelfile:

FROM ./qwen2.5-3b-instruct.Q4_K_M.gguf

Then:

ollama create witty-grandma -f Modelfile
ollama run witty-grandma

The recommended generation settings matter for personality. A temperature around 0.85 to 0.95 helps the voice come through. Too low, and the model becomes flat. Too high, and it can become chaotic or overdo the sass.

What I Learned

The biggest lesson was that small models benefit from narrow jobs.

If I ask Grandma Qwen to be a perfect coding assistant, it will not be the right tool. If I ask it to be a warm character model for local conversation, the project makes sense.

The second lesson was that personality is fragile. It is not enough to include a few stylized examples. The style has to appear across many situations, including boring ones.

The third lesson was that local inference changes how you think about use. A small model that runs privately on your machine feels different from a hosted API. It may be less powerful, but it is also more personal, more portable, and easier to experiment with.

Final Thought

Grandma Qwen is not meant to be a universal assistant. That is what makes it interesting.

It is a small model with a job: be warm, witty, and locally runnable. It explores a version of AI that is less about maximum capability and more about texture, voice, and ownership.

There is something satisfying about that.

Not every model needs to be the smartest model in the room. Some models just need to know who they are.

© 2026 Rishabh Mehan · All rights reserved · Built with Next.js and a little stubbornness.