Skip to article
AI/LLMs 3 Jan 2026 16 min read 2.4K views

Fine-Tuning LLMs on a Budget: LoRA, QLoRA and Modal

A cost-effective path to custom language models using parameter-efficient fine-tuning techniques on cloud GPUs.

Suboor Khan

Full-Stack Developer & Technical Writer

Full fine-tuning a 7B parameter LLM costs thousands of dollars in A100 GPU time. Most teams hear that and give up. They shouldn't — because LoRA and QLoRA have changed the economics entirely. You can now fine-tune a Llama 3.1 8B model for under $8 on cloud GPUs.

We'll cover the theory behind LoRA and QLoRA, set up a training run on Modal (serverless GPU cloud), evaluate the result, and merge the adapter back into the base model for deployment.

Why Fine-Tune vs. Prompting?

Prompting (including few-shot and RAG) is always the right first move. But fine-tuning wins when:

  • You need a specific tone or style reliably enforced without a long system prompt
  • The task is domain-specific enough that base model quality is poor (legal, medical, code in a niche DSL)
  • Latency is critical — a small fine-tuned 7B is faster and cheaper per token than GPT-4o
  • You need the model deployed on-premise (data residency requirements)

LoRA: Low-Rank Adaptation

Instead of updating all 7 billion parameters, LoRA freezes the base model and injects tiny trainable matrices (rank-4 to rank-64) into the attention layers. The maths: a large weight matrix W ∈ ℝ^(d×k) is updated by W + BA where B ∈ ℝ^(d×r) and A ∈ ℝ^(r×k) with rank r ≪ d.

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                            # rank — higher = more capacity, more memory
    lora_alpha=32,                   # scaling factor — usually 2x rank
    target_modules=['q_proj', 'v_proj'],  # inject into attention Q and V
    lora_dropout=0.05,
    bias='none',
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 8,030,261,248 || 0.05%

We train only 0.05% of parameters. The resulting adapter is ~16MB instead of 16GB. During inference, you load the frozen base model once and apply the adapter on top — multiple adapters, one base model.

QLoRA: 4-bit Quantisation + LoRA

QLoRA loads the base model in 4-bit NF4 quantisation (~4GB for 8B model vs. 16GB in BFloat16) using bitsandbytes. LoRA adapters stay in BFloat16 — they're tiny, so it's fine. The result: fine-tune an 8B model on a single 24GB GPU.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',         # NternalFloat4 — best for LLMs
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Meta-Llama-3.1-8B-Instruct',
    quantization_config=bnb_config,
    device_map='auto',
)

# Apply LoRA on top
model = get_peft_model(model, lora_config)

Evaluation & Merging the Adapter

Before merging, evaluate on a held-out test set. We use sentence-transformers cosine similarity between model output and reference answers as a quick proxy metric, plus human eval on 50 samples.

# Merge LoRA adapter into base weights for efficient inference
from peft import PeftModel

base     = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16)
model    = PeftModel.from_pretrained(base, '/weights/adapter')
merged   = model.merge_and_unload()

# Save merged model — now a standard HuggingFace model, no PEFT needed at inference
merged.save_pretrained('./merged-model', safe_serialization=True)

The merged model runs on any inference framework (vLLM, Ollama, llama.cpp) without the overhead of loading adapters separately. Deploy to a Modal inference endpoint or Replicate for production.

Summary

  • LoRA updates only 0.05% of parameters — adapter is ~16MB vs. 16GB full model
  • QLoRA runs the base model in 4-bit NF4 — fine-tune 8B on a single 24GB GPU
  • Modal spins up GPUs on-demand — pay only for training time (~$1–8 per run)
  • Use SFTTrainer + BFloat16 + cosine LR schedule for stable convergence
  • Merge the adapter post-training for maximum inference performance

Stay Updated

Enjoyed this article?

Deep-dive articles on React, AI, WebGL, and software craft — twice a month. No spam, ever.