Fine-Tuning LLMs on a Budget: LoRA, QLoRA and Modal

Full fine-tuning a 7B parameter LLM costs thousands of dollars in A100 GPU time. Most teams hear that and give up. They shouldn't — because LoRA and QLoRA have changed the economics entirely. You can now fine-tune a Llama 3.1 8B model for under $8 on cloud GPUs.

We'll cover the theory behind LoRA and QLoRA, set up a training run on Modal (serverless GPU cloud), evaluate the result, and merge the adapter back into the base model for deployment.

Why Fine-Tune vs. Prompting?

Prompting (including few-shot and RAG) is always the right first move. But fine-tuning wins when:

You need a specific tone or style reliably enforced without a long system prompt
The task is domain-specific enough that base model quality is poor (legal, medical, code in a niche DSL)
Latency is critical — a small fine-tuned 7B is faster and cheaper per token than GPT-4o
You need the model deployed on-premise (data residency requirements)

LoRA: Low-Rank Adaptation

Instead of updating all 7 billion parameters, LoRA freezes the base model and injects tiny trainable matrices (rank-4 to rank-64) into the attention layers. The maths: a large weight matrix W ∈ ℝ^(d×k) is updated by W + BA where B ∈ ℝ^(d×r) and A ∈ ℝ^(r×k) with rank r ≪ d.

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                            # rank — higher = more capacity, more memory
    lora_alpha=32,                   # scaling factor — usually 2x rank
    target_modules=['q_proj', 'v_proj'],  # inject into attention Q and V
    lora_dropout=0.05,
    bias='none',
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 8,030,261,248 || 0.05%

We train only 0.05% of parameters. The resulting adapter is ~16MB instead of 16GB. During inference, you load the frozen base model once and apply the adapter on top — multiple adapters, one base model.

QLoRA: 4-bit Quantisation + LoRA

QLoRA loads the base model in 4-bit NF4 quantisation (~4GB for 8B model vs. 16GB in BFloat16) using bitsandbytes. LoRA adapters stay in BFloat16 — they're tiny, so it's fine. The result: fine-tune an 8B model on a single 24GB GPU.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',         # NternalFloat4 — best for LLMs
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Meta-Llama-3.1-8B-Instruct',
    quantization_config=bnb_config,
    device_map='auto',
)

# Apply LoRA on top
model = get_peft_model(model, lora_config)

Training on Modal (Serverless GPUs)

Modal spins up an A10G GPU (24 GB VRAM) on-demand. You pay only for the time your training script runs — typically $1–3/hour for an A10G. A 500-step fine-tune takes ~25 minutes = ~$1.25.

import modal

app    = modal.App('llm-finetune')
volume = modal.Volume.from_name('model-weights', create_if_missing=True)

image = modal.Image.debian_slim().pip_install(
    'transformers', 'peft', 'trl', 'bitsandbytes', 'datasets', 'accelerate'
)

@app.function(
    gpu=modal.gpu.A10G(),
    image=image,
    volumes={'/weights': volume},
    timeout=3600,
)
def train():
    from datasets import load_dataset
    from trl import SFTTrainer
    from transformers import TrainingArguments

    dataset = load_dataset('json', data_files='train.jsonl')['train']

    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            output_dir='/weights/adapter',
            num_train_epochs=3,
            per_device_train_batch_size=4,
            gradient_accumulation_steps=4,
            learning_rate=2e-4,
            warmup_ratio=0.05,
            lr_scheduler_type='cosine',
            fp16=False, bf16=True,
            save_steps=100,
            logging_steps=10,
        ),
        dataset_text_field='text',
        max_seq_length=2048,
    )
    trainer.train()
    model.save_pretrained('/weights/adapter')

Full training cost

25min

Training time

16MB

Adapter size

Evaluation & Merging the Adapter

Before merging, evaluate on a held-out test set. We use sentence-transformers cosine similarity between model output and reference answers as a quick proxy metric, plus human eval on 50 samples.

# Merge LoRA adapter into base weights for efficient inference
from peft import PeftModel

base     = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16)
model    = PeftModel.from_pretrained(base, '/weights/adapter')
merged   = model.merge_and_unload()

# Save merged model — now a standard HuggingFace model, no PEFT needed at inference
merged.save_pretrained('./merged-model', safe_serialization=True)

The merged model runs on any inference framework (vLLM, Ollama, llama.cpp) without the overhead of loading adapters separately. Deploy to a Modal inference endpoint or Replicate for production.

Summary

LoRA updates only 0.05% of parameters — adapter is ~16MB vs. 16GB full model
QLoRA runs the base model in 4-bit NF4 — fine-tune 8B on a single 24GB GPU
Modal spins up GPUs on-demand — pay only for training time (~$1–8 per run)
Use SFTTrainer + BFloat16 + cosine LR schedule for stable convergence
Merge the adapter post-training for maximum inference performance

Fine-Tuning LLMs on a Budget: LoRA, QLoRA and Modal

Why Fine-Tune vs. Prompting?

LoRA: Low-Rank Adaptation

QLoRA: 4-bit Quantisation + LoRA

Evaluation & Merging the Adapter

Summary

More Articles

React 19 Concurrency: A Deep Dive into useTransition and Suspense

Building Particle Systems with Three.js & WebGL Shaders

Building Production RAG Pipelines with LangChain & Supabase

Enjoyed this article?