Full fine-tuning a 7B parameter LLM costs thousands of dollars in A100 GPU time. Most teams hear that and give up. They shouldn't — because LoRA and QLoRA have changed the economics entirely. You can now fine-tune a Llama 3.1 8B model for under $8 on cloud GPUs.
We'll cover the theory behind LoRA and QLoRA, set up a training run on Modal (serverless GPU cloud), evaluate the result, and merge the adapter back into the base model for deployment.
Why Fine-Tune vs. Prompting?
Prompting (including few-shot and RAG) is always the right first move. But fine-tuning wins when:
- You need a specific tone or style reliably enforced without a long system prompt
- The task is domain-specific enough that base model quality is poor (legal, medical, code in a niche DSL)
- Latency is critical — a small fine-tuned 7B is faster and cheaper per token than GPT-4o
- You need the model deployed on-premise (data residency requirements)
LoRA: Low-Rank Adaptation
Instead of updating all 7 billion parameters, LoRA freezes the base model and injects tiny trainable matrices (rank-4 to rank-64) into the attention layers. The maths: a large weight matrix W ∈ ℝ^(d×k) is updated by W + BA where B ∈ ℝ^(d×r) and A ∈ ℝ^(r×k) with rank r ≪ d.
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank — higher = more capacity, more memory
lora_alpha=32, # scaling factor — usually 2x rank
target_modules=['q_proj', 'v_proj'], # inject into attention Q and V
lora_dropout=0.05,
bias='none',
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 8,030,261,248 || 0.05%
We train only 0.05% of parameters. The resulting adapter is ~16MB instead of 16GB. During inference, you load the frozen base model once and apply the adapter on top — multiple adapters, one base model.
QLoRA: 4-bit Quantisation + LoRA
QLoRA loads the base model in 4-bit NF4 quantisation (~4GB for 8B model vs. 16GB in BFloat16) using bitsandbytes. LoRA adapters stay in BFloat16 — they're tiny, so it's fine. The result: fine-tune an 8B model on a single 24GB GPU.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4', # NternalFloat4 — best for LLMs
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
'meta-llama/Meta-Llama-3.1-8B-Instruct',
quantization_config=bnb_config,
device_map='auto',
)
# Apply LoRA on top
model = get_peft_model(model, lora_config)
Training on Modal (Serverless GPUs)
Modal spins up an A10G GPU (24 GB VRAM) on-demand. You pay only for the time your training script runs — typically $1–3/hour for an A10G. A 500-step fine-tune takes ~25 minutes = ~$1.25.
import modal
app = modal.App('llm-finetune')
volume = modal.Volume.from_name('model-weights', create_if_missing=True)
image = modal.Image.debian_slim().pip_install(
'transformers', 'peft', 'trl', 'bitsandbytes', 'datasets', 'accelerate'
)
@app.function(
gpu=modal.gpu.A10G(),
image=image,
volumes={'/weights': volume},
timeout=3600,
)
def train():
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
dataset = load_dataset('json', data_files='train.jsonl')['train']
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=TrainingArguments(
output_dir='/weights/adapter',
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_ratio=0.05,
lr_scheduler_type='cosine',
fp16=False, bf16=True,
save_steps=100,
logging_steps=10,
),
dataset_text_field='text',
max_seq_length=2048,
)
trainer.train()
model.save_pretrained('/weights/adapter')
$8
Full training cost
25min
Training time
16MB
Adapter size
Evaluation & Merging the Adapter
Before merging, evaluate on a held-out test set. We use sentence-transformers cosine similarity between model output and reference answers as a quick proxy metric, plus human eval on 50 samples.
# Merge LoRA adapter into base weights for efficient inference
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, '/weights/adapter')
merged = model.merge_and_unload()
# Save merged model — now a standard HuggingFace model, no PEFT needed at inference
merged.save_pretrained('./merged-model', safe_serialization=True)
The merged model runs on any inference framework (vLLM, Ollama, llama.cpp) without the overhead of loading adapters separately. Deploy to a Modal inference endpoint or Replicate for production.
Summary
- LoRA updates only 0.05% of parameters — adapter is ~16MB vs. 16GB full model
- QLoRA runs the base model in 4-bit NF4 — fine-tune 8B on a single 24GB GPU
- Modal spins up GPUs on-demand — pay only for training time (~$1–8 per run)
- Use SFTTrainer + BFloat16 + cosine LR schedule for stable convergence
- Merge the adapter post-training for maximum inference performance