Most "prompt engineering" guides teach you to say "think step by step" and call it a day. Production systems need more: deterministic output formats, accuracy guarantees, cost-controlled evaluation pipelines, and a way to iterate on prompts without breaking things.
This article covers the techniques that survive contact with real users and real scale.
Chain-of-Thought (CoT) Prompting
CoT prompting asks the model to reason through a problem before answering. Two variants:
Zero-Shot CoT
const prompt = `You are a financial analyst.
Question: Should a startup with $200K runway burn $30K/month on marketing?
Think through this step-by-step before giving your recommendation.`;
// Simple and effective for single-turn reasoning tasks
Few-Shot CoT (More Reliable)
const prompt = `Analyse business decisions step by step.
Example:
Question: A B2B SaaS company has 80% gross margin — should they raise prices?
Thinking:
1. High gross margin means pricing has room — cost isn't the constraint
2. B2B buyers are rational; price increases require value justification
3. Churn risk depends on switching costs and competitor pricing
4. A 10% price increase on 80% margin adds ~10% to bottom line
Recommendation: Yes — implement a 10-15% increase for new customers first, monitor churn.
Now answer this:
Question: A consumer app has 500K MAU but only 1% convert to paid. Should they invest in new features or conversion optimisation?
Thinking:`;
Self-Consistency for High-Stakes Decisions
Run the same prompt N times at high temperature, then take the majority answer. This significantly improves accuracy on classification and extraction tasks.
async function selfConsistencyClassify(text: string, n = 5): Promise<string> {
const results = await Promise.all(
Array.from({ length: n }, () =>
openai.chat.completions.create({
model: 'gpt-4o-mini', // cheap model, run multiple times
messages: [
{ role: 'system', content: 'Classify the sentiment. Reply with exactly one word: Positive, Negative, or Neutral.' },
{ role: 'user', content: text }
],
temperature: 0.7, // high temperature for diversity
max_tokens: 5,
}).then(r => r.choices[0].message.content?.trim())
)
);
// Majority vote
const counts = results.reduce((acc, r) => {
if (r) acc[r] = (acc[r] || 0) + 1;
return acc;
}, {} as Record<string, number>);
return Object.entries(counts).sort((a, b) => b[1] - a[1])[0][0];
}
// 5 × gpt-4o-mini << 1 × gpt-4o in cost, but comparable accuracy
const sentiment = await selfConsistencyClassify("The product is okay but delivery was late");
Structured Extraction with XML Tags
When you don't have JSON mode available (or need intermediate reasoning), XML tags reliably demarcate structured sections in LLM output.
const prompt = `Extract the key information from this job posting.
Return your response in exactly this format:
<extraction>
<title>[job title]</title>
<company>[company name]</company>
<salary_min>[number or null]</salary_min>
<salary_max>[number or null]</salary_max>
<remote>[true/false/hybrid]</remote>
<required_skills>[comma-separated list]</required_skills>
</extraction>
Job posting:
${jobPostingText}`;
// Parse with regex (reliable for simple schemas)
function parseXmlExtraction(text: string) {
const get = (tag: string) => text.match(new RegExp(`<${tag}>(.+?)<\\/${tag}>`, 's'))?.[1]?.trim();
return {
title: get('title'),
company: get('company'),
salaryMin: Number(get('salary_min')) || null,
salaryMax: Number(get('salary_max')) || null,
remote: get('remote'),
requiredSkills: get('required_skills')?.split(',').map(s => s.trim()),
};
}
LLM-as-Judge Evaluation Loops
Manual evaluation doesn't scale. Use a capable model (GPT-4o or Claude) as an automated judge to evaluate outputs — ideal for catching regressions when you change prompts.
async function evaluateOutput(
input: string,
output: string,
criteria: string[]
): Promise<{ scores: Record<string, number>; feedback: string }> {
const criteriaList = criteria.map((c, i) => `${i + 1}. ${c}`).join('\n');
const judge = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: `You are an impartial evaluator. Score the AI output on each criterion from 1-5.
INPUT: ${input}
AI OUTPUT: ${output}
CRITERIA:
${criteriaList}
Respond as JSON: {"scores": {"criterion_name": score}, "feedback": "brief explanation"}`
}],
response_format: { type: 'json_object' },
temperature: 0,
});
return JSON.parse(judge.choices[0].message.content!);
}
// Run eval suite before deploying prompt changes
const results = await Promise.all(
testCases.map(tc => evaluateOutput(tc.input, await runPrompt(tc.input), ['accuracy', 'helpfulness', 'conciseness']))
);
const avgScores = results.reduce((acc, r) => { /* aggregate */ return acc; }, {});
Prompt Versioning & A/B Testing
Treat prompts like code. They should live in version control, have changelogs, and be deployed through a proper release process — not edited directly in a `.env` file.
// prompts/summarise.ts — versioned prompt templates
export const SUMMARISE_V3 = {
version: '3.0.0',
model: 'gpt-4o-mini',
system: `You are a concise technical writer. Summarise content in plain English.
Rules:
- Maximum 3 sentences
- No jargon unless essential (define it if so)
- Focus on what the reader gains, not what the text contains`,
user: (content: string) => `Summarise this:\n\n${content}`,
};
// Feature flag A/B test
async function summarise(content: string, userId: string): Promise<string> {
const template = isInExperiment(userId, 'summarise-v4') ? SUMMARISE_V4 : SUMMARISE_V3;
const response = await openai.chat.completions.create({
model: template.model,
messages: [
{ role: 'system', content: template.system },
{ role: 'user', content: template.user(content) },
],
});
// Log for analysis
await logPromptResult({ userId, templateVersion: template.version, input: content, output: response.choices[0].message.content });
return response.choices[0].message.content!;
}
Temperature & Parameter Tuning
Low Temperature (0–0.3)
- • Extraction tasks
- • Classification
- • Code generation
- • Factual Q&A
High Temperature (0.7–1.0)
- • Creative writing
- • Brainstorming
- • Self-consistency sampling
- • Diverse suggestions
// Production defaults by task type
const TASK_CONFIGS = {
extraction: { temperature: 0, max_tokens: 500 },
classification: { temperature: 0, max_tokens: 10 },
summarisation: { temperature: 0.2, max_tokens: 300 },
generation: { temperature: 0.7, max_tokens: 1500 },
creative: { temperature: 0.95, max_tokens: 2000 },
} as const;
Summary
- CoT: zero-shot for single tasks, few-shot for high-accuracy requirements
- Self-consistency: N cheap model calls + majority vote beats 1 expensive call
- XML tags: reliable structured output when JSON mode isn't available
- LLM-as-judge: automated evaluation enables confident prompt iteration
- Version prompts in code; A/B test with feature flags; log everything
- Match temperature to task type — not one-size-fits-all