Prompt Engineering for Production LLMs: Advanced Techniques That Actually Work

Most "prompt engineering" guides teach you to say "think step by step" and call it a day. Production systems need more: deterministic output formats, accuracy guarantees, cost-controlled evaluation pipelines, and a way to iterate on prompts without breaking things.

This article covers the techniques that survive contact with real users and real scale.

Chain-of-Thought (CoT) Prompting

CoT prompting asks the model to reason through a problem before answering. Two variants:

Zero-Shot CoT

const prompt = `You are a financial analyst.

Question: Should a startup with $200K runway burn $30K/month on marketing?

Think through this step-by-step before giving your recommendation.`;
// Simple and effective for single-turn reasoning tasks

Few-Shot CoT (More Reliable)

const prompt = `Analyse business decisions step by step.

Example:
Question: A B2B SaaS company has 80% gross margin — should they raise prices?
Thinking:
1. High gross margin means pricing has room — cost isn't the constraint
2. B2B buyers are rational; price increases require value justification
3. Churn risk depends on switching costs and competitor pricing
4. A 10% price increase on 80% margin adds ~10% to bottom line
Recommendation: Yes — implement a 10-15% increase for new customers first, monitor churn.

Now answer this:
Question: A consumer app has 500K MAU but only 1% convert to paid. Should they invest in new features or conversion optimisation?
Thinking:`;

Self-Consistency for High-Stakes Decisions

Run the same prompt N times at high temperature, then take the majority answer. This significantly improves accuracy on classification and extraction tasks.

async function selfConsistencyClassify(text: string, n = 5): Promise<string> {
  const results = await Promise.all(
    Array.from({ length: n }, () =>
      openai.chat.completions.create({
        model: 'gpt-4o-mini',  // cheap model, run multiple times
        messages: [
          { role: 'system', content: 'Classify the sentiment. Reply with exactly one word: Positive, Negative, or Neutral.' },
          { role: 'user', content: text }
        ],
        temperature: 0.7,  // high temperature for diversity
        max_tokens: 5,
      }).then(r => r.choices[0].message.content?.trim())
    )
  );

  // Majority vote
  const counts = results.reduce((acc, r) => {
    if (r) acc[r] = (acc[r] || 0) + 1;
    return acc;
  }, {} as Record<string, number>);

  return Object.entries(counts).sort((a, b) => b[1] - a[1])[0][0];
}

// 5 × gpt-4o-mini << 1 × gpt-4o in cost, but comparable accuracy
const sentiment = await selfConsistencyClassify("The product is okay but delivery was late");

Structured Extraction with XML Tags

When you don't have JSON mode available (or need intermediate reasoning), XML tags reliably demarcate structured sections in LLM output.

const prompt = `Extract the key information from this job posting.

Return your response in exactly this format:
<extraction>
  <title>[job title]</title>
  <company>[company name]</company>
  <salary_min>[number or null]</salary_min>
  <salary_max>[number or null]</salary_max>
  <remote>[true/false/hybrid]</remote>
  <required_skills>[comma-separated list]</required_skills>
</extraction>

Job posting:
${jobPostingText}`;

// Parse with regex (reliable for simple schemas)
function parseXmlExtraction(text: string) {
  const get = (tag: string) => text.match(new RegExp(`<${tag}>(.+?)<\\/${tag}>`, 's'))?.[1]?.trim();
  return {
    title:          get('title'),
    company:        get('company'),
    salaryMin:      Number(get('salary_min')) || null,
    salaryMax:      Number(get('salary_max')) || null,
    remote:         get('remote'),
    requiredSkills: get('required_skills')?.split(',').map(s => s.trim()),
  };
}

LLM-as-Judge Evaluation Loops

Manual evaluation doesn't scale. Use a capable model (GPT-4o or Claude) as an automated judge to evaluate outputs — ideal for catching regressions when you change prompts.

async function evaluateOutput(
  input: string,
  output: string,
  criteria: string[]
): Promise<{ scores: Record<string, number>; feedback: string }> {
  const criteriaList = criteria.map((c, i) => `${i + 1}. ${c}`).join('\n');

  const judge = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{
      role: 'user',
      content: `You are an impartial evaluator. Score the AI output on each criterion from 1-5.

INPUT: ${input}

AI OUTPUT: ${output}

CRITERIA:
${criteriaList}

Respond as JSON: {"scores": {"criterion_name": score}, "feedback": "brief explanation"}`
    }],
    response_format: { type: 'json_object' },
    temperature: 0,
  });

  return JSON.parse(judge.choices[0].message.content!);
}

// Run eval suite before deploying prompt changes
const results = await Promise.all(
  testCases.map(tc => evaluateOutput(tc.input, await runPrompt(tc.input), ['accuracy', 'helpfulness', 'conciseness']))
);
const avgScores = results.reduce((acc, r) => { /* aggregate */ return acc; }, {});

Prompt Versioning & A/B Testing

Treat prompts like code. They should live in version control, have changelogs, and be deployed through a proper release process — not edited directly in a `.env` file.

// prompts/summarise.ts — versioned prompt templates
export const SUMMARISE_V3 = {
  version: '3.0.0',
  model:   'gpt-4o-mini',
  system: `You are a concise technical writer. Summarise content in plain English.
Rules:
- Maximum 3 sentences
- No jargon unless essential (define it if so)
- Focus on what the reader gains, not what the text contains`,
  user: (content: string) => `Summarise this:\n\n${content}`,
};

// Feature flag A/B test
async function summarise(content: string, userId: string): Promise<string> {
  const template = isInExperiment(userId, 'summarise-v4') ? SUMMARISE_V4 : SUMMARISE_V3;

  const response = await openai.chat.completions.create({
    model: template.model,
    messages: [
      { role: 'system', content: template.system },
      { role: 'user',   content: template.user(content) },
    ],
  });

  // Log for analysis
  await logPromptResult({ userId, templateVersion: template.version, input: content, output: response.choices[0].message.content });

  return response.choices[0].message.content!;
}

Temperature & Parameter Tuning

Low Temperature (0–0.3)

• Extraction tasks
• Classification
• Code generation
• Factual Q&A

High Temperature (0.7–1.0)

• Creative writing
• Brainstorming
• Self-consistency sampling
• Diverse suggestions

// Production defaults by task type
const TASK_CONFIGS = {
  extraction:     { temperature: 0,    max_tokens: 500  },
  classification: { temperature: 0,    max_tokens: 10   },
  summarisation:  { temperature: 0.2,  max_tokens: 300  },
  generation:     { temperature: 0.7,  max_tokens: 1500 },
  creative:       { temperature: 0.95, max_tokens: 2000 },
} as const;

Summary

CoT: zero-shot for single tasks, few-shot for high-accuracy requirements
Self-consistency: N cheap model calls + majority vote beats 1 expensive call
XML tags: reliable structured output when JSON mode isn't available
LLM-as-judge: automated evaluation enables confident prompt iteration
Version prompts in code; A/B test with feature flags; log everything
Match temperature to task type — not one-size-fits-all

Prompt Engineering for Production LLMs: Advanced Techniques That Actually Work

Chain-of-Thought (CoT) Prompting

Zero-Shot CoT

Few-Shot CoT (More Reliable)

Self-Consistency for High-Stakes Decisions

Structured Extraction with XML Tags

LLM-as-Judge Evaluation Loops

Prompt Versioning & A/B Testing

Temperature & Parameter Tuning

Summary

More Articles

Agentic AI with LangGraph

Claude API Masterclass

GPT-4o Deep Dive

Stay Updated