Skip to content
Playbooks

How we'd scope it.

Each playbook below describes how we would scope an engagement we know well: the data we would insist on, the success criterion we would write down, and the things we would not promise without doing the work.

These are not case studies. We're a young firm and we'd rather show our scoping thinking than fabricate proof. Each playbook below describes how we'd approach a problem of a familiar kind, with the engineering discipline we'd actually apply. Real case studies will appear here once our first delivered engagements wrap and the clients have given us permission to write about them, anonymised.

Before any playbook

Is your problem pilotable yet?

Start from an example, then tweak

Solid foundation: clean data, named sponsor, clear baseline. Mostly ready to scope.

Nine criteria · click any answer to change it
  1. 01 · Data

    Do you have at least 6 months of relevant historical data?

  2. 02 · Data

    Is the data clean, accessible, and labelled where needed?

  3. 03 · Data

    Can you verify whether a model output is correct?

  4. 04 · Measurement

    Do you currently measure the metric you would want the model to improve?

  5. 05 · Measurement

    Is there a documented threshold for what 'good' looks like?

  6. 06 · Stakeholders

    Is there an executive sponsor accountable for the outcome?

  7. 07 · Stakeholders

    Are stakeholders aligned on what success looks like?

  8. 08 · Integration

    Is there an existing system to integrate the model into?

  9. 09 · Integration

    Will an internal team support the model after handover?

Pick an example scenario or score your own. The same nine questions we ask in a discovery call: do you have data, a baseline, a documented success criterion, an executive sponsor, and a system to integrate into. The score, the surfaced risk flags, and the next-step list are exactly what we would say back to you on a call.

Note · this is a simplified demo

In a real discovery call we would also probe regulatory constraints, blast radius (customer-facing vs internal), labelling cost, ground-truth latency, integration depth, and who actually owns the metric the model is trying to move. The scorecard above gives you the rough shape; the call gets to the things that only show up by asking the second question.

A mid-sized Australian retailer processes 30,000+ supplier invoices a year, all manually keyed into the ERP. The finance team spends sixty-plus hours a week on retyping, and ledger reconciliation surfaces errors that arrived through the keying process.

The invoices are semi-structured, the volume justifies the investment, and the errors are recurring and well-understood. This is the kind of problem IDP was built for. It doesn't need novel model research.

// How we'd scope it
  1. 01

    Pull a representative sample of 200 invoices from the last 90 days. Anything below 300 DPI or rotated incorrectly gets rejected. If the scan process is the bottleneck, that's where the first week of work goes.

  2. 02

    Run a one-day workshop with finance, AP, and procurement to write down what each field means in plain language. "Invoice number" is whichever value flows into column X of the ERP. If a supplier prints two of them, the workshop decides which one wins and writes the decision down before any modelling starts.

  3. 03

    Document the target schema in the ERP and run one record end-to-end (extracted → ERP write → reconciled) before training anything. About half of failed IDP projects fail in the integration layer, not the extraction layer, so it's the first place to look.

  4. 04

    Set a per-field confidence threshold (the model's self-reported confidence in each extraction, distinct from the per-field accuracy measured against ground truth in the success criterion below); the default starting point is 0.85, calibrated downstream as the reviewer queue settles. Fields above the threshold pass straight through. Fields below it queue for human review until the model has learned enough to clear them on its own.

  5. 05

    Write the success criterion in the SOW before any code: ≥90% straight-through accuracy on a 200-document holdout, measured by week 3. Miss the target and the production phase doesn't happen.

Pilot target

≥90% per-field accuracy on a 200-document holdout, with low-confidence routing in place. Measured at week 3.

Realistic timeline

Pilot: 3 weeks. Production handover: 4–6 weeks.

// What we won't promise without doing the work

Not every document category clears 90%. Handwritten remittance advice, fax-quality scans, and multi-page composite documents tend to land below threshold and stay there. They need human-in-the-loop review indefinitely. We'll flag those categories by week 2, not at the end.

Playbook · 02Generative AI

A government agency or professional services firm holds 5,000+ pages of internal policy and regulation. Staff lose hours each week searching for the right document and the right clause; wrong answers create compliance exposure.

The corpus is structured (versioned policies, named document owners), the answer to most questions lives somewhere in the corpus, and the stakes are high enough that a confident wrong answer is worse than an honest refusal.

// How we'd scope it
  1. 01

    Inventory the corpus before any modelling work. Confirm what's authoritative, what's been superseded, and who owns updates. A RAG system over a stale corpus is worse than no RAG system at all. It makes wrong answers look sourced.

  2. 02

    Define the refusal behaviour before defining the answering behaviour. When the grounding score falls below threshold, the system says "I don't have a confident source for this" instead of improvising. The refusal path is the first thing built; happy-path answers come after it works.

  3. 03

    Build the eval set first. Fifty questions with known correct answers and known correct source citations, written by your subject-matter experts. Without it there's no contract; with it, every modelling decision has something to be measured against.

  4. 04

    Per question we measure two separate things. Did it cite the right source? And does the answer match what the source says? Hallucination rate (cited source doesn't support the claim) is the kill metric. Anything else can be tuned.

  5. 05

    Citations appear inline in the interface, on every answer. Users build appropriate trust over time because they can verify what the system told them, on the spot, without a separate process.

Pilot target

≥85% answer correctness AND ≤2% hallucination rate on a 50-question eval set written by your SMEs. Measured at week 3.

Realistic timeline

Pilot: 3 weeks. Production handover: 6–10 weeks.

// What we won't promise without doing the work

100% correctness isn't on the table. That's a property of the humans-plus-LLM system, not the LLM in isolation. What we'll do is quantify the residual error rate and write a human-review process for the high-stakes queries where the residual matters. The aim is calibrated trust: users knowing what the system is and isn't reliable on.

A model has been live for eight months. Performance has visibly degraded; nobody is sure when it started, why, or by how much. Logs aren't structured, the data scientist who trained it has left, and retraining requires a Jupyter notebook that currently can't be located.

This is forensics first and remediation second. The instinct in the room is usually to retrain straight away. The right move is to stop the bleeding, build the infrastructure that should have been there originally, and only then think about a new model.

// How we'd scope it
  1. 01

    Audit in week one. What's deployed, what data the model is consuming, when the inputs and outputs last drifted, and from what baseline. The deliverable is a written audit document. Verbal answers get remembered selectively when the pressure's on.

  2. 02

    Stop the bleeding before retraining anything. Usually a rollback to a known-good version, or a heuristic safety net (model output capped to a sane range, manual review on the worst-affected segment) that can ship within days. The retrain can wait; the customer-visible failure usually can't.

  3. 03

    Build the missing infrastructure in parallel with the immediate fix: feature and prediction logging on a stable schema, weekly drift reports, a manual rollback procedure anyone on call can execute without a deploy. This is what should have been there in the original build.

  4. 04

    Only after the safety net is in place do we retrain on current data, with an explicit evaluation against the original training set's distribution. Sometimes the baseline you're trying to recover doesn't exist anymore. The world has shifted and the model's job has effectively changed. We'll quantify the new realistic baseline and rebuild against that.

  5. 05

    Final deliverable is a written runbook for what to do when this happens again, with named owners on your team. The aim is for you not to need us next time the same thing happens.

Pilot target

Week 1: written audit, prioritised remediation plan, rollback procedure in place. Week 4: first drift report running on a weekly cadence, operated by your team, not ours.

Realistic timeline

Audit: 1–2 weeks. Implementation: 4–10 weeks.

// What we won't promise without doing the work

The original accuracy number may not come back. Sometimes the underlying world has moved on and the metric you remember from launch reflects a problem the model isn't really solving anymore. When that happens we'll say so explicitly and rebuild against a new realistic baseline.

A different problem

Tell us what you're facing.

A 30-minute discovery call. We'll tell you whether we're the right shop for it, and if we are, we'll show you how we would scope the pilot.

Book a discovery call →