Skip to content
Writing

Insights.

Short technical pieces on machine learning, production AI systems, and the parts of the work worth thinking twice about.

// Filter by topic

Pick a metric that reflects what you actually want

Most ML projects that fail didn't fail at modelling. They optimised the wrong metric and succeeded, by that metric, all the way to an unusable system.

EvaluationScoping
5 min

Before you fine-tune, three things to try first

Fine-tuning is expensive, introduces versioning complexity, and usually isn't solving the problem people think it's solving. Three interventions that almost always come first.

LLMsEvaluation
5 min

Pricing optimisation is a game of elasticity, not prediction

A demand forecast tells you what you'll sell at a given price. A pricing system needs to know what you'd have sold at a price you didn't try. These are not the same question.

CausalPricing
6 min

When not to use machine learning

The most valuable thing a consultancy can sometimes do is tell a client they don't need ML. Three situations where rules, humans, or SQL consistently beat a model.

ScopingEngagement
5 min

Labels are the moat: why your training data matters more than your model

Switching from logistic regression to a transformer buys you three to five percentage points of accuracy. Improving label quality by twenty percent routinely buys you ten to fifteen.

DataEvaluation
5 min

Show your working: three forms of explainability that actually help

Explainability isn't one thing, it's at least three. Prediction-level explanations, global feature importance, and audit trails each answer a different question, and most ML projects confuse them.

ComplianceMLOps
5 min

Drift detection you can set up in an afternoon

Basic drift detection is trivially cheap and catches most production ML regressions before they become incidents. Three things to monitor, none of which require an MLOps platform.

MLOps
5 min

The case for fixed-fee pilots in ML consulting

Time and materials pricing in ML consulting is an incentive alignment problem dressed as a pricing model. Why fixed-fee is the only sensible way to scope a pilot.

Engagement
5 min

Why most IDP projects fail before they start

The OCR engine is almost never the bottleneck. Three things silently sink intelligent document processing projects, and all of them happen before the model does anything.

IDPScoping
6 min

Boring beats brilliant in production ML

The best production system is almost always the boringest one you can get away with. Monitoring, rollback, and retrainability beat one-percent accuracy bumps every time.

MLOpsScoping
6 min

Eval sets are contracts, not afterthoughts

Most ML projects treat the eval set as something you write at the end. It's actually the only artefact that defines what 'working' means. It belongs at the start.

ScopingMLOps
6 min

Cost per inference is the metric that ages projects

GenAI pilots run cheap and ship fast. Six months later the monthly bill arrives and someone has a hard conversation. Tracking cost-per-inference from day one prevents the surprise.

GenAIMLOps
6 min

Forecasting at the wrong granularity is the silent killer

Most forecasting failures are not model failures. They are granularity mismatches. A forecast at the wrong level of aggregation is wrong even when the model is right.

ForecastingEngineering
7 min

Vision systems fail on the lighting, not the model

Most vision pilots succeed in dev and fail in deployment because the model was trained on perfect images and the world is not perfect. Engineering for variation matters more than engineering the model.

VisionEngineering
6 min

The hardest part of agentic AI is the failure modes

A 95 percent agent is not 95 percent useful. Agents fail in confidently wrong ways, and the engineering that matters is everything that happens when they do.

GenAIEngineering
7 min

RAG is mostly retrieval

People debate which LLM to use for RAG and ignore that retrieval quality determines the answer's ceiling. The LLM cannot reconstruct information that wasn't retrieved.

GenAIEngineering
6 min

Most A/B tests are underpowered (and the second-most-common mistake is stopping too early)

Sample size, statistical power, and the discipline of pre-registering what 'significant' means. Most reported A/B results don't survive replication.

AnalyticsEngineering
7 min

Latency budgets you don't see until production

Production AI systems accumulate latency from places nobody profiled. The model is rarely the slowest part. Designing for the budget means designing the whole stack, not just the inference call.

MLOpsEngineering
6 min

Synthetic data: when it works and when it traps

Synthetic data has real uses and real failure modes. Treating it as a label-shortage cure-all is one of the more expensive mistakes in modern ML projects.

DataEngineering
6 min

Where ML teams sit decides what they ship

ML organisations inside Engineering ship different things from ML organisations inside Research, and from ML organisations inside Product. The structural choice predicts the output more than the technical choices do.

ScopingEngagements
6 min

Time-series cross-validation: rolling, not random

Standard k-fold leaks future information into past folds and is the silent reason most 'great' forecast models fall over in production. Rolling-origin validation matches the deployment pattern.

ForecastingEngineering
6 min

The hidden cost of fine-tuning

Fine-tuning looks cheap on the model card. The total cost of ownership (eval, retraining cadence, monitoring, lock-in) usually dwarfs the training run.

GenAIMLOps
7 min

Prompt regression is real (and the only defence is a regression test)

Tweaking a prompt to fix one case quietly breaks three others. Without an automated regression test on a labelled set, you are shipping silent quality drops every release.

GenAIEngineering
6 min

Anomaly detection: choose supervised when you can

Most 'anomaly detection' problems are actually classification problems with poorly labelled positives. Reframing them is the single biggest improvement you can make.

AnalyticsEngineering
6 min

Migrating models to v2 without breaking your users

A new model that scores better on the eval set is not a deploy-ready model. The discipline of a safe upgrade (shadow, canary, fall-back) decides whether v2 ships or rolls back.

MLOpsEngineering
7 min