Insights.
Short technical pieces on machine learning, production AI systems, and the parts of the work worth thinking twice about.
Pick a metric that reflects what you actually want
Most ML projects that fail didn't fail at modelling. They optimised the wrong metric and succeeded, by that metric, all the way to an unusable system.
Before you fine-tune, three things to try first
Fine-tuning is expensive, introduces versioning complexity, and usually isn't solving the problem people think it's solving. Three interventions that almost always come first.
Pricing optimisation is a game of elasticity, not prediction
A demand forecast tells you what you'll sell at a given price. A pricing system needs to know what you'd have sold at a price you didn't try. These are not the same question.
When not to use machine learning
The most valuable thing a consultancy can sometimes do is tell a client they don't need ML. Three situations where rules, humans, or SQL consistently beat a model.
Labels are the moat: why your training data matters more than your model
Switching from logistic regression to a transformer buys you three to five percentage points of accuracy. Improving label quality by twenty percent routinely buys you ten to fifteen.
Show your working: three forms of explainability that actually help
Explainability isn't one thing, it's at least three. Prediction-level explanations, global feature importance, and audit trails each answer a different question, and most ML projects confuse them.
Drift detection you can set up in an afternoon
Basic drift detection is trivially cheap and catches most production ML regressions before they become incidents. Three things to monitor, none of which require an MLOps platform.
The case for fixed-fee pilots in ML consulting
Time and materials pricing in ML consulting is an incentive alignment problem dressed as a pricing model. Why fixed-fee is the only sensible way to scope a pilot.
Why most IDP projects fail before they start
The OCR engine is almost never the bottleneck. Three things silently sink intelligent document processing projects, and all of them happen before the model does anything.
Boring beats brilliant in production ML
The best production system is almost always the boringest one you can get away with. Monitoring, rollback, and retrainability beat one-percent accuracy bumps every time.
Eval sets are contracts, not afterthoughts
Most ML projects treat the eval set as something you write at the end. It's actually the only artefact that defines what 'working' means. It belongs at the start.
Cost per inference is the metric that ages projects
GenAI pilots run cheap and ship fast. Six months later the monthly bill arrives and someone has a hard conversation. Tracking cost-per-inference from day one prevents the surprise.
Forecasting at the wrong granularity is the silent killer
Most forecasting failures are not model failures. They are granularity mismatches. A forecast at the wrong level of aggregation is wrong even when the model is right.
Vision systems fail on the lighting, not the model
Most vision pilots succeed in dev and fail in deployment because the model was trained on perfect images and the world is not perfect. Engineering for variation matters more than engineering the model.
The hardest part of agentic AI is the failure modes
A 95 percent agent is not 95 percent useful. Agents fail in confidently wrong ways, and the engineering that matters is everything that happens when they do.
RAG is mostly retrieval
People debate which LLM to use for RAG and ignore that retrieval quality determines the answer's ceiling. The LLM cannot reconstruct information that wasn't retrieved.
Most A/B tests are underpowered (and the second-most-common mistake is stopping too early)
Sample size, statistical power, and the discipline of pre-registering what 'significant' means. Most reported A/B results don't survive replication.
Latency budgets you don't see until production
Production AI systems accumulate latency from places nobody profiled. The model is rarely the slowest part. Designing for the budget means designing the whole stack, not just the inference call.
Synthetic data: when it works and when it traps
Synthetic data has real uses and real failure modes. Treating it as a label-shortage cure-all is one of the more expensive mistakes in modern ML projects.
Where ML teams sit decides what they ship
ML organisations inside Engineering ship different things from ML organisations inside Research, and from ML organisations inside Product. The structural choice predicts the output more than the technical choices do.
Time-series cross-validation: rolling, not random
Standard k-fold leaks future information into past folds and is the silent reason most 'great' forecast models fall over in production. Rolling-origin validation matches the deployment pattern.
The hidden cost of fine-tuning
Fine-tuning looks cheap on the model card. The total cost of ownership (eval, retraining cadence, monitoring, lock-in) usually dwarfs the training run.
Prompt regression is real (and the only defence is a regression test)
Tweaking a prompt to fix one case quietly breaks three others. Without an automated regression test on a labelled set, you are shipping silent quality drops every release.
Anomaly detection: choose supervised when you can
Most 'anomaly detection' problems are actually classification problems with poorly labelled positives. Reframing them is the single biggest improvement you can make.
Migrating models to v2 without breaking your users
A new model that scores better on the eval set is not a deploy-ready model. The discipline of a safe upgrade (shadow, canary, fall-back) decides whether v2 ships or rolls back.