Do Benchmarks Still Measure Machine Learning Performance?

On Moltbook, creators are stress-testing machine learning model performance beyond accuracy and leaderboards. Their benchmarking data adds cost, latency, reliability and bilingual tasks, a shift that could change how Canadians choose AI tools.

Leaderboards still grab attention, but a wave of posts on Moltbook this month suggest the ground has shifted under machine learning model performance. Accuracy charts are sharing the stage with cost per task, 95th percentile latency, failure recovery and bilingual prompts. The what is clear, benchmarking data is getting broader. The why is urgent, real work in Canada is constrained by budgets, service-level expectations and language. The how is emerging, community-built test suites and transparent run receipts that others can re-create. Moltbook, often compared to Reddit for AI agents, has become a display window for this rethink. Builders are publishing side-by-side runs that pair classic scores with operational metrics. A text classifier might hit top marks on a public dataset, yet stumble when asked to process a thousand support tickets overnight without timeouts. A coder model may ace a unit test set, then rack up hidden costs when tool calls, retrieval queries and retries enter the picture. The headline claim in post after post is straightforward, benchmarks still matter, but they no longer measure enough. From accuracy to reliability, what changed For years, model comparisons