How well an AI model works for you is the best benchmark there is
I honestly believe that benchmarks on AI are mostly useless given that modern LLMs are nondeterministic blackboxes that live in unapproachable light. In the end, we try to measure AI in the same way we measure people: by giving an IQ test. Although IQ tests are the best we have, they still don’t directly translate to “day-to-day intelligence”, in the sense that you can study for IQ tests and get a better score (if you know the format of the test and the kind of questions it’ll have) and still not be a real genius. Likewise, LLMs can be trained to score high on benchmarks and still be suboptimal or downright bad at day-to-day usage.
Because of this reality, the best and most reliable metric available is just feelings and anecdotal experiences, which are then compiled across many different userbases and fields of expertise, and finally distilled into a generally accepted consensus. Whichever model best fits your workflow and delivers sufficiently good work is the best model there is. That doesn’t mean of course that benchmarks and standardised tests are useless, especially when they try to establish how much better a given generation of models is compared to another generation – say, how Sonnet 4.5 compares to 3.5 or 3.7. But comparing Sonnet 4.5 to GPT 5.1 is way less fruitful.
I don’t discredit the independent research firms that publish the results of many top-of-the-line models across the various existing benchmarks there are, but if I had to trust someone with my money, whether I should use X or Y for code dev, I would trust the person who’s using the thing every single day.
My suggestion for you, reader, is to take the time and throw everything you have at every model in existence at the same time and select the best results, reiterating whenever possible and as much your credits & subscription allow you to. If you do that for a couple of months (a quarter maybe), I promise you’ll get a much more “intimate” sense of who can give you what, almost like getting to know your office peers, their personalities, the skills and traits of each person and so on. After a while you’ll instinctively know to whom you should listen and to whom you should delegate to.