I have building quite complex architecture applications from some time now. So I think I know the answer. For pure Coding no one comes close to Claude. No matter what the benchmark says, no one beats claude in terms of sheer coding skills. Having said that, claude lacks in architecture design decision making, it does not make good decision regarding that, I find ChatGPT more smart in terms of system designs and architectures. And I have experienced this not once but 4 times now. And For mathematical reasoning and formula making Gemini is better than both claude and chatGPT. I have experienced this once, when I had to design formula for calculating scores of different files and functions in a codebase.
We've been testing it extensively and its performance is no better than prior versions, and in many cases worse than open weight models like GLM. Gemini3.1 Pro is so significantly better.
To me, the play is: open weight on a provider like BaseTen (solid performance, low price point), or pay up for Gemini3.1 Pro if you need it.
But at their high price and low-ish quality, OpenAI models just aren't in the conversation right now without heavy incentives, e.g. via Azure.
Crazy, TBH. Curious if others find the same thing?
I recently tested Claude code, opencode, codex on same frontend feature and codex with 5.4 with high effort was the best but most expensive. For me in Europe, Claude Code with 90$ max subscription is the best value for money.
My thinking is:
codex - best harness
opencode - best ux/dx
claude - best value for money
I think the other side of that coin is how much effort it takes to get it to do what you need. Our pipeline is a sequence of very precise tasks where subtle contextual cues matter a lot, and there are a large classes of related error modes.
So yes, while we can work with any of these models to get them to do what we need eventually -- e.g. with prompt tuning to their particular style, adding more examples, or breaking tasks into smaller steps, etc. -- their instruction following has a huge impact on how quickly we can move as a team.
When I say "stinks", for me, if we do three rounds of optimization and testing and a model is still performing inconsistently across a class of related traps then using that model is going to slow us down, and I think it stinks.
In my experience, gemini3.1pro tends to work very consistently with light nudging, GLM with 2-ish rounds of optimization, and for GPT5.4, well it provided no improvement over prior models and would slow us down over the others meaningfully ... and costs too much for the effort.
So, meh, so I still think it stinks, skill level considered.
Yeah, I've had good experience with Kimi too. Good for the price point, for sure.
Anthropic models are still the best for me -- as long as you don't ask them to do something they don't want to -- but also, way too expensive for bulk pipeline processing. So I keep it to coding and Coworking...
I have building quite complex architecture applications from some time now. So I think I know the answer. For pure Coding no one comes close to Claude. No matter what the benchmark says, no one beats claude in terms of sheer coding skills. Having said that, claude lacks in architecture design decision making, it does not make good decision regarding that, I find ChatGPT more smart in terms of system designs and architectures. And I have experienced this not once but 4 times now. And For mathematical reasoning and formula making Gemini is better than both claude and chatGPT. I have experienced this once, when I had to design formula for calculating scores of different files and functions in a codebase.
We've been testing it extensively and its performance is no better than prior versions, and in many cases worse than open weight models like GLM. Gemini3.1 Pro is so significantly better.
To me, the play is: open weight on a provider like BaseTen (solid performance, low price point), or pay up for Gemini3.1 Pro if you need it.
But at their high price and low-ish quality, OpenAI models just aren't in the conversation right now without heavy incentives, e.g. via Azure.
Crazy, TBH. Curious if others find the same thing?
I recently tested Claude code, opencode, codex on same frontend feature and codex with 5.4 with high effort was the best but most expensive. For me in Europe, Claude Code with 90$ max subscription is the best value for money.
My thinking is:
I can’t use it after engaging with Claude. Even simply having a conversation about some design decisions seems annoying.
So I would agree with you, it is not great.
skill issues. in practice, these models are all great. no matter how great the hammer or saw is, you need skills to be a great carpenter.
I think the other side of that coin is how much effort it takes to get it to do what you need. Our pipeline is a sequence of very precise tasks where subtle contextual cues matter a lot, and there are a large classes of related error modes.
So yes, while we can work with any of these models to get them to do what we need eventually -- e.g. with prompt tuning to their particular style, adding more examples, or breaking tasks into smaller steps, etc. -- their instruction following has a huge impact on how quickly we can move as a team.
When I say "stinks", for me, if we do three rounds of optimization and testing and a model is still performing inconsistently across a class of related traps then using that model is going to slow us down, and I think it stinks.
In my experience, gemini3.1pro tends to work very consistently with light nudging, GLM with 2-ish rounds of optimization, and for GPT5.4, well it provided no improvement over prior models and would slow us down over the others meaningfully ... and costs too much for the effort.
So, meh, so I still think it stinks, skill level considered.
Using GPT to review work produced by Claude works extremely well.
Try OpenCode or Kimi, they're mostly all the same thing
We still have to see what Anthropic has cooked though
Yeah, I've had good experience with Kimi too. Good for the price point, for sure.
Anthropic models are still the best for me -- as long as you don't ask them to do something they don't want to -- but also, way too expensive for bulk pipeline processing. So I keep it to coding and Coworking...