This is gold for Anthropic's profitability. The Claude Code addicts can double their spend to plow through tokens because they need to finish something by a deadline. OpenAI will have a similar product within a week but will only charge 3x the normal rate.
This angle might also be NVidias reason for buying Groq. People will pay a premium for faster tokens.
I switched back to 4.5 Sonnet or Opus yesterday since 4.6 was so slow and often “over thinking” or “over analyzing” the problem space. Tasks which accurately took under an minute in Sonnet 4.5 were still running after 5 minutes in 4.6 (yeah I had them race for a few tasks)
Someone of this could be system overload I suppose.
Edit ~/.claude/settings.json and add "effortLevel": "medium". Alternatively, you can put it in .claude/settings.json in a project if you want to try it out first.
They recommend this in the announcement[1], but the way they suggest doing it is via a bogus /effort command that doesn't exist. See [2] for full details about thinking effort. It also recommends a bogus way to change effort by using the arrow keys when selecting a model, so don't use that either.
Good to know it works for some people! I think it's another issue where they focus too much on MacOS and neglect Windows and Linux releases. I use WSL for Claude Code since the Windows release is far worse and currently unusable do to several neglected issues.
Hoping to see several missing features land in the Linux release soon.
I'm also feeling weak and the pull of getting a Mac is stronger. But I also really don't like the neglect around being cross-platform. It's "cross-platform" except a bunch of crap doesn't work outside MacOS. This applies to Claude Code, Claude Desktop (MacOS and Windows only - no Linux or WSL support), Claude Cowork (MacOS only). OpenAI does the same crap - the new Codex desktop app is MacOS only. And now I'm ranting.
Yep, and their documentation AI assistant will egregiously hallucinate whatever it thinks you want to hear, then repeat itself in a loop when you tell it that it's wrong.
Yesterday I asked a question about a Claude Code setting inside Claude Code, don't recall which, and their builtin documentation skill—something like that—ended up doing a web search and found a wrong answer on a third party site. Later I went to their documentation site and it was right there in the docs. Wonder why they can't bundle an AI-friendly version of their own docs (can't be more than a few hundred KBs compressed?) inside their 174MB executable.
It's insane that they concluded the builtin introspection skill for claude documentation should do a web search instead of simply packing the correct documentation in local files. I had the same experience like you, wasting tokens and my time because their architecture decision doesn't work in practice.
They used to? I have a distinct memory of it doing exactly that a few months ago. Maybe it got dropped in the mad dash that passes for CC sprint cycles
Pathetic how they have no support for modifying sampling settings, or even a "logit_bias" so I can ban my claude from using the EM dash (and regular dash), semicolons, or "not". Also will upweight things like exclamation points
Clearly those whose job it is to "monitor" folks use this as their "tell" if someone AI generated something. That's why every major LLM has this particular slop profile. It's infuriating.
Yeah, nothing is sped up, their initial deployment of 4.6 is so unbearably slow they are just now offering you the opportunity to pay more for the same experience of 4.5. What's the word for that?
Honestly, Open AI isn't worth it. I cancelled my Open AI plan (and hopefully will delete my account soon once I export all my data out) because of philosophy differences. They shared they are evaluating a model where they can get a % of your business in exchange for letting you use code generated by their AI models. That and the possible advertising angle. But, that's not even the worst, I asked ChatGPT to fairly evaluate the risky model where one for profit corporation holds your entire intimate personal details and uses it for advertising, it staunchly defended OpenAI. That was the nail in the coffin for me.
Contrast to this - Anthropic actually asks you if you want their AI to remember details about you and they have lot of toggles around privacy. I don't care if they make money from extra tokens as long as they don't go the Open AI route.
> They shared they are evaluating a model where they can get a % of your business in exchange for letting you use code generated by their AI models.
That's a gross mischaracterization of what the CFO said. She basically just said the pricing space is huge, and they've even explored things like royalty models.
I'm guessing you just saw a headline and read nothing into it.
Isn't it what "evaluating a model where they can get a % of your business in exchange for letting you use code generated by their AI models" precisely mean?
If they find that this business model is most profitable for OpenAI, and that they can somehow release models better than any competitor, wouldn't they say they want royalties ? That's what Unity (the game engine) does so it wouldn't be unseen.
> Now they have a profit motive for slowing down the normal service.
Sure. But for now, this is a competitive space. The competitors offer models at a decent quality*speed/price ratio and prevent Anthopic from going too far downhill.
Actually, as I think about it... I don't enjoy any other model as much as Opus 4.5 and 4.6. For me, this is no longer a competitive space. Anthropic are in full right to charge premium prices for their premium product.
The difference being that Airlines and food delivery did make a profit, just figured they had to do these tricks to earn some more. Mature businesses resort to lowering quality, fake scarcity.
Here the scarcity is real, and profits are nowhere to be seen
These schemes will soon fall apart entirely when an open weight model can run on Groq/Cerebras/SambaNova at even higher speeds and be just fine for all tasks. Arguably already the case, but not many know yet.
Quite a premium for speed. Especially when Gemini 3 Pro is 1.8x the tokens/sec speed (of regular-speed Opus 4.6) at 0.45x the price [2]. Though it's worse at coding, and Gemini CLI doesn't have the agentic strength of Claude Code, yet.
no, it’s for Max subscribers to enable “use API when running out of session limit”. the assumption (probably) being that many will forget to turn it off, and they’ll earn it back that way.
This was my first thought, but by default, you have no automatic reload of your prepaid account. Which I think is for once user friendly. They could have applied a dark pattern here.
A useful feature would be slow-mode which gets low cost compute on spot pricing.
I’ll often kick off a process at the end of my day, or over lunch. I don’t need it to run immediately. I’d be fine if it just ran on their next otherwise-idle gpu at much lower cost that the standard offering.
Still do. Great for workloads where it's okay to bundle a bunch of requests and wait some hours (up to 24h, usually done faster) for all of them to complete.
Yep same, I often think why this isn’t a thing yet. Running some tasks in the night at e.g. 50% of the costs - there’s the batch api but that is not integrated in e.g. claude code
> I’ll often kick off a process at the end of my day, or over lunch. I don’t need it to run immediately. I’d be fine if it just ran on their next otherwise-idle gpu at much lower cost that the standard offering.
If it's not time sensitive, why not just run it at on CPU/RAM rather than GPU.
But it's incredibly incapable compared to SOTA models. OP wants high quality output but doesn't need it fast. Your suggestion would mean slow AND low quality output.
I'm assuming GP means 'run inference locally on GPU or RAM'. You can run really big LLMs on local infra, they just do a fraction of a token per second, so it might take all night to get a paragraph or two of text. Mix in things like thinking and tool calls, and it will take a long, long time to get anything useful out of it.
I’ve been experimenting with this today. I still don’t think AI is a very good use of my programming time… but it’s a pretty good use of my non-programming time.
I ran OpenCode with some 30B local models today and it got some useful stuff done while I was doing my budget, folding laundry, etc.
It’s less likely to “one shot” apples to apples compared to the big cloud models; Gemini 3 Pro can one shot reasonably complex coding problems through the chat interface. But through the agent interface where it can run tests, linters, etc. it does a pretty good job for the size of task I find reasonable to outsource to AI.
This is with a high end but not specifically AI-focused desktop that I mostly built with VMs, code compilation tasks, and gaming in mind some three years ago.
Yes, this is what I meant. People are running huge models at home now, I assumed people could do it on premises or in a data center if you're a business, presumably faster... but yeah it definitely depends on what time scales we're talking.
Huge models? First you have to spend $5k-$10k or more on hardware. Maybe $3k for something extremely slow (<1 tok/sec) that is disk-bound. So that's not a great deal over batch API pricing for a long, long time.
Also you still wouldn't be able to run "huge" models at a decent quantization and token speed. Kimi K2.5 (1T params) with a very aggressive quantization level might run on one Mac Studio with 512GB RAM at a few tokens per second.
To run Kimi K2.5 at an acceptable quantization and speed, you'd need to spend $15k+ on 2 Mac Studios with 512GB RAM and cluster them. Then you'll maybe get 10-15 tok/sec.
I'd love to know what kind of hardware would it take to do inference at the speed provided by the frontier model providers (assuming their models were available for local use).
How much extra power do you think you would need to run an LLM on a CPU (that will fit in RAM and be useful still)? I have a beefy CPU and if I ran it 24/7 for a month it would only cost about $30 in electricity.
Note that you can't use this mode to get the most out of a subscription - they say it's always charged as extra usage:
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
Although if you visit the Usage screen right now, there's a deal you can claim for $50 free extra usage this month.
So it's basically useless then. Even with Claude Max I have to manage my usage when doing TDD, and using ccusage tool I've seen that I'd frequently hit $200 per day if I was on the API. At 6x cost you'll burn through $50 in about 20 minutes. I wish that was hyperbole.
Because we all prefer it over Gemini and Codex. Anthropic knows that and needs to get as much out of it as possible while they can. Not saying the others will catch up soon. But at some point other models will be as capable as Opus and Sonnet are now, and then it's easier to let price guide the choice of provider.
My (and many others) normal workflow includes a planning phase, followed by an implementation phase. For me the most useful time for fast mode would be during that planning phase.
The current "clear context and execute plan" would be great to be joined by a, "clear context, switch to regular speed mode, and execute plan".
I even think I would not require fast mode for the explore agents etc - they have so much to do that I accept that takes a while. being able to rapidly iterate on the plan before setting it going would make it easier.
Counterintuitively, I feel like this will not be super useful, at least for me. My bottleneck is MY ability to parse and understand LLM-generated code. The agent can code a lot faster than I can read and understand its output.
Spend time building a test harness and evaluations of whether the solution meets the requirements. Then you don't need to look at the code because those other pieces will bring the necessary guarantees and trust.
Smart if you can get away with it. For non trivial things you can't quite get away with it.
I've started asking Claude to make me a high-level implementation plan and basically prompt ME to write it. For the most part I walk through it and ask Claude to just do it. But then 10% of the time there is a pretty major issue that I investigate, weigh pros/cons, and then decide to change course on.
Those things accumulate. Maybe 5-10 things over the course of an MVP that I wouldn't really have a clue about if I let Claude just dutiful implement it's own plan.
The async AI + sync approval bottleneck is real. One thing that helped me: stop being physically desk-bound during those wait times.
I built ForkOff to solve this - when Claude needs approval, push notification to your phone, one-tap approve from anywhere. Turns out you don't actually need to be at your desk for most approvals.
The fast mode helps with speed, but even faster is letting the AI work while you're literally anywhere else. Early access: forkoff.app
(And yes, the pricing for fast mode is wild - $100 burned in 2 hours per some comments here!)
> ...stop being physically desk-bound during those wait times.
> I built ForkOff to solve this
This does sound useful but I have to laugh because I just work out or play with my children. If you enjoy calisthenics and stretching its great to use Claude while not feeling chair-bound. Programming becomes more physical!
So us normal pro accounts have slow mode. Thanks anthropic.
I'm currently testing Kimi2.5 with cli, works great and fast. Even comes with a web interface so you can communicate with you kimi-cli instance (remote even if you use vpn).
I’m curious what’s behind the speed improvements. It seems unlikely it’s just prioritization, so what else is changing? Is it new hardware (à la Groq or Cerebras)? That seems plausible, especially since it isn’t available on some cloud providers.
Also wondering whether we’ll soon see separate “speed” vs “cleverness” pricing on other LLM providers too.
It comes from batching and multiple streams on a GPU. More people sharing 1 GPU makes everyone run slower but increases overall token throughput.
Mathematically it comes from the fact that this transformer block is this parallel algorithm. If you batch harder, increase parallelism, you can get higher tokens/s. But you get less throughput. Simultaneously there is also this dial that you can speculatively decode harder with fewer users.
Its true for basically all hardware and most models. You can draw this Pareto curve of how much throughput per GPU vs how many tokens per second per stream. More tokens/s less total throughput.
See this graph for actual numbers:
Token Throughput per GPU vs. Interactivity
gpt-oss 120B • FP4 • 1K / 8K • Source: SemiAnalysis InferenceMAX™
> If you batch harder, increase parallelism, you can get higher tokens/s. But you get less throughput. Simultaneously there is also this dial that you can speculatively decode harder with fewer users.
I think you skipped the word “total throughout” there right? Cause tok/s is a measure of throughput, so it’s clearer to say you increase throughput/user at the expense of throughput/gpu.
I’m not sure about the comment about speculative decode though. I haven’t served a frontier model but generally speculative decode I believe doesn’t help beyond a few tokens, so I’m not sure you can “speculatively decode harder” with fewer users.
There are a lot of knobs they could tweak. Newer hardware and traffic prioritisation would both make a lot of sense. But they could also lower batching windows to decrease queueing time at the cost of lower throughput, or keep the KV cache in GPU memory at the expense of reducing the number of users they can serve from each GPU node.
2.4x faster memory - which is exactly what they are saying the speedup is. I suspect they are just routing to GB200 (or TPU etc equivalents).
FWIW I did notice _sometimes_ recently Opus was very fast. I put it down to a bug in Claude Code's token counting, but perhaps it was actually just occasionally getting routed to GB200s.
Dylan Patel did analysis that suggests lower batch size and more speculative decoding leads to 2.5x more per-user throughput for 6x the cost for open models [0]. Seems plausible this could be what they are doing. We probably won't get to know for sure any time soon.
Regardless, they don't need to be using new hardware to get speedups like this. It's possible you just hit A/B testing and not newer hardware. I'd be surprised if they were using their latest hardware for inference tbh.
Why does this seem unlikely? I have no doubt they are optimizing all the time, including inference speed, but why could this particular lever not entirely be driven by skipping the queue? It's an easy way to generate more money.
Yes it's 100% prioritization. Through that it's also likely running on more GPUs at once but that's an artifact of prioritization at the datacenter level. Any task coming into an AI datacenter atm is split into fairly fined grained chunks of work and added to queues to be processed.
When you add a job with high priority all those chunks will be processed off the queue first by each and every GPU that frees up. It probably leads to more parallelism but... it's the prioritization that led to this happening. It's better to think of this as prioritization of your job leading to the perf improvement.
Here's a good blog for anyone interested which talks about prioritization and job scheduling. It's not quite at the datacenter level but the concepts are the same. Basically everything is thought of as a pipeline. All training jobs are low pri (they take months to complete in any case), customer requests are mid pri and then there's options for high pri. Everything in an AI datacenter is thought of in terms of 'flow'. Are there any bottlenecks? Are the pipelines always full and the expensive hardware always 100% utilized? Are the queues backlogs big enough to ensure full utilization at every stage?
Amazon Bedrock has a similar feature called "priority tier": you get faster responses at 1.75x the price. And they explicitly say in the docs "priority requests receive preferential treatment in the processing queue, moving ahead of standard requests for faster responses".
I wonder if they might have mostly implemented this for themselves to use internally, and it is just prioritization but they don't expect too many others to pay the high cost.
I was thinking about inhouse model inference speeds at frontier labs like Anthropic and OpenAI after reading the "Claude built a C compiler" article.
Having higher inference speed would be an advantage, especially if you're trying to eat all the software and services.
Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.
In the predicted nightmare future where everything happens via agents negotiating with agents, the side with the most compute, and the fastest compute, is going to steamroll everyone.
LLM APIs are tuned to handle a lot of parallel requests. In short, the overall token throughput is higher, but the individual requests are processed more slowly.
The scaling curves aren't that extreme, though. I doubt they could tune the knobs to get individual requests coming through at 10X the normal rate.
This likely comes from having some servers tuned for higher individual request throughput, at the expense of overall token throughput. It's possible that it's on some newer generation serving hardware, too.
This makes no sense. It's not like they have a "slow it down" knob, they're probably parallelizing your request so you get a 2.5x speedup at 10x the price.
All of these systems use massive pools of GPUs, and allocate many requests to each node. The “slow it down” knob is to steer a request to nodes with more concurrent requests; “speed it up” is to route to less-loaded nodes.
But it’s actually not so difficult is it? The simplest way to make a slow pool is by having fewer GPUs and queuing requests for the non-premium users. Dead simple engineering.
Oh, of course. That’s just conspiratorial thinking. Paying to be in a premium pool makes sense, all of this “they probably serve rotten food to make people pay for quality food” nonsense is just silly.
What they are probably doing is speculative decoding, given they've mentioned identical distribution at 2.5x speed. That's roughly in the range you'd expect for that to achieve; 10x is not.
It's also absolute highway robbery (or at least overly-aggressive price discrimination) to charge 6x for speculative decoding, by the way. It is not that expensive and (under certain conditions, usually very cheap drafter and high acceptance rate) actually decrease total cost. In any case, it's unlikely to be even a 2x cost increase, let alone 6x.
Where on earth are you getting these numbers?
Why would a SaaS company that is fighting for market dominance withhold 10x performance if they had it? Where are you getting 2.5x?
This is such bizarre magical thinking, borderline conspiratorial.
There is no reason to believe any of the big AI players are serving anything less than the best trade off of stability and speed that they can possibly muster, especially when their cost ratios are so bad.
Thats also called slowing down default experience so users have to pay more for the fast mode. I think its the first time we are seeing blatant speed ransoms in the LLMs.
That's not how this works. LLM serving at scale processes multiple requests in parallel for efficiency. Reduce the parallelism and you can process individual requests faster, but the overall number of tokens processed is lower.
They have contracts with companies, and those companies wont be able to change quickly. By the time those contracts will come back for renewals it will already be too late, their code becoming completely unreadable by humans. Individual devs can move quickly but companies don't.
Are you at all familiar with the architecture of systems like theirs?
The reason people don't jump to your conclusion here (and why you get downvoted) is that for anyone familiar with how this is orchestrated on the backend it's obvious that they don't need to do artificial slowdowns.
Seriously, thinking at the price structure of this (6x the price for 2.5x the speed, if that's correct) it seems to target something like real time applications with very small context. Maybe vocal assistants? I guess that if you're doing development it makes more sense to parallelize over more agents rather than paying that much for a modest increase in speed.
Just when you thought it was safe to use Opus 4.5 at 1/3 the cost, they go and add a 6x 'bank-breaking mode' - So now accidental bankruptcy is just one toggle away.
I’d love to hear from engineers who find that faster speed is a big unlock for them.
The deadline piece is really interesting. I suppose there’s a lot of people now who are basically limited by how fast their agents can run and on very aggressive timelines with funders breathing down their necks?
> I’d love to hear from engineers who find that faster speed is a big unlock for them.
How would it not be a big unlock? If the answers were instant I could stay focused and iterate even faster instead of having a back-and-forth.
Right now even medium requests can take 1-2 minutes and significant work can take even longer. I can usually make some progress on a code review, read more docs, or do a tiny chunk of productive work but the constant context switching back and forth every 60s is draining.
I won't be paying extra to use this, but Claude Code's feature-dev plugin is so slow that even when running two concurrent Claudes on two different tasks, I'm twiddling my thumbs some of the time. I'm not fast and I don't have tight deadlines, but nonetheless feature-dev is really slow. It would be better if it were fast enough that I wouldn't have time to switch off to a second task and could stick with the one until completion. The mental cost of juggling two tasks is high; humans aren't designed for multitasking.
3-4 parallel projects is the norm now, though I find task-parallelism still makes overlap reduction bothersome, even with worktrees. How did you work around that?
The only time I find faster speed to be a big unlock is when iterating on UI stuff. If you're talking to your agent, with hot reload and such the model can often be the bottleneck in a style tuning workflow by a lot.
it's simpler than that - making it faster means it becomes less of an asynchronous task.
current speeds are "ask it to do a thing and then you the human need find something else to do for minutes (or more!) while it works". at a certain point at it being faster you just sit there and tell it to do a thing and it does and you just constantly work on the one thing.
cerebras is just about fast enough for that already, with the downside of being more expensive and worse at coding than claude code.
it feels like absolute magic to use though.
so, depends how you price your own context switches, really.
Yes, but GPT-5.2 and Codex were widely considered slower than Opus before that. They still feel very slow, at least on high. I should give medium a try more often.
If models really do continue to get more expensive then it's not going to make sense to let everyone at your org have equal budget for spend. We're on track for a world where there are the equivalent of F1 drivers for ai.
One persons output doesn't scale for an entire org no matter how expensive and good the AI is. It's either good enough that anyone can do it, or bad enough that a human needs to be in control which caps the output at human understanding. It will always be more efficient to have every engineer be boosted a little bit than a single one a lot.
Given how little most of us can know about the true cost of inference for these providers (and thus the financial sustainability of their services), this is an interesting signal. Not sure how to interpret it, but it doesn’t feel like it bodes well.
Given that providers of open source models can offer Kimi K2.5 at input $0.60 and output $2.50 per million tokens, I think the cost of inference must be around that. We would still need to compare the tokens per second.
A developer can blast millions of tokens in minutes. When you have a context size of 250k that’s just 4 queries. But with tool usage and subsequent calls etc it can easily just do many millions in one request
But if you just ask a question or something it’ll take a while to spend a million tokens…
Yeah that’s what they try to do with the latest coding agents sub agents which only have the context they need etc. but atm it’s too much work to manage contexts at that level
I use one Claude instance at a time, roughly fulltime (writes 90% of my code). Generally making small changes, nothing weird. According to ccusage, I spend about $20 of tokens a day, a bit less than 1 MTOK output tokens a way. So the exact same workflow would be about $120 for higher speed.
Could be a use for the $50 extra usage credit. It requires extra usage to be enabled.
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
After exceeding the increasingly shrinking session limit with Opus 4.6, I continued with the extra usage only for a few minutes and it consumed about $10 of the credit.
I can't imagine how quickly this Fast Mode goes through credit.
While it's an excellent way to make more money in the moment, I think this might become a standard no-extra-cost feature in several months (see Opus becoming way cheaper and a default model within months). Mental load management while using agents will become even more important it seems.
Why would they cut a money making feature? In fact I am already imagining them asking for speed ransom every time you are in a pinch, some extra context space will also become buyable. Anthropic is in a penny pincher phase right now and they will try to milk everything. Watch them add micro transactions too.
The API price is 6x that of normal Opus, so look forward to a new $1200/mo subscription that gives you the same amount of usage if you need the extra speed.
The writing has been on the wall since day 1. They wouldn't be marketing a subscription being sold at a loss as hard as they are if the intention wasn't to lock you in and then increase the price later.
What I expect to happen is that they'll slowly decrease the usage limits on the existing subscriptions over time, and introduce new, more expensive subscription tiers with more usage. There's a reason why AI subscriptions generally don't tell you exactly what the limits are, they're intended to be "flexible" to allow for this.
It's explicitly called out as excluded in the blue info bubble they have there.
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
Mintlify is the best example of a product that is just nice. They don't claim to have a moat, or weird agi vibes, or whatever. It just works and it's pretty. 10m arr right there
I don't think this is the case, according to the docs, right? The effort level will use fewer tokens, but the independent fast mode just somehow seems to use some higher priority infrastructure to serve your requests.
You're comparing two different things. It's not useless knowledge, it's something you need to understand.
Opus fast mode is routed to different servers with different tuning that prioritizes individual response throughput. Same model served differently. Same response, just delivered faster.
The Gemini fast mode is a different model (most likely) with different levels of thinking applied. Very different response.
Inference is run on shared hardware already, so they're not giving you the full bandwidth of the system by default. This most likely just allocates more resources to your request.
AI data centers are a whole lot of pipelines pumping data around utilizing queues. They want those expensive power hungry cards near 100% utilized at all times. So they have a queue of jobs on each system ready to run, feeding into the GPU memory as fast as completed jobs are read out of memory (and passed into the next stage) and they aim to have enough backlog in these queues to keep the pipeline full. You see responses in seconds but at the data center you're request was broken into jobs, passed around into queues, processed in an orderly manner and pieced back together.
With fast mode you're literally skipping the queue. An outcome of all of this is that for the rest of us the responses will become slower the more people use this 'fast' option.
I do suspect they'll also soon have a slow option for those that have Claude doing things overnight with no real care for latency of the responses. The ultimate goal is pipelines of data hitting 100% hardware utilization at all times.
Hmm not sure I agree with you there entirely. You're right there's queues to ensure that you max out the hardware with concurrent batches to _start_ inference, but I doubt you'd want to split up the same job into multiple bits and move them around servers if you could at all avoid it.
It requires a lot of bandwidth to do that and even at 400gbit/sec it would take a good second to move even a smaller KV cache between racks even in the same DC.
Will this mean that when cost is more important than latency that replies will now take longer?
I’m not in favor of the ad model chatgpt proposes. But business models like these suffer from similar traps.
If it works for them, then the logical next step is to convert more to use fast mode. Which naturally means to slow things down for those that didn’t pick/pay for fast mode.
We’ve seen it with iPhones being slowed down to make the newer model seem faster.
Not saying it’ll happen. I love Claude. But these business models almost always invite dark patterns in order to move the bottom line.
This pricing is pathetic. I've been using it for two hours at what I consider "normal" interactive speed and it burned $100. Normally the $200 subscription is enough for an entire month. I guess if you are rich, you can pay 40 times as much for roughly double speed (assuming 8 hours usage a day, 5 days a week)?
Edit: I just realized that's with the currently 50% discounted price! So you can pay 80 times as much!
Smart business model decision, since most people and organizations prefer regular progress.
In the future this might be the reason enterprise software companies win - because they can use their customer funds to pay for faster tokens and adaptions.
AI trained on real time data will always and only get dumber over time. Reversion to mean of human IQ. just like every web forum ever. Eternal September.
That’s why gen 1-3 AI felt so smart. It was trained on the best curated human knowledge available. And now that’s done it’s just humanities brain dumps left to learn from.
2 ways out: self referential learning gen 1-3 AIs. Or, Pay experts to improve datasets and not training with general human data. Inputs and outputs.
I redeemed my 50 USD credit to give it a go. In literally less than 10 minutes I spent 10 USD. Insane. I love Claude Code, but this pricing is madness.
LLM programming is very easy. First you have to prompt it to not mistakes. Then you have to tell it to go fast. Software engineering is over bro, all humans will be replaced in 6 days bro
The title is doing a lot of heavy lifting. No credible scientist describes AMOC collapse as the “end of humanity.” It would be catastrophic, full stop: European winters dropping 5-15°C, massive disruptions to global food production, coastal flooding, monsoon shifts affecting billions. But humans survived the last AMOC shutdown during the Younger Dryas roughly 12,900 years ago with stone tools and no supply chains. We’d suffer enormously. We wouldn’t go extinct. Framing it that way makes it easier for skeptics to dismiss the real risk.
This is gold for Anthropic's profitability. The Claude Code addicts can double their spend to plow through tokens because they need to finish something by a deadline. OpenAI will have a similar product within a week but will only charge 3x the normal rate.
This angle might also be NVidias reason for buying Groq. People will pay a premium for faster tokens.
I switched back to 4.5 Sonnet or Opus yesterday since 4.6 was so slow and often “over thinking” or “over analyzing” the problem space. Tasks which accurately took under an minute in Sonnet 4.5 were still running after 5 minutes in 4.6 (yeah I had them race for a few tasks)
Someone of this could be system overload I suppose.
Edit ~/.claude/settings.json and add "effortLevel": "medium". Alternatively, you can put it in .claude/settings.json in a project if you want to try it out first.
They recommend this in the announcement[1], but the way they suggest doing it is via a bogus /effort command that doesn't exist. See [2] for full details about thinking effort. It also recommends a bogus way to change effort by using the arrow keys when selecting a model, so don't use that either.
[1]: https://www.anthropic.com/news/claude-opus-4-6
[2]: https://code.claude.com/docs/en/model-config#adjust-effort-l...
You can do it via /model and pressing left and right though
That's not a thing, at least not in my installation of Claude Code.
It works for me! (Edited link since original had laptops serial number in it: https://screen.studio/share/3CEvdyji)
Claude Code v2.1.37
EU region, Claude Max 20x plan
Mac -- Tahoe 26.2
Good to know it works for some people! I think it's another issue where they focus too much on MacOS and neglect Windows and Linux releases. I use WSL for Claude Code since the Windows release is far worse and currently unusable do to several neglected issues.
Hoping to see several missing features land in the Linux release soon.
I'm also feeling weak and the pull of getting a Mac is stronger. But I also really don't like the neglect around being cross-platform. It's "cross-platform" except a bunch of crap doesn't work outside MacOS. This applies to Claude Code, Claude Desktop (MacOS and Windows only - no Linux or WSL support), Claude Cowork (MacOS only). OpenAI does the same crap - the new Codex desktop app is MacOS only. And now I'm ranting.
what? Their documentation is hallucinated?
Yep, and their documentation AI assistant will egregiously hallucinate whatever it thinks you want to hear, then repeat itself in a loop when you tell it that it's wrong.
Yesterday I asked a question about a Claude Code setting inside Claude Code, don't recall which, and their builtin documentation skill—something like that—ended up doing a web search and found a wrong answer on a third party site. Later I went to their documentation site and it was right there in the docs. Wonder why they can't bundle an AI-friendly version of their own docs (can't be more than a few hundred KBs compressed?) inside their 174MB executable.
It's insane that they concluded the builtin introspection skill for claude documentation should do a web search instead of simply packing the correct documentation in local files. I had the same experience like you, wasting tokens and my time because their architecture decision doesn't work in practice.
They used to? I have a distinct memory of it doing exactly that a few months ago. Maybe it got dropped in the mad dash that passes for CC sprint cycles
Pathetic how they have no support for modifying sampling settings, or even a "logit_bias" so I can ban my claude from using the EM dash (and regular dash), semicolons, or "not". Also will upweight things like exclamation points
Clearly those whose job it is to "monitor" folks use this as their "tell" if someone AI generated something. That's why every major LLM has this particular slop profile. It's infuriating.
I wrote a long winded rant about this bullshit
https://gist.github.com/Hellisotherpeople/71ba712f9f899adcb0...
They mentioned in the release notes if it's over-thinking you should decrease the reasoning effort.
Yeah, nothing is sped up, their initial deployment of 4.6 is so unbearably slow they are just now offering you the opportunity to pay more for the same experience of 4.5. What's the word for that?
Enslopification.
"Back in my day you had to wait 3 minutes to generate 10k lines of code."
Honestly, Open AI isn't worth it. I cancelled my Open AI plan (and hopefully will delete my account soon once I export all my data out) because of philosophy differences. They shared they are evaluating a model where they can get a % of your business in exchange for letting you use code generated by their AI models. That and the possible advertising angle. But, that's not even the worst, I asked ChatGPT to fairly evaluate the risky model where one for profit corporation holds your entire intimate personal details and uses it for advertising, it staunchly defended OpenAI. That was the nail in the coffin for me.
Contrast to this - Anthropic actually asks you if you want their AI to remember details about you and they have lot of toggles around privacy. I don't care if they make money from extra tokens as long as they don't go the Open AI route.
> They shared they are evaluating a model where they can get a % of your business in exchange for letting you use code generated by their AI models.
That's a gross mischaracterization of what the CFO said. She basically just said the pricing space is huge, and they've even explored things like royalty models.
I'm guessing you just saw a headline and read nothing into it.
Isn't it what "evaluating a model where they can get a % of your business in exchange for letting you use code generated by their AI models" precisely mean?
If they find that this business model is most profitable for OpenAI, and that they can somehow release models better than any competitor, wouldn't they say they want royalties ? That's what Unity (the game engine) does so it wouldn't be unseen.
Gold for Anthropic but kinda shit for everyone else no? Now they have a profit motive for slowing down the normal service.
This is the Deliveroo playbook of offering a ‘premium’ service that is really just the original service with the original slowed down.
Same with speedy boarding for airlines. Now almost everyone pays for it so you don’t even get a benefit.
> Now they have a profit motive for slowing down the normal service.
Sure. But for now, this is a competitive space. The competitors offer models at a decent quality*speed/price ratio and prevent Anthopic from going too far downhill.
Actually, as I think about it... I don't enjoy any other model as much as Opus 4.5 and 4.6. For me, this is no longer a competitive space. Anthropic are in full right to charge premium prices for their premium product.
The difference being that Airlines and food delivery did make a profit, just figured they had to do these tricks to earn some more. Mature businesses resort to lowering quality, fake scarcity.
Here the scarcity is real, and profits are nowhere to be seen
These schemes will soon fall apart entirely when an open weight model can run on Groq/Cerebras/SambaNova at even higher speeds and be just fine for all tasks. Arguably already the case, but not many know yet.
In cursor, GPT models already have +Fast options that work faster with 2x price
Does that just use 2 agents at the same time or something like that?
So 2.5x the speed at 6x the price [1].
Quite a premium for speed. Especially when Gemini 3 Pro is 1.8x the tokens/sec speed (of regular-speed Opus 4.6) at 0.45x the price [2]. Though it's worse at coding, and Gemini CLI doesn't have the agentic strength of Claude Code, yet.
[1] - https://x.com/claudeai/status/2020207322124132504 [2] - https://artificialanalysis.ai/leaderboards/models
6x price/token, so 15x price/second, and only at the API pricing level, not the far cheaper (per token) subscription pricing.
Definitely an interesting way to encourage whales to spend a lot of money quickly.
I didn’t quite understand why they were randomly giving people $50 in credits. But I think this is why?
no, it’s for Max subscribers to enable “use API when running out of session limit”. the assumption (probably) being that many will forget to turn it off, and they’ll earn it back that way.
This was my first thought, but by default, you have no automatic reload of your prepaid account. Which I think is for once user friendly. They could have applied a dark pattern here.
Gemini is pretty good for frontend tasks
> Though it's worse at coding, and Gemini CLI doesn't have the agentic strength of Claude Code, yet.
You can use OpenCode instead of Gemini CLI.
or you can proxy Gemini through Claude Code
That sounds pretty nice. How are you achieving that?
Litellm makes it easy
A useful feature would be slow-mode which gets low cost compute on spot pricing.
I’ll often kick off a process at the end of my day, or over lunch. I don’t need it to run immediately. I’d be fine if it just ran on their next otherwise-idle gpu at much lower cost that the standard offering.
https://platform.claude.com/docs/en/build-with-claude/batch-...
> The Batches API offers significant cost savings. All usage is charged at 50% of the standard API prices.
Can this work for Claude? I think it might be raw API only.
I'm not sure I understand the question? Are you perhaps asking if messages can be batched via Claude Code and/or the Claude web UI?
Yes, Claude code.
No
OpenAI offers that, or at least used to. You can batch all your inference and get much lower prices.
Still do. Great for workloads where it's okay to bundle a bunch of requests and wait some hours (up to 24h, usually done faster) for all of them to complete.
Yep same, I often think why this isn’t a thing yet. Running some tasks in the night at e.g. 50% of the costs - there’s the batch api but that is not integrated in e.g. claude code
The discount MAX plans are already on slow-mode.
> I’ll often kick off a process at the end of my day, or over lunch. I don’t need it to run immediately. I’d be fine if it just ran on their next otherwise-idle gpu at much lower cost that the standard offering.
If it's not time sensitive, why not just run it at on CPU/RAM rather than GPU.
Yeah just run a LLM with over 100 billion parameters on a CPU.
200 GB is an unfathomable amount of main memory for a CPU
(with apologies for snark,) give gpt-oss-120b a try. It’s not fast at all, but it can generate on CPU.
But it's incredibly incapable compared to SOTA models. OP wants high quality output but doesn't need it fast. Your suggestion would mean slow AND low quality output.
Run what exactly?
I'm assuming GP means 'run inference locally on GPU or RAM'. You can run really big LLMs on local infra, they just do a fraction of a token per second, so it might take all night to get a paragraph or two of text. Mix in things like thinking and tool calls, and it will take a long, long time to get anything useful out of it.
I’ve been experimenting with this today. I still don’t think AI is a very good use of my programming time… but it’s a pretty good use of my non-programming time.
I ran OpenCode with some 30B local models today and it got some useful stuff done while I was doing my budget, folding laundry, etc.
It’s less likely to “one shot” apples to apples compared to the big cloud models; Gemini 3 Pro can one shot reasonably complex coding problems through the chat interface. But through the agent interface where it can run tests, linters, etc. it does a pretty good job for the size of task I find reasonable to outsource to AI.
This is with a high end but not specifically AI-focused desktop that I mostly built with VMs, code compilation tasks, and gaming in mind some three years ago.
Yes, this is what I meant. People are running huge models at home now, I assumed people could do it on premises or in a data center if you're a business, presumably faster... but yeah it definitely depends on what time scales we're talking.
Huge models? First you have to spend $5k-$10k or more on hardware. Maybe $3k for something extremely slow (<1 tok/sec) that is disk-bound. So that's not a great deal over batch API pricing for a long, long time.
Also you still wouldn't be able to run "huge" models at a decent quantization and token speed. Kimi K2.5 (1T params) with a very aggressive quantization level might run on one Mac Studio with 512GB RAM at a few tokens per second.
To run Kimi K2.5 at an acceptable quantization and speed, you'd need to spend $15k+ on 2 Mac Studios with 512GB RAM and cluster them. Then you'll maybe get 10-15 tok/sec.
I'd love to know what kind of hardware would it take to do inference at the speed provided by the frontier model providers (assuming their models were available for local use).
10k worth of hardware? 50k? 100k?
Assuming a single user.
Does that even work out to be cheaper, once you factor in how much extra power you'd need?
How much extra power do you think you would need to run an LLM on a CPU (that will fit in RAM and be useful still)? I have a beefy CPU and if I ran it 24/7 for a month it would only cost about $30 in electricity.
I wonder if /slow with 6 times less the price could have some uses
Note that you can't use this mode to get the most out of a subscription - they say it's always charged as extra usage:
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
Although if you visit the Usage screen right now, there's a deal you can claim for $50 free extra usage this month.
So it's basically useless then. Even with Claude Max I have to manage my usage when doing TDD, and using ccusage tool I've seen that I'd frequently hit $200 per day if I was on the API. At 6x cost you'll burn through $50 in about 20 minutes. I wish that was hyperbole.
I tried casually using it for two hours and it burned $100 at the current 50% discounted rate, so your guess is pretty accurate...
I still don't get why Claude is so expensive.
Because we all prefer it over Gemini and Codex. Anthropic knows that and needs to get as much out of it as possible while they can. Not saying the others will catch up soon. But at some point other models will be as capable as Opus and Sonnet are now, and then it's easier to let price guide the choice of provider.
My (and many others) normal workflow includes a planning phase, followed by an implementation phase. For me the most useful time for fast mode would be during that planning phase.
The current "clear context and execute plan" would be great to be joined by a, "clear context, switch to regular speed mode, and execute plan".
I even think I would not require fast mode for the explore agents etc - they have so much to do that I accept that takes a while. being able to rapidly iterate on the plan before setting it going would make it easier.
Please and thank you, Boris.
Counterintuitively, I feel like this will not be super useful, at least for me. My bottleneck is MY ability to parse and understand LLM-generated code. The agent can code a lot faster than I can read and understand its output.
If it was fast I'd ask questions more than read the code in detail. This isn't viable for that approach yet though.
Spend time building a test harness and evaluations of whether the solution meets the requirements. Then you don't need to look at the code because those other pieces will bring the necessary guarantees and trust.
People who vibe code don't really care about silly things such as understanding the code though.
There's a lot of people out there that aren't paying as close attention to the actual code. Wild times!
Smart if you can get away with it. For non trivial things you can't quite get away with it.
I've started asking Claude to make me a high-level implementation plan and basically prompt ME to write it. For the most part I walk through it and ask Claude to just do it. But then 10% of the time there is a pretty major issue that I investigate, weigh pros/cons, and then decide to change course on.
Those things accumulate. Maybe 5-10 things over the course of an MVP that I wouldn't really have a clue about if I let Claude just dutiful implement it's own plan.
Oh, I don’t like it.
Not including broken things, you end up with random errors or things that just don’t work right.
Looking at the "Decide when to use fast mode", it seems the future they want is:
- Long running autonomous agents and background tasks use regular processing.
- "Human in the loop" scenarios use fast mode.
Which makes perfect sense, but the question is - does the billing also make sense?
The billing doesn't even make sense for Opus at the API prices, the sub is the killer.
It'll be a Cadillac offering for whales. People who care about value will just run stuff in parallel.
> It'll be a Cadillac offering for whales.
We need to update terminology. Cadillac hasn't been the Cadillac of anything for decades now.
Cadillac margarita would like a word. ;)
The async AI + sync approval bottleneck is real. One thing that helped me: stop being physically desk-bound during those wait times.
I built ForkOff to solve this - when Claude needs approval, push notification to your phone, one-tap approve from anywhere. Turns out you don't actually need to be at your desk for most approvals.
The fast mode helps with speed, but even faster is letting the AI work while you're literally anywhere else. Early access: forkoff.app
(And yes, the pricing for fast mode is wild - $100 burned in 2 hours per some comments here!)
> ...stop being physically desk-bound during those wait times.
> I built ForkOff to solve this
This does sound useful but I have to laugh because I just work out or play with my children. If you enjoy calisthenics and stretching its great to use Claude while not feeling chair-bound. Programming becomes more physical!
So us normal pro accounts have slow mode. Thanks anthropic.
I'm currently testing Kimi2.5 with cli, works great and fast. Even comes with a web interface so you can communicate with you kimi-cli instance (remote even if you use vpn).
I’m curious what’s behind the speed improvements. It seems unlikely it’s just prioritization, so what else is changing? Is it new hardware (à la Groq or Cerebras)? That seems plausible, especially since it isn’t available on some cloud providers.
Also wondering whether we’ll soon see separate “speed” vs “cleverness” pricing on other LLM providers too.
It comes from batching and multiple streams on a GPU. More people sharing 1 GPU makes everyone run slower but increases overall token throughput.
Mathematically it comes from the fact that this transformer block is this parallel algorithm. If you batch harder, increase parallelism, you can get higher tokens/s. But you get less throughput. Simultaneously there is also this dial that you can speculatively decode harder with fewer users.
Its true for basically all hardware and most models. You can draw this Pareto curve of how much throughput per GPU vs how many tokens per second per stream. More tokens/s less total throughput.
See this graph for actual numbers:
Token Throughput per GPU vs. Interactivity gpt-oss 120B • FP4 • 1K / 8K • Source: SemiAnalysis InferenceMAX™
https://inferencemax.semianalysis.com/
> If you batch harder, increase parallelism, you can get higher tokens/s. But you get less throughput. Simultaneously there is also this dial that you can speculatively decode harder with fewer users.
I think you skipped the word “total throughout” there right? Cause tok/s is a measure of throughput, so it’s clearer to say you increase throughput/user at the expense of throughput/gpu.
I’m not sure about the comment about speculative decode though. I haven’t served a frontier model but generally speculative decode I believe doesn’t help beyond a few tokens, so I’m not sure you can “speculatively decode harder” with fewer users.
There are a lot of knobs they could tweak. Newer hardware and traffic prioritisation would both make a lot of sense. But they could also lower batching windows to decrease queueing time at the cost of lower throughput, or keep the KV cache in GPU memory at the expense of reducing the number of users they can serve from each GPU node.
I think it's just routing to faster hardware:
H100 SXM: 3.35 TB/s HBM3
GB200: 8 TB/s HBM3e
2.4x faster memory - which is exactly what they are saying the speedup is. I suspect they are just routing to GB200 (or TPU etc equivalents).
FWIW I did notice _sometimes_ recently Opus was very fast. I put it down to a bug in Claude Code's token counting, but perhaps it was actually just occasionally getting routed to GB200s.
Dylan Patel did analysis that suggests lower batch size and more speculative decoding leads to 2.5x more per-user throughput for 6x the cost for open models [0]. Seems plausible this could be what they are doing. We probably won't get to know for sure any time soon.
Regardless, they don't need to be using new hardware to get speedups like this. It's possible you just hit A/B testing and not newer hardware. I'd be surprised if they were using their latest hardware for inference tbh.
[0] https://nitter.net/dylan522p/status/2020302299827171430
> It seems unlikely it’s just prioritization
Why does this seem unlikely? I have no doubt they are optimizing all the time, including inference speed, but why could this particular lever not entirely be driven by skipping the queue? It's an easy way to generate more money.
Yes it's 100% prioritization. Through that it's also likely running on more GPUs at once but that's an artifact of prioritization at the datacenter level. Any task coming into an AI datacenter atm is split into fairly fined grained chunks of work and added to queues to be processed.
When you add a job with high priority all those chunks will be processed off the queue first by each and every GPU that frees up. It probably leads to more parallelism but... it's the prioritization that led to this happening. It's better to think of this as prioritization of your job leading to the perf improvement.
Here's a good blog for anyone interested which talks about prioritization and job scheduling. It's not quite at the datacenter level but the concepts are the same. Basically everything is thought of as a pipeline. All training jobs are low pri (they take months to complete in any case), customer requests are mid pri and then there's options for high pri. Everything in an AI datacenter is thought of in terms of 'flow'. Are there any bottlenecks? Are the pipelines always full and the expensive hardware always 100% utilized? Are the queues backlogs big enough to ensure full utilization at every stage?
https://www.aleksagordic.com/blog/vllm
>Yes it's 100% prioritization
Amazon Bedrock has a similar feature called "priority tier": you get faster responses at 1.75x the price. And they explicitly say in the docs "priority requests receive preferential treatment in the processing queue, moving ahead of standard requests for faster responses".
Until everyone buys it. Like fast pass at an amusement park where the fast line is still two hours long
At 6x the cost, and it requiring you to pay full API pricing, I don’t think this is going to be a concern.
It's a good way to squeeze extra out of a bunch of people without actually raising prices.
I wonder if they might have mostly implemented this for themselves to use internally, and it is just prioritization but they don't expect too many others to pay the high cost.
Roon said as much here [0]:
> codex-5.2 is really amazing but using it from my personal and not work account over the weekend taught me some user empathy lol it’s a bit slow
[0] https://nitter.net/tszzl/status/2016338961040548123
I see Anthropic says so here as well: https://x.com/claudeai/status/2020207322124132504
Nvidia GB300 i.e. Blackwell.
> so what else is changing?
Let me guess. Quantization?
I was thinking about inhouse model inference speeds at frontier labs like Anthropic and OpenAI after reading the "Claude built a C compiler" article.
Having higher inference speed would be an advantage, especially if you're trying to eat all the software and services.
Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.
In the predicted nightmare future where everything happens via agents negotiating with agents, the side with the most compute, and the fastest compute, is going to steamroll everyone.
> Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.
They said the 2.5X offering is what they've been using internally. Now they're offering via the API: https://x.com/claudeai/status/2020207322124132504
LLM APIs are tuned to handle a lot of parallel requests. In short, the overall token throughput is higher, but the individual requests are processed more slowly.
The scaling curves aren't that extreme, though. I doubt they could tune the knobs to get individual requests coming through at 10X the normal rate.
This likely comes from having some servers tuned for higher individual request throughput, at the expense of overall token throughput. It's possible that it's on some newer generation serving hardware, too.
This makes no sense. It's not like they have a "slow it down" knob, they're probably parallelizing your request so you get a 2.5x speedup at 10x the price.
All of these systems use massive pools of GPUs, and allocate many requests to each node. The “slow it down” knob is to steer a request to nodes with more concurrent requests; “speed it up” is to route to less-loaded nodes.
Right, but that's still not Anthropic adding an intentional delay for the sole purpose of having you pay more to remove it.
But it’s actually not so difficult is it? The simplest way to make a slow pool is by having fewer GPUs and queuing requests for the non-premium users. Dead simple engineering.
No, the simplest way is `sleep(10)`.
Oh, of course. That’s just conspiratorial thinking. Paying to be in a premium pool makes sense, all of this “they probably serve rotten food to make people pay for quality food” nonsense is just silly.
What they are probably doing is speculative decoding, given they've mentioned identical distribution at 2.5x speed. That's roughly in the range you'd expect for that to achieve; 10x is not.
It's also absolute highway robbery (or at least overly-aggressive price discrimination) to charge 6x for speculative decoding, by the way. It is not that expensive and (under certain conditions, usually very cheap drafter and high acceptance rate) actually decrease total cost. In any case, it's unlikely to be even a 2x cost increase, let alone 6x.
Where on earth are you getting these numbers? Why would a SaaS company that is fighting for market dominance withhold 10x performance if they had it? Where are you getting 2.5x?
This is such bizarre magical thinking, borderline conspiratorial.
There is no reason to believe any of the big AI players are serving anything less than the best trade off of stability and speed that they can possibly muster, especially when their cost ratios are so bad.
Not magical thinking, not conspiratorial, just hypothetical.
Just because you can't afford to 10x all your customers' inference doesn't mean you can't afford to 10x your inhouse inference.
And 2.5x is from Anthropic's latest offering. But it costs you 6x normal API pricing.
Also, from a comment in another thread, from roon, who works at OpenAI:
> codex-5.2 is really amazing but using it from my personal and not work account over the weekend taught me some user empathy lol it’s a bit slow
[0] https://nitter.net/tszzl/status/2016338961040548123
Thats also called slowing down default experience so users have to pay more for the fast mode. I think its the first time we are seeing blatant speed ransoms in the LLMs.
That's not how this works. LLM serving at scale processes multiple requests in parallel for efficiency. Reduce the parallelism and you can process individual requests faster, but the overall number of tokens processed is lower.
They can now easily decrease the speed for the normal mode, and then users will have to pay more for fast mode.
Do you have any evidence that this is happening? Or is it just a hypothetical threat you're proposing?
These companies aren't operating in a vacuum. Most of their users could change providers quickly if they started degrading their service.
They have contracts with companies, and those companies wont be able to change quickly. By the time those contracts will come back for renewals it will already be too late, their code becoming completely unreadable by humans. Individual devs can move quickly but companies don't.
Are you at all familiar with the architecture of systems like theirs?
The reason people don't jump to your conclusion here (and why you get downvoted) is that for anyone familiar with how this is orchestrated on the backend it's obvious that they don't need to do artificial slowdowns.
I am familiar with the business model. This is clear indication of what their future plan is.
Also, I just pointed out at the business issue, just raising a point which was not raised here. Just want people to be more cautious
So you are not familiar with the system architecture. Okay.
Slowing down respect to what?
Slowing down with respect to original speed of response. Basically what we used to get few months back and what is the best possible experience.
There is no "original speed of response". The more resources you pour in, the faster it goes.
Watch them decrease resources for the normal mode so people are penny pinched into using fast mode.
Seriously, thinking at the price structure of this (6x the price for 2.5x the speed, if that's correct) it seems to target something like real time applications with very small context. Maybe vocal assistants? I guess that if you're doing development it makes more sense to parallelize over more agents rather than paying that much for a modest increase in speed.
At this point why don't we just CNAME HN to the Claude marketing blog?
Because we would miss simonw’s self-promotion blog posts.
jealous much?
It gives the same space, if not a lot more, to OpenAI.
It should definitely be renamed to AINews instead of HackerNews, but Claude posts are a lot less frequent than OpenAI's.
It's the big thing right now. Have a look at the old HN front pages and you'll notice other topics/technologies dominated in the past.
Just when you thought it was safe to use Opus 4.5 at 1/3 the cost, they go and add a 6x 'bank-breaking mode' - So now accidental bankruptcy is just one toggle away.
it states how much it costs but not how much faster it is
I’d love to hear from engineers who find that faster speed is a big unlock for them.
The deadline piece is really interesting. I suppose there’s a lot of people now who are basically limited by how fast their agents can run and on very aggressive timelines with funders breathing down their necks?
> I’d love to hear from engineers who find that faster speed is a big unlock for them.
How would it not be a big unlock? If the answers were instant I could stay focused and iterate even faster instead of having a back-and-forth.
Right now even medium requests can take 1-2 minutes and significant work can take even longer. I can usually make some progress on a code review, read more docs, or do a tiny chunk of productive work but the constant context switching back and forth every 60s is draining.
I won't be paying extra to use this, but Claude Code's feature-dev plugin is so slow that even when running two concurrent Claudes on two different tasks, I'm twiddling my thumbs some of the time. I'm not fast and I don't have tight deadlines, but nonetheless feature-dev is really slow. It would be better if it were fast enough that I wouldn't have time to switch off to a second task and could stick with the one until completion. The mental cost of juggling two tasks is high; humans aren't designed for multitasking.
Two? I'd estimate twelve (three projects x four tasks) going at peak.
3-4 parallel projects is the norm now, though I find task-parallelism still makes overlap reduction bothersome, even with worktrees. How did you work around that?
The only time I find faster speed to be a big unlock is when iterating on UI stuff. If you're talking to your agent, with hot reload and such the model can often be the bottleneck in a style tuning workflow by a lot.
If it could help avoid you needing to context switch between multiple agents, that could be a big mental load win.
The idea of development teams bottlenecked by agent speed rather than people, ideas, strategy, etc. gives me some strange vibes.
it's simpler than that - making it faster means it becomes less of an asynchronous task.
current speeds are "ask it to do a thing and then you the human need find something else to do for minutes (or more!) while it works". at a certain point at it being faster you just sit there and tell it to do a thing and it does and you just constantly work on the one thing.
cerebras is just about fast enough for that already, with the downside of being more expensive and worse at coding than claude code.
it feels like absolute magic to use though.
so, depends how you price your own context switches, really.
I see no mention of that, but OpenAI already has "service tier" API option[0] that improves the speed of a request by about 40% according to my tests.
[0]: https://openai.com/api-priority-processing/
The one question I have that isn't answered by the page is how much faster?
Obviously they can't make promises but I'd still like a rough indication of how much this might improve the speed of responses.
Yeah is this cerebras/groq speed, or I just skip the queue?
2.5x faster or so (https://x.com/claudeai/status/2020207322124132504).
6x more expensive
What's crazy is the pricing difference given that OpenAI recently reduced latency on some models with no price change - https://x.com/OpenAIDevs/status/2018838297221726482
Yes, but GPT-5.2 and Codex were widely considered slower than Opus before that. They still feel very slow, at least on high. I should give medium a try more often.
If models really do continue to get more expensive then it's not going to make sense to let everyone at your org have equal budget for spend. We're on track for a world where there are the equivalent of F1 drivers for ai.
One persons output doesn't scale for an entire org no matter how expensive and good the AI is. It's either good enough that anyone can do it, or bad enough that a human needs to be in control which caps the output at human understanding. It will always be more efficient to have every engineer be boosted a little bit than a single one a lot.
Given how little most of us can know about the true cost of inference for these providers (and thus the financial sustainability of their services), this is an interesting signal. Not sure how to interpret it, but it doesn’t feel like it bodes well.
Given that providers of open source models can offer Kimi K2.5 at input $0.60 and output $2.50 per million tokens, I think the cost of inference must be around that. We would still need to compare the tokens per second.
Fair but we technically do not know the parameter count
This seems like an incredibly bad deal, but maybe they're probing to see if people will pay more
You know if people pay for this en masse it'll be the new default pricing, with fast being another step above
Not at all. Some work is the combination of short duration, bottlenecking in nature, and serial.
Example: You're merging 3 branches, and there's a minor merge conflict.
You only need 15k tokens to fix it, so it's short duration. And it's bottlenecking. And it's a serial task.
This belongs on Cerebras or whatever.
Once fixed, go back to slower compute.
The pricing on this is absolutely nuts.
For us mere mortals, how fast does a normal developer for through a MTok. How about a good power user?
A developer can blast millions of tokens in minutes. When you have a context size of 250k that’s just 4 queries. But with tool usage and subsequent calls etc it can easily just do many millions in one request
But if you just ask a question or something it’ll take a while to spend a million tokens…
It's worth noting those 250k tokens will be cached for repeat queries.
Seems like an opportunity to condense the context into 'documentation' level and only load the full text/code for files that expect to be edited?
Yeah that’s what they try to do with the latest coding agents sub agents which only have the context they need etc. but atm it’s too much work to manage contexts at that level
I use one Claude instance at a time, roughly fulltime (writes 90% of my code). Generally making small changes, nothing weird. According to ccusage, I spend about $20 of tokens a day, a bit less than 1 MTOK output tokens a way. So the exact same workflow would be about $120 for higher speed.
I tried using it for two hours and it burned $100 at the 50% discounted pricing.
It doesn’t say how much faster it is but from my experience with OpenAI’s “service_tier=priority” option on SQLAI.ai is that it’s twice as fast.
Could be a use for the $50 extra usage credit. It requires extra usage to be enabled.
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
After exceeding the increasingly shrinking session limit with Opus 4.6, I continued with the extra usage only for a few minutes and it consumed about $10 of the credit.
I can't imagine how quickly this Fast Mode goes through credit.
It has to be. The timing is just too close.
While it's an excellent way to make more money in the moment, I think this might become a standard no-extra-cost feature in several months (see Opus becoming way cheaper and a default model within months). Mental load management while using agents will become even more important it seems.
Why would they cut a money making feature? In fact I am already imagining them asking for speed ransom every time you are in a pinch, some extra context space will also become buyable. Anthropic is in a penny pincher phase right now and they will try to milk everything. Watch them add micro transactions too.
Yeah especially once they make an even faster fast mode.
I wouldn't be surprised if the implementation is
- Turn down the thinking token budget to one half
- Multiply the thinking tokens by 2 on the usage stats returned
- Phew! Twice the speed
IMO charging for the thinking tokens that you can't see is scam.
I pay $200 a month and don't get any included access to this? Ridiculous
That is the point. They raised prices and want you to pay more for quicker answers.
No different to paying a knowledge worker but this time, you are paying more to get them to respond quicker to your questions.
Well, you can burn your $50 bonus on it
The API price is 6x that of normal Opus, so look forward to a new $1200/mo subscription that gives you the same amount of usage if you need the extra speed.
I always wondered this, is this true/does the math come out to be really that bad? 6x?
Is the writing on the wall for $100-$200/mo users that, it's basically known-subsidized for now and $400/mo+ is coming sooner than we think?
Are they getting us all hooked and then going to raise it in the future, or will inference prices go down to offset?
The writing has been on the wall since day 1. They wouldn't be marketing a subscription being sold at a loss as hard as they are if the intention wasn't to lock you in and then increase the price later.
What I expect to happen is that they'll slowly decrease the usage limits on the existing subscriptions over time, and introduce new, more expensive subscription tiers with more usage. There's a reason why AI subscriptions generally don't tell you exactly what the limits are, they're intended to be "flexible" to allow for this.
..But it says "Available to all Claude Code users on subscription plans (Pro/Max/Team/Enterprise) and Claude Console."
Is this wrong?
It's explicitly called out as excluded in the blue info bubble they have there.
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
https://code.claude.com/docs/en/fast-mode#requirements
I think this is just worded in a misleading way. It’s available to all users, but it’s not included as part of the plan.
I really like Anthropic's web design. This doc site looks like it's using gitbook (or a clone of gitbook) but they make it look so nice.
Its just https://www.mintlify.com/ with barely customized theme
Ah fair enough. Their webdesign on the homepage and other stuff is still great. And the font/colour choice on the Mintlify theme is nice.
Mintlify is the best example of a product that is just nice. They don't claim to have a moat, or weird agi vibes, or whatever. It just works and it's pretty. 10m arr right there
Looks like mintlify to me. Especially the copy page button.
So fast mode uses more tokens, in direct opposition to Gemini where fast 'mode' means less. One more piece of useless knowledge to remember.
I don't think this is the case, according to the docs, right? The effort level will use fewer tokens, but the independent fast mode just somehow seems to use some higher priority infrastructure to serve your requests.
You're comparing two different things. It's not useless knowledge, it's something you need to understand.
Opus fast mode is routed to different servers with different tuning that prioritizes individual response throughput. Same model served differently. Same response, just delivered faster.
The Gemini fast mode is a different model (most likely) with different levels of thinking applied. Very different response.
AFAIK, they don't have any deals or partnerships with Groq or Cerebras or any of those kinds of companies.. so how did they do this?
Inference is run on shared hardware already, so they're not giving you the full bandwidth of the system by default. This most likely just allocates more resources to your request.
The models are running on Google TPUs.
Could well be running on Google TPUs.
Where is this perf gain coming from? Running on TPUs?
AI data centers are a whole lot of pipelines pumping data around utilizing queues. They want those expensive power hungry cards near 100% utilized at all times. So they have a queue of jobs on each system ready to run, feeding into the GPU memory as fast as completed jobs are read out of memory (and passed into the next stage) and they aim to have enough backlog in these queues to keep the pipeline full. You see responses in seconds but at the data center you're request was broken into jobs, passed around into queues, processed in an orderly manner and pieced back together.
With fast mode you're literally skipping the queue. An outcome of all of this is that for the rest of us the responses will become slower the more people use this 'fast' option.
I do suspect they'll also soon have a slow option for those that have Claude doing things overnight with no real care for latency of the responses. The ultimate goal is pipelines of data hitting 100% hardware utilization at all times.
Hmm not sure I agree with you there entirely. You're right there's queues to ensure that you max out the hardware with concurrent batches to _start_ inference, but I doubt you'd want to split up the same job into multiple bits and move them around servers if you could at all avoid it.
It requires a lot of bandwidth to do that and even at 400gbit/sec it would take a good second to move even a smaller KV cache between racks even in the same DC.
Will this mean that when cost is more important than latency that replies will now take longer?
I’m not in favor of the ad model chatgpt proposes. But business models like these suffer from similar traps.
If it works for them, then the logical next step is to convert more to use fast mode. Which naturally means to slow things down for those that didn’t pick/pay for fast mode.
We’ve seen it with iPhones being slowed down to make the newer model seem faster.
Not saying it’ll happen. I love Claude. But these business models almost always invite dark patterns in order to move the bottom line.
No we’ve actually never seen that in iPhones. There is zero proof of this.
It's a good way to address the price insensitive segment. As long as they don't slow down the rest, good move.
This sounds like one of these theme park "skip the queue" tickets. It will absolutely slow down the rest.
Whatever optimisation is going on is at the hardware level since the fast option persists in a session.
This pricing is pathetic. I've been using it for two hours at what I consider "normal" interactive speed and it burned $100. Normally the $200 subscription is enough for an entire month. I guess if you are rich, you can pay 40 times as much for roughly double speed (assuming 8 hours usage a day, 5 days a week)?
Edit: I just realized that's with the currently 50% discounted price! So you can pay 80 times as much!
Fast. Cheap. Quality. Pick 2.
Smart business model decision, since most people and organizations prefer regular progress.
In the future this might be the reason enterprise software companies win - because they can use their customer funds to pay for faster tokens and adaptions.
let's see where it goes.
Interesting, output price is insane/Mtok
Is this is the beginning of the ‘Speedy boarding’ / ‘Fastest delivery’ enshitification?
Where everyone is forced to pay for a speed up because the ‘normal’ service just gets slower and slower.
I hope not. But I fear.
This is to test the room before real enshitification happens. Companies who bought from Anthropic are really in for a ride.
Give me a slow mode that’s cheaper instead lol
If anyone from Anthropic is here…….
I need a way to put stuff in code with 150% certainty that no LLM will remove it.
Specifically so I can link requirements identifiers to locations in code, but there must be other uses too.
Instead of better/cheaper/faster you just the the last one?
Back to Gemini.
AI trained on real time data will always and only get dumber over time. Reversion to mean of human IQ. just like every web forum ever. Eternal September.
That’s why gen 1-3 AI felt so smart. It was trained on the best curated human knowledge available. And now that’s done it’s just humanities brain dumps left to learn from.
2 ways out: self referential learning gen 1-3 AIs. Or, Pay experts to improve datasets and not training with general human data. Inputs and outputs.
If this pricing ratio holds it is going to mint money for Cerebras.
Many suspected a 2x premium for 10x faster. It looks like they may have been incorrect.
I redeemed my 50 USD credit to give it a go. In literally less than 10 minutes I spent 10 USD. Insane. I love Claude Code, but this pricing is madness.
What would have been the human labor cost equivalent?
But waiting for the agent to finish is my 2026 equivalent of "compiling!"
https://xkcd.com/303/
Personally, I’d prefer a slow mode that’s a ton cheaper for a lot of things.
> $30/150 MTok Umm no thank you
LLM programming is very easy. First you have to prompt it to not mistakes. Then you have to tell it to go fast. Software engineering is over bro, all humans will be replaced in 6 days bro
What is “$30/150MTok”? Claude Opus 4.6 is normally priced at “$25/MTok”. Am I just reading it wrong or is this a typo?
EDIT: I understand now. $30 for input, $150 for output. Very confusing wording. That’s insanely expensive!
Yeah I don't understand. Is it actually saying that fast mode is ten times more expensive than normal mode? I cannot be reading this right.
Yes.
The title is doing a lot of heavy lifting. No credible scientist describes AMOC collapse as the “end of humanity.” It would be catastrophic, full stop: European winters dropping 5-15°C, massive disruptions to global food production, coastal flooding, monsoon shifts affecting billions. But humans survived the last AMOC shutdown during the Younger Dryas roughly 12,900 years ago with stone tools and no supply chains. We’d suffer enormously. We wouldn’t go extinct. Framing it that way makes it easier for skeptics to dismiss the real risk.
What?