Claude Sonnet 4.5

anthropic.com

1553 points by adocomplete 2 days ago


System card: https://assets.anthropic.com/m/12f214efcc2f457a/original/Cla...

simonw - 2 days ago

I had access to a preview over the weekend, I published some notes here: https://simonwillison.net/2025/Sep/29/claude-sonnet-4-5/

It's very good - I think probably a tiny bit better than GPT-5-Codex, based on vibes more than a comprehensive comparison (there are plenty of benchmarks out there that attempt to be more methodical than vibes).

It particularly shines when you try it on https://claude.ai/ using its brand new Python/Node.js code interpreter mode. Try this prompt and see what happens:

  Checkout https://github.com/simonw/llm and run the tests with
  
  pip install -e '.[test]'
  pytest
I then had it iterate on a pretty complex database refactoring task, described in my post.
iagooar - 2 days ago

Anecdotal evidence.

I have a fairly large web application with ~200k LoC.

Gave the same prompt to Sonnet 4.5 (Claude Code) and GPT-5-Codex (Codex CLI).

"implement a fuzzy search for conversations and reports either when selecting "Go to Conversation" or "Go to Report" and typing the title or when the user types in the title in the main input field, and none of the standard elements match, a search starts with a 2s delay"

Sonnet 4.5 went really fast at ~3min. But what it built was broken and superficial. The code did not even manage to reuse already existing auth and started re-building auth server-side instead of looking how other API endpoints do it. Even re-prompting and telling it how it went wrong did not help much. No tests were written (despite the project rules requiring it).

GPT-5-Codex needed MUCH longer ~20min. Changes made were much more profound, but it implemented proper error handling, lots of edge cases and wrote tests without me prompting it to do so (project rules already require it). API calls ran smoothly. The entire feature worked perfectly.

My conclusion is clear: GPT-5-Codex is the clear winner, not even close.

I will take the 20mins every single time, knowing the work that has been done feels like work done by a senior dev.

The 3mins surprised me a lot and I was hoping to see great results in such a short period of time. But of course, a quick & dirty, buggy implementation with no tests is not what I wanted.

manofmanysmiles - 2 days ago

I haven't shouted into the void for a while. Today is as good a day as any other to do so.

I feel extremely disempowered that these coding sessions are effectively black box, and non-reproducible. It feels like I am coding with nothing but hopes and dreams, and the connection between my will and the patterns of energy is so tenuous I almost don't feel like touching a computer again.

A lack of determinism comes from many places, but primarily: 1) The models change 2) The models are not deterministic 3) The history of tool use and chat input is not availabler as a first class artifact for use.

I would love to see a tool that logs the full history of all agents that sculpt a codebase, including the inputs to tools, tool versions and any other sources of enetropy. Logging the seed into the RNGs that trigger LLM output would be the final piece that would give me confidence to consider using these tools seriously.

I write this now after what I am calling "AI disillusionment", a feel where I feel so disconnected from my codebase I'd rather just delete it than continue.

Having a set of breadcrumbs would give me at least a modicum of confidence that the work was reproducible and no the product of some modern ghost, completely detached from my will.

Of course this would require actually owning the full LLM.

Bjorkbat - 2 days ago

> Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks.

Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all in their system card. It's only through an article on The Verge that we get more context. Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code (https://www.theverge.com/ai-artificial-intelligence/787524/a...)

I have very low expectations around what would happen if you took an LLM and let it run unattended for 30 hours on a task, so I have a lot of questions as to the quality of the output

rudedogg - 2 days ago

I just ran this through a simple change I’ve asked Sonnet 4 and Opus 4.1, and it fails too.

It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this change and they could.

I worry everyone is chasing benchmarks to the detriment of general performance. Or the next token weight for the incorrect change outweigh my simple but precise instructions. Either way it’s no good

Edit: With a followup “please do what I asked” sort of prompt it came through, while Opus just loops. So theres that at least

yewenjie - 2 days ago

Looking at the chart here, it seems like Sonnet 4 was already better than GPT-5-codex in the SWE verified benchmark.

However, my subjective personal experience was GPT-5-codex was far better at complex problems than Claude Code.

ojosilva - 2 days ago

To @simonw and all the coding agent and LLM benchmarkers out there: please, always publish the elapsed time for the task to complete successfully! I know this was just a "it works straight in claude.ai" post, but still, nowhere in the transcript there's a timestamp of any kind. Durations seem to be COMPLETELY missing from the LLM coding leaderboards everywhere [1] [2] [3]

There's a huge difference in time-to-completion from model to model, platform to platform, and if, like me, you are into trial-and-error, rebooting the session over and over to get the prompt right or "one-shot", it's important how reasoning efforts, provider's tokens/s, coding agent tooling efficiency, costs and overall model intelligence play together to get the task done. Same thing applies to the coding agent, when applicable.

Grok Code Fast and Cerebras Code (qwen) are 2 examples of how models can be very competitive without being the top-notch intelligence. Running inference at 10x speed really allows for a leaner experience in AI-assisted coding and more task completion per day than a sluggish, but more correct AI. Darn, I feel like a corporate butt-head right now.

1. https://www.swebench.com/

2. https://www.tbench.ai/leaderboard

3. https://gosuevals.com/agents.html

lexarflash8g - 2 days ago

Just tested this on a rather simple issue. Basically it falls into rabbits holes just like the other models and tries to brute force fixes through overengineering through trial and error. It also says "your job should now pass" maybe after 10 prompts of roughly doing the same thing stuck in a thought loop.

A GH actions pipeline was failing due to a CI job not having any source code files -- error was "No build system detected". Using Cursor agent with Sonnet 4.5, it would try to put dummy .JSON files and set parameters in the workflow YAML file to false, and even set parameters that don't exist. Simple solution was to just override the logic in the step to "Hello world" to get the job to pass.

I don't understand why the models are so bad with simple thinking outside the box solutions? Its like a 170 iq savant who can't even ride public transporation.

bradley13 - 2 days ago

I need to try Claude - haven't gotten to it.

I use AI for different things, though, including proofreading posts on political topics. I have run into situations where ChatGPT just freezes and refuses. Example: discussing the recent rape case involving a 12-year-old in Austria. I assume its guardrails detect "sex + kid" and give a hard "no" regardless of the actual context or content.

That is unacceptable.

That's like your word processor refusing to let you write about sensitive topics. It's a tool, it doesn't get to make that choice.

peterdstallion - 2 days ago

I am a paying subscriber to Gemini, Claude and OpenAI.

I don't know if it's me, but over the last few weeks I've got to the conclusion ChatGPT is very strongly leading the race. Every answer it gives me is better - it's more concise and more informative.

I look forward to testing this further, but out of the few runs I just did after reading about this - it isn't looking much better

trevin - 2 days ago

I’m always fascinated by the fine-tuning of LLM personalities. Might we finally get less of the reflexive “You’re absolutely right” with this one?

Maybe we’re entering the Emo Claude era.

Per the system card: In 250k real conversations, Claude Sonnet 4.5 expressed happiness about half as often as Claude 4, though distress remained steady.

nickstinemates - 2 days ago

I gave it a quick spin with System Initiative[1]. The combination solved a 503 error in our infrastructure in 15 minutes that took over 2 hours to debug manually.

It's pretty good! I wrote about a few other use cases on my blog[2]

1: https://systeminit.com 2: https://keeb.dev/2025/09/29/claude-sonnet-4.5-system-initiat...

baobabKoodaa - 2 days ago

Here's an anecdata. I have a real-world use case financial dataset where I have created benchmarks. Sonnet 4.5 provides no measurable improvement on these benchmarks over Sonnet 4. This is a bit surprising to me, especially when considering that the benchmark results published by Anthropic indicate that Sonnet 4.5 should be better than Sonnet 4 specifically on financial data analysis.

zurfer - 2 days ago

Same price and a 4.5 bp jump from 72.7 to 77.2 SWEBench

Pretty solid progress for roughly 4 months.

pembrook - 2 days ago

If they stopped the automatic "You're absolutely right!" responses after the model fails to fix something 20 times in a row, then that alone will be worth the upgrade.

Me: "You just burned my house down"

Claude: "You're absolutely right! I burned your house down, I need to revert the previous change and..."

Me: "Now you rebuilt my house with a toilet in the living room"

Claude: "You're absolutely right! I put a toilet in your living room..."

Etc.

0xbadcafebee - 2 days ago

Claude doesn't know how to calculate realistic minimum voltages for solar arrays w/MPPT chargers. ChatGPT does.

Prompt: "Can I use two strings of four Phono Solar PS440M8GFH solar panels with a EG4 12kPV Hybrid Inverter? I want to make sure that there will not be an issue any time of year. New York upstate."

Claude 4.5: Returns within a few seconds. Does not find the PV panel specs, so it asks me if I want it to search for them. I say yes. Then it finally comes up with: "YES, your configuration is SAFE [...] MPPT range check: Your operating voltage of 131.16V fits comfortably in the 120-500V MPPT operating range".

ChatGPT 5: Returns after 78 seconds. Says: "Hot-weather Vmpp check: Vmpp_string @ STC = 4 × 32.79 = 131 V (inside 120–500 V). Using the panel’s NOCT point (31.17 V each), a typical summer operating point is ~125 V — still OK. But at very hot cell temps (≈70 °C is possible), Vmpp can drop roughly ~13% from STC → ~114 V, which is below the EG4’s 120 V MPPT lower limit. That can cause the tracker to fall out of its optimal range and reduce harvest during peak heat."

ChatGPT used deeper thinking to determine that the lowest possible voltage in the heat would be below the MPPT's minimum operating voltage. It doesn't indicate that in reality it might not charge at all at that point... but it does point out the risk, whereas Claude says everything is fine. I need about 5 back-and-forths with Claude to get it to finally realize its mistake.

schmorptron - 2 days ago

Oh wow, a lot of focus on code from the big labs recently. In hindsight it makes sense that the domain the people building it know best is the one getting the most attention, and it's also the one the models have seen the most undeniable usefulness in so far. Though personally, the unpredictability of the future where all of this goes is a bit unsettling at the same time...

MichealCodes - 2 days ago

I really hope benchmarking improves soon to monitor the model in the weeks following the announcement. It really seems like these companies introduce a new "buffed" model and then slowly nerf the intelligence through optimizations.

If we saw task performance week 1 vs week 8 on benchmarks, this would at least give us more insight into the loop here. In an environment lacking true progress a company could surely "show" it with this strategy.

siva7 - 2 days ago

Does 4.5 still answer everything with "You're absolutely right!" or is it now able to communicate like a real programmer?

techpression - 2 days ago

It took me one question to have it spit out a completely dreamt up codebase, complete with emojis, promises of solutions and fixing all my problems, and of course nothing of it worked. It was a very simple question about something very well documented (Oban timeouts).

I doubt LLM benchmarks more and more, what are they even testing?

greenfish6 - 2 days ago

As the rate of model improvement appears to slow, the first reactions seem to be getting worse and worse, as it takes more time to assess the model's quality and understand the nuances & subtler improvements

unshavedyak - 2 days ago

Interesting, in the new 2.0.0 claude code they got rid of the "Plan with Opus then switch to Sonnet" feature. I hope they're correct in Sonnet being good enough to Plan too, because i quite preferred Opus planning. It wasn't necessarily "better", just more predictable in my experience.

Also as a Max $200 user, feels weird to be paying for an Opus tailored sub when now the standard Max $100 would be preferred since they claim Sonnet is better than Opus.

Hope they have Opus 4.5 coming out soon or next month i'm downgrading.

Galaco - 2 days ago

If you pause your subscription, Claude.ai breaks. I paused my subscription, and my account immediately transitioned to free. It has removed my invoice history, and attempts to upgrade again fail with an internal error. Their chatbot is telling me to navigate to UI elements that don't exist, and free users do not have the option of human support.

So I'm stuck; my sub is paused, and I cannot either cancel, or unpause and cannot speak to a human to solve this because the pause process took away all possibility of human interaction.

This is the future we live in.

- 2 days ago
[deleted]
cryptoz - 2 days ago

I've really got to refactor my side project which I tailored to just use OpenAI API calls. I think the Anthropic APIs are a bit different so I just never put in the energy to support the changes. I think I remember reading that there are tools to simpify this kind of work, to support multiple LLM APIs? I'm sure I could do it manually but how do you all support multiple API providers that have some differences in the API design?

mohsen1 - 2 days ago

Price is playing a big role in my AI usage for coding. I am using Grok Code Fast as it's super cheap. Next to it GPT-5 Codex. If you are paying for model use out of pocket Claude prices are super expensive. With better tooling setup those less smart (and often faster) models can give you better results.

I am going to give this another shot but it will cost me $50 just to try it on a real project :(

devinprater - 2 days ago

I hope that one day Anthropic work on making Claude more accessible to screen reader users. ChatGPT is currently the only AI that I know of that, when it's thinking, sends that status to the screen reader, and then sends the response to the screen reader to be spoken as well, like any other good chat app does.

chipgap98 - 2 days ago

Interesting that this is better than Opus 4.1. I want to see how this holds up under real world use, but if that's the case its very impressive.

I wonder how long it will be before we get Opus 4.5

Aflynn50 - 2 days ago

When I see how much the latest models are capable of it makes me feel depressed.

As well as potentially ruining my career in the next few years, its turning all the minutiae and specifics of writing clean code, that I've worked hard to learn over the past years, into irrelivent details. All the specifics I thought were so important are just implementation details of the prompt.

Maybe I've got a fairly backwards view of it, but I don't like the feeling that all that time and learning has gone to waste, and that my skillset of automating things is becoming itself more and more automated.

alach11 - 2 days ago

I'm really interested in the progress on computer use. These are the benchmarks to watch if you want to forecast economic disruption, IMO. Mastery of computer use takes us out of the paradigm of task-specific integrations with AI to a more generic interface that's way more scalable.

Kim_Bruning - 10 hours ago

I think it's to do with system prompting too, but Sonnet 4.5 actually pushes back at times. And it tries to keep me on topic. It's refreshing!

cmrdporcupine - 2 days ago

So far I'm liking that it seems to follow my CLAUDE.md instructions better, doing more frequent checkins with me to ask me to review what it's done, etc, and taking my advice more.

What I'm not liking is it seems even... lazier... than previously. By which I mean the classic "This is getting complicated so..." (followed by cop-out, dropping the original task and motivation).

There's also a bug where compaction becomes impossible. ("conversation too long" and its advice on how to fix doesn't work)

jatins - 2 days ago

I tested this on some day to day pattern matching kind of tasks and it didn't do well. Still the same over eagerness to make wild code changes instead of "reasoning" about the error

_joel - 2 days ago

`claude model claude-sonnet-4-5-20250929` for cli users

seaal - 2 days ago

They really had to release an updated model, I can only imagine how many people cancelled their plans and switched over to Codex over the past month.

I'm glad they at least gave me the full $100 refund.

vbtechguy - 2 days ago

Claude Sonnet 4.5 definitely the best model I've tried to date - my evaluation rankings against 23 AI models at https://github.com/centminmod/claude-sonnet-4.5-evaluation :)

catigula - 2 days ago

I'm still absolutely right constantly, I'm a genius. I also make various excellent points.

andrewstuart - 2 days ago

Still waiting to be able to upload zip files to Claude, which Gemini and ChatGPT have had for ages.

ChatGPT even does zip file downloads, packaging up all your files.

rtp4me - 2 days ago

Just updated to Sonnet 4.5 and Claude Code 2.0 this afternoon. I worked on a quick project (creating PXE bootable files) using the updates and have to say, this new version seems much faster and more accurate than before. I did not go round-and-round trying to get good output and Claude did not go down rabbit holes like before. So far, so good.

user1999919 - 2 days ago

its time to start benchmarking benchmarks. im pretty sure they are bmw levels doping the game here

mohsen1 - 2 days ago

That's a pretty pelican on a bicycle!

https://jsbin.com/hiruvubona/edit?html,output

https://claude.ai/share/618abbbf-6a41-45c0-bdc0-28794baa1b6c

aliljet - 2 days ago

These benchmarks in real world work remain remarkably weak. If you're using this for day-to-day work, the eval that really matters is how the model handles a ten step action. Context and focus are absolutely king in real world work. To be fair, Sonnet has tended to be very good at that...

I wonder if the 1m token context length is coming for this ride too?

StarterPro - 2 days ago

Once the bottom falls out of ai, will programming be seen as a marketable skill again?

cube2222 - 2 days ago

So… seems like we’re back to Sonnet being better than Opus? At least based on their benchmarks.

Curious to see that in practice, but great if true!

usr19021ag - 2 days ago

Their benchmark chart doesn't match what's published on https://www.swebench.com/.

I understand that they may have not published the results for sonnet 4.5 yet, but I would expect the other models to match...

mattlangston - 2 days ago

It does well with screenshot-calculus for me. For example, I pasted a screenshot of the Layer Norm equation into Claude Code 2 and asked:

"Differentiate y(x) w.r.t x, gamma and beta."

It not only produced the correct result, but it understood the context - I didn't tell it the context was layer norm, back-propagation and matrices.

This release is a step function for my use cases.

My screenshot came from here: https://docs.pytorch.org/docs/stable/generated/torch.nn.Laye...

mchusma - 2 days ago

For me, Opus 4.1 was so much better than Sonnet 4.0 that I used it exclusively in Claude Code and cancelled Cursor. I'm a bit skeptical that Sonnet 4.5 will be in practice better, but will test with it and see! Hopefully we get Opus 4.5 soon.

n8m8 - 2 days ago

So far the only thing I’ve noticed is that it made me confirm that it should do a 10 minute task “manually” because it “would take 2 or 3 hours”

It was a context merging task for my unorganized collection of agents… it sort of made sense, but was the exact reason I was asking it to do it… like you’re the bot, lol

ChaoPrayaWave - 2 days ago

What impressed me most about Claude Sonnet 4.5 is that its output structure is more stable than many other models and less prone to crashes. I ran some real world scripts from my own projects, and it exhibited fewer hallucinations than GPT-4 and performed more faithfully on code interpretation tasks. However, it can be a bit slow to warm up, and sometimes I needed more prompts in the first few rounds.

oscord - 2 days ago

Sonnet 4 had turned to shit recently (about 2.5 months according to my observations). It hallucinated on 3 questions in a row while looking at a simple bash script. Was enough for me to cancel. Claude biz is killing Claude dev. It was good while they were not so stingy on GPU.

jdlyga - 2 days ago

Compared to Claude Sonnet 4, anecdotal evidence. But I'm noticing very little difference.

wohoef - 2 days ago

And Sonnet is again better than Opus. I’d love to see simultaneous release dates for Sonnet and Opus one day. Just so that Opus is always better than Sonnet

system2 - 2 days ago

I didn't try the checkpoints, I use local git + /resume from a chat that I pick closer to the git version I restore if Claude screws up.

Will this checkpoint help with chat memory and disregard the latest chat's info?

I use WSL under Windows, VSCode with the WSL plugin, and Claude-Code installed on Ubuntu 24. It is generally solid and has no issue with this setup.

sberens - 2 days ago

Is "parallel test time compute" available in claude code or the api? Or is it something they built internally for benchmark scores?

marginalia_nu - 2 days ago

Is there some accessible explainer for what these numbers that keep going up actually mean? What happens at 100% accuracy or win rate?

scosman - 2 days ago

Interesting quirk on first use: "`temperature` and `top_p` cannot both be specified for this model. Please use only one."

virtualritz - 16 hours ago

So I was using Opus exclusively (Max plan) to write Rust since June.

CC switched to Sonnet 4.5 by default yesterday, I'm just very unimpressed. It seems like a considerable regression. Probably this is related to me using it to write Rust and not Python or JS/TS?

Example: I asked it to refactor a for loop to functional code with rayon, compiler barfs about mutation (it used par_iter()). It rolls back the code to what it was before. Then this happens:

Me: There is par_iter_mut().

Sonnet: Ah yes, but we can't use par_iter_mut() on self.vertices because we're calling self.set_vertex_position() which needs mutable access to the entire mesh (not just the vertices map).

However, self is &mut. Wtf?

This would have never happened with Opus, i.e. Opus would have used par_iter_mut() to start with (or applied the right fix w/o hand-holding after the compile failed with par_iter()).

I had just a bunch of those wtfs since yesterday from more or less zero before. I.e. it doesn't feel like coincidence.

j45 - 2 days ago

A question I have for anyone is -- has Claude Max returned to or repaired the response quality and service issues between the usage limits and performance of the model for coding and non-coding tasks?

Anecdata is welcome as it seems like it's the only thing available sometimes.

AbuAssar - 2 days ago

I used to treat writing code as a form of art, with attention to details and best practices, and using design patterns whenever possible.

but it seems this will come to an end eventually as these agents become more stronger and capable each day, and will be better and faster than human coders.

ionwake - 2 days ago

Do we have a pelican for it yet ?

miletus - 2 days ago

we at agentsea.com have been playing with it for a while.

here's what we think about it:

- still sucks at generating pretty ui - great for creative writing and long-form planning - it’s really fast but not smarter than gpt-5 - pairs well with external tools/agents for research and automation - comes with a 1m token context window, so you can feed it monstrous codebases or giant docs - still hallucinates or stumbles on complex requests

drbojingle - 2 days ago

Imo we're going to start needing more examples of where the successor is better than what came before, and not just benchmarks.

- 2 days ago
[deleted]
Attummm - 2 days ago

Anthropic really nailed this release.

There had been a trend where each new model released from OpenAI, Anthropic, etc. felt like a letdown or worse a downgrade.

But the release of 4.5 break that trend, And is a pleasant surprise on day one.

Well done! :)

meetpateltech - 2 days ago

Seeing the progress of the Claude models is really cool!

Charting Claude's progress with Sonnet 4.5: https://youtu.be/cu1iRoc1wBo

ancorevard - 2 days ago

Can't use Anthropic models in Cursor. Completely cost prohibitive compared to gpt-5 and grok models.

Why is this? Does Anthropic have just higher infrastructure costs compared to OpenAI/xAI?

jonathanstrange - 2 days ago

I would like to see completely independent test results of these companies' products. I'm skeptical because every AI company claims their new product is the best.

croemer - 2 days ago

It's not yet on LMarena: https://lmarena.ai/leaderboard/text

niyazpk - 2 days ago

Does anyone know whatever happened to the Haiku family of models? They've not been updated since 3.5! Did Anthropic give up on them?

fibers - 2 days ago

This looks exciting. I hope they add this to Windsurf soon.

chrisford - 2 days ago

The vision model has consistently been degraded since 3.5, specifically around OCR, so I hope it has improved with Claude Sonnet 4.5!

mutant - 2 days ago

Didn't they promise a 1m token input ? I don't see that here.

edude03 - 2 days ago

Ah, I figured something was up - I had sonnet 4 selected but it changed to "Legacy Model" while I was using the app.

cwoolfe - 2 days ago

I've been really impressed with how good Cursor is at coding. I threw it a standard backend api endpoint and database task yesterday and it generated 4 hours of code in 2 minutes. It was set to Auto which I think uses some Claude model.

jdthedisciple - 2 days ago

Why the focus on the "alignment"-aspect of safety?

Surely there are more pressing issue with LLMs currently...

mccoyb - 2 days ago

Congratulations: it's faster, but worse, with a larger context window.

i-chuks - 2 days ago

AI companies really need to consider regional pricing. Huuuuge barrier!

hu3 - 2 days ago

I wonder if/when this will be available to GitHub Copilot in VSCode.

dr_dshiv - 2 days ago

Anyone try the Imagine with Claude yet? How does it work?

vinhnx - 2 days ago

Claude Sonnet 4.5 has landed support in my CLI coding agent VT Code, combining SOTA language model and agentic semantic code understanding github.com/vinhnx/vtcode

vb-8448 - 2 days ago

claims against gpt-5 are huge!

I used to use cc, but I switched to codex (and it was much better) ... no I guess I have to switch batch to CC, at least to test it

tresil - 2 days ago

I'll add another really positive review here. Sonnet 4.0 had been really struggling to implement an otel monitoring solution using grafana's lgtm stack. Sonnet 4.0 had 4 or 5 different attempts - some of them longer than 10 min - troubleshooting why metrics were supposedly being emitted from the api, but not showing up in Prometheus. Sonnet 4.5 correctly diagnosed and fixed the real issue within about 5 min. Not sure if that's the model being smarter, but I definitely saw the agent using some new approaches and seemingly managing it's context better.

asdev - 2 days ago

how do claude/openai get around rate limiting/captcha with their computer use functionality?

smakosh - 2 days ago

Available on llmgateway.io already

nickphx - 2 days ago

It will be great when the VC cash runs out, the screws tighten, and finally an end to the incessant misleading marketing claims.

deviation - 2 days ago

Interesting. In a thought process while editing a PDF, Claude disclosed the folder hierarchy for it's "skills". I didn't know this was available to us:

> Reading the PDF skill documentation to create the resume PDF

> Here are the files and directories up to 2 levels deep in /mnt/skills/public/pdf, excluding hidden items and node_modules:

iFire - 2 days ago

Is it 15x cheaper like Grok?

AtNightWeCode - 2 days ago

Sonnet is just so expensive comparing to other competitors. Have they fixed this?

pants2 - 2 days ago

Unfortunately also disappointed with it in Cursor vs GPT-5-Codex. I asked it to add a test for a specific edge case, it hallucinated some parameters and didn't use existing harnesses. GPT-5-Codex with the same prompt got everything right.

dbbk - 2 days ago

So Opus isn't recommended anymore? Bit confusing

MarcelOlsz - 2 days ago

Terrible. It can't even do basic scaffolding which is all it was good for, now it can't even do that. You can wrangle it with taskmaster or bmadcode or whatever but at that point I'd rather just write it myself. Writing English to build things is goofy. Unsubscribed.

catigula - 2 days ago

I happened to be in the middle of a task in a production codebase that the various models struggled on so I can give a quick vibe benchmark:

opus 4.1: made weird choices, eventually got to a meh solution i just rolled back.

codex: took a disgusting amount of time but the result was vastly superior to opus. night and day superiority. output was still not what i wanted.

sonnet 4.5: not clearly better than opus. categorically worse decision-making than codex. very fast.

Codex was night and day the best. Codex scares me, Claude feels like a useful tool.

- 2 days ago
[deleted]
rishabhaiover - 2 days ago

hn displays a religious hatred towards ai progress

cloverich - 2 days ago

Please y'all, when you list supportive or critical complaints based on your actual work, include some specifics of the task and prompt. Like actual prompt, actual bugs, actual feature, etc. I've had great success with both ChatGPT and Claude for years, am around 3x sustained output increase in my professional work, and kicking off and finishing new side projects / features that I used to simply not ever finish. BUT there's some tasks I run into where it's god awful. Because I have enough good experience, I know how to work around, when to give up, when to move on, etc. I am still surprised at things it cannot do, for example Claude code could not seem to stitch together three screens in an iOS app using the latest SwiftUI (I am not an iOS dev). IMHO for people using it off and on or sparingly, it's going to seem either incredible or worthless depending on your project and prompt. Share details, it's so helpful for meaningful conversation!

lihaciudanieljr - 2 days ago

[dead]

kixiQu - 2 days ago

Lots of feature dev here – anyone have color on the behavior of the model yet? Mouthfeel, as it were.

idkmanidk - 2 days ago

Page cannot be found Empty screen mocks my searching Only void responds

but: https://imgur.com/a/462T4Fu

atemerev - 2 days ago

Ah, the company where the models are unusable even with Pro subscription (start to hit the limit after 20 minutes of talking), and free models are not usable at all (currently can't even send a single message to Sonnet 4.5)...

hsn915 - 2 days ago

It is time to acknowledge that AI coding does not actually work.

ok, you think it's a promising field and you want to explore it, fine. Go for it.

Just stop pretending that what these models are currently doing is good enough to replace programmers.

I use LLMs a lot, even for explaining documentation.

I used to use them for writing _some_ code, but I have never ever gotten a code sample over 10 lines that was not in need of heavy modifications to make it work correctly.

Some people are pretending to write hundreds of lines of code with LLMs, even entire applications. All I have to say is "lol".