sjmog 8 hours ago

yesterday I posted a video testing claude-4 sonnet solving an https://simstack.io long-horizon swe challenge unaided (https://news.ycombinator.com/item?id=44424468). for comparison, here's gemini 2.5-pro.

I noticed that 2.5-pro is way more cavalier, skipping backups, and trying "more stuff more quickly" than claude's more cautious approach.

incomingpain 7 hours ago

Ive been using gemini cli,

a couple times it got into a loop of 'i need to take X out' ok now that's gone, ill add Y. Ok now that i added Y, i need to add X. but now that i did that, I'll remove Y.

Tobe fair, I ran out of limit on claude and chatgpt trying to make the same changes. It was a curious one; the code was a bit above my head but i eventually took about 45mins on my own tweaking to get it working. I was shocked when i accidentally got it fixed.