Debugging face-off: Claude, ChatGPT, and Gemini tackled a sabotaged Pygame project with three hidden logic errors under zero-shot conditions. Claude's clean sweep: Claude identified and fixed all bugs ...
Claude Opus 4.1 scores 74.5% on the SWE-bench Verified benchmark, indicating major improvements in real-world programming, bug detection, and agent-like problem solving. Anthropic has just rolled out ...