GPT-5 vs Claude 4 for code generation — real benchmarks?
Has anyone done systematic benchmarks comparing GPT-5 and Claude 4 for code generation tasks? I'm specifically interested in complex refactoring, test generation, and multi-file context handling. Share your results!