Anthropic Opus 4.6 is less good at finding vulns than you might think
We benchmarked Opus 4.6's ability to find simple C vulns and found that the model flags about 1 in 4 flaws -- with a very high false positive rate and lots of inconsistency from run to run. Techniques like judge agents and requiring the model to justify its results improve the results to some extent, but they're still not great.
[link] [comments]