For one month beginning on October 5, I ran an experiment: Every day, I asked ChatGPT 5 (more precisely, its “Extended Thinking” version) to find an error in “Today’s featured article”. In 28 of these 31 featured articles (90%), ChatGPT identified what I considered a valid error, often several. I have so far corrected 35 such errors.


But we don’t know what the false positive rate is either? How many submissions were blocked that shouldn’t have been, it seems like you don’t have a way to even find that metric out unless somebody complained about it.
It isn’t doing anything automatically; it isn’t moderating for me. It’s just flagging submissions for human review. “Hey, maybe have a look at this one”. So if it falsely flags something it shouldn’t, which is common, I simply ignore it. And as I said, that error rate is moderate, and although I haven’t checked the numbers of the error rate, it’s still successful enough to be quite useful.