We tested a new ChatGPT-detector for teachers. It flagged an innocent student.
"Five high school students helped our tech columnist test a ChatGPT detector coming from Turnitin to 2.1 million teachers. It missed enough to get someone in trouble.
High school senior Lucy Goetz got the highest possible grade on an original essay she wrote about socialism. So imagine her surprise when I told her that a new kind of educational software I’ve been testing claimed she got help from artificial intelligence.
A new AI-writing detector from Turnitin — whose software is already used by 2.1 million teachers to spot plagiarism — flagged the end of her essay as likely being generated by ChatGPT.
“Say what?” says Goetz, who swears she didn’t use the AI writing tool to cheat. “I’m glad I have good relationships with my teachers.”
After months of sounding the alarm about students using AI apps that can churn out essays and assignments, teachers are getting AI technology of their own. On April 4, Turnitin is activating the software I tested for some 10,700 secondary and higher-educational institutions, assigning “generated by AI” scores and sentence-by-sentence analysis to student work. It joins a handful of other free detectors already online. For many teachers I’ve been hearing from, AI detection offers a weapon to deter a 21st-century form of cheating.
But AI alone won’t solve the problem AI created. The flag on a portion of Goetz’s essay was an outlier, but shows detectors can sometimes get it wrong — with potentially disastrous consequences for students. Detectors are being introduced before they’ve been widely vetted, yet AI tech is moving so fast, any tool is likely already out of date.
It’s a pivotal moment for educators: Ignore AI and cheating could go rampant. Yet even Turnitin’s executives tell me that treating AI purely as the enemy of education makes about as much sense in the long run as trying to ban calculators.
Ahead of Turnitin’s launch this week, the company says 2 percent of customers have asked it not to display the AI writing score on student work. That includes a "significant majority” of universities in the United Kingdom, according to UCISA, a professional body for digital educators.
To see what’s at stake, I asked Turnitin for early access to its software. Five high school students, including Goetz, volunteered to help me test it by creating 16 samples of real, AI-fabricated and mixed-source essays to run past Turnitin’s detector.
The result? It got over half of them at least partly wrong. Turnitin accurately identified six of the 16 — but failed on three, including a flag on 8 percent of Goetz’s original essay. And I’d give it only partial credit on the remaining seven, where it was directionally correct but misidentified some portion of ChatGPT-generated or mixed-source writing.
Turnitin claims its detector is 98 percent accurate overall. And it says situations such as what happened with Goetz’s essay, known as a false positive, happen less than 1 percent of the time, according to its own tests.
Turnitin also says its scores should be treated as an indication, not an accusation. Still, will millions of teachers understand they should treat AI scores as anything other than fact? After my conversations with the company, it added a caution flag to its score that reads, “Percentage may not indicate cheating. Review required.”
“Our job is to create directionally correct information for the teacher to prompt a conversation,” Turnitin chief product officer Annie Chechitelli tells me. “I’m confident enough to put it out in the market, as long as we’re continuing to educate educators on how to use the data.” She says the company will keep adjusting its software based on feedback and new AI advancements.
The question is whether that will be enough. “The fact that the Turnitin system for flagging AI text doesn’t work all the time is concerning,” says Rebecca Dell, who teaches Goetz’s AP English class in Concord, Calif. “I’m not sure how schools will be able to definitively use the checker as ‘evidence’ of students using unoriginal work.”
Unlike accusations of plagiarism, AI cheating has no source document to reference as proof. “This leaves the door open for teacher bias to creep in,” says Dell.
For students, that makes the prospect of being accused of AI cheating especially scary. “There is no way to prove that you didn’t cheat unless your teacher knows your writing style, or trusts you as a student,” says Goetz.
Why detecting AI is so hard
Spotting AI writing sounds deceptively simple. When a colleague recently asked me if I could detect the difference between real and ChatGPT-generated emails, I didn’t perform very well.
Detecting AI writing with software involves statistics. And statistically speaking, the thing that makes AI distinct from humans is that it’s “extremely consistently average,” says Eric Wang, Turnitin’s vice president of AI.
Systems such as ChatGPT work like a sophisticated version of auto-complete, looking for the most probable word to write next. “That’s actually the reason why it reads so naturally: AI writing is the most probable subset of human writing,” he says.
Turnitin’s detector “identifies when writing is too consistently average,” Wang says.
The challenge is that sometimes a human writer may actually look consistently average.
On economics, math and lab reports, students tend to hew to set styles, meaning they’re more likely to be misidentified as AI writing, says Wang. That’s likely why Turnitin erroneously flagged Goetz’s essay, which veered into economics. (“My teachers have always been fairly impressed with my writing,” says Goetz.)
Wang says Turnitin worked to tune its systems to err on the side of requiring higher confidence before flagging a sentence as AI. I saw that develop in real time: I first tested Goetz’s essay in late January, and the software identified much more of it — about 50 percent — as being AI generated. Turnitin ran my samples through its system again in late March, and that time only flagged 8 percent of Goetz’s essay as AI-generated.
But tightening up the software’s tolerance came with a cost: Across the second test of my samples, Turnitin missed more actual AI writing. “We’re really emphasizing student safety,” says Chechitelli.
Turnitin does perform better than other public AI detectors I tested. One introduced in February by OpenAI, the company that invented ChatGPT, got eight of our 16 test samples wrong. (Independent tests of other detectors have declared they “fail spectacularly.”)
Turnitin’s detector faces other important technical limitations, too. In the six samples it got completely right, they were all clearly 100 percent student work or produced by ChatGPT. But when I tested it with essays from mixed AI and human sources, it often misidentified the individual sentences or missed the human part entirely. And it couldn’t spot the ChatGPT in papers we ran through Quillbot, a paraphrasing program that remixes sentences.
What’s more, Turnitin’s detector may already be behind the state of the AI art. My student helpers created samples with ChatGPT, but since they did the writing, the app has gotten a software update called GPT-4 with more creative and stylistic capabilities. Google also introduced a new AI bot called Bard. Wang says addressing them is on his road map.
Some AI experts say any detection efforts are at best setting up an arms race between cheaters and detectors. “I don’t think a detector is long-term reliable,” says Jim Fan, an AI scientist at Nvidia who used to work at OpenAI and Google.
“The AI will get better, and will write in ways more and more like humans. It is pretty safe to say that all of these little quirks of language models will be reduced over time,” he says.
Is detecting AI a good idea?
Given the potential — even at 1 percent — of being wrong, why release an AI detector into software that will touch so many students?
“Teachers want deterrence,” says Chechitelli. They’re extremely worried about AI and helping them see the scale of the actual problem will “bring down the temperature.”
Some educators worry it will actually raise the temperature.
Mitchel Sollenberger, the associate provost for digital education at the University of Michigan-Dearborn, is among the officials who asked Turnitin not to activate AI detection for his campus at its initial launch.
He has specific concerns about how false positives on the roughly 20,000 student papers his faculty run through Turnitin each semester could lead to baseless academic-integrity investigations. “Faculty shouldn’t have to be expert in a third-party software system — they shouldn’t necessarily have to understand every nuance,” he says.
Ian Linkletter, who serves as emerging technology and open-education librarian at the British Columbia Institute of Technology, says the push for AI detectors reminds him of the debate about AI exam proctoring during pandemic virtual learning.
“I am worried they’re marketing it as a precision product, but they’re using dodgy language about how it shouldn’t be used to make decisions,” he says. “They’re working at an accelerated pace not because there is any desperation to get the product out but because they’re terrified their existing product is becoming obsolete.”
Said Chechitelli: “We are committed to transparency with the community and have been clear about the need to continue iterating on the user experience as we learn more from students and educators.
Deborah Green, CEO of UCISA in the U.K., tells me she understands and appreciates Turnitin’s motives for the detector. “What we need is time to satisfy ourselves as to the accuracy, the reliability and particularly the suitability of any tool of this nature.”
It’s not clear how the idea of an AI detector fits into where AI is headed in education. “In some academic disciplines, AI tools are already being used in the classroom and in assessment,” says Green. “The emerging view in many U.K. universities is that with AI already being used in many professions and areas of business, students actually need to develop the critical thinking skills and competencies to use and apply AI well.”
There’s a lot more subtlety to how students might use AI than a detector can flag today.
My student tests included a sample of an original student essay written in Spanish, then translated into English with ChatGPT. In that case, what should count: the ideas or the words? What if the student was struggling with English as a second language? (In our test, Turnitin’s detector appeared to miss the AI writing, and flagged none of it.)
Would it be more or less acceptable if a student asked ChatGPT to outline all the ideas for an assignment, and then wrote the actual words themselves?
“That’s the most interesting and most important conversation to be having in the next six months to a year — and one we’ve been having with instructors ourselves,” says Chechitelli.
“We really feel strongly that visibility, transparency and integrity are the foundations of the conversations we want to have next around how this technology is going to be used,” says Wang.
For Dell, the California teacher, the foundation of AI in the classroom is an open conversation with her students.
When ChatGPT first started making headlines in December, Dell focused an entire lesson with Goetz’s English class on what ChatGPT is, and isn’t good for. She asked it to write an essay for an English prompt her students had already completed themselves, and then the class analyzed the AI’s performance.
“Part of convincing kids not to cheat is making them understand what we ask them to do is important for them,” said Dell."