I’ve had a few responses to my blog post from 30 January 2024 about AI generated articles making up definitions for nonsense words.
For the term “lrtsjerk”, for example, I found 2460 articles; a few definitions were “a group of online jerks”, “a leading technology solution”, “an out-of-the-box way of thinking”, “a magical land”, “a full-body workout with ancient roots”, “an app” and “a linguistic trend”. You can find the original post here.
“lrtsjerk” is a particular type of nonce word: a ghost word
I posited that these AI generated definitions for typos were real “non-words”; strings of letters that truly are not words, because they are not being used by, and not being defined by, humans.
I have since been schooled in the categories of nonsense words, or “nonce words” in linguistics. There are nonce words in poetry, such as in The Jabberwocky by Lewis Carroll, in teaching children to read (“bic”, “zop”, “wap”), in linguistic studies ( “wug”), in famous tweets (“covfefe“) and more.
One subcategory of nonce words is ghost words: a nonce word authoritatively described in a reference work that turns out to have originated from a typo or other simple error. The term was invented long before AI, but I think it fits rather well.
Some other questions that came my way:
What is the first citation of “lrtsjerk”?
I have had a look on https://www.oldestsearch.com/. This tool isn’t very dependable; though it gives three hits for 2006, 2012 and 2019, when I click on these I find no mention of “lrtsjerk”. Even the early 2023 entries seem to be a product of the page in question linking to a future article about “lrtsjerk” (an “if you liked this, you might also like” situation).
The oldest post I can find that looks legit is this one from 8 September 2023. Blog posts can be set to an earlier date by their creators, of course, so this isn’t wholly reliable. However, looking on Oldest Search this is when the “real” blog posts start cropping up, one every few days, so I think early September is right.
The term is not on Google Ngram, Google Books gives 7 results, but when I then search inside those books, “lrtsjerk” isn’t there.
Also, interestingly, ChatGPT tells me it is “not a recognised term”.
You suggest these ghost words come from human typos; why do you think they are not made up by the AI itself?
I believe these articles were meant to generate clicks in Google. Nobody is going to search for a nonsense word that is completely new, so there is no point to doing that. But a bot that has been given a list of Google search entries does not distinguish between typos and proper searches. It just generates articles for the lot.
Secondly, generative AI wouldn’t come up with a word that is not pronounceable in English, such as “lrtsjerk”. Here are the kinds of words it comes up with:
How can we know for sure that these articles are computer generated?
This is the huge problem about AI generated material. Recognition tools are famously unreliable, which is why this is such a problem for schools. (I no longer let my students write essays at home. We flip the classroom, do the instruction at home, and write the essay in class, on paper, no phones or smartwatches allowed.)
When you monitor the web for specific content, like I do, a certain type of AI generated content stands out. I think almost anyone would recognise it with a bit of practice. It is long-winded. It doesn’t cite its sources. It often has a table of contents at the top and a Q&A at the bottom. It repeats itself unnecessarily. It contradicts itself. There is a certain AI-quality to the style of writing.
Just google “lrtsjerk” and read some of the articles. After a few of them I think you will see what I mean.
Is there a role for dictionaries, here?
As I ended my other article: Google and the other search engines really have their work cut out for them. People looking to make money will not think twice about flooding the internet with millions of AI-generated nonsense articles.
I think reputable pages like Wikipedia, the New York Times and, yes, dictionaries, have a big role to play. I know as a mother, when I google stuff about my kids (they each just had a bout of fifth disease – cute red cheeks, less cute irritability), I no longer trust any sources I don’t recognise. I do use Google, but when I get the results, I only choose sites like the NHS or Wikipedia.
For these AI ghost words, I think it is good news for dictionaries, but it does mean they have to keep up. Words are being added all the time (I should know!). People will trust dictionary definitions more than random blogs, but when there are no dictionary definitions, they might trust that random blog after all.
Here’s an idea: Should dictionaries keep running lists of AI ghost words, just to let people know that that is what they are?
Heddwen Newton is an English teacher and translator. She is fascinated by contemporary English and the way English changes. Her newsletter is English in Progress. 1100 subscribers and growing every day!