Shortly after OpenAI released its first “inference” AI model, o1, people started noticing strange phenomena. Models sometimes started “thinking” in Chinese, Farsi, or other languages even when asked questions in English.
When given a problem to sort out, such as “How many R's are there in the word 'strawberry'?”, o1 begins a “thinking” process and performs a series of reasoning steps to arrive at the answer. If the question is written in English, o1's final response will be in English. However, the model performs several steps in another language before making a conclusion.
“[O1] Halfway through, I randomly started thinking in Chinese,” one Reddit user said.
“Why did it happen? [o1] Do you randomly start thinking in Chinese? another user asked in a post to X. “Some of the conversations (more than 5 messages) were not in Chinese.”
Why did o1 professionals start thinking in Chinese? Some of the conversations (more than 5 messages) were not in Chinese… It was very interesting… Impact of training data pic.twitter.com/yZWCzoaiit
— Rishab Jain (@RishabJainK) January 9, 2025
OpenAI has not explained or acknowledged o1's strange behavior. So what's going on?
Well, AI experts aren't so sure. But they have some theories.
Several people at X, including Hugging Face CEO Clément Delangue, have alluded to the fact that inference models like o1 are trained on datasets containing large amounts of kanji. Ted Xiao, a researcher at Google DeepMind, said that companies including OpenAI use third-party Chinese data labeling services, and that o1's switch to Chinese was due to “Chinese linguistic impact on inference.” claimed to be an example of.
“[Labs like] Utilizing OpenAI and Anthropic [third-party] “Data labeling service for doctoral-level inference data in science, mathematics, and coding,” Xiao wrote in a post on X.[F]Many of these data providers are based in China due to availability of expert resources and cost reasons. ”
Labels, also known as tags or annotations, help models understand and interpret data during the training process. For example, labels for training image recognition models may take the form of markings around objects or captions that refer to each person, place, or object depicted in the image.
Research has shown that biased labels can produce biased models. For example, the average annotator is more likely to label phrases in African American Vernacular English (AAVE), an informal grammar used by some Black Americans, as toxic, and Leading AI toxicity detectors consider AAVE to be disproportionately toxic.
However, other experts do not support the o1 China data labeling hypothesis. They point out that o1 is just as likely to switch to a language other than Hindi, Thai or Chinese while searching for a solution.
Rather, these experts say, o1 and other models of reasoning may simply be using the language that seems most efficient to accomplish their goals (or illusions).
“The model doesn't know what the language is or that it's different,” Matthew Guzdial, an AI researcher and assistant professor at the University of Alberta, told TechCrunch. “It's all just text.”
In fact, the model does not process words directly. Use tokens instead. The token can be a word like “awesome.” Or it can be a syllable, such as “fan,” “tas,” or “tic.” Or it can be an individual letter within a word, such as 'f', 'a', 'n', 't', 'a', 's', 't', 'i', 'c' .
As with labeling, tokens can introduce bias. For example, many word-to-token translators assume that spaces in a sentence represent new words, even though not all languages use spaces to separate words.
Tiezhen Wang, a software engineer at AI startup Hugging Face, agrees with Guzdial that language discrepancies in inference models could be explained by the associations the models made during training.
“By embracing all linguistic nuances, we expand the model's worldview and allow it to learn from the full range of human knowledge,” Wang wrote in a post on X. “For example, I prefer to do math in Chinese because each digit is just one number.” Using syllables makes calculations crisp and efficient. But when it comes to topics like unconscious bias, I automatically switch to English. This is mainly because I learned and absorbed those ideas for the first time in English. ”
Wang's theory is plausible. After all, models are probabilistic machines. Trained on many examples, they learn patterns to make predictions, such as how “to whom” in an email is typically placed before “may be of concern” .
But Luca Soldaini, a researcher at the nonprofit Allen Institute for AI, cautions that we don't know for sure. “These models are so opaque that it's impossible to back up these kinds of observations about deployed AI systems,” he told TechCrunch. “This is one of many examples of why transparency in how AI systems are built is important.”
The lack of answers from OpenAI leaves us wondering why o1 thinks of songs in French but synthetic biology in Chinese.