Testing the Robustness of Machine‐Generated Text Detectors Against Cyclic Machine Translation - minalee-research/cs257-students GitHub Wiki
#method
, #application
, #analysis
Edmund Hogan
Abstract
Accurate detection of machine-generated text is increasingly vital in academic and professional environments, especially given the rapid proliferation of large language models to numerous tasks worldwide. My project investigates whether cyclic translation, translating machine-generated text from English into Spanish, Russian, or Chinese and back again, can evade AI text detectors. Initial results indicate that these translations actually increase detection scores, suggesting that fine-tuned detectors are more robust to such perturbations than initially expected. Notably, applying the same translations to human-generated text also raises its likelihood of being flagged as AI-generated, underscoring some important consequence of machine manipulation of human generated text. These findings highlight both the strengths and potential biases of current detection methods and motivate further exploration into more robust detection systems.
Background
Machine generated text detection is a crucial task in the new and expand world of NLP, particularly necessary in academic and educational settings where proof of authorship and plagiarism are serious concerns. This project will look at the effectiveness of different machine-generated text detectors on naive machine-generated responses, compared to responses which have been translated to another language and then back again. In particular, this project will begin with our responses in English which will then be translated to another language (Spanish, Chinese, and Russian), this translation will then immediately be translated back into English resulting in some slight perturbation of the original text. This translation cycle will be repeated for n iterations until the resulting text is passed to the detector. We will then compile an average effect on the TPR@FPR = 25% for the detector for each language’s re-translation compared to the detection score of the detector on the original unmodified text. Finally we can compare these delta’s to see the change in delta depending on the language of translation and number of cycles.
This seems like a potential reasonable plan to avoid detection for a few different reasons. According to Wang et al. (2024), we know that there are two main types of AI text detectors that are on the market today, metric based detectors, which rely on directly predicting the next token output from the generator LLM, and model based, or fine-tuned detectors, which are trained on model outputs via supervised learning. Using these ideas, it makes sense to think that word choice substitutions and word order perturbations, which are common in these translation cycles, would fool a detector that has only seen direct outputs or that attempts to predict the exact token that the generator LLM predicted. The paper also outlines that these "lower-level perturbations show greater attack success than higher-level perturbations" which would support our idea that these cyclic translation would prove hard for a machine generated text detector to detect. It is an interesting idea to test whether or not these translation cycles will strengthen the relationships between word vectors and make it easier to predict them, or the other way around, scrambling the relationships and making it harder for the detector to pick up on them.
Experimental Design
In this experiment, we begin with a dataset of short, human-generated responses and use the first 20 tokens of each response as prompts for a language model to produce machine-generated text. We then apply multiple cyclic translations (into Spanish, Russian, or Chinese and back to English) to introduce varying degrees of linguistic perturbations. Finally, we evaluate how effectively the Sapling AI Text Detector can distinguish between these perturbed, machine-generated texts and the original human-generated texts.
Dataset
For the data in our experiment we used a public dataset on Huggingface. This dataset contained a list of short, 100-200 token, human generated responses to a certain prompt. The machine generated responses were then generated by passing the first 20 tokens of the human generated text as an input to the model with the instructions to output a completed response.
Translation
In terms of my translation design I decided to use three different languages. The first language that I choose was Spanish, I choose this language because it is a romance language with many derivatives from English as well as a relatively similar sentence structure. This is the planned lower perturbation translation where we hope to see a small number of perturbations with less of a focus on sentence structure perturbations. The second language that we selected was Russian, I choose this language because it is of Slavic origin, this means that is has a significantly smaller existing relationship between English and Russian. I am hoping that since the languages developed with little interaction from each other, that these translation cycles will have greater variability in both sentence structure and word choice. Finally the last language that I decided to translate into was Chinese, again this is a language that developed away from English and other romance languages and is likely to have significant differences in sentence structure and word choice, even from the Slavic based Russian translations, which will hopefully be revealed with the cyclic translation. The actual translational process consisted of, for each cycle, instructing the model to directly translate the phrase to the chosen target language, then this output was passed to another model which had the instructions to directly translate that output from the target language back into English. This cycle was repeated up to 5 times for each language with the results of the translations being passed to the detector.
Detection
For this experiment we will be using the Sapling AI Text Detector. This is an online publicly provided, primarily fine-tunned based detector, trained on a dataset of samples generated by various AI language models, including GPT4, Gemini, Claude, as well as some open-source models like Llama and Mistral. Sapling provides API documentation on their website to implement their detector on AI or human generated text. For our chosen dataset we had a detection rate of 87% on machine generated texts with an FPR of 25% on human generated texts giving us a baseline TPR@FPR = 25% of 87%. While the detection rate seems satisfactory, the FPR of 25% is higher than expected, however this is likely due to the short length of responses which is known to significantly impact detection accuracy Wang et al. (2024).
Experimental Results
Translation Results
Looking at the results of our translation process we can see a few interesting things. Focusing first on the first translational cycle, we can see that there are perturbations throughout all of the translations in both word order or paraphrasing and word substitution or word choice. In this first cycle the languages seem to follow a trend of how close the are in sentence structure to English with the languages that are father away having a greater number of perturbations. We can also see some themes begin to arise throughout the translations, in fact in the given example we can see both inter and intra translational themes beginning to appear in only this first translational cycle. As an inter-translational theme we can see the phrase "they've made" in the original machine generated text has been translated to "they have created" in all three languages, this seems to suggest that there is an inter-language relationship where cyclic translation of any kind has this effect. Now focusing on the Russian translation for a language specific intra-translational theme, we can see that every case of the token "grandparents" in the original text has been replaced with "grandmother and grandfather" in the Russian translation. This seems to suggest that perhaps there is not direct translation for the word "grandparents" in Russian, and idea which can be confirmed with a quick google search.
Now moving on to the fifth cycle we can see that some of these themes persist, there are still both types of perturbations in every translation following a density according to their relative similarity to English. We can still see that the phrase "they've made" in the original text has been translated to "they have created" in all three languages, however something interesting has happened to our intra-translational Russian theme. Instead of seeing "grandmother and grandfather" replace all iterations of "grandparents", we actually see the token "grandparents" reappear in the fifth cycle of the Russian translation. This seems to suggest that even though there is no direct word for "grandparents" in Russian, the cyclic nature of the translation allows for a blend between the original machine generated text and the translational outputs from the target language.
Detection Results
Now lets look at the results when we pass these translation to the machine generated text detector, immediately we can see that these translation cycles actually increase the detection score for all languages. In fact we can see that almost all of the impact of the translation on the detection scores is captured within the first translation cycle where the detection score of the machine generated texts in all three languages jumps to approximately 100%. One hypothesis for why this might be the case is due to the fine-tuned nature of the detector as well as our generation and translational models. Since the detector has been trained on outputs from both the original generational model (GPT4) as well as the translational model (Llama 3.3 70B) it is possible that it is able to generalize between the outputs of these models to capture the translational perturbations with a higher accuracy. We can also see that this detection score persists throughout the translation cycles which seems to indicate that despite what we saw with the meshing of original machine generated content with translational perturbations, the fine-tuned nature of the detector is able to abstract these changes out, giving it a confident assessment of the texts human integrity.
Curiously we also see that when these same translational cycles are applied to human generated content we also see an increase in the detection score across the board. This is consistent with our previous theory as it would make sense that the machine generated text detector is picking up upon the translational perturbations created by the translational model (Llama 3.3 70B), which the detector was trained upon. It stands to reason that when the detector sees these translational changes, which it then classifies as AI written, that it would be more likely to classify the entire text as AI written leading to an increase in detection score for the human generated translated texts.
Conclusion
In conclusion we can see that cyclic translation actually increase detection score for this specific detector across all three translational languages, in fact translation even increased the detection score of human generated text to approximately 50%. I am curious how the use of a different type of detector model, perhaps a more metrics focused model which would care more about individual token perturbations as opposed to general response level perturbations, would be more susceptible to this cyclic type of attack. Given more time and resources I would also be curious to tamper with the translational hyperparameters, perhaps increase the temperature of the translational model or the number of cyclic translations to attempt to achieve more variation in the perturbations. Along the same vein it would be interesting to see if we were able to use a translational model that the machine generated text detector was not trained upon, whether this would affect the results of the detections scores post translation?
References
Wang, Yichen, Shangbin Feng, Abe Bohan Hou, Xiao Pu, Chao Shen, Xiaoming Liu, Yulia Tsvetkov, and Tianxing He.
Stumbling blocks: Stress testing the robustness of machine-generated text detectors under attacks, 2024.
Edmund Hogan Github Repository of the Data and Codebase
(The content is based on Stanford CS224N’s Custom Final Project.)