Graduation Thesis

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

GRADUATION THESIS

ScholarSleuth: A System for Fine-grained

AI-Generated Content Detection

TẠ NGỌC MINH

minh.tn214918@sis.hust.edu.vn

Program: Data Science and Artificial Intelligence

Supervisor: Dr. Đinh Viết Sang

Signature

Department: Computer Science

School: School of Information and Communications Technology

HANOI, 06/2025

ACKNOWLEDGMENT

"If you pass by quiet Bách khoa street,

Bring late crêpe myrtles in summer heat.

Where friends once rushed night and day,

Our final exams take youth away..."

"Nếu em về qua phố nhỏ Bách khoa

Nhớ mang cho tôi chùm bằng lăng cuối hạ,

Nơi bè bạn đêm ngày hối hả

Mùa thi cuối cùng, mai hết tuổi sinh viên..."

This is my graduation thesis, marking the end of my four years at Hanoi University

of Science and Technology.

I want to express my gratitude to Dr. Đinh Viết Sang, my beloved supervisor,

who inspired me to select and complete the thesis. His mentorship extended far

beyond academic matters, influencing my career path and personal development.

Besides, I also extend my thanks to Professor Preslav Nakov, Dr. Yuxia Wang and

the research team at Natural Language Processing department, Mohamed bin Za-

yed University of Artificial Intelligence – where I will pursue a PhD degree after

finishing my bachelor – for their invaluable supports during entire this time and

related research in this topic.

A hear tfelt thanks goes to my family, who are my "platinum sponsors" during

the past 22 years. Mom and Dad, thank you for your unconditional love and encour-

agement. Your patience, trust, and constant belief in me have been the foundation

of all my efforts and successes.

I also extend my thank to my old colleagues at Viettel IT Center (formerly IT

Division, Viettel Group HQ) for creating favorable environment during my studies.

Particularly, I want to show my appreciation to Mr. Nguyễn Chí Thanh and Mr. Võ

Đức Quân for their valuable insights in my career path.

My sincere appreciation goes to Ho Chi Minh’s Youth Union – SOICT and

Google Developer Group on Campus – HUST. These organizations enriched my

university life, helping me grow personally and professionally while connecting

me with inspiring peers and friends.

To my close fr iends, Bình, Trường, Việt Anh and Dũng – thank you for your

companionship throughout the four years. Thanks my teammates at Foundation

Model Lab, BKAI Research Center, Mỹ Anh, Đông, Trường, Anh Minh and Đức

Anh for their valuable feedbacks and contributions that helped shape this thesis.

Finally, thanks HUST for being an unforgettable part of my life.

"When I have journeyed all life through,

My soul remains Bách khoa and true."

"Mai đây đi trọn đường đời,

Hồn tôi vẫn mãi là người Bách khoa"

LỜI CẢM ƠN

"Nếu em về qua phố nhỏ Bách khoa

Nhớ mang cho tôi chùm bằng lăng cuối hạ,

Nơi bè bạn đêm ngày hối hả

Mùa thi cuối cùng, mai hết tuổi sinh viên..."

Bốn năm tại Bách khoa của tôi đang dần khép lại cùng với quyển ĐATN này.

Em xin chân thành cảm ơn thầy Đinh Viết Sang, người thầy hướng dẫn đã luôn

đồng hành, truyền cảm hứng, giúp em lựa chọn và hoàn thiện đề tài này. Sự dìu

dắt của thầy không chỉ giới hạn trong phạm vi học thuật mà còn ảnh hưởng sâu sắc

đến định hướng nghề nghiệp và quá trình trưởng thành của em. Em cũng xin gửi

lời cảm ơn đến GS. Preslav Nakov, TS. Yuxia Wang và nhóm nghiên cứu tại khoa

Xử lý ngôn ngữ tự nhiên, Đại học Mohamed bin Zayed về Trí tuệ Nhân tạo – nơi

em sẽ tiếp tục theo học chương trình tiến sĩ – vì những hỗ trợ quý báu trong suốt

thời gian qua và các nghiên cứu liên quan đến đề tài này.

Con xin bày tỏ lòng biết ơn sâu sắc đến gia đình – những "nhà tài trợ kim cương"

của con trong suốt 22 năm qua. Cảm ơn bố mẹ vì luôn yêu thương, động viên và

đặt niềm tin vào con. Sự kiên nhẫn và tin tưởng của bố mẹ chính là nền tảng vững

chắc cho mọi nỗ lực và thành công của con.

Em cũng xin gửi lời cảm ơn đến các anh chị đồng nghiệp cũ tại Trung tâm CNTT

Viettel (trước đây là Ban CNTT Tập đoàn) đã tạo điều kiện thuận lợi cho em trong

suốt quá trình học tập và công tác tại đây. Đặc biệt, em cảm ơn anh Nguyễn Chí

Thanh và anh Võ Đức Quân vì những chia sẻ cho em về định hướng tương lai.

Chặng đường này của tôi được đồng hành bởi Đoàn Thanh niên - Hội Sinh viên

trường CNTT&TT và Google Developer Group on Campus – HUST. Những tổ

chức này đã tô thêm màu sắc cho 4 năm đại học, cũng như giúp tôi trưởng thành

hơn trong cả chuyên môn lẫn kỹ năng xã hội.

Gửi đến những người bạn, Bình, Trường, Việt Anh và Dũng – cảm ơn các cậu

đã đồng hành cùng tớ suốt bốn năm qua. Cảm ơn các thành viên của Foundation

Model Lab, Tr ung tâm BKAI – Mỹ Anh, Đông, Trường, Anh Minh và Đức Anh –

vì những góp ý quý giá đã góp phần hoàn thiện đồ án này.

Cuối cùng, cảm ơn Bách khoa vì đã trở thành một phần của thanh xuân.

"Mai đây đi trọn đường đời,

Hồn tôi vẫn mãi là người Bách khoa"

ABSTRACT

The growing use of large language models in writing has made it increasingly

difficult to distinguish between human-written and AI-generated content, espe-

cially in academic contexts where authorship integrity is vital. As AI-assisted texts

become commonplace, there is a pressing need for reliable methods to assess the

extent of AI involvement, even when experts struggle to distinguish them.

While existing detection systems focus on binary classification, they fall short

in real-world scenarios. These approaches often fail to capture human-AI collab-

oration, struggle to generalize across domains and languages, and perform poorly

on unseen AI models.

This thesis introduces ScholarSleuth, a fine-grained detection framework for

nuanced authorship analysis. It comprises: (1) FAIDSet, a multilingual, multi-

domain dataset; (2) FAID, a novel model combining contrastive learning and multi-

task classification to detect human-written, AI-generated, and collaborative texts;

and (3) an interactive web application featur ing an AI Factor Impact Score.

ScholarSleuth achieves up to 95.58% accuracy on in-domain, known genera-

tors settings and 93.31% on unseen generators, outperforming existing methods in

accuracy, interpretability, demonstrating strong generalizability across diverse do-

mains. This work provides a robust, practical and adaptable solution for transparent

and responsible AI use in writing, especially in academic scenarios.

iii

Student

Tạ Ngọc Minh

TABLE OF CONTENTS

ACKNOWLEDGEMENTS.................... .................................................... i

ABSTRACTS ......................... ........................................................... ....... iii

TABLE OF CONTENTS .......................... ................................................. iv

LIST OF FIGURES......................................................... .......................... viii

LIST OF TABLES .. ............................................... ................................... ix

LIST OF EQUATIONS ....................................... ...................................... x

LIST OF ABBREVIATIONS.................... ................................................. xi

CHAPTER 1. INTRODUCTION......................................... .................... 1

1.1 Problem Statement........................... ..................................................... 1

1.2 Foundational Background and Problem of Research................................. 2

1.3 Scope of Research .. ..................................................... ......................... 3

1.4 Contributions .......... ..................................................... ........................ 5

1.5 Organization of Thesis ......... ..................................................... ............ 6

CHAPTER 2. BACKGROUND AND RELATED WORKS ...................... 8

2.1 Overview .......................................................... ................................... 8

2.2 Advancements in Large Language Models and Their Impacts on Writing

Practices................................... ..................................................... ............ 8

2.2.1 The Advancements of Large Language Models............................. 8

2.2.2 Tendency of Using AI Assistant in Writing................................... 10

2.2.3 Human vs. AI: The Limits of Expert Judgment ............................. 12

2.3 Fine-grained AI-generated Text Datasets........................................ ......... 12

2.4 Generalization of AI-generated Text Detection ........................................ 16

2.5 Contrastive Learning for AI-generated Text Detection.............................. 18

2.6 Summary ......................................... .................................................... 19

CHAPTER 3. METHODOLOGY..................................................... ....... 20

3.1 Overview .......................................................... ................................... 20

3.2 Overview of FAID Architecture .................................................... ......... 20

3.3 Training Architecture .................................................................... ........ 21

3.3.1 Architecture Design ............................................................. ...... 21

3.3.2 What is LLM family? ................................................................. 22

3.3.3 Multi-level Contrastive Learning ......................................... ........ 23

3.3.4 Multitask Auxiliary Lear ning .......... ............................................ 25

3.4 Inference Architecture.................................................... ....................... 25

3.4.1 Architecture Design ............................................................. ...... 25

3.4.2 Handling Unseen Data without Retraining.................................... 27

3.4.3 Fuzzy k-Nearest Neighbors with Softmax Weighting .............. ....... 27

3.5 Summary ......................................... .................................................... 28

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS.............. 30

4.1 Overview .......................................................... ................................... 30

4.2 FAIDSet .. ....................................................................... ..................... 30

4.2.1 Overview ....................................... ........................................... 30

4.2.2 The Formulation of Human-written Dataset............................ ...... 30

4.2.3 Dataset Statistics................................................... ..................... 31

4.2.4 Diverse Prompt Strategies................ ........................................... 31

4.3 Other Datasets................................... ................................................... 34

4.4 Encoder Selection for FAID Architecture........... ..................................... 35

4.5 Analysis on Text generated by Different AI Families................................ 36

4.5.1 Text Distribution between AI Families ........... .............................. 36

4.5.2 Text Distribution between AI Models within the Same Family ....... 38

4.5.3 Embedding Visualization and Semantic Cohesion......................... 38

4.5.4 Key Findings ................. ............................................................ 38

4.6 How can We deal with Unseen Data?................ ...................................... 41

4.6.1 The Need to Use Vector Database............................. ................... 41

4.6.2 Clustering Algorithm Selection .................................. ................. 41

4.7 Baseline Models for FAID Evaluation ............................................. ....... 42

4.8 Experimental Settings to evaluate FAID Architecture performance............ 43

4.9 Evaluation Metrics for Evaluation Process ....................................... ....... 43

4.10 Whether the text is Human, AI, or Human-AI? ...................................... 45

4.11 Identifying Different Generators.......... ................................................. 46

4.12 Discussion.............. ..................................................... ....................... 46

4.13 Summary ...................................................................... ..................... 48

CHAPTER 5. SCHOLARSLEUTH WEB APPLICATION...................... 49

5.1 Overview .......................................................... ................................... 49

5.2 AI Factor Impact Score ..................... .................................................... 49

5.2.1 AFI Score Calculation and Threshold .......... ................................ 49

5.2.2 Why AFI = 0.7 for Human-AI Collaborative Texts?..................... .. 50

5.3 System Overview. ........................................................... ...................... 50

5.4 User Interface Walkthrough..... ..................................................... ......... 51

5.4.1 Analysis Tools .............. ..................................................... ........ 51

5.4.2 Playgrounds.... ........................................................... ................ 56

5.5 Summary ......................................... .................................................... 60

CHAPTER 6. CONCLUSIONS AND BROADER IMPACTS... ................ 62

6.1 Conclusion......................... ..................................................... ............. 62

6.2 Suggestion for Future Works....................................................... ........... 63

6.3 Ethics and Broader Impacts ....................................................... ............ 63

PUBLICATIONS AND AWARDS GIVEN TO THIS THESIS .......... ........ 65

REFERENCES ..................................................... ................................... 66

vii

LIST OF FIGURES

Figure 2.1 Survey on student’s tendency of using AI . . . . . . . . . . 10

Figure 2.2 Analysis from University of Illinois Chicago on student’s

AI usage purpose. . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Figure 3.1 Training architecture . . . . . . . . . . . . . . . . . . . . . 22

Figure 3.2 Inference architecture . . . . . . . . . . . . . . . . . . . . 26

Figure 4.1 Text length distributions in words and characters across gen-

erators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Figure 4.2 Top 20 most common 3-gram from generators using 500

sample prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Figure 4.3 Visualizations showing clustering behavior using 2D and

3D embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Figure 5.1 Homepage of ScholarSleuth system . . . . . . . . . . . . . 52

Figure 5.2 Text Analyzer interface . . . . . . . . . . . . . . . . . . . 53

Figure 5.3 Document Analyzer interface . . . . . . . . . . . . . . . . 55

Figure 5.4 A sample of generated report . . . . . . . . . . . . . . . . 57

Figure 5.5 Prompt Crafting Tool interface . . . . . . . . . . . . . . . 59

Figure 5.6 Rewrite Challenge interface . . . . . . . . . . . . . . . . . 61

viii

LIST OF TABLES

Table 2.1 English Fine-grained AI-generated Text Detection Datasets

Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Table 4.1 Statistics of human-written text’s origins in FAIDSet. . . . . 31

Table 4.2 Number of examples per label in subsets in FAIDSet. . . . . 32

Table 4.3 Samples of diverse prompt templates used to generate FAIDSet 34

Table 4.4 Encoder selection on known vs. unseen generators. . . . . . 35

Table 4.5 Comparison of clustering algorithms on known vs. unseen

generators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Table 4.6 Performances of detectors with three labels . . . . . . . . . 45

Table 4.7 Accuracy of detectors in identifying generators . . . . . . . 46

LIST OF EQUATIONS

Equation 3.1 Relations between text distributions . . . . . . . . . . . . 21

Equation 3.2 Relations between classes in case x = 1 . . . . . . . . . 23

Equation 3.3 Relations between classes in case x = 0 . . . . . . . . . 23

Equation 3.4 Relations between classes in case y = 1 . . . . . . . . . 23

Equation 3.5 Relations explained by parameter z for human-AI collab-

orative texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Equation 3.6 Overall constraints for text representation . . . . . . . . . 24

Equation 3.7 Loss function for each case based on SimCLR . . . . . . 24

Equation 3.8 Multi-level contrastive loss . . . . . . . . . . . . . . . . 24

Equation 3.9 Multitask Auxiliary loss computed using cross entropy . 25

Equation 3.10 Overall loss formula . . . . . . . . . . . . . . . . . . . . 25

Equation 3.11 Softmax similarity weighting for fuzzy kNN. . . . . . . . 27

Equation 3.12 Boost Factor for Fuzzy Voting. . . . . . . . . . . . . . . 28

Equation 4.1 Accuracy formula . . . . . . . . . . . . . . . . . . . . . 43

Equation 4.2 Precision formula . . . . . . . . . . . . . . . . . . . . . 44

Equation 4.3 Recall formula . . . . . . . . . . . . . . . . . . . . . . . 44

Equation 4.4 F1-macro score formula . . . . . . . . . . . . . . . . . . 44

Equation 4.5 MSE formula . . . . . . . . . . . . . . . . . . . . . . . 44

Equation 4.6 MAE formula . . . . . . . . . . . . . . . . . . . . . . . 44

LIST OF ABBREVIATIONS

Abbreviation Definition

AFI AI Factor Impact score

AI Artificial Intelligence

AIGT AI-Generated Text Detection

FAID Fine-grained AI-generated text

Detection (Name of the proposed

framework)

FAIDSet Fine-grained AI-generated text

Detection dataSet (Name of the

proposed dataset)

GPT Generative Pretrained Transformer

IELTS International English Language Testing

System

KNN k-Nearest Neighbors

LLM Large Language Model

MAE Mean Absolute Error

MSE Mean Squared Er ror

NLP Natural Language Processing

PDF Por table Document Format

SVM Support Vector Machine

VJOL Vietnam Online Journals

XLM Cross-lingual Language Model

CHAPTER 1. INTRODUCTION

1.1 Problem Statement

In recent years, LLMs such as ChatGPT

, Gemini

, Llama

, and DeepSeek

have emerged as powerful tools capable of generating high-quality natural language

text. These models are trained on massive corpora of web and human-authored

data, enabling them to produce fluent, contextually relevant, and stylistically var-

ied output in response to simple prompts. Their capabilities are being rapidly in-

tegrated into both professional and educational workflows, profoundly reshaping

how people write, learn, and communicate.

While these models offer considerable benefits, such as aiding in idea genera-

tion, automating routine writing tasks, or providing assistance to non-native speak-

ers, they have also introduced a number of serious challenges. The alarming con-

cern among these is their potential for misuse in contexts where originality, au-

thorship, and intellectual contribution are essential. In academic environments, for

instance, students increasingly use LLMs not just as writing aids, but as substitutes

for substantial parts of their assignments, including essays, code submissions, and

even research papers. In some cases, entire documents are generated with minimal

to no human refinement, yet submitted under the guise of original student work.

This shift has introduced a new layer of complexity to the evaluation of aca-

demic submissions and the enforcement of academic integrity. Unlike traditional

forms of plagiarism – which rely on the reuse of existing human-authored text –

AI-generated content is often novel in form and phrasing, making it undetectable

to conventional plagiarism detection tools. Consequently, instructors, editors, and

evaluators are increasingly faced with the question: Was this written by a human,

or by a machine?

The difficulty is not merely technical, it also strikes at the core of educational

values. Academic institutions rely on the assumption that submitted work reflects

the student’s own understanding, reasoning, and expression. When LLMs are used

to bypass this process, not only is the assessment compromised, but the student’s

learning process is also diminished. Furthermore, in broader domains such as jour-

nalism and digital media, AI-generated content can obscure the origin of informa-

https://chatgpt.com

https://gemini.google.com

https://www.perplexity.ai

https://chat.deepseek.com

CHAPTER 1. INTRODUCTION

tion, making it harder to verify sources and reducing trust in what is published.

A further complication arises from the increasingly blurred boundaries between

human and AI contributions. Many individuals use LLMs to draft content that

they later revise or personalize – a form of human-AI collaboration. This gray area

presents a significant challenge for detection: existing tools often reduce the prob-

lem to a binary classification (human vs. AI), which fails to capture the nuanced

reality of mixed-authorship text.

As the use of LLMs becomes more pervasive and the models themselves grow

more sophisticated, the need for accurate, reliable, and fine-grained methods to dis-

tinguish between human-generated and AI-generated content becomes urgent. This

is not merely a technical requirement, but a foundational necessity for maintaining

integrity, accountability, and trust across academic, professional, and digital plat-

forms.

1.2 Foundational Background and Problem of Research

Recent studies have highlighted the growing presence of AI-assisted writing in

education and research. Tools like ChatGPT are increasingly used to generate es-

says, solve prog ramming tasks, or even draft entire research papers. While these ap-

plications offer undeniable productivity benef its, they also introduce cr itical chal-

lenges related to academic integrity. Notably, the line between AI assistance and

academic misconduct has become increasingly difficult to draw.

A key issue lies in the detection of AI-generated content. Traditional plagiarism

detection tools, which are designed to identify copied material through textual over-

lap, are inadequate for this task. AI-generated texts are often or iginal in phrasing

and structure, making them elusive to existing detection systems. Prior approaches

to detecting such content have largely focused on binary classification (i.e., human

vs. AI), but these models tend to suffer from limited adaptability. They often fail

when applied to different academic domains, languages, or newer AI models, and

they struggle to detect more nuanced cases such as human-AI collaboration.

Moreover, many existing methods rely on surface-level features, such as word

frequency or sentence patterns, which do not capture the deeper stylistic or seman-

tic traits of AI-generated texts. The lack of sophisticated, fine-grained detection

tools introduces several significant problems:

• Erosion of Trust: The inability to reliably detect AI-generated content can

under mine trust in digital platforms, as users may struggle to verify the au-

thenticity of information.

CHAPTER 1. INTRODUCTION

• Challenges in Content Moderation: Platforms that rely on user-generated

content face difficulties in policing AI-generated texts, leading to the prolifer-

ation of misleading or harmful content.

• Ethical Concerns: AI-generated content can be used for malicious purposes,

such as disinformation campaigns or academic dishonesty, making it vital to

develop robust detection mechanisms.

• Lack of Generalization: Existing detection systems often fail to generalize

across different AI models, languages, and domains, limiting their practical

effectiveness as AI technology continues to evolve.

To address these limitations, this thesis proposes a new approach aimed at im-

proving the detection of AI-generated content, especially in academic settings. The

proposed system, ScholarSleuth, is designed to go beyond binary classification

and support a more nuanced analysis of authorship. By focusing on generalizabil-

ity and adaptability, the framework seeks to offer a more robust solution for distin-

guishing human, AI, and hybrid-authored content across varied contexts.

In addition to safeguarding academic integrity, the broader implications of this

work extend to other fields such as content moderation, journalism, and disinfor-

mation detection – where verifying the authenticity and origin of text is equally

cr itical. As AI technologies continue to evolve rapidly, there is a growing need for

detection systems that can evolve in parallel and maintain their effectiveness across

different domains.

1.3 Scope of Research

The scope of this research is centered on the detection of AI-generated text, with

a particular emphasis on academic content. This research aims to develop a com-

prehensive framework to identify AI-generated text in academic theses, research

papers, and other scholarly reports, ensuring academic integrity and transparency

in the evolving landscape of AI-assisted writing.

The ScholarSleuth system, incorporating the FAID architecture, will be the core

tool developed to address this problem. The FAID framework will focus on de-

tecting fine-grained AI-generated text, distinguishing between three categories of

authorship: fully AI-generated, human-written, and human-AI collaborative text.

The ability to identify the specific AI model family responsible for generating the

content (such as GPT-4, Gemini, or Llama) will also be an essential feature of the

proposed detection system.

This study primarily concentrates on the following areas:

CHAPTER 1. INTRODUCTION

1. Multilingual, Multi-Domain, Multi-Generator Dat aset Construction: To

support robust training and evaluation, this research includes the creation

of FAIDSet – a fine-grained dataset encompassing texts from multiple lan-

guages, academic domains, and AI generators. The dataset includes human-

wr itten, AI-generated, and hybrid texts sourced from diverse educational and

disciplinary contexts. This diversity allows the system to learn generalized pat-

terns and improves its adaptability to real-world scenarios involving various

types of AI involvement and linguistic nuances.

2. Detection of AI-Influenced Academic Texts: This research addresses the

challenge of detecting both fully AI-generated and human-AI collaborative

academic texts, including research papers, thesis drafts, and scholarly re-

ports. I investigates the linguistic and stylistic characteristics common to AI-

influenced academic wr iting, such as excessive fluency, generic phrasing, lack

of nuanced argumentation, or uniform structure that differentiate it from fully

human writing. The ScholarSleuth system is designed to identify these fea-

tures across a spectrum of AI involvement, from minor assistance to complete

generation.

3. Generalization Across AI Models and Domains: A core objective of this re-

search is to ensure the detection system’s robustness across various AI model

families (e.g., GPT, Llama, Gemini, DeepSeek) and academic subject areas.

Because each model exhibits distinct stylistic tendencies, the FAID architec-

ture will be developed with the capacity to generalize across these differences.

This generalization ensures the system remains effective even as new models

and applications emerge in different academic domains.

4. Multilingual Detection: Recognizing the global nature of academic publish-

ing, this study emphasizes the importance of detecting AI-generated content in

multiple languages. The ScholarSleuth system is evaluated for multilingual ca-

pability, extending detection suppor t beyond English to include high-resource

and low-resource languages alike. This ensures broader applicability and rel-

evance in international academic environments.

5. Scalability and Real-World Applicability: Finally, the ScholarSleuth frame-

work is designed for practical deployment in real-world academic settings.

The system will be evaluated for scalability, efficiency, and ease of integra-

tion within workflows at universities, publishers, or research institutions. Its

ability to handle large volumes of submissions and deliver reliable classifica-

tion results in operational environments will be a critical metric of success.

CHAPTER 1. INTRODUCTION

In summary, this research is aimed at creating a robust and adaptable solution

for detecting AI-generated text in academic writing, addressing the critical need

for reliable methods to maintain academic integrity in an era of AI-assisted con-

tent creation. By focusing on the detection of fine-grained AI-generated and col-

laborative content, the ScholarSleuth system offers a comprehensive approach that

combines multi-level contrastive learning with real-world applicability across lan-

guages, domains, and evolving AI models.

1.4 Contributions

This research makes several significant contributions to the field of AI-

generated content detection, with a particular focus on academic texts. Three pri-

mary contributions of this thesis – also three main components of ScholarSleuth –

are outlined below:

1. Creation of FAIDSet – a Multilingual, Multi-Domain, Multi-Generator

Dataset: I introduce a new benchmark dataset comprising approximately

84,000 examples across multiple domains (e.g., student theses, research ab-

stracts), languages (English and Vietnamese), and AI generators (e.g., GPT,

Llama, Gemini). This fine-grained dataset includes three categories of au-

thorship: human-written, AI-generated, and human-AI collaborative texts. It

addresses the scarcity of publicly available resources for comprehensive eval-

uation of AI-generated content detection systems in multilingual and mixed-

authorship settings.

2. Design of the FAID Detection Framework: I propose FAID (Fine-grained

AI-authorship Identification Detector), a novel detection framework designed

to improve generalization in identifying AI-generated and collaborative con-

tent. FAID leverages a combination of contrastive learning and auxiliary clas-

sification to capture subtle stylistic and semantic cues in latent representa-

tions. This enables the model to distinguish between closely related authorship

classes with higher robustness, especially in complex or deeply mixed inputs.

FAID also aims to deal with these two specific problems:

• Improved Generalization to Unseen Domains and Generators: FAID

demonstrates superior performance over strong baseline detectors in both

in-domain and out-of-domain scenarios. In particular, its ability to gener-

alize to unseen AI models and academic domains makes it a practical and

forward-compatible solution. This advantage stems from its architecture’s

focus on learning model-agnostic stylistic features, rather than overfitting

to known generators.

CHAPTER 1. INTRODUCTION

• Enhanced Interpretability via Stylistic Proximity: Beyond raw classi-

fication, FAID provides a mechanism to interpret its predictions by mea-

sur ing the stylistic proximity between the input text and reference exam-

ples with known authorship in a learned embedding space. This capability

offers insights into why a given text is attributed to a particular class, sup-

porting greater transparency and trust in AI-authorship judgments.

3. Web Application for AI-Generated Text Detection: In addition to the core

detection framework, this research presents a web application that enables

users to upload and generate reports on the AI-generated content within the

document. It also introduce AI Factor Impact score – a benchmark to identify

the AI involvement level. After processing, the application generates a detailed

report, categorizing sections of the text as human-written, AI-generated, or

collaborative, and providing insights into the specific AI model used for gen-

eration. This tool makes it easy for researchers, educators, and content moder-

ators to assess the authenticity of texts in an accessible, user-friendly manner.

Overall, this thesis contributes to the growing field of AI content detection by

providing an advanced, flexible, and scalable solution for identifying AI-generated

text in academic writing. The ScholarSleuth framework, combined with its web

application, offers a practical tool for improving transparency, accountability, and

academic integrity in an era where AI-assisted content creation is becoming in-

creasingly prevalent.

1.5 Organization of Thesis

The remaining parts of this thesis is organized as follow:

Chapter 2 provides the necessary backgrounds and related works. This chapter

explores the increasing integration of AI assistance in wr iting processes, the need

for detection methods and the limitations of existing detection techniques. It also

discusses recent developments in dataset construction, generalization challenges in

detection systems, and the use of contrastive learning to improve model robustness.

Chapter 3 defines the core task addressed in this thesis: the fine-grained clas-

sification of text authorship into three categories. It introduces the FAID frame-

work, including its task formulation and the architectural design of both the train-

ing and inference components. This chapter also extends the methodology by defin-

ing "LLM families" and optimizing sentence encoders for better generalization. It

introduces the full training pipeline, including a multitask auxiliary classifier and

a contrastive learning approach that incorporates both authorship types and LLM

families. The chapter also describes an adaptive inference strategy for handling

CHAPTER 1. INTRODUCTION

unseen texts without retraining.

Chapter 4 details the selection and evaluation process in designing FAID archi-

tecture, the experimental evaluation of the architecture and presents the construc-

tion of FAIDSet, a new multilingual, multi-domain, and multi-generator dataset

designed to support research on AI-generated text detection with details about the

data sources, collection methodology, and the rationale behind the dataset design.

It covers the setup, baseline comparisons, and results from tasks involving both

authorship classification and generator attribution. The experiments are conducted

across multiple conditions, including unseen domains and generators, to test the

system’s generalization performance.

Chapter 5 describes the ScholarSleuth web application, which implements the

FAID framework in an interactive system for real-time AI-generated content detec-

tion. This chapter showcases the system’s practical deployment and demonstrates

its usability for academic and educational settings.

Finally, Chapter 6 summarizes and draws a conclusions about the key findings

of this thesis, highlighting the importance and effects of applying this system into

real-world scenario. This chapter also makes some suggestions about potential di-

rections for future works and improvements in the field of AI-generated content

detection.

CHAPTER 2. BACKGROUND AND RELATED WORKS

2.1 Overview

Advanced LLMs have changed how we create and refine text. Instead of or igi-

nating every word, humans now more often act as post-editors or reviewers, step-

ping in after AI drafts the initial content. This collaborative workflow spans fields

from academic publishing and journalism to education and social media, making

hybrid human-AI texts the new norm [12, 30]. As AI-generated content becomes

increasingly sophisticated, detecting the origin of written text – whether human or

AI-generated – has become a critical challenge.

This chapter provides an overview of the key concepts and research areas rel-

evant to this thesis. First, it examines the evolution and capabilities of large lan-

guage models, discussing their development and how they have been leveraged in

wr iting assistance. It then explores the growing tendency to use AI as an assis-

tant in writing, the struggle of experts in distinguishing authorship, followed by an

overview of the fine-g rained AI-generated text datasets and the need for accurate

detection methods. I also discuss the generalization challenges in detecting AI-

generated content across domains and languages. The chapter also discuss about

contrastive learning techniques, which have proven effective in improving the ac-

curacy of AI content detection. They are also the foundational knowledge required

to understand the contributions of this thesis will be established.

2.2 Advancements in Large Language Models and Their Impacts on Writing

Practices

Advancements in LLMs have transformed how text is generated, refined, and

evaluated. As models grow more sophisticated, they are increasingly embedded in

wr iting workflows, from academic research to everyday communication. At the

same time, their human-like output challenges traditional notions of authorship

and authenticity, raising complex questions about trust, literacy, and detection in

the age of AI-assisted writing.

2.2.1 The Advancements of Large Language Models

The inception of LLMs can be traced back to the introduction of the transformer

architecture [46] with self-attention mechanisms to process and generate sequences

of text, which replaced traditional recurrent neural networks and long short-term

memory networks with self-attention mechanisms [34, 53]. This architecture al-

lows for more efficient processing of sequences and better handling of long-range

CHAPTER 2. BACKGROUND AND RELATED WORKS

dependencies in text.

Building upon this foundation, OpenAI introduced the Generative Pre-trained

Transformer (GPT) series, starting with GPT-2 [41], which demonstrated the abil-

ity to generate coherent and contextually relevant text. GPT-3, released in 2020,

further scaled up the model to 175 billion parameters [7], achieving state-of-the-

art performance on a wide range of NLP tasks without task-specific fine-tuning.

GPT-4 [37], released in 2023, continued this trend, offering improved reasoning

capabilities and performance across various benchmarks.

Meta’s Llama series provides an open-source alternative to proprietary mod-

els. Llama-2 [22], released in 2023, and Llama-3 [32], released in 2024, are auto-

regressive transformer models trained on publicly available datasets. These models

emphasize efficiency and accessibility, aiming to democratize access to powerful

language models.

Google’s Gemini series [20], introduced in 2023, represents a significant ad-

vancement in LLMs. Gemini models are multimodal, capable of processing and

generating not only text but also images and other data types. This multimodal ca-

pability allows Gemini models to perform complex tasks that require understanding

and generating multiple forms of data simultaneously.

The release of DeepSeek R1 [15] marks a pivotal moment in the evolution of AI

models, offering an innovative approach to AI’s "thought process" or reasoning ca-

pabilities. Unlike traditional LLMs that focus primarily on generating contextually

appropr iate text, DeepSeek R1 introduces a model trained to simulate reasoning

in a manner more akin to human cognition. It is designed to replicate the way AI

"thinks" by mimicking how humans process information, make decisions, and ana-

lyze situations. The model incorporates a more structured decision-making process

that not only generates text but also considers multiple layers of reasoning before

arriving at conclusions. This is achieved through a unique architecture that inte-

grates multilevel contrastive learning and decision layers to model the reasoning

paths an AI might take in generating text.

This model’s release has significant implications for academic, professional,

and creative writing applications, as it offers a level of sophistication that is closer

to human thought processes. For content detection systems, however, the chal-

lenge becomes more complex, as distinguishing between machine-generated text

and human-generated content becomes less about surface-level patterns and more

about deeper, reasoning-driven structures in the text. The development of such AI

models emphasizes the need for detection methods that can account for more than

CHAPTER 2. BACKGROUND AND RELATED WORKS

Figure 2.1: Survey on student’s tendency of using AI conducted by Digital Education

Council [16].

just linguistic features, incorporating mechanisms that can assess the logic and rea-

soning behind content generation.

2.2.2 Tendency of Using AI Assistant in Writing

The integration of AI assistants into writing processes has seen a significant

uptick across educational and professional domains. A 2024 survey of Digital Ed-

ucation Council revealed that 86% of students reported using AI tools in their stud-

ies, 54% of them are using in weekly or daily basis, with 58% feeling inadequately

prepared for an AI-enabled workforce, highlighting the rapid adoption and the ac-

companying need for AI literacy [16].

In academic settings, AI tools are employed for various purposes, including

literature reviews, drafting, and editing. A survey of research scholars indicated

that 73.6% use AI in education, with 51% utilizing it for literature reviews and

CHAPTER 2. BACKGROUND AND RELATED WORKS

Figure 2.2: Analysis from University of Illinois Chicago on student’s AI usage purpose.

46.3% for writing and editing tasks [54]. Furthermore, 72.04% of participants in a

study on AI tools in academic writing reported that these tools had improved the

overall academic rigor of their work [35].

However, the widespread use of AI assistants has raised concerns about aca-

demic integrity. A survey by the Tertiary Education Quality and Standards Agency

found that over a third of students use chatbots like ChatGPT for assessments with-

out viewing it as cheating [5]. This has prompted educators to reconsider assess-

ment strategies, with some advocating for oral assessments and requiring students

to "show their working" to mitigate potential misuse.

Despite these concerns, many students view AI tools as valuable educational

aids. A report from the University of Illinois Chicago, which is shown in Figure 2.2

noted that around 56% of students acknowledged using AI for research and and

75% of them use AI for writing assistance, highlighting its utility in their academic

endeavors [26]. Moreover, 75% of educational leaders believe that generative AI

tools will improve student research skills, and 69% think these tools will enhance

students’ ability to write clearly and persuasively [44].

In summary, the adoption of AI assistants in writing is widespread, with stu-

dents and researchers leveraging these tools to enhance productivity and writing

quality. While concerns about academic integrity persist, the general consensus

CHAPTER 2. BACKGROUND AND RELATED WORKS

acknowledges the potential benefits of AI in education, underscoring the need for

balanced and ethical integration of these technologies.

2.2.3 Human vs. AI: The Limits of Expert Judgment

Our recent findings [52] show that even expert annotators – linguists, NLP re-

searchers, and practitioners – struggle to reliably distinguish between human and

AI-generated text. Across 16 datasets in 9 languages, expert accuracy averaged just

87.6%, suggesting that even seasoned evaluators are often misled by AI outputs

designed to mimic human writing.

Detection success varies by context. In paired comparisons, where human and

AI texts are judged side-by-side, cues like concreteness and linguistic quirks are

more detectable. But in single-text evaluations, accuracy drops signif icantly, some-

times nearing chance levels – highlighting the influence of framing and context on

expert judgment.

Prompt engineering further blurs the line. When models are guided to avoid

templates, add cultural nuance, and adopt varied styles, expert detection perfor-

mance can fall to 72.5%. These refined outputs increasingly mirror the richness

and spontaneity of human language, making even trained readers uncertain.

Ultimately, these findings challenge the reliability of conventional detection cri-

ter ia and signal the need for a reevaluation of what it means to produce "authentic"

text in an era increasingly influenced by advanced language models. If even the

experts, armed with extensive domain knowledge and contextual acuity, struggle

to distinguish between human and machine prose, then the implications extend be-

yond academic exercises to broader concerns over trust, authenticity, and account-

ability in our digital communications.

2.3 Fine-grained AI-generated Text Datasets

The detection of AI-generated text, particularly in academic contexts, requires

fine-grained datasets that can distinguish between human-authored, AI-generated,

and human-AI collaborative content. Fine-grained detection goes beyond tradi-

tional plagiarism detection by focusing on the subtle stylistic differences between

these types of content. As AI models like GPT, Llama, Gemini, and DeepSeek

become more sophisticated, the need for datasets that can capture these nuanced

distinctions grows.

Fine-grained datasets typically categorize AI-generated content into several

types [51]:

• AI-polished: Texts in this category are originally human-written and then re-

CHAPTER 2. BACKGROUND AND RELATED WORKS

fined or polished by AI models. These refinements typically involve enhanc-

ing grammar, clarity, or fluency without altering the original meaning or struc-

ture significantly. Detecting AI-polished text requires models that can identify

minor, yet characteristic stylistic changes introduced by the AI.

• AI-continued: In this case, a human starts the text, and an AI model generates

the continuation. This category often presents the most challenging detection

scenar ios because the AI’s writing must maintain coherence with the original

human-wr itten part while adhering to its own stylistic patterns.

• AI-paraphrased: Texts that were originally written by humans and reworded

by AI to express the same meaning using different phrasing, sentence struc-

ture, or vocabulary. Paraphrased texts may retain the original ideas but lack

the human-like identity that distinguish genuine human writing.

• Purely AI-generated: These texts are entirely generated by AI, without any

human input. The challenge here lies in detecting the characteristics that

differentiate machine-generated content from human-authored text, such as

overly structured language, formulaic sentence construction, and lack of deep

analysis.

Var ious datasets have been developed to support the fine-grained detection of

AI-generated text. These datasets typically contain a mix of human-written and

AI-generated content, spanning different domains, languages, and levels of collab-

oration.

• LLM-DetectAIve [3]: This dataset includes a variety of domains, such as

academic papers, student essays, and articles, with labels for both human and

AI-generated texts. The dataset also includes machine-polished and machine-

humanized texts, providing valuable data for training and evaluating AI con-

tent detection models.

• HART [6]: This dataset spans domains such as student essays, research pa-

per abstracts, and creative writing. It includes four categories: human-written

texts, LLM-refined texts, AI-generated texts, and humanized AI-generated

texts. The inclusion of hybrid texts where human input is mixed with AI-

generated content makes it especially useful for fine-grained detection sys-

tems.

• M4GT [50]: This dataset focuses on the generalization of AI-generated con-

tent detection across different domains and AI models. It contains a wide va-

r iety of texts, including both human-wr itten and machine-generated content,

CHAPTER 2. BACKGROUND AND RELATED WORKS

and is designed to test the performance of detection models in real-world, di-

verse scenarios.

• Beemo [4]: The Beemo dataset includes texts that are AI-generated, AI-

humanized, and deeply mixed, along with human-written texts. It is partic-

ularly useful for understanding how AI models blend with human writing and

for detecting subtle differences in mixed content.

These datasets play a critical role in developing detection systems capable of

handling the increasing variety of AI-generated text. They offer valuable annotated

examples that allow models to learn the distinguishing features between human and

AI-generated content. I do analysis in many cur rent dataset on AI-generated text

and summarized them in Table 2.1. Many dataset contains fine-grained labels and

multi-generator, but all of analyzed datasets are monolingual – only in English.

This problem makes me to collect and formulate a new dataset for multilingual,

multi-domain and multilingual AI-generated text dataset to adapt the approach to

real-world scenarios.

While these datasets provide the necessary foundation for fine-grained detec-

tion, several challenges remain:

• Multilinguality: Many existing datasets are limited to English or a few other

languages, making it difficult to apply detection methods in multilingual con-

texts. There is a pressing need for datasets that cover a wider array of lan-

guages to ensure that detection models can generalize across linguistic bound-

ar ies.

• Domain Diversity: AI models like GPT and Llama are trained on diverse data

sources, which means their generated content can span many domains. Fine-

grained detection systems must be trained on datasets that represent a broad

range of topics, from technical papers and research articles to creative writing

and casual dialogue.

• Evolving AI Models: The rapid development of new AI models means that

detection systems need to be adaptable. Models like Gemini and DeepSeek

represent new generations of AI tools, each with its own distinctive features.

Fine-grained detection must be able to handle new model architectures and

the content they generate, which requires continuously updated datasets and

adaptable detection techniques.

• Hybrid Texts: One of the most challenging aspects of AI-generated text detec-

tion is identifying hybrid texts – those created through collaboration between

CHAPTER 2. BACKGROUND AND RELATED WORKS

Dataset Languages Label Space Domains Generators Size

MixSet

[57]

English

Human-written, AI-polished

Human-initiated, AI-continued

AI-generated, human-edited

Deeply-mixed text

News

Game reviews,

Paper abstracts,

Speech,

Blog

GPT-4

Llama 2

Dolly

3,600

LLM-DetectiAIve

[3]

English

Human-written

AI-generated

Human-written, AI-polished

AI-generated, AI-humanized

arXiv abstracts,

Reddit posts,

Wikihow,

Wikipedia articles,

OUTFOX essays,

Peer reviews

GPT-4o

Mistral 7B

Llama 3.1 8B

Llama 3.1 70B

Gemini

Cohere

487,996

Beemo

[4]

English

AI-generated, AI-humanized

Human-written

AI-generated

AI-generated, human-edited

Generation,

Rewrite,

Open QA,

Summarize,

Closed QA

Llama 2

Llama 3.1

GPT-4o

Zephyr

Mixtral

Tulu

Gemma

Mistral

19,256

M4GT

[50]

English

Human-written, AI-continued

Human-written

AI-generated

Peer review,

OUTFOX

Llama 2

GPT-4

GPT-3.5

33,912

Real or Fake

[17]

English

Human-written

Human-initiated, AI-continued

Recipes,

Presidential Speeches,

Short Stories,

New York Times

GPT-2,

GPT-2 XL

CTRL

9,148

RoFT-chatgpt

[29]

English Human-initiated, AI-continued

Short Stories,

Recipes,

New York Times,

Presidential Speeches

GPT-3.5-turbo 6,940

Co-author

[55]

English Deeply-mixed text

Creative writing,

New York Times

GPT-3 1,447

TriBERT

[56]

English

Human-initiated, AI-continued

Deeply-mixed text

Human-written

Essays ChatGPT 34,272

LAMP

[9]

English AI-generated, human-edited Creative writing

GPT-4o

Claude 3.5 Sonnet

Llama 3.1 70B

1,282

APT-Eval

[42]

English Human-written, AI-polished Based on MixSet

GPT-4o

Llama 3.1 70B

Llama 3.1 8B

Llama 2 7B

11,700

HART

[6]

English

Human-written

Human-written, AI-polished

AI-generated, AI-humanized

AI-generated text

AI-generated, human-edted

GPT-3.5-turbo

GPT-4o

Claude 3.5 Sonnet

Gemini 1.5 Pro

Llama 3.3 70B

Qwen 2.5 72B

16,000

LLMDetect

[12]

English

Human-written

Human-written, AI-polished

Human-written, AI-extended

AI-generated

DeepSeek v2

Llama 3 70B

Claude 3.5 Sonnet

GPT-4o

64,304

ICNALE corpus

[33]

English Human-written Essays

Qwen 2.5

Llama 3.1 8B/70B

Llama 3.2 1B/3B

Mistral Small

67,000

CyberHumanAI

[36]

English Human-written Wikipedia 500

Table 2.1: English Fine-grained AI-generated Text Detection Datasets Overview.

CHAPTER 2. BACKGROUND AND RELATED WORKS

humans and AI. These texts often blend human creativity with AI-generated

content, making it harder to attribute authorship accurately. The datasets used

for training detection models must therefore include a significant amount of

hybrid content to allow the model to learn how to identify these complex cases.

2.4 Generalization of AI-generated Text Detection

Recent studies such as M4GT [50] and LLM-DetectAIve [3] have highlighted a

persistent challenge for both binary and fine-grained AI-generated text detection:

poor generalization to unseen domains, languages, and generators. Many detection

methods show a signif icant drop in performance on out-of-distribution data, under-

scor ing the difficulty of building robust detectors to deal with real-world scenarios

and evolving LLMs outputs.

Several key challenges must be overcome to improve the generalization of AI-

generated text detection:

• Domain Shifts: One of the most significant hurdles in AI-generated text de-

tection is the variability of content across different domains. AI models like

GPT and Llama are often trained on diverse datasets, enabling them to gener-

ate content in a wide variety of domains, including technical writing, creative

content, academic research, and casual conversations. A model trained to de-

tect AI-generated content in one domain may not perform well in another,

especially if the writing styles or terminology used in the domain differ sig-

nificantly. This issue, known as domain shift, makes it necessary to develop

detection models that can adapt to different topics and content types without

significant performance degradation.

• Language Diversity: Most AI-generated text detection models are initially

trained on English-language datasets, which limits their applicability in mul-

tilingual settings. As AI models like Gemini and DeepSeek become increas-

ingly global, generating text in multiple languages, detection models must be

capable of handling non-English content. This multilingual challenge requires

the development of language-agnostic features or the use of multilingual train-

ing datasets to ensure that models can detect AI-generated text in diverse lin-

guistic contexts.

• Emerging AI Models: The rapid pace of innovation in AI technology means

that new language models are continually being introduced. For example,

GPT-4 introduced new capabilities compared to its predecessors, and mod-

els like DeepSeek R1 offer unique features that distinguish them from other

AI systems. Detection models that are trained on one specific AI model may

CHAPTER 2. BACKGROUND AND RELATED WORKS

not generalize well to newer models with different architectural or training

methodologies. Furthermore, generative models are becoming more sophis-

ticated in their ability to mimic human writing styles, making it increasingly

difficult for detection systems to identify subtle AI-generated characteristics

in new content.

• Hybrid Texts: The increasing prevalence of human-AI collaboration in con-

tent generation creates another challenge for generalization. Hybrid texts –

those in which humans and AI collaborate – are not easily categorized as

either fully human or fully AI-generated. These texts often exhibit character-

istics of both human creativity and AI-generated content, complicating the

detection process. Models trained on purely human or purely AI texts may

struggle to identify the nuanced features of hybrid texts, requiring more ad-

vanced detection strategies that can assess the degree of AI involvement in the

wr iting process.

To address these challenges and improve the generalization of AI-generated text

detection, several strategies have been proposed:

• Domain-Adversarial Training: One approach to improving generalization

across domains is domain-adversarial training [18]. This method involves

training the detection model to be domain-agnostic by using adversarial tech-

niques that encourage the model to learn features that are not specific to any

particular domain. By learning to ignore domain-specific information, the

model can become more adaptable to new topics and content types, improving

its ability to detect AI-generated text across a wide range of domains.

• Adversarial In-Context Learning: OUTFOX [28] proposes dynamically

generating challenging examples within the in-context learning paradigm. By

exposing the model to adversarially difficult instances, it enhances robust-

ness and forces the model to better distinguish subtle signals of AI generation.

However, this method can be computationally expensive and does not fully

resolve transfer issues to unseen domains.

• Sentence-Level Detection with Style-Content Interaction: SeqXGPT [49]

adopts a sentence-level detection strategy by leveraging a combination of log-

probability features and deep neural architectures, including convolutional

layers and self-attention. This allows the model to better capture local mixed-

content patterns and stylistic inconsistencies, contributing to improved gen-

eralization across text structures. Nevertheless, its reliance on model-specific

features may hinder scalability to unseen generators.

CHAPTER 2. BACKGROUND AND RELATED WORKS

• Fine-Grained Multi-Class Classification: LLM-DetectAIve [3] takes a fine-

grained classification approach, categorizing texts according to specific gener-

ator types while also using domain-adversarial training to mitigate overfitting.

This dual mechanism improves detection within known domains but faces

challenges in generalizing to completely novel generators and lacks multilin-

gual support.

The ability to generalize AI-generated text detection methods across domains,

languages, and models is essential for deploying these systems in real-world set-

tings. For example, in academic publishing, where papers from a wide range of

disciplines and authors are submitted, a detection model must be capable of iden-

tifying AI-generated content regardless of the subject matter or language used.

Similarly, in content moderation for social media platforms or news outlets, AI-

generated posts must be identified and flagged for review, even when they come

from different sources or involve different AI models.

Furthermore, as AI tools continue to evolve, detection systems must be able to

keep pace with advancements in generative models. This requires a flexible and

adaptable approach to model training and evaluation, ensuring that AI-generated

content is reliably detected across a wide range of scenarios.

Generalizing AI-generated text detection is a complex and ongoing challenge,

dr iven by factors such as domain diversity, language variability, the rapid develop-

ment of new AI models, and the rise of hybrid texts. To ensure the robustness of

detection systems, it is essential to employ strategies such as domain-adversarial

training, multilingual training, and incremental learning. By addressing these chal-

lenges, detection systems can be developed that are not only accurate but also

adaptable to the ever-changing landscape of AI-generated content.

2.5 Contrastive Learning for AI-generated Text Detection

Contrastive learning has been extensively used to improve the sentence repre-

sentation by pulling semantically similar sentences closer and pushing dissimilar

ones apart, reorganizing the hidden space of sentence representation by seman-

tics. For example, a representative work SimCSE regarded an input sentence un-

der dropout noise as its semantically similar counterpart (i.e., positive pairs), and

trained the encoder to close their distance. Additionally, their natural language in-

ference pairs, treats entailment pairs as positives and contradiction pairs as hard

negatives [19]. For negative pairs with contradictory meanings, the embeddings

were trained to be distant in the latent space. Similarly, DeCLUTR [23] constr ucted

positive pairs by extracting different spans from the same texts, and negative pairs

CHAPTER 2. BACKGROUND AND RELATED WORKS

are sampled from different texts.

We adopt this philosophy to reorganize the latent space based on writing styles,

pulling human-written texts to cluster together while remaining distant from AI-

generated ones. Analogous to semantic textual similarity tasks, where sentence

similar ity ranges from 0 to 5 to reflect varying degrees of semantic overlap, we

incorporate ordinal regression into our framework to model the degree of human

involvement in collaborative texts, ranging from 0 (fully AI-generated) to 1 (fully

human-wr itten).

DeTeCtive also took advantage of contrastive learning and was used in binary

AI-generated text detection [24]. Based on a multi-task framework, DeTeCtive was

trained to learn writing-style diversity using a multi-level contrastive loss, and an

auxiliary task of classifying the source of a given text (human vs. AI) to capture dis-

tinguishable signals. For inference, the pipeline encoded the input text as a hidden

vector and applied dense retrieval to match the cluster, depending on the stylistic

similar ity with previously indexed training feature database. Additionally, instead

of retraining the model when faced with new data, they encoded new data with the

trained encoder to obtain embeddings, and then added them to the feature database

to augment. This largely improved the generalizability in unseen domains and new

generators.

However, this approach only considers two types of text: human-written and

AI-generated, without further consideration of human-AI collaborative texts. Our

work fills this gap with the goal of improving generalization performance for fine-

grained AI-generated text detection.

2.6 Summary

In this chapter, I have reviewed the foundational concepts and recent advance-

ments in the detection of AI-generated text. I began by examining the backgrounds

of large language models and their increasing use as writing assistants in both aca-

demic and professional contexts. I then discussed the importance of fine-grained

datasets for training detection systems and the challenges associated with gener-

alizing detection methods across domains, languages, and evolving AI models.

Finally, I explored the role of contrastive learning in improving the detection of

AI-generated content. This background provides the basis for the development of

the ScholarSleuth framework, which leverages these techniques to offer a compre-

hensive solution for detecting AI-generated content in academic writing.

CHAPTER 3. METHODOLOGY

3.1 Overview

This chapter presents the overall FAID architecture and the methodology for de-

tecting and analyzing AI-generated texts with an emphasis on generalization and

robustness to unseen data. I begin by exploring stylistic and semantic consistencies

among outputs of different LLM families, establishing the foundation for author-

ship attribution and classification. Through empirical analysis, I identify distinct

family-specific traits, justifying the use of family-level labels.

Next, I evaluate a variety of sentence encoders to select the optimal model for

my task. Building on this, I introduce a multi-level contrastive learning framework

that leverages hierarchical relationships between text sources and model families

to enhance representation learning. To further improve performance and encourage

broader generalization, I incorporate a multitask auxiliary classifier trained with

cross-entropy loss alongside the contrastive objective.

Finally, to handle texts from unseen models or domains without requiring re-

training, I propose an adaptive inference pipeline that combines the trained encoder

with a vector database and Fuzzy k-Nearest Neighbors for robust similar ity-based

classification. This strategy significantly improves resilience to distributional shifts

in real-world deployment scenarios.

3.2 Overview of FAID Architecture

In this work, I formulate the task as a three-class classification problem: human-

written, AI-generated, and human-AI collaborative texts. The human-AI collabora-

tive categor y involves a range of interaction between humans and AI systems, such

as human-written then AI-polished, human-initiated then AI-continued, human-

written then AI-paraphrased, AI-generated then AI-humanized, etc.

Given the growing variety and complexity (e.g., deeply mixed text) of collab-

orative patterns, I do not exhaust all patterns. Instead, I consolidate all forms of

human-AI collaboration into a single label to maintain practical simplicity and

model generalizability. This approach reflects real-world usage, where using AI

tools to enhance clarity or expression is increasingly common and often ethically

acceptable.

I also investigate the dataset and find that the AI models within the same family

tend to have similar writing styles and text distributions, due to their shared training

data and architecture, which is described more detailed in Section 3.3.2. Therefore,

CHAPTER 3. METHODOLOGY

I consider each model’s family to be an author with unique writing styles.

I aim to enable the encoder to capture multi-level similarities between authors

(AI, human and human-AI) as Equation 3.6, forcing model to learn features that can

differentiate distinct authors in a high-dimensional hidden vector space. I denote

as cosine similarity, ϕ(·) is the encoder function and P

, P

, (1 ≤ i ≤ j < 5) are

distributions of different text subsets. I have:

x∈P

,y∈P

(ϕ(x), ϕ(y))] ≥ E

x∈P

,y∈P

j+1

(ϕ(x), ϕ(y))] (3.1)

where P

corresponds to the distribution generated by a particular LLM family,

to the distribution generated by any LLMs, P

to the distribution of a collabo-

rative text generated by humans and the LLM family of P

, P

to the distr ibution

of collaborative text generated by humans and any LLM families, and P

to the

distribution of human-written text.

To clarify the rationale for configuring FAID to expect that the similarity of a

text x (from lower-level distributions P

or P

) with samples from P

is generally

greater than or equal to its similarity with samples from P

, consider the following.

• If x ∈ P

(x is generated by a particular LLM): Let y

LHS

be drawn from P

and y

RHS

be drawn from P

. Naturally, the similarity S

(x, y

LHS

) should be

greater than S

(x, y

RHS

). This is because P

contains texts that share a direct

LLM family origin with x.

• If x ∈ P

(x is generated by any LLM): In this scenario, with y

LHS

from P

and

RHS

from P

as defined above, the similarity S

(x, y

LHS

) is generally expected

to be equal to S

(x, y

RHS

). Since x can originate from any LLM, it does not

inherently possess a stronger connection to the specific LLM family in P

than

to the broader human-LLM collaborations represented in P

This configuration aims to ensure that closeness in distribution corresponds to

heightened similarity after encoding, encouraging the model to discern fine-

grained multi-level relations.

3.3 Training Architecture

3.3.1 Architecture Design

Leveraging multi-level contrastive learning loss, I fine-tune a language model

based on the human, human-AI, and AI texts to reorganize the embedding space.

The goal is to pull embeddings of the same authorship category closer and push

apart embeddings of different origins. This method allows the encoder to capture

CHAPTER 3. METHODOLOGY

Encoder

(XLM-RoBERTa)

AI-generated

Human

GPT

LLaMA

Human +

Deepseek

Human-Written

Human-AI collaboration

Push

Text Embeddings

Differentiating Authors Via Contrastive Learning

Pull

Push

Figure 3.1: Training architecture. Leveraging multi-level contrastive learning loss, we fine-

tune a language model based on the human, human-AI and AI texts, to force the model to

reorganize the hidden space, pulling the embeddings within the same author families closer,

and pushing the embeddings from different authors farther. We train an encoder that can

represent text with distinguishable signals to discern authorship of text.

nuanced differences across the three major classes: human-written, human-AI col-

laborative, and AI-generated.

As illustrated in Figure 3.1, three categories of texts are first passed through a

shared encoder, specifically XLM-RoBERTa [45], which is selected for its multi-

lingual capability and robust per formance on diverse datasets. The encoder trans-

forms the input into dense vector representations. These embeddings are then used

to compute contrastive loss by comparing intra-class and inter-class distances.

On the right side of the figure, the visualized embedding space demonstrates

the desired outcome: embeddings from the same author group (e.g., Human, GPT,

Llama, or Human + DeepSeek) are pulled together, while embeddings from dif-

ferent groups are pushed away. For example, human-only embeddings are grouped

separately from those of GPT or Llama. Meanwhile, collaborative texts such as

Human + DeepSeek are positioned between AI-only and Human-only clusters, re-

flecting their hybrid nature.

This structure not only supports three-way classification, but also enables deeper

analysis by identifying subtle signals of AI involvement, supporting downstream

tasks like generator attribution and authorship clustering.

More detailed information about the model selection process is provided in Sec-

tion 4.4.

3.3.2 What is LLM family?

Here, I define a LLM family is all the models from a company, but with different

model generations. For example: GPT-4 and GPT-4o both come from OpenAI, but

GPT-4o is newer than GPT-4, so they are called same family. This assumption was

made because different generations in a same family have shared training data,

CHAPTER 3. METHODOLOGY

shared architectures, guiding styles, etc.

To confirm the assumption, assess the stylistic and semantic consistency of AI-

generated texts, I conducted a comprehensive analysis using various perspectives:

N-grams distribution, text length patterns, and semantic embedding visualization in

Section 4.5. These analyses allow me to examine similarities both within the same

model family and across different model families, leading to a robust understanding

of LLM "authorship" characteristics.

3.3.3 Multi-level Contrastive Learning

Given a dataset with N samples, the i

sample is denoted as T

and assigned

with three-level labels indicating its source.

• x: if T

is fully AI-generated, x

= 0, otherwise x

= 1;

• y: if T

is fully human-written, y

= 0, otherwise y

= 1;

• z: an indicator of a specific LLM family.

The encoder ϕ(·) represents the sample T

in a d-dimensional vector space R

Then I calculate the cosine similarity between two samples T

and T

, denoted by:

σ(i, j) = S

(ϕ(T

), ϕ(T

)).

For AI-generated text (x = 0), the similarity between T

and another AI-

generated text T

is greater than that with a human-written or collaborative text

(x = 1):

σ(i, j) > σ(i, k), ∀x

= 0, x

= x

, x

= 1 (3.2)

If x = 0, then the text is fully AI-generated. For this case, I do not consider

y since AI-generated text is considered as non-AI-collaboration. I can imply that

the similarity between two texts written by the same AI is higher than that of two

different AI families. Hence:

σ(i, j) > σ(i, k), ∀x

= 0, z

= z

, z

= z

(3.3)

The reverse condition (with y = 1) is also true. I can conclude that:

σ(i, j) > σ(i, k), ∀x

= 1, y

= y

, y

= y

(3.4)

For cases where all samples are human-AI collaborative texts (x, y = 1), two

texts with involvements of the same AI family tend to be more similar than the ones

CHAPTER 3. METHODOLOGY

that involve contributions from different families. That is:

σ(i, j) > σ(i, k), ∀x

= 1, y

i,j,k

= 1, z

= z

= z

(3.5)

Combining all, the text representation is learned with the following constraints:











σ(i, j) > σ(i, k), ∀x

= 0, x

= x

= x

;

σ(i, j) > σ(i, k), ∀x

= 0, z

= z

= z

;

σ(i, j) > σ(i, k), ∀x

= 1, x

= x

= x

;

σ(i, j) > σ(i, k), ∀x

= 1, y

= y

= y

;

σ(i, j) > σ(i, k), ∀x

= 1, y

= y

= 1, z

= z

, z

= z

(3.6)

To enforce the similarity constraints outlined in Equation 3.6, I build upon the

SimCLR framework [10] and introduce a strategy for defining both positive and

negative sample pairs, which forms the basis of my contrastive learning loss. De-

parting from traditional contrastive losses that rely on a single positive sample,

my approach considers a group of positive instances that satisfy specif ic criteria.

The similarity between the anchor and positive samples is computed as the average

similar ity across this entire positive set. For negative samples, I follow the method-

ology used in SimCLR. The resulting contrastive loss, expressed in Equation 3.7,

involves q as the anchor sample, K

as the positive sample set, K

−

as the negative

sample set, τ as the temperature parameter, and N

as the number of positive

samples.

= − log

exp

k∈K

S(q,k)

exp

k∈K

S(q,k)

k∈K

−

exp



S(q,k)



(3.7)

Different constraints result in varied sets of positive and negative samples.

Based on these sets, contrastive losses are computed at multiple levels. As I de-

clared in Equation 3.7, each inequality in Equation 3.6 is denoted as L

,ε

where

ε =

1, 5 respectively. To form Equation 3.8, I need to add the coefficients α, β, γ, δ,

and ζ to maintain the balance of multi-level relations.

mcl

i=1

(αL

+ βL

) + (1 − x

) (γL

+ δL

+ ζL

)] (3.8)

CHAPTER 3. METHODOLOGY

Therefore, due to the last inequality in Equation 3.6 only specifying a case (y =

1), and the other cases considering both values for y, so I have ζ = 2γ = 2δ. Also,

to maintain the equilibrium, I need to keep α + β = γ + δ + ζ. I set γ = δ = 1,

then ζ = 2, α = β = 2.

This approach encourages the model to capture subtle and detailed features from

different sources. As a result, the model becomes more adept at recognizing vari-

ations in writing styles. This capability enhances its accuracy and strengthens its

ability to generalize when detecting AI-generated text.

3.3.4 Multitask Auxiliary Learning

Multi-task learning [8] allows a model to learn several tasks concurrently by

shar ing relevant information across them. This joint learning process helps the

model develop more general and distinctive features. As a result, it improves the

model’s ability to generalize to new data. Building on the previously described con-

trastive learning framework, I extend the encoder by attaching an MLP classifier at

its output layer. This classifier is responsible for binary classification, determining

whether a given quer y text was written by a human or generated by a LLM. Denote

that the probability of i

sample with label x

= 0 is p

. To train this component, I

apply a cross-entropy loss function, denoted as L

, defined as follows:

= −

i=1

log p

+ (1 − x

) log (1 − p

) (3.9)

Therefore, the overall loss is computed as:

L = L

+ L

mcl

(3.10)

3.4 Inference Architecture

3.4.1 Architecture Design

The inference architecture as shown in Figure 3.2 is composed of three sequen-

tial stages that leverage a frozen encoder and a vector-based retrieval system to

generalize beyond training distributions:

(a) Embed the input text into embedding vectors using the fine-tuned en-

coder: All incoming texts – including human-written, human-AI collabora-

tive, AI-generated, and unseen data – are first processed by the previously

fine-tuned XLM-RoBERTa encoder. This encoder remains frozen during in-

ference to ensure stable representation learning.

CHAPTER 3. METHODOLOGY

Encoder

(XLM-RoBERTa)

Encoder

(XLM-RoBERTa)

Human

Human-written

Human-AI Collab.

Unseen Data

LLaMA

Human

Deepseek

GPT

AI-generated

Human+GPT

LLaMA

Human+GPT

GPT

Deepseek

Unseen data

(a) Encode Input Texts to Embedding Vectors

Dataset

Vector Database with Fuzzy K-Nearest Neighbor

Vector

Database

Figure 3.2: Inference architecture. (a) embed the input text into embedding vector using

the fine-tuned encoder; (b) use Fuzzy KNN algorithm to cluster, retrieving which cluster

the input text belongs to. (c) The stored vector database VD was created by saving all

embeddings of texts in training and validation sets using the fine-tuned encoder. If the

input text is unseen, we embed the unseen text and save into a temporary vector database

′

, enhancing the generalization of the detector.

(b) Cluster and classify using a fuzzy k-Nearest Neighbors retrieval mecha-

nism: The encoded vectors are compared against those stored in a pre-built

vector database VD, using a fuzzy KNN algorithm. This step helps identify

the closest author-style cluster for the input, even if it comes from an unseen

combination or slightly novel stylistic pattern. The fuzzy clustering allows

some flexibility in ambiguous cases, especially for hybrid or borderline texts

such as human+GPT or human+DeepSeek.

vector database VD is built from embeddings of all labeled training and vali-

dation data, including diverse author types (Human, GPT, Llama, DeepSeek,

Human+GPT, etc.). For unseen inputs during evaluation, an auxiliary tem-

porary database VD

′

is instantiated. This buffer enables dynamic general-

ization by maintaining embeddings of new, previously unobserved samples,

thereby improving clustering performance and robustness on domain-shifted

or author-shifted examples.

This architecture supports not only reliable authorship classification but also

fine-grained cluster ing across AI model families and collaborative writing types.

The structured use of vector search and embedding consistency enables better in-

terpretability and scalability for real-world deployment scenarios.

More detailed information about the selection and justification of the fuzzy K-

Nearest Neighbor algorithm can be found in Section 3.4.2.

CHAPTER 3. METHODOLOGY

3.4.2 Handling Unseen Data without Retraining

Unseen data, whether from an unseen domain or generated by an unfamiliar

model, remains a significant challenge even for state-of-the-art AI-generated text

detection methods, due to the increasing capabilities of large language models.

I tried to use the model classifier on its own, but I ended up by using a vector

database along with Fuzzy k-Nearest Neighbors, which is illustrated in Figure 3.2

along with detailed experiments in Section 4.6. Specifically, when dealing with

unseen data, I use my trained model to embed these samples and add them to my

existing vector database. Through careful parameter tuning, this approach enables

my system to effectively handle newly encountered unseen data without the need

retraining.

3.4.3 Fuzzy k-Nearest Neighbors with Softmax Weighting

To improve the robustness and accuracy of nearest neighbor-based classifica-

tion, we adopt a fuzzy k-Nearest Neighbors algorithm that incorporates two key

mechanisms: (i) softmax-based similarity weighting, and (ii) fuzzy voting based

on type c ompatibility. This approach allows the model to account for both seman-

tic similarity and categorical proximity, addressing limitations of simple majority

voting in traditional kNN.

Softmax Similarity Weighting. Given a query instance, we first retrieve its k

nearest neighbors based on a distance or similarity score s

(e.g., cosine similarity).

Rather than treating each neighbor equally, we apply a softmax transformation to

assign a normalized weight to each neighbor based on its relative similarity:

Softmax(s

, T ) =

−max(s)

j=1

−max(s)

(3.11)

where s

is the similarity score of the i

neighbor, T is a temperature parameter

controlling the sharpness of the distribution, and n is the number of neighbors. Sub-

tracting max(s) improves numerical stability. A lower T results in a more peaked

distribution, amplifying the influence of the most similar neighbors, while a higher

T produces a more uniform weighting.

Fuzzy Voting via Boost Factors. Once the initial prediction q is derived from the

nor mal voting of neighbors (voted by majority), we refine the voting process by in-

troducing a boost factor that adjusts the contribution of each neighbor based on the

compatibility between its true label c and the predicted label q. The intuition is to

CHAPTER 3. METHODOLOGY

give more influence to neighbors whose labels are either exactly or approximately

compatible with the initial prediction:

Boost(q, c) =











1.3 if q = c,

1.1 if |q − c| = 1,

0.9 if |q − c| = 2,

0.8 otherwise.

(3.12)

Here, q is the initial predicted class, and c is the ground-truth label of the neigh-

bor. The boost factor reflects the degree of semantic or categorical proximity be-

tween the predicted and actual labels. For instance, if the classes represent ordinal

categor ies or semantic types, a small numerical difference between q and c implies

partial compatibility.

Final Weighted Contribution. Each neighbor’s final influence on the prediction

is computed as the product of its softmax-der ived weight and its boost factor:

Contr ibution = Softmax(s

, T ) × Boost(q, c). (3.13)

This mechanism ensures that neighbors who are both highly similar to the query

(high weight) and type-compatible (high boost) exert stronger influence on the final

decision. It mitigates the risk of over-relying on semantically distant but technically

close neighbors.

This hybrid approach offers several advantages over traditional kNN:

• It emphasizes semantically similar neighbors through the softmax function.

• It incorporates domain knowledge of class relationships via the boost factor.

• It allows smooth generalization in the presence of noisy or ambiguous class

boundar ies.

Overall, fuzzy kNN with softmax weighting provides a more nuanced and

context-aware classification mechanism, particularly useful in tasks involving hi-

erarchical labels, ordinal types, or semantically correlated categories.

3.5 Summary

In this chapter, I presented a comprehensive methodology and framework for

fine-grained AI-generated text detection, focusing on the identification of stylistic

and semantic traits among model families. My analysis established the presence of

CHAPTER 3. METHODOLOGY

family-level consistency in text length, lexical usage, and embedding structure –

motivating my treatment of each LLM family as a unified authoring entity.

I identified the unsupervised SimCSE XLM-RoBERTa-base encoder as the op-

timal backbone for my system, offering superior performance on both known and

unseen LLM outputs. I then proposed a multi-level contrastive learning framework

that enforces nuanced relational constraints among text sources and model families,

effectively enhancing feature representation.

To further support classification, I added a multitask auxiliary loss and intro-

duced a robust inference mechanism using a vector database and Fuzzy KNN, en-

abling the model to generalize to texts from previously unseen generators. These

innovations collectively contribute to a flexible and scalable framework capable

of reliable authorship detection in diverse and evolving LLM-generated content

environments.

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

4.1 Overview

This chapter presents the experimental validation of the FAID model for fine-

grained AI-generated text detection. I begin by detailing the datasets used in the ex-

per iments, including both in-domain and unseen-domain sources, along with syn-

thetic extensions using a variety of large language models. I then introduce three

baseline methods for comparison: LLM-DetectAIve, SVM with N-gram features,

and T5-Sentinel.

Two primary tasks are evaluated: (1) classifying text as human-written, AI-

generated, or human-AI collaborative; and (2) identifying the specific AI generator

responsible for a text. The experiments are conducted under four evaluation condi-

tions: in-domain with known generators, unseen domains, unseen generators, and

a combination of unseen domains and unseen generators.

4.2 FAIDSet

4.2.1 Overview

To fill the gap of fewer publicly-available multilingual fine-grained AI-

generated text datasets, I collected a multilingual, multi-domain and multi-

generator AI-generated text dataset FAIDSet. It contains texts generated by LLMs,

wr itten by humans and collaborated by both humans and LLMs, resulting in a total

of ∼84,000 examples.

FAIDSet covers two domains: student theses and paper abstracts, where identi-

fying authorship is critical, across two languages including Vietnamese and En-

glish. I collected student theses from the database of Hanoi University of Sci-

ence and Technology, and paper abstracts from arXiv and Vietnam Journals Online

(VJOL).

FAIDSet contributes not only a large-scale resource for training and evaluation

but also a methodological foundation for future research into multilingual, fine-

grained authorship attribution in the age of large language models.

4.2.2 The Formulation of Human-written Dataset

To build a reliable benchmark for AI-generated text detection, I first curated a

high-quality set of human-written texts across two languages and domains. The

goal was to ensure that these texts reflect genuine human authorship without AI

assistance.

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

Source No. Human texts

arXiv abstracts 2,000

VJOL abstracts 2,195

HUST theses (English) 4,898

HUST theses (Vietnamese) 11,164

Table 4.1: Statistics of human-written text’s origins in FAIDSet.

I collected undergraduate theses written in Vietnamese and English from the

Hanoi University of Science and Technology (HUST). These documents span a

diverse range of disciplines including computer science, mechanical engineering,

economics, and more. To ensure quality and consistency, I selected theses that

passed internal university review processes and excluded documents with notice-

able signs of AI assistance.

For paper writing domain, I sourced human-written English abstracts from

arXiv, a well-known open-access repository of scientific papers. Only papers pub-

lished prior to 2022 were selected to minimize the chance of LLM involvement

dur ing writing. For Vietnamese abstracts, I collected abstracts from Vietnam Jour-

nals Online

, which indexes peer-reviewed academic journals across various scien-

tific disciplines. All abstracts were manually verified to ensure they do not include

AI-generated segments.

The resulted dataset for human-written labels is described in Table 4.1. These

texts serve as critical anchors in my dataset to help train and evaluate AI-generated

text detection models with greater reliability and linguistic diversity.

4.2.3 Dataset Statistics

I used the following multilingual LLMs families to produce both AI and human-

AI collaborative texts: GPT-4/4o, Llama-3.x, Gemini 2.x, and DeepSeek V3/R1.

Regarding the human-AI collaborative text, I include AI-polished, AI-continued,

and AI-paraphrased where the models are requested to polish or paraphrase inputs

while ensuring the accuracy of any figures and statistics.

My FAIDSet includes 83,350 examples, which are separated into three subsets:

train, validation, and test with the ratio shown in Table 4.2. The data set also com-

pr ises various sources of human-written texts, which are descr ibed in Table 4.1.

4.2.4 Diverse Prompt Strategies

To avoid biasing my generated corpora toward any single style or topic, I employ

a broad set of prompt templates when synthesizing AI-generated and human-AI

https://vjol.info.vn

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

Subset Human AI Human-AI collab.

Train 14,176 12,076 32,091

Validation 3,038 2,588 6,876

Test 6,879 2,599 3,038

Table 4.2: Number of examples per label in subsets in FAIDSet.

collaborative texts. By varying prompt structure, content domain, and complexity,

I ensure that the resulting outputs cover a wide distribution of writing patter ns, vo-

cabulary, and rhetorical devices. This diversity helps my detector generalize more

effectively to real-world inputs [52]. Depending on the data source and context, I

craft prompts to create varied outputs suitable for different real-world application

scenar ios. I generate responses with different tones using prompts such as "You are

an IT student..." and "...who are very familiar with abstract writing...".

Concretely, I sample two out of five prompts for AI-generated texts and several

categor ies of human-AI collaborative texts in Table 4.3:

• AI-polished: Texts that were originally written by a human and then lightly

refined by an AI system to improve grammar, clarity, or fluency without alter-

ing the core content or intent.

• AI-continued: Texts where a human wrote an initial portion (e.g., a sentence

or paragraph), and an AI generated a continuation that attempts to follow the

or iginal style, tone, and intent.

• AI-paraphrased: Texts that were originally written by a human and then re-

worded by an AI system to express the same meaning using different phrasing,

possibly altering sentence structure or word choice while preserving the orig-

inal message.

By mixing prompts across these categories, I generate a balanced corpus that

mitigates over-fitting to any one prompt pattern and better reflects the diversity of

real user quer ies.

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

Category Student Thesis Paper Abstract

AI-generated

• You are an IT student who are

writing the graduation thesis. In

a formal academic tone, write

this section of a thesis on ,

clearly outlining the structure,

then write the passage con-

cisely. The original text: .

• In clear, structured prose, draft

the section for a thesis titled

, cite some related works

you mentioned in the passage

and highlighting the contribu-

tion. The original text: .

• You are a computer scientist

who are very familiar with ab-

stract writing for your works

based on the title. Craft a con-

cise word_count-word ab-

stract for a paper titled ,

summarizing the problem state-

ment, methodology, key find-

ings, and contributions. The

original text: .

• Compose a word_count-

word abstract for the paper ,

ensuring it includes motivation,

approach, results, and implica-

tions for future research. The

original text: .

AI-polished

• You are an IT student who are

refining your work to make

it more complete. Refine the

following section excerpt for

grammar, clarity, and academic

style while preserving its orig-

inal meaning and terminology.

The original text: .

• You are an IT student who are

refining your work to make it

more complete. Enhance the

academic tone, coherence, and

logical flow of this thesis sec-

tion without altering technical

content. The original text: .

• You are a computer scientist

who are very familiar with ab-

stract writing and refining the

written abstract. Improve the

coherence, precision, and for-

mal tone of this draft abstract

without introducing new con-

tent. The original text: .

• Improve the clarity and con-

ciseness of this abstract para-

graph while maintaining all

original findings and terminol-

ogy. The original text: .

Table continued on next page

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

Table continued from previous page

Category Student Thesis Paper Abstract

AI-continued

• You are an IT student who are

writing your graduation thesis.

Continue to write the section

from this thesis excerpt for ap-

proximately word_count

words, maintaining formal aca-

demic structure and style. The

original text: .

• You are an IT student who are

writing your graduation thesis.

Extend the section by adding

supporting detailed information

for a thesis on . The origi-

nal text: .

• You are a computer scientist

who are very familiar with ab-

stract writing. Add some con-

cise concluding sentences to

this partial abstract that high-

lights implications for future re-

search. The original text: .

• Continue the abstract by writing

a closing statement that under-

scores the study’s contributions

and potential applications. The

original text: .

AI-paraphrased

• Paraphrase the following thesis

section in a clear academic tone,

preserving citations and techni-

cal terms exactly. The original

text: .

• Reword this thesis excerpt to im-

prove readability and maintain

its scholarly voice, keeping all

references unchanged. The origi-

nal text: .

• Rephrase this abstract in formal

academic English, maintaining

all original citations and techni-

cal accuracy. The original text:

• Paraphrase this abstract para-

graph to enhance clarity and

flow, ensuring all technical terms

and citations remain intact. The

original text: .

Table 4.3: Samples of diverse prompt templates used to generate FAIDSet

4.3 Other Datasets

In addition to FAIDSet, for in-domain evaluation, I used another two datasets:

LLM-DetectAIve [3] encompasses various domains including arXiv, Wikihow,

Wikipedia, Reddit, student essays, and peer reviews. The original human-written

and machine-generated texts were augmented using a variety of LLMs (e.g.,

Llama 3, Mixtral, Gemma) to create 236,000-example dataset with two new la-

bels: machine-written then machine-humanized, and human-written then machine-

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

Model #. Params

Known Generators Unseen Generators

Acc ↑ F1-macro ↑ MSE ↓ MAE ↓ Acc ↑ F1-macro ↑ MSE ↓ MAE ↓

RoBERTa-base 125M 80.09 76.22 0.7328 0.3778 73.45 69.10 0.8901 0.4320

FLAN-T5-base 248M 80.19 75.77 0.7783 0.3947 72.80 68.55 0.9123 0.4467

e5-base-v2 109M 81.53 77.90 0.8023 0.4086 74.21 70.15 0.8804 0.4392

Multilingual-e5-base 278M 91.41 90.82 0.3436 0.1732 85.32 84.50 0.5102 0.2543

XLM-RoBERTa-base 279M 91.90 90.63 0.2345 0.1190 86.75 85.20 0.4125 0.2104

Sup-SimCSE-RoBERTa-base 279M 81.22 78.88 0.7102 0.3619 74.00 71.30 0.8420 0.4251

UnSup-SimCSE-RoBERTa-base 279M 82.19 79.38 0.7156 0.3637 75.10 72.40 0.8305 0.4207

UnSup-SimCSE-XLM-RoBERTa-base 279M 92.12 91.75 0.1904 0.0958 87.45 86.90 0.3507 0.1802

Table 4.4: Encoder selection on known vs. unseen generators. The best results in each

column are in bold.

polished.

HART [6] is a dataset spans domains of student essays, arXiv abstracts, story

wr iting, and news articles. It has four categor ies: human-written texts, LLMs-

refined texts, AI-generated texts, and humanized AI-generated texts. They further

generated 21,500 cases based on this dataset for unbalanced labels.

To evaluate generalization ability of FAID on unseen scenarios, I collected the

following data:

Unseen domain: I create a dataset consisting of 150 IELTS Writing essays from

Kaggle [25], where all original texts are human-written. These essays were then

used to generate human-AI collaborative and AI-generated texts, using the same

models with FAIDSet.

Unseen generators: I selected 150 human-written abstracts from the FAIDSet

test set and generated data for remaining labels using the three new LLM families:

Qwen 2, Mistral, and Gemma 2.

Unseen domain & generators: Based on the human-written IELTS essays

above, instead of the original four LLMs, I used the same LLM families as Un-

seen generator test set to generate data for the AI and human-AI labels, following

the same process as with the unseen domains.

4.4 Encoder Selection for FAID Architecture

To identify the best encoder for my classification task, I evaluated each candi-

date model on both known generators (FAIDSet testing data) and the new unseen-

generator test set introduced in Section 4.3. Each transformer-based encoder was

fine-tuned on FAIDSet training data and then used to predict labels on the two eval-

uation splits. Table 4.4 summarizes the accuracy, F1-macro, Mean Squared Error

(MSE), and Mean Absolute Error (MAE) for each model under both conditions.

Base model comparison. I first evaluated three popular monolingual models:

RoBERTa-base [31], Flan-T5-base [13], and e5-base-v2 [47] using the same train-

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

ing and evaluation splits. RoBERTa-base and e5-base-v2 achieved the most bal-

anced trade-off between classification accuracy and regression error (MSE, MAE),

while Flan-T5-base lagged slightly in F1-macro. These results indicated that a

stronger encoder backbone yields more robust performance and motivated explo-

ration of multilingual variants for further gains.

Multilingual variants. Next, I experimented with XLM-RoBERTa-base [14]

and Multilingual-e5-base [48]. Both models leverage cross-lingual pretraining,

which in my scenario appears to improve representation of diverse linguistic pat-

terns present in FAIDSet. In particular, XLM-RoBERTa-base delivered a substan-

tial boost in all metrics, suggesting that its multilingual training enhances general-

ization even on primarily monolingual inputs.

Contrastive learning with SimCSE. Finally, I integrated contrastive learning

via SimCSE [19] to further refine the sentence embeddings. I tested both super-

vised (trained on NLI data) and unsupervised (trained on Wikipedia corpus) Sim-

CSE variants applied to RoBERTa-base. The unsupervised version outperformed

its supervised counterpart, conf irming prior findings that unsupervised SimCSE

often yields stronger semantic encoders for downstream tasks. Encouraged by

this, I adopted the unsupervised SimCSE approach on XLM-RoBERTa-base [45],

which achieved the highest accuracy and lowest error rates across all configura-

tions.

Based on these experiments, I selected the unsupervised SimCSE

XLM-RoBERTa-base model for my final system.

4.5 Analysis on Text generated by Different AI Families

4.5.1 Text Distribution between AI Families

I first examined the distribution of text length, measured in both word and

character counts across outputs from five AI models: Llama-3.3-70B-Instruct-

Turbo [32], GPT-4o-mini [38], Gemini 2.0, Gemini 2.0 Flash-Lite, and Gemini 1.5

Flash [20]. Using 2000 arXiv prompt seeds, each model generated a corresponding

output, and their length distributions were plotted.

From Figure 4.1, distinct patterns emerged between families. Gemini models

consistently produced shorter and more compact outputs, while Llama and GPT

models demonstrated more variance and a tendency toward longer completions.

Despite using different versions (e.g., Gemini 2.0 vs. Gemini 1.5 Flash), the Gem-

ini outputs remained closely grouped in terms of both word and character counts,

suggesting a shared output generation strategy and stylistic consistency across the

family. In contrast, Llama and GPT distributions showed greater separation and

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

(a) Word length distr ibution across five AI models.

(b) Character length distribution across five AI models.

Figure 4.1: Text length distributions in words and characters across Llama-3.3, GPT-

4o/4o-mini, Gemini 2.0, Gemini 2.0 Flash-Lite, and Gemini 1.5 Flash.

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

variability.

This reinforces the hypothesis that text length behavior is not only model-

dependent but also family-coherent, with Gemini models forming a distinct cluster.

4.5.2 Text Distribution between AI Models within the Same Family

I performed N-g rams frequency analysis on responses generated by three mod-

els within the Gemini family: Gemini 2.0, Gemini 2.0 Flash, and Gemini 1.5 Flash

using 500 texts from the arXiv abstract dataset. Figure 4.2 highlights overlapping

high-frequency tokens and similar patterns in word usage and phrase structure

among the three models.

Despite minor differences in architectural speed (e.g., Flash vs. regular) or re-

lease chronology, the N-grams distributions show minimal divergence. Frequently

used tokens, such as domain-specific terms and transitional phrases, appeared with

nearly identical frequencies. This suggests that these models share similar decod-

ing strategies and training biases, likely due to shared pretraining corpora and op-

timization techniques resulting in highly consistent stylistic patterns.

These intra-family similarities support treating model variants within a family

as a unified authoring entity when performing analysis or authorship attribution.

4.5.3 Embedding Visualization and Semantic Cohesion

To delve into deeper semantic alignment, I visualized embeddings generated by

unsuper vised SimCSE XLM-RoBERTa-base model of generated texts from two

model families: Gemini family (texts generated by Gemini 2.0, Gemini 2.0 Flash-

Lite, and Gemini 1.5 Flash), and GPT-4o/4o-mini using PCA to reduce the high-

dimensional embeddings into a space.

As shown in Figure 4.3, Gemini model embeddings form tight, overlapping

clusters, indicating a high degree of semantic cohesion among their outputs. This

clustering remains consistent across both sample sizes of 2000 texts. In contrast,

GPT-4o/4o-mini embeddings occupy a separate region of the space, with greater

dispersion and less overlap with Gemini clusters.

This visualization confirms that the Gemini family not only shares stylistic fea-

tures but also maintains semantic coherence, distinguishing it from models of other

families at a conceptual level.

4.5.4 Key Findings

The consistency observed across N-grams distributions, text length patterns, and

semantic embeddings among Gemini models substantiates my decision to treat

each AI family as an author. These models demonstrate coherent writing styles,

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

(a) Top 3-grams of Gemini 2.0 (500 samples).

(b) Top 3-grams of Gemini 2.0 Flash (500 samples).

Figure 4.2: Top 20 most common 3-gram from Gemini 2.0, Gemini 2.0 Flash-Lite, Gemini

1.5 Flash using 500 sample prompts.

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

(a) 2D embedding space.

(b) 3D embedding space.

Figure 4.3: Visualizations showing clustering behavior of Gemini model family (Gemini

2.0, Gemini 2.0 Flash-Lite, Gemini 1.5 Flash) and GPT-4o/4o-mini using 2D and 3D em-

beddings with sample size of 2000 texts.

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

Algorithm

Known generators Unseen generators

Accuracy ↑ F1-macro ↑ Accuracy ↑ F1-macro ↑

k-Nearest Neighbors 90.52 90.21 85.37 84.95

k-Means 88.13 87.48 80.22 79.81

Fuzzy k-Nearest Neighbors 95.18 95.05 93.31 93.25

Fuzzy C-Means 92.67 92.31 90.04 89.53

Table 4.5: Comparison of clustering algorithms on known vs. unseen generators. The best

results are shown in bold.

shared lexical preferences, and tightly clustered semantic representations - hall-

marks of unified authorship. Conversely, inter-family comparisons show clear sep-

arability, emphasizing the distinctiveness of each AI model family’s writing behav-

ior.

4.6 How can We deal with Unseen Data?

4.6.1 The Need to Use Vector Database

I first used the trained model to classify the text and observed a performance

decline from 92.12% to 87.45% when evaluating on unseen models, which is re-

flected in Table 4.4. This drop reflects the model’s limited ability to generalize

beyond the distribution of its original training data. To address this, I propose inte-

grating a vector database that stores dense embeddings of all collected examples,

both training and incoming unlabeled data.

By indexing and retrieving semantically similar examples at inference time, the

vector database serves as a flexible, scalable memory that bridges gaps between

training and test distributions, enhancing the classifier’s resilience to domain shifts

and unseen generators. Specifically:

• Robust Domain Adaptation: New inputs are matched against a broad, con-

tinuously growing repository of embeddings, allowing the classifier to lever-

age analogous instances from related domains without full retraining.

• Generator-Independent Coverage: As novel text generators emerge, their

embeddings populate the database; retr ieval naturally adapts to new styles or

patterns by finding the closest existing vectors.

4.6.2 Clustering Algorithm Selection

To improve my detector’s robustness against unseen domains and generators,

I evaluated four clustering strategies on the encoded vectors stored in my vec-

tor database. Each algorithm was tasked with grouping text samples into human-

wr itten, AI-generated, and human-AI collaborative categories, using both known-

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

generator data (held-out from training) and entirely unseen generator data. Perfor-

mance was measured in terms of accuracy and F1-macro score.

I encoded each text sample using my pretrained sequence-classification model’s

penultimate layer, then applied clustering within the vector database to assign soft

or hard cluster labels corresponding to the three text-origin classes. From the re-

sults in Table 4.5, I draw some key findings:

• Traditional algorithms show reasonable performance on held-out known

generators but deg rade notably on unseen generators.

• Fuzzy C-Means leverages membership degrees to handle overlapping distri-

butions, improving both accuracy and F1-macro by about 4% over k-Means,

with smaller degradation on unseen data.

• Fuzzy KNN combines local neighbor information with fuzzy membership,

achieving the best overall performance: 95.2% accuracy and 95.1% F1-macro

on known generators, and 93.3% accuracy and 93.3% F1-macro on unseen

generators.

Given its super ior ability to adapt to novel domains and generators through

weighted neighbor voting and soft cluster assignments, I adopt Fuzzy k-Nearest

Neighbors as the clustering component in my full architecture.

4.7 Baseline Models for FAID Evaluation

LLM-DetectAIve [3]: Fine-tuned a RoBERTa [31] on fine-grained English AI-

generated texts with learning rate of 2 × 10

−5

for 10 epochs.

Support Vector Machine with N-grams [39]: Utilized SVM with N-gram

features for binary AI text detection (Human vs. AI). I adapted this approach to

a three-class detection. I first generated word-level trigram features using Count

Vectorizer, and then used a Linear SVM to train with the gold labels to guide the

classification.

T5-Sentinel [11]: Utilized T5-Sentinel to classify text into two labels (Hu-

man vs. AI). This approach leverages the T5 model to predict the conditional

probability of the next token, integrating the classification task into a sequence-

to-sequence completion framework. This unique method directly leverages the

strength of encoder-decoder model, which are trained to understand context and

generate coherent text.

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

4.8 Experimental Settings to evaluate FAID Architecture performance

I evaluate the performance of my proposed method FAID and baseline models

under two main tasks: (1) fine-grained classification of text into three categories

– human-written, AI-generated, and human-AI collaborative, and (2) identification

of the specific AI generator responsible for the text.

For in-domain evaluation, models are trained and tested on splits of the same

dataset, using FAIDSet, LLM-DetectAIve, and HART. These datasets include ex-

amples generated by known LLMs that are also available during training.

For out-of-domain generalization, I assess the detectors trained on FAIDSet

and test them on unseen datasets constructed to simulate three settings:

• Unseen domain: where text originates from a new domain (IELTS essays).

• Unseen generators: where texts are generated by previously unseen LLMs

(Qwen 2 [40], Mistral [27], Gemma 2 [21]).

• Unseen domain & generators: combining both shifts.

Each model is evaluated on both classification and generator identification

tasks. For the three-class classification task, I report Accuracy, Precision, Recall,

F1-macro, MSE, and MAE). For the generator identification task, I classify each

text into one of five categories (Human, GPT, Gemini, DeepSeek, Llama) and re-

port Accuracy, Precision, Recall, and F1-macro.

This experimental setup enables a comprehensive assessment of model perfor-

mance in both familiar and challenging generalization scenarios.

4.9 Evaluation Metrics for Evaluation Process

To evaluate the performance of text classification in this thesis, we adopt several

standard metrics, including Accuracy, Precision, Recall, F1 Score, Mean Squared

Error and Mean Absolute Error. Below are the definitions and formulas for each

metr ic:

Accuracy: Accuracy measures the proportion of correct predictions among the

total number of cases examined. It is defined as:

Accuracy =

T P + T N

T P + T N + F P + F N

(4.1)

where T P is true positives, T N is true negatives, F P is false positives, and F N

is false negatives.

Precision: Precision quantifies the number of true positive predictions made out

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

of all positive predictions (both tr ue and false). It is defined as:

Precision =

T P

T P + F P

(4.2)

Recall: Recall (also known as sensitivity or true positive rate) measures the

number of true positives identified out of all actual positives. It is defined as:

Recall =

T P

T P + F N

(4.3)

F1 Score: The F1 Score is the harmonic mean of precision and recall. It bal-

ances the trade-off between the two, especially in imbalanced datasets. It is calcu-

lated as:

F1 Score = 2 ×

Precision × Recall

Precision + Recall

(4.4)

Because the text distribution in each class is put in a natural order – with human-

wr itten texts being the most sparse, followed by human-AI collaborative, and AI-

generated being the most frequent – we denote the labels as: human-written = 0,

human-AI collaborative = 1, and AI-generated = 2. This numerical encoding re-

flects the semantic distance among the categories and allows the use of regression-

based error metrics (MSE and MAE) to not only capture misclassifications but

also to penalize confusion between more distant classes (e.g., mistaking human for

AI) more heavily than between adjacent classes (e.g., human-AI vs. AI). This setup

aims to minimize the severity of such confusion, encouraging more context-aware

classification.

Mean Squared Error: MSE measures the average squared difference between

predicted and actual values. It penalizes larger errors more heavily and is defined

as:

MSE =

i=1

− ˆy

)

(4.5)

where y

is the true value, ˆy

is the predicted value, and n is the number of data

points.

Mean Absolute Error: MAE measures the average absolute difference between

predicted and actual values, providing a more interpretable metric in terms of units.

It is defined as:

MAE =

i=1

− ˆy

| (4.6)

These metrics together provide a comprehensive evaluation of the models used

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

Scenario Dataset Detector Accuracy ↑ Precision ↑ Recall ↑ F1-macro ↑ MSE ↓ MAE ↓

In-domain,

Known generator

FAIDSet

LLM-DetectAIve 94.34 94.45 93.79 94.10 0.1888 0.1107

SVM (N-grams) 84.41 84.53 82.93 83.62 0.5630 0.2916

T5-Sentinel 93.31 94.92 93.10 93.15 0.2104 0.1101

FAID 95.58 95.78 95.33 95.54 0.1719 0.0875

LLM-DetectAIve

LLM-DetectAIve 95.71 95.78 95.72 95.71 0.1606 0.1314

SVM (N-grams) 83.43 76.09 71.84 73.60 0.2878 0.2064

T5-Sentinel 94.77 94.70 92.60 93.60 0.1663 0.1503

FAID 96.99 95.29 88.14 91.58 0.1561 0.0754

HART

LLM-DetectAIve 94.39 94.25 94.33 94.29 0.3244 0.1789

SVM (N-grams) 58.93 59.75 61.07 60.30 1.2334 0.6849

T5-Sentinel 86.68 87.25 87.69 87.38 0.4339 0.2334

FAID 96.73 97.61 98.05 97.80 0.4631 0.1806

Unseen domain,

Unseen generator

Unseen domain

LLM-DetectAIve 52.83 47.31 64.62 53.28 0.4733 0.4722

SVM (N-grams) 39.28 43.13 29.41 22.77 0.7689 0.6611

T5-Sentinel 55.56 49.54 66.67 55.34 0.4444 0.4444

FAID 62.78 70.73 71.77 69.46 0.4514 0.4486

Unseen

generators

LLM-DetectAIve 75.71 73.25 75.63 74.30 0.3714 0.2957

SVM (N-grams) 74.19 61.88 50.17 55.17 0.4324 0.3162

T5-Sentinel 85.95 85.77 84.59 85.16 0.3648 0.2419

FAID 93.31 92.40 94.44 93.25 0.1691 0.1167

Unseen domain

and Unseen

generators

LLM-DetectAIve 62.93 66.74 71.17 61.97 0.4479 0.3964

SVM (N-grams) 34.50 39.38 26.67 21.70 0.9186 0.7429

T5-Sentinel 57.07 49.82 66.61 55.45 0.4314 0.4300

FAID 66.55 74.44 73.57 72.58 0.3939 0.3167

Table 4.6: Perfor mances of detectors with three labels. The best results are in bold and the

second best are underlined.

in this work, covering both classification effectiveness and regression prediction

accuracy.

4.10 Whether the text is Human, AI, or Human-AI?

Table 4.6 shows the performance of FAID and three baselines in four evaluation

settings. FAID consistently achieves the best accuracy in (i) in-domain and known

generators, (ii) unseen domain, (iii) unseen generators, and (iv) unseen domain &

generators settings, followed by LLM-DetectAIve in (i) and (iv), and T5-Sentinel

in (ii) and (iii). Taking advantage of pretrained language models (i.e., RoBERTa,

T5), they outperform SVM in most cases, particularly on HART dataset.

FAID’s approach enhances the generalization performance over unseen domains

and generators, compared with other methods. From the absolute accuracy, I can

find that generalizing to (ii) unseen domains and (iv) unseen domain & generator

remains challenging, with accuracy of 62.78% and 66.55% respectively. The result

suggests that FAID is an effective method to address the multilingual fine-grained

AI-generated text detection task, improving the performance by leveraging multi-

level contrastive learning to capture generalizable stylistic differences tied to AI

families, rather than overfitting to surface-level ar tifacts.

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

Dataset Detector Accuracy ↑ Precision ↑ Recall ↑ F1-macro ↑

FAIDSet

LLM-DetectAIve 0.7596 0.7697 0.7690 0.7653

SVM (N-grams) 0.6764 0.6515 0.6126 0.6312

T5-Sentinel 0.7568 0.7985 0.7840 0.7837

FAID 0.7964 0.8328 0.8352 0.8327

LLM-DetectAIve

LLM-DetectAIve 0.9049 0.9064 0.8352 0.8693

SVM (N-grams) 0.8567 0.6516 0.6127 0.6313

T5-Sentinel 0.8154 0.8137 0.8009 0.8105

FAID 0.9089 0.8817 0.8672 0.8737

HART

LLM-DetectAIve 0.8900 0.8787 0.8674 0.8715

SVM (N-grams) 0.6347 0.5454 0.4385 0.4806

T5-Sentinel 0.7852 0.7713 0.7834 0.7759

FAID 0.8996 0.9157 0.8648 0.8667

Table 4.7: Accuracy of detectors in identifying generators: human, GPT, Gemini,

DeepSeek, and Llama. The best is in bold.

4.11 Identifying Different Generators

I further examine the detector to identify specific AI generators which were

seen during training. The goal of FAID is not only to detect AI involvement but

also to identify the specific AI model or family applied, treating them as dis-

tinct authors. As shown in Table 4.7, across the three datasets, FAID consistently

achieves higher performance compared to other baselines in almost all metrics.

LLM-DetectAIve achieves comparable results with my detector FAID on the test

set of LLM-DetectAIve dataset, except that its precision is slightly lower. In the

generator identification task, FAID also shows high discriminative power across

AI families (GPT, Gemini, DeepSeek, Llama), achieving over 90% accuracy in

in-domain settings (FAIDSet, LLM-DetectAIve). This suggests that AI model out-

puts still retain distinctive patterns that can be captured with a sufficiently nuanced

detection framework.

FAID’s high performance when dealing with text originating from diverse

known generators within these datasets indicates that FAID learned unique writ-

ing patterns and features of different AI generators by leveraging multi-level con-

trastive learning.

4.12 Discussion

The results across both experiments demonstrate that FAID is a robust and gen-

eralizable detector for fine-grained AI-generated content detection. It consistently

outperforms strong baselines in both three-way classification (human, AI, human-

AI) and generator identification tasks, across a wide range of domains and genera-

tors (Table 4.6, Table 4.7).

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

In particular, FAID achieves strong generalization to unseen generators and

domains – a key requirement for real-world deployment where new models and

wr iting styles continuously emerge. This demonstrates that multi-level contrastive

learning, by encouraging the model to learn deeper stylistic and structural signals

rather than superficial artifacts, is an effective strategy for building generalizable

detectors.

Failure case discussion. While FAID performs strongly overall, some failure

cases reveal interesting challenges. One notable example comes from an unseen do-

main sample classified as human-AI collaborative, when it was in fact fully human-

wr itten. Upon inspection, this is a sample human-AI collaborative text where the

detector misclassified as human-written:

Sample case of detector’s failure

Lately, many researchers are using neural topic models to automatically find

topics in text because they are simpler than older, math-heavy methods like

LDA. However, these newer models have two main weaknesses: they often

make poor assumptions about how topics are structured, and they sometimes

can’t figure out the specific blend of topics within a document. To solve this,

we created a new method called the Bidirectional Adversarial Topic (BAT)

model. It’s the first of its kind to use a special bidirectional adversarial train-

ing technique. This technique creates a two-way street between the words

in a document and the topics it represents, helping the model learn more

effectively. We also developed an enhanced version, Gaussian-BAT, which

understands how different words relate to each other.

In this case, the sample was primarily written by a human author, but was later

polished using light AI-assisted editing for g rammar and flow improvements. The

core ideas, sentence structures, and phrasing largely remained human-authored.

However, subtle surface-level regularities introduced by the AI polishing – such

as smoother connective phrases and more uniform sentence rhythm – likely con-

tr ibuted to the detector assigning a human-AI collaborative label.

Importantly, this illustrates a key nuance: even when minor AI involvement is

present, its impact on authorship can remain small, and the text should still be

considered predominantly human-written. The detector, while highly effective at

identifying strong AI signatures, may at times be overly sensitive to light editing

effects. This highlights the need for future calibration work to better reflect the true

degree of AI contribution in borderline cases.

CHAPTER 4. EXPERIMENTS AND NUMERICAL RESULTS

4.13 Summary

The experiments demonstrate that the designed FAID framework significantly

outperforms existing baselines in both classification and generator identification

tasks. FAID achieves state-of-the-art performance across all in-domain datasets

and shows robust generalization to unseen domains and generators, outperforming

LLM-DetectAIve, T5-Sentinel, and SVM-based methods.

In the multi-class classification task, FAID achieves the highest accuracy, pre-

cision, and recall in most scenarios, especially when generalization is required –

highlighting its robustness. In the generator identification task, FAID not only cor-

rectly identifies AI involvement but also attributes authorship to the correct model

family with higher precision and F1 scores than the baselines.

These results confirm the effectiveness of FAID’s multi-level contrastive learn-

ing approach, which captures semantic and stylistic cues specific to different AI

model families. This establishes FAID as a strong and generalizable solution for

fine-grained AI-generated text detection and attribution across domains, languages,

and model types.

CHAPTER 5. SCHOLARSLEUTH WEB APPLICATION

5.1 Overview

This chapter introduces the concept of the AI Factor Impact Score – a novel

metr ic designed to measure the degree of AI influence within a document. I

also present the public-facing implementation of this methodology via Schol-

arSleuth [1], a web-based application developed to help users assess and visual-

ize AI authorship in texts. The platform suppor ts both short text classification and

document-level analysis, generating comprehensive reports that include paragraph-

wise classifications and an overall AFI score.

5.2 AI Factor Impact Score

To provide a quantifiable measure of AI involvement in a given text, I introduce

the AI Factor Impact Score (AFI). This score represents the overall degree of AI

influence across a document, based on paragraph-level classifications.

5.2.1 AFI Score Calculation and Threshold

I first apply the fuzzy k-Nearest Neighbors classifier to each paragraph in the

text. The classifier outputs a probabilistic classification, from which I determine

the dominant label: human-written, AI-generated, or human-AI collaborative.

Let the document be composed of n paragraphs. For paragraph i

, let:

• w

denote the number of words in paragraph i

• L

denote the label assigned to paragraph i

• f(L

) denote the AFI weight for the label L

f(L

) =











0 if L

= human-written

1 if L

= AI-generated

0.7 if L

= human-AI collaborative

(5.1)

The overall AI Factor Impact Score for the document is then calculated as a

weighted average:

AFI =

i=1

× f(L

)

i=1

(5.2)

This formulation ensures that longer paragraphs contribute proportionally more

to the overall score, yielding a representative measure of AI influence relative to

CHAPTER 5. SCHOLARSLEUTH WEB APPLICATION

the actual volume of content.

I interpret the AFI score using the following scale:

• ■ 0 – 39: Primarily human-written content – Acceptable.

• ■ 40 – 69: Mixed human-AI collaboration – Marginally Acceptable.

• ■ 70 – 100: Primarily AI-generated content – Unacceptable.

5.2.2 Why AFI = 0.7 for Human-AI Collaborative Texts?

While a naive assumption might assign a value of 0.5 to the human-AI collabo-

rative class – implying equal contribution – my decision to set it at 0.7 is based on

several empirical and practical considerations:

• In many collaborative cases, the AI’s influence often extends beyond a simple

50-50 split. For example, a paragraph initially written by a human may be

heavily rewritten or paraphrased by an AI, resulting in text that reflects more

of the AI’s style and structure than the original human input.

• I observed in user studies and qualitative analyses that collaborative texts

tend to exhibit more stylistic traits associated with AI-generated content than

human-authored writing. These include consistent tone, fluent but generic

phrasing, and minimal semantic drift, all characteristics amplified by AI edit-

ing.

• The 0.7 value maintains sensitivity to the elevated influence of AI while pre-

ser ving a clear distinction from fully AI-generated text (AFI = 1). It thus

ser ves as a balanced representation of "par tial but substantial" AI involve-

ment.

This approach allows us to flag documents where AI may have had a substantial

impact while avoiding overly penalizing legitimate use of AI as a writing assistant.

5.3 System Overview

To demonstrate the practical utility of the proposed AI-generated content detec-

tion system, I developed a publicly accessible web-based platform named Schol-

arSleuth, available at: https://scholarsleuth.tnminh.com.

The system is deployed on Google Cloud Platform and comprises two main

components:

• API Serving Endpoint: The inference service is built with FastAPI, con-

tainer ized, and deployed using Cloud Run. It operates with a configuration of

1 vCPU and 16 GB RAM to efficiently host the model.

CHAPTER 5. SCHOLARSLEUTH WEB APPLICATION

• Web Hosting: The frontend is developed with Next.js and also deployed via

Cloud Run, with a lightweight configuration of 0.5 vCPU and 1 GB RAM.

This deployment setup is lightweight, scalable, and platform-agnostic, making

it suitable for integration into a wide range of systems.

The web application allows users to interact with the underlying detection model

and obtain meaningful insights into the authorship composition of texts. Schol-

arSleuth supports two main functionalities:

• Analysis Tools: These features allows user to detect the level of AI contri-

butions in their documents and get a detailed repor t to raise the awareness of

user’s in writing. We will have deeper discussion with this function in Sub-

section 5.4.1.

• Playground: While I offer the detection method and tools, I do not stay of-

fensive to this new norm. I offer two features in the playground, for user to use

AI generator wisely and help them to improve their writing skills in adapting

with AI asssitants. The more detailed information about this function will be

descr ibed in Sub-section 5.4.2.

These functionalities make ScholarSleuth suitable for academic authors, instruc-

tors, reviewers, or anyone interested in their adaptation to the the extent of AI in-

volvement in written content.

5.4 User Interface Walkthrough

Figure 5.1 show the homepage of the system, which introduces the system with

a clear call to action and highlights its two main functionalities: Analysis Tools

and Playground. This entry point allows users to begin interacting with the system

with a single click and learn about the features through concise descriptions.

5.4.1 Analysis Tools

ScholarSleuth offers a suite of core analysis tools that help users assess and inter-

pret the degree of AI involvement in a given text. These tools serve as the backbone

of the system’s capability to support fine-grained content attribution. They allow

users to analyze text for transparency, self-evaluation, academic integrity checks,

or curiosity about AI-generated content patterns. The primary tool in this suite is

the Text Analyzer, described below.

a, Text Analyzer

The Text Analyzer feature (Figure 5.2) allows users to submit a text passage

and obtain a detailed analysis of its likely composition in terms of human and

AI involvement. Users can input text through a simple web form and trigger the

CHAPTER 5. SCHOLARSLEUTH WEB APPLICATION

Figure 5.1: Homepage of ScholarSleuth system. Users can select between single text or

Document Analyzer.

CHAPTER 5. SCHOLARSLEUTH WEB APPLICATION

Figure 5.2: Text Analyzer interface. The input text is analyzed into three categories with

visual feedback.

CHAPTER 5. SCHOLARSLEUTH WEB APPLICATION

analysis by clicking the Analyze button.

The system processes the input and outputs a distribution of label scores across

three categories:

• Human-written

• AI-generated

• Human-AI collaboration

Results are presented visually with a horizontal bar chart and an overall AI Fac-

tor Impact Score, which summarizes the weighted contribution of AI-generated

content and collaboration in the analyzed text. This visualization helps users

quickly interpret how much AI influence is present in their text.

In the example shown in Figure 5.2, the analysis indicates a dominant proportion

of human-AI collaborative writing (67.15%), a substantial human-wr itten portion

(32.85%), and no fully AI-generated segments. This fine-grained feedback allows

users to better understand their own writing processes, assess the degree of AI use,

and make informed decisions about transparency and attribution.

b, Document Analyzer

For more comprehensive review, users can upload entire documents through the

Document Analyzer interface (Figure 5.3). The platfor m supports PDF f iles and

provides a detailed multi-level analysis report, including a calculated AI Factor

Impact (AFI) score and a fine-grained paragraph-level classification.

After uploading a document and clicking the Analyze PDF button, the sys-

tem processes the file, chunk file into paragraphs and returns several key outputs:

• A visual AI Factor Impact Score, summarizing the overall AI-related influence

across the document.

• A classification breakdown across three categories – Human-written, AI-

generated, and Human-AI collaboration.

• Selected Top Examples by Category, where representative text excerpts are

shown for each class to support interpretability and transparency.

An example analysis result is shown in Figure 5.3, based on a random gradua-

tion thesis from a SOICT student in the Fall 2024 semester. In this example, the

document displays a high level of Human-AI collaboration (70.32%), with some

fully AI-generated segments (23.77%) and a small proportion of Human-written

content. Such insights help users and evaluators assess whether the degree of AI

CHAPTER 5. SCHOLARSLEUTH WEB APPLICATION

Figure 5.3: Document Analyzer interface. A document is uploaded and evaluated with a

detailed breakdown and categorized text examples.

CHAPTER 5. SCHOLARSLEUTH WEB APPLICATION

involvement aligns with the expected standards for the document.

In addition to the interactive on-screen analysis, users can generate a polished

visual report by clicking the Export PDF button. The generated report (Fig-

ure 5.4) summarizes the analysis results in an easy-to-read format that includes:

• A visual AI Factor Impact Score bar.

• A classification results pie chart.

• Full analyzed pieces of text for each category.

The report enhances transparency in content attribution and is suitable for shar-

ing with educators, reviewers, or institutional evaluators. In the illustrated report,

the document received an AFI score of 73/100, which exceeds the critical thresh-

old, indicating a high level of AI influence in the document.

With its clean design, responsive layout, and intuitive navigation, ScholarSleuth

lowers the barrier to understanding AI-generated content detection. The platform

not only supports educational applications, but also facilitates transparency in au-

thorship evaluation, thereby promoting ethical writing practices in academic and

professional contexts.

5.4.2 Playgrounds

The Playgrounds serves as an interactive space where users can explore ad-

ditional features beyond AI-generated content detection. One of the guiding prin-

ciples behind the design of ScholarSleuth is not to discourage the use of AI in

wr iting, but to promote transparency and raise user awareness regarding its appli-

cation. To support this goal, we developed two new features that encourage mindful

and responsible use of AI writing tools.

a, Prompt Crafting Tool

The Prompt Crafting Tool, as shown in Figure 5.5 assists users in designing

prompts that produce AI-generated text matching specific writing styles. This tool

accepts an input text sample and analyzes its style using the underlying LLM-based

detection model. It then suggests optimized prompts that can guide users to gener-

ate new content with similar stylistic characteristics.

An additional unique feature of this tool is its ability to recommend which AI

model is most suitable for producing the desired style of text. This helps users make

informed choices rather than relying blindly on AI tools. The tool also provides

an AI Factor Impact Score and a breakdown of the writing style (Human-written,

AI-generated, Human-AI collaboration), helping users understand the current style

CHAPTER 5. SCHOLARSLEUTH WEB APPLICATION

Figure 5.4: Generated report summarizing AI Factor Impact (AFI) score and classification

results.

CHAPTER 5. SCHOLARSLEUTH WEB APPLICATION

profile of their input.

To further encourage conscious use of AI, the tool suggests different types of

prompt strategies – such as starting with a human-written draft and letting the

AI polish it, or having the AI generate a draft that the user then refines. These

strategies promote human-in-the-loop wr iting practices instead of fully automated

content generation. This aligns with our system’s goal to foster user awareness and

responsibility when employing AI in academic and professional writing.

This feature assists users in designing prompts that produce AI-generated text

matching specific writing styles. This tool accepts an input text sample and ana-

lyzes its style using the underlying LLM-based detection model. It then suggests

optimized prompts that can guide users to generate new content with similar stylis-

tic characteristics.

An additional unique feature of this tool is its ability to recommend which AI

model is most suitable for producing the desired style of text. This helps users make

informed choices rather than relying blindly on AI tools. The tool also provides

an AI Factor Impact Score and a breakdown of the writing style (Human-written,

AI-generated, Human-AI collaboration), helping users understand the current style

profile of their input.

To further encourage conscious use of AI, the tool suggests different types of

prompt strategies – such as starting with a human-written draft and letting the

AI polish it, or having the AI generate a draft that the user then refines. These

strategies promote human-in-the-loop wr iting practices instead of fully automated

content generation. This aligns with our system’s goal to foster user awareness and

responsibility when employing AI in academic and professional writing.

b, Rewrite Challenge

Figure 5.6 illustrates the Rewrite Challenge. This is an interactive gamified fea-

ture designed to help users improve their ability to identify and adapt writing styles

between human-like and AI-like text. In each challenge, users are presented with a

text passage and asked to rewrite it according to one of two modes: Humanize It or

Make It More AI-like.

The tool then evaluates the rewritten text using an AI-based scoring system that

assesses its stylistic alignment with the target mode. Users receive a correspond-

ing score and feedback to guide their learning. A leaderboard encourages friendly

competition and motivates users to improve their skills through repeated practice.

This feature serves a dual purpose. First, it raises user awareness by training

CHAPTER 5. SCHOLARSLEUTH WEB APPLICATION

Figure 5.5: Prompt Crafting Tool interface. This features provides user a range of prompts

that helps user to produce texts with AI assistant wisely.

CHAPTER 5. SCHOLARSLEUTH WEB APPLICATION

them to better distinguish stylistic markers of human and AI-generated writing.

Second, it fosters more mindful use of AI writing tools by encouraging deliberate

style choices rather than passive acceptance of AI output. Through this engaging

format, the Rewrite Challenge supports our broader goal of promoting ethical and

transparent AI use in writing.

5.5 Summary

In this chapter, I proposed and formalized the AI Factor Impact score as a prac-

tical and interpretable metric for assessing the level of AI contribution in textual

content. By assigning scalar weights to each classification label, the AFI score

captures nuanced authorship dynamics at scale. The paragraph-level classification

approach, combined with word count-based weighting, ensures a representative

and document-sensitive evaluation.

I implemented this methodology into ScholarSleuth, a fully functional web

platform that enables users to analyze text passages or upload full documents for

comprehensive authorship analysis. The system generates AFI scores, label distri-

butions, and downloadable repor ts to support ethical decision-making and promote

transparency in content creation.

Importantly, ScholarSleuth is designed not to discourage the use of AI in writ-

ing, but to raise user awareness about its influence. To support this goal, the system

includes an interactive Playground with features such as the Prompt Crafting Tool

and the Rewrite Challenge. These tools encourage users to experiment with writing

styles and reflect on the role of AI in their writing processes. Through this balanced

approach, ScholarSleuth fosters responsible and informed use of AI-assisted wr it-

ing technologies.

CHAPTER 5. SCHOLARSLEUTH WEB APPLICATION

Figure 5.6: Rewrite Challenge interface. User can use it to improve the writing skills and

raise awareness of using AI-generated contents.

CHAPTER 6. CONCLUSIONS AND BROADER IMPACTS

6.1 Conclusion

In this work, I introduced the ScholarSleuth system, which includes three main

components.

First, FAIDSet is a multilingual and fine-grained benchmark for AI-generated

text detection, comprising approximately 84,000 samples across human-written,

AI-generated, and human-AI collaborative categories.

Building on this dataset, I proposed FAID, a novel detection framework that

combines multi-level contrastive learning and multi-task auxiliary objectives. By

treating AI model families as stylistic "authors," FAID captures nuanced writing

patterns that persist across different domains and models. My approach integrates

a fuzzy KNN-based inference mechanism and a training-free incremental adapta-

tion strategy to enhance performance and robustness in both in-domain and out-of-

distribution settings. Experimental results demonstrate that FAID achieves up to

95.58% accuracy on in-domain, known generators settings and 93.31% on unseen

generators, significantly outperforms strong baselines across a variety of datasets

and configurations, especially in detecting collaborative texts and adapting to un-

seen generative models – key challenges in the evolving landscape of AI-assisted

wr iting.

To support practical deployment, I also developed ScholarSleuth – a web-based

application where users can analyze documents and receive detailed reports, in-

cluding an AI Factor Impact score – a reference metric estimating the degree of AI

contr ibution in a given text. The web application also provides some features that

help users to raise their awareness in writing, promoting transparency and respon-

sibility in AI-assisted content creation.

While ScholarSleuth demonstrates strong performance and generalization

across various domains and AI models, several limitations remain. First, although

my dataset is multilingual and multi-domain, it is still limited in terms of low-

resource languages and niche writing domains, which may affect performance in

those contexts. Second, my framework is based on the fact that texts produced by

AI models from the same family share similar stylistic features. However, this as-

sumption may break down when a single text is influenced by multiple AI systems

– such as when a human uses different models for drafting, rewriting, and pol-

ishing. In such cases, the resulting style may blend traits from multiple sources,

CHAPTER 6. CONCLUSIONS AND BROADER IMPACTS

making it more difficult to attribute authorship to a single AI family or clearly

distinguish collaboration boundaries.

6.2 Suggestion for Future Works

In future work, I aim to further enrich FAIDSet by incorporating additional lan-

guages, including low-resource ones, more generative models, and a broader range

of domains such as informal texts, social media posts, and educational writing.

This expansion will support better generalization in multilingual and low-resource

scenar ios.

I also plan to extend FAID’s capabilities to handle increasingly complex human-

AI collaborative scenarios. These include texts generated by multiple AI families

and subsequently edited by humans – reflecting real-world behaviors and work-

flows. Additionally, I will explore integrating more f ine-grained attr ibution tech-

niques to distinguish not only between human and AI contributions but also to

localize segments within collaborative texts.

On the application side, I intend to enhance the ScholarSleuth web platform to

support batch analysis, educational integration, and real-time document feedback.

These improvements aim to further the impact of FAID in real-world use cases,

particularly in academic and editorial settings where authorship transparency is

essential.

6.3 Ethics and Broader Impacts

Data Collection and Licenses A primary ethical consideration is the data li-

cense. I reused existing dataset for my research: LLM-DetectAIve, HART and

IELTS Writing, which have been publicly released under clear licensing agree-

ments. I adhere to the intended usage of these dataset licenses.

Security Implications. FAIDSet streamlines both the creation and rigorous test-

ing of FAID. By spotting AI-generated material, FAID helps preserve academic in-

tegrity, flag potential misconduct, and protect the genuine contributions of authors.

More broadly, it supports efforts to prevent the misuse of generative technologies in

areas such as credential falsification. Detecting AI-crafted content across different

languages can be tricky, due to language’s unique grammar and style. By enabling

robust, multilingual and multi-generator detection with accurate results, FAIDSet

empowers people everywhere, especially in academic scenarios, to deploy AI re-

sponsibly. At the same time, it fosters critical digital literacy, giving everyone a

clear understanding of both the strengths and limits of generative AI.

CHAPTER 6. CONCLUSIONS AND BROADER IMPACTS

Responsible Use of AI-generated Text Detection. FAID is designed to enhance

transparency in AI-assisted writing by enabling the fine-grained detection of AI

involvement in text generation. While this has clear benefits for academic integrity,

and content provenance, I acknowledge the potential for misuse. For instance, such

tools could be used to unfairly penalize individuals in educational or professional

settings based on incorrect or biased predictions. To mitigate this, I stress that FAID

is not intended for high-stakes decision-making without human oversight.

Bias and Fairness. AI-generated text detection systems may inadvertently en-

code or amplify biases present in training data. FAIDSet has been carefully con-

structed to include diverse domains and languages to reduce such biases. Nonethe-

less, I encourage ongoing auditing and benchmarking of fairness across popula-

tions and writing styles, and welcome community feedback for further improve-

ments.

Transparency and Reproducibility. In this work, to promote open research and

community contributions, I published my code and data to the public.

PUBLICATIONS AND AWARDS GIVEN TO THIS THESIS

1. Publications

The whole research of this research thesis is published into two publications,

which are under-review in The 2025 Conference on Empirical Methods in Nat-

ural Language Processing. I first authored both of them:

[1] FAID: Fine-grained AI-generated Text Detection using Multi-task Auxiliary

and Multi-level Contrastive Learning [2].

[2] ScholarSleuth: A System for Fine-grained AI-generated Text Detection for

Academic Integrity [1].

Some supports components for this paper are published in three publications,

which are:

[3] LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detec-

tion, In Proceedings of The 2024 Conference on Empirical Methods in Natu-

ral Language Processing: System Demonstration (I co-first authored this pa-

per) [3].

[4] GenAI Content Detection Task 1: English and Multilingual Machine-

Generated Text Detection: AI vs. Human, In Proceedings of the 1st Workshop

on GenAI Content Detection, The 2025 International Conference on Compu-

tational Linguistics (I co-authored this paper) [51].

[5] Is Human-Written Text Liked by Humans? Multilingual Human Detection and

Preference Against AI, Under-review in Proceedings of The 2025 Conference

on Empirical Methods in Natural Language Processing (I co-authored this

paper) [52].

2. Awards

This thesis has won the Second Prize in The 42

Student Scientific Research

Conference at Hanoi University of Science and Technology, in the area of Applied

AI, Blockchain and Big Data [43].

REFERENCES

[1] Minh Ngoc Ta, Duc-Anh Hoang, Dong Cao Van, Minh Le-Anh, Truong

Nguyen, My Anh Tran Nguyen, Yuxia Wang, Preslav Nakov, and Sang Dinh.

ScholarSleuth: A System for Fine-grained AI-generated Text Detection for

Academic Integrity. 2025. OpenReview: UnderReview.

[2] Minh Ngoc Ta, Dong Cao Van, Duc-Anh Hoang, Minh Le-Anh, Truong

Nguyen, My Anh Tran Nguyen, Yuxia Wang, Preslav Nakov, and Sang

Dinh. FAID: Fine-grained AI-generated Text Detection using Multi-task Aux-

iliary and Multi-level Contrastive Learning. 2025. arXiv: 2505 . 14271

[cs.CL]. URL: https://arxiv.org/abs/2505.14271.

[3] Mervat Abassy, Kareem Elozeiri, Alexander Aziz, Minh Ngoc Ta, Raj

Vardhan Tomar, Bimarsha Adhikari, Saad El Dine Ahmed, et al. “LLM-

DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection”.

In: Proceedings of the 2024 Conference on Empirical Methods in Natural

Language Processing: System Demonstrations. Miami, Florida, USA: Asso-

ciation for Computational Linguistics, Nov. 2024, pp. 336–343. DOI: 10 .

18653/v1/2024.emnlp-demo.35.

[4] Ekaterina Artemova, Jason S Lucas, Saranya Venkatraman, Jooyoung Lee,

Sergei Tilga, Adaku Uchendu, and Vladislav Mikhailov. “Beemo: Bench-

mark of Expert-edited Machine-generated Outputs”. In: Proceedings of the

2025 Conference of the Nations of the Americas Chapter of the Associa-

tion for Computational Linguistics: Human Language Technologies (Vol-

ume 1: Long Papers). Ed. by Luis Chir uzzo, Alan Ritter, and Lu Wang.

Albuquerque, New Mexico: Association for Computational Linguistics,

Apr. 2025, pp. 6992–7018. ISBN: 979-8-89176-189-6. URL: https : / /

aclanthology.org/2025.naacl-long.357/.

[5] The Australian. “TEQSA survey: Over a third of students use chatbots for

assessments without viewing it as cheating”. In: (2025). URL: https :

/ / www . theaustralian . com . au / higher - education /

unis - warned - chatbots - taking - over - and - return -

to - pen - and - paper - is - futile / news - story /

0a3e668fe18e4f8c7a65365d2de2e9eb.

[6] Guangsheng Bao, Lihua Rong, Yanbin Zhao, Qiji Zhou, and Yue Zhang.

Decoupling Content and Expression: Two-Dimensional Detection of AI-

REFERENCES

Generated Text. 2025. arXiv: 2503 . 00258 [cs.CL]. URL: https :

//arxiv.org/abs/2503.00258.

[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Ka-

plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,

Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger,

Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu,

Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,

Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCan-

dlish, Alec Radford, Ilya Sutskever, and Dar io Amodei. “Language Models

are Few-Shot Learners”. In: Advances in Neural Information Processing Sys-

tems. Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H.

Lin. Vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901. URL: https:

/ / proceedings. neurips. cc / paper _files / paper / 2020 /

file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

[8] Rich Caruana. “Multitask learning”. In: Machine learning 28 (1997), pp. 41–

75.

[9] Tuhin Chakrabarty, Philippe Laban, and Chien-Sheng Wu. “Can AI writ-

ing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Align-

ment in the Writing Process through Edits”. In: Proceedings of the 2025 CHI

Conference on Human Factors in Computing Systems, CHI 2025, Yokohama-

Japan, 26 April 2025- 1 May 2025 . Ed. by Naomi Yamashita, Vanessa Ev-

ers, Koji Yatani, Sharon Xianghua Ding, Bongshin Lee, Marshini Chetty,

and Phoebe O. Toups Dugas. ACM, 2025, 1210:1–1210:33. DOI: 10 .

1145/3706598.3713559. URL: https://doi.org/10.1145/

3706598.3713559.

[10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton.

“A Simple Framework for Contrastive Learning of Visual Representations”.

In: Proceedings of the 37th International Conference on Machine Learning.

Ed. by Hal Daumé III and Aarti Singh. Vol. 119. Proceedings of Machine

Learning Research. PMLR, July 2020, pp. 1597–1607. URL: https : / /

proceedings.mlr.press/v119/chen20j.html.

[11] Yutian Chen, Hao Kang, Vivian Zhai, Liangze Li, Rita Singh, and Bhiksha

Raj. “Token Prediction as Implicit Classification to Identify LLM-Generated

Text”. In: Proceedings of the 2023 Conference on Empirical Methods in

Natural Language Processing. Ed. by Houda Bouamor, Juan Pino, and Ka-

lika Bali. Singapore: Association for Computational Linguistics, Dec. 2023,

REFERENCES

pp. 13112–13120. DOI: 10 . 18653 / v1 / 2023 . emnlp - main . 810.

URL: https://aclanthology.org/2023.emnlp- main.810/.

[12] Zihao Cheng, Li Zhou, Feng Jiang, Benyou Wang, and Haizhou Li. “Be-

yond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role

Recognition and Involvement Measurement”. In: Proceedings of the ACM on

Web Conference 2025. WWW ’25. ACM, Apr. 2025, pp. 2677–2688. DOI:

10.1145/3696410.3714770. URL: http://dx.doi.org/10.

1145/3696410.3714770.

[13] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William

Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et

al. “Scaling instruction-finetuned language models”. In: Journal of Machine

Learning Research 25.70 (2024), pp. 1–53. URL : http://jmlr.org/

papers/v25/23-0870.html.

[14] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary,

Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke

Zettlemoyer, and Veselin Stoyanov. “Unsupervised Cross-lingual Represen-

tation Learning at Scale”. In: Proceedings of the 58th Annual Meeting of the

Association for Computational Linguistics. Ed. by Dan Jurafsky, Joyce Chai,

Natalie Schluter, and Joel Tetreault. Online: Association for Computational

Linguistics, July 2020, pp. 8440–8451. DOI: 10.18653/v1/2020.acl-

main . 747. URL: https : / / aclanthology. org / 2020 . acl -

main.747/.

[15] DeepSeek-AI. “DeepSeek-V3 Technical Report”. In: CoRR abs/2412.19437

(2024). DOI: 10.48550/ARXIV.2412.19437. arXiv: 2412.19437.

URL: https://doi.org/10.48550/arXiv.2412.19437.

[16] Digital Education Council. Digital Education Council Global AI Student

Survey 2024. https :/ /www. digitaleducationcouncil.com /

post/digital-education-council-global-ai- student-

survey-2024. 2025.

[17] Liam Dugan, Daphne Ippolito, Arun Kirubarajan, Sherry Shi, and Chris

Callison-Burch. “Real or fake text? investigating human ability to detect

boundar ies between human-written and machine-generated text”. In: Pro-

ceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence

and Thirty-Fifth Conference on Innovative Applications of Artificial Intel-

ligence and Thirteenth Symposium on Educational Advances in Artificial

Intelligence. AAAI’23/IAAI’23/EAAI’23. AAAI Press, 2023. ISBN: 978-1-

REFERENCES

57735-880-0. DOI: 10 .1609 / aaai. v37i11 .26501. URL: https :

//doi.org/10.1609/aaai.v37i11.26501.

[18] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo

Larochelle, Franc¸ois Laviolette, Mario March, and Victor Lempitsky.

“Domain-Adversarial Training of Neural Networks”. In: Journal of Machine

Learning Research 17.59 (2016), pp. 1–35. URL : http://jmlr.org/

papers/v17/15-239.html.

[19] Tianyu Gao, Xingcheng Yao, and Danqi Chen. “SimCSE: Simple Con-

trastive Learning of Sentence Embeddings”. In: Proceedings of the 2021

Conference on Empirical Methods in Natural Language Processing. Ed. by

Mar ie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau

Yih. Online and Punta Cana, Dominican Republic: Association for Com-

putational Linguistics, Nov. 2021, pp. 6894–6910. DOI: 10.18653/v1/

2021.emnlp- main.552. URL: https://aclanthology.org/

2021.emnlp-main.552/.

[20] Gemini Team, Google. Gemini: A Family of Highly Capable Multimodal

Models. 2024. arXiv: 2312.11805 [cs.CL]. URL: https://arxiv.

org/abs/2312.11805.

[21] Gemma Team, Google DeepMind. Gemma 2: Improving Open Language

Models at a Practical Size. 2024. arXiv: 2408 .00118 [cs.CL]. URL:

https://arxiv.org/abs/2408.00118.

[22] GenAI, Meta. Llama 2: Open Foundation and Fine-Tuned Chat Models.

2023. arXiv: 2307 .09288 [cs.CL]. URL: https://arxiv.org/

abs/2307.09288.

[23] John Giorgi, Osvald Nitski, Bo Wang, and Gary Bader. “DeCLUTR: Deep

Contrastive Learning for Unsupervised Textual Representations”. In: Pro-

ceedings of the 59th Annual Meeting of the Association for Computational

Linguistics and the 11th International Joint Conference on Natural Lan-

guage Processing (Volume 1: Long Papers). Ed. by Chengqing Zong, Fei

Xia, Wenjie Li, and Roberto Navigli. Online: Association for Computational

Linguistics, Aug. 2021, pp. 879–895. DOI: 10.18653/v1/2021.acl-

long.72. URL: https://aclanthology.org/2021.acl-long.

72/.

[24] Xun Guo, Shan Zhang, Yongxin He, Ting Zhang, Wanquan Feng, Haibin

Huang, and Chongyang Ma. “DeTeCtive: Detecting AI-generated Text

via Multi-Level Contrastive Learning”. In: Advances in Neural Informa-

REFERENCES

tion Processing Systems. Ed. by A. Globerson, L. Mackey, D. Belgrave,

A. Fan, U. Paquet, J. Tomczak, and C. Zhang. Vol. 37. Curran Asso-

ciates, Inc., 2024, pp. 88320–88347. URL: https : / / proceedings .

neurips . cc / paper _ files / paper / 2024 / file /

a117a3cd54b7affad04618c77c2fb18b - Paper- Conference .

pdf.

[25] ibrahimmazlum. IELTS Writing Scored Essays Dataset — kaggle.com.

https : / / www. kaggle . com / datasets / mazlumi / ielts -

writing-scored-essays-dataset. [Accessed 29-04-2025]. 2023.

[26] University of Illinois Chicago. “AI tools in education: A report on students’

attitudes and usage”. In: (2025). URL: https:// learning.uic.edu/

news- stories/report- on- student- attitudes- towards-

ai-in-academia/.

[27] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford,

Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna

Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-

Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang,

Timothée Lacroix, and William El Sayed. Mistral 7B. 2023. arXiv: 2310.

06825 [cs.CL]. URL: https://arxiv.org/abs/2310.06825.

[28] Ryuto Koike, Masahiro Kaneko, and Naoaki Okazaki. “OUTFOX: LLM-

Generated Essay Detection Through In-Context Learning with Adversarially

Generated Examples”. In: Proceedings of the AAAI Conference on Artificial

Intelligence 38.19 (Mar. 2024), pp. 21258–21266. DOI: 10.1609/aaai.

v38i19 . 30120. URL: https : / / ojs . aaai . org / index . php /

AAAI/article/view/30120.

[29] Laida Kushnareva, Tatiana Gaintseva, German Magai, Serguei Barannikov,

Dmitry Abulkhanov, Kristian Kuznetsov, Eduard Tulchinskii, Irina Pio-

ntkovskaya, and Sergey Nikolenko. AI-generated text boundary detection

with RoFT. 2024. arXiv: 2311 . 08349 [cs.CL]. URL: https : / /

arxiv.org/abs/2311.08349.

[30] Mina Lee, Percy Liang, and Qian Yang. “CoAuthor: Designing a Human-AI

Collaborative Writing Dataset for Exploring Language Model Capabilities”.

In: CHI ’22: CHI Conference on Human Factors in Computing Systems, New

Orleans, LA, USA, 29 April 2022 - 5 May 2022. Ed. by Simone D. J. Barbosa,

Cliff Lampe, Caroline Appert, David A. Shamma, Steven Mark Drucker,

Julie R. Williamson, and Koji Yatani. ACM, 2022, 388:1–388:19. DOI: 10.

REFERENCES

1145/3491102.3502030. URL: https://doi.org/10.1145/

3491102.3502030.

[31] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi

Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.

“RoBERTa: A Robustly Optimized BERT Pretraining Approach”. In: CoRR

abs/1907.11692 (2019). arXiv: 1907 .11692. URL: http: / / arxiv.

org/abs/1907.11692.

[32] Llama Team, AI @ Meta. “The Llama 3 Herd of Models”. In: (2024). arXiv:

2407.21783 [cs.AI]. URL: https://arxiv.org/abs/2407.

21783.

[33] Dominik Macko, Robert Moro, and Ivan Srba. Increasing the Robustness of

the Fine-tuned Multilingual Machine-Generated Text Detectors. 2025. arXiv:

2503.15128 [cs.CL]. URL: https://arxiv.org/abs/2503.

15128.

[34] Larry R Medsker, Lakhmi Jain, et al. “Recurrent neural networks”. In: De-

sign and Applications 5.64-67 (2001), p. 2.

[35] M Monika, V Divyavarsini, and C Suganthan. “A survey on analyzing the

effectiveness of AI tools among research scholars in academic writing and

publishing”. In: International Journal of Advance Research and Innovative

Ideas in Education 9.6 (2023), pp. 1293–1305.

[36] Ayat A. Najjar, Huthaifa I. Ashqar, Omar A. Darwish, and Eman Hammad.

Detecting AI-Generated Text in Educational Content: Leveraging Machine

Learning and Explainable AI for Academic Integrity. 2025. arXiv: 2501.

03203 [cs.CL]. URL: https://arxiv.org/abs/2501.03203.

[37] OpenAI. “GPT-4 Technical Report”. In: ArXiv abs/2303.08774 (2023). URL:

https://api.semanticscholar.org/CorpusID:257532815.

[38] OpenAI. GPT-4o System Card. 2024. arXiv: 2410 . 21276 [cs.CL].

URL: https://arxiv.org/abs/2410.21276.

[39] Nuzhat Prova. “Detecting AI Generated Text Based on NLP and Machine

Learning Approaches”. In: arXiv preprint arXiv:2404.10032 (2024). URL:

https://arxiv.org/abs/2404.10032.

[40] Qwen Team, Alibaba Group. Qwen2 Technical Report. 2024. arXiv: 2407.

10671 [cs.CL]. URL: https://arxiv.org/abs/2407.10671.

REFERENCES

[41] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya

Sutskever. “Language Models are Unsupervised Multitask Learners”. In:

(2019).

[42] Shoumik Saha and Soheil Feizi. Almost AI, Almost Human: The Challenge

of Detecting AI-Polished Writing. 2025. arXiv: 2502 .15666 [cs.CL].

URL: https://arxiv.org/abs/2502.15666.

[43] SoICT. Tổng kết Hội nghị Sinh viên nghiên cứu khoa học 2025 - SoICT —

soict.hust.edu.vn. https : / / soict . hust . edu . vn / tong - ket -

hoi-nghi-sinh-vien-nghien- cuu-khoa-hoc-2025.html.

2025.

[44] Elon University. “Generative AI tools in education: Improving research skills

and writing ability”. In: (2025). URL: https: //www.elon. edu/u/

news/2025/01/23/elon- aacu- survey- focuses- on- ais-

impact-on-teaching-and-learning/.

[45] Jannis Vamvas and Rico Sennrich. “Towards Unsupervised Recognition of

Token-level Semantic Differences in Related Documents”. In: Proceedings

of the 2023 Conference on Empirical Methods in Natural Language Pro-

cessing. Ed. by Houda Bouamor, Juan Pino, and Kalika Bali. Singapore:

Association for Computational Linguistics, Dec. 2023, pp. 13543–13552.

DOI: 10 . 18653 / v1/ 2023 . emnlp - main . 835. URL : https : / /

aclanthology.org/2023.emnlp-main.835/.

[46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion

Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. “Atten-

tion is All you Need”. In: Advances in Neural Information Processing

Systems 30: Annual Conference on Neural Information Processing Sys-

tems 2017, December 4-9, 2017, Long Beach, CA, USA. Ed. by Isabelle

Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fer-

gus, S. V. N. Vishwanathan, and Roman Gar nett. 2017, pp. 5998–6008. URL:

https : / / proceedings . neurips . cc / paper / 2017 / hash /

3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

[47] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang,

Daxin Jiang, Rangan Majumder, and Furu Wei. Text Embeddings by

Weakly-Supervised Contrastive Pre-training. 2024. arXiv: 2212 . 03533

[cs.CL]. URL: https://arxiv.org/abs/2212.03533.

REFERENCES

[48] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder,

and Furu Wei. Multilingual E5 Text Embeddings: A Technical Report. 2024.

arXiv: 2402.05672 [cs.CL]. URL: https:/ /arxiv. org/abs/

2402.05672.

[49] Pengyu Wang, Linyang Li, Ke Ren, Botian Jiang, Dong Zhang, and Xipeng

Qiu. “SeqXGPT: Sentence-Level AI-Generated Text Detection”. In: Pro-

ceedings of the 2023 Conference on Empirical Methods in Natural Language

Processing. Ed. by Houda Bouamor, Juan Pino, and Kalika Bali. Singapore:

Association for Computational Linguistics, Dec. 2023, pp. 1144–1156. URL:

https://aclanthology.org/2023.emnlp-main.73/.

[50] Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shel-

manov, Akim Tsvigun, Osama Mohammed Afzal, Tarek Mahmoud, Gio-

vanni Puccetti, Thomas Arnold, Alham Aji, Nizar Habash, Iryna Gurevych,

and Preslav Nakov. “M4GT-Bench: Evaluation Benchmark for Black-Box

Machine-Generated Text Detection”. In: Proceedings of the 62nd Annual

Meeting of the Association for Computational Linguistics (Volume 1: Long

Papers). Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok,

Thailand: Association for Computational Linguistics, Aug. 2024, pp. 3964–

3992. DOI: 10 .18653 / v1/ 2024 .acl - long . 218. URL: https :

//aclanthology.org/2024.acl-long.218/.

[51] Yuxia Wang, Artem Shelmanov, Jonibek Mansurov, Akim Tsvigun,

Vladislav Mikhailov, Rui Xing, Zhuohan Xie, Jiahui Geng, Giovanni Puc-

cetti, Ekaterina Artemova, Jinyan Su, Minh Ngoc Ta, Mervat Abassy, et

al. “GenAI Content Detection Task 1: English and Multilingual Machine-

Generated Text Detection: AI vs. Human”. In: Proceedings of the 1st Work-

shop on GenAI Content Detection (GenAIDetect). Ed. by Firoj Alam, Preslav

Nakov, Nizar Habash, Iryna Gurevych, Shammur Chowdhury, Artem Shel-

manov, Yuxia Wang, Ekaterina Artemova, Mucahid Kutlu, and George

Mikros. Abu Dhabi, UAE: International Conference on Computational Lin-

guistics, Jan. 2025, pp. 244–261. URL: https://aclanthology.org/

2025.genaidetect-1.27/.

[52] Yuxia Wang, Rui Xing, Jonibek Mansurov, Giovanni Puccetti, Zhuohan Xie,

Minh Ngoc Ta, Jiahui Geng, et al. Is Human-Like Text Liked by Humans?

Multilingual Human Detection and Preference Against AI. 2025. arXiv:

2502.11614 [cs.CL]. URL: https://arxiv.org/abs/2502.

11614.

REFERENCES

[53] Chengfeng Xu, Jian Feng, Pengpeng Zhao, Fuzhen Zhuang, Deqing Wang,

Yanchi Liu, and Victor S. Sheng. “Long- and short-term self-attention

network for sequential recommendation”. In: Neurocomputing 423 (2021),

pp. 580–589. ISSN: 0925-2312. DOI: https://doi.org/10.1016/

j.neucom.2020.10.066. URL: https://www.sciencedirect.

com/science/article/pii/S0925231220316441.

[54] Zendy.io. “AI in research for students and researchers: 2025 trends and

statistics”. In: (2025). URL: https : / / zendy. io / blog / ai - in -

research - for - students - researchers - 2025 - trends -

statistics.

[55] Zijie Zeng, Shiqi Liu, Lele Sha, Zhuang Li, Kaixun Yang, Sannyuya Liu,

Dragan Ga

sevi

c, and Guanliang Chen. “Detecting AI-generated sentences in

human-AI collaborative hybrid texts: challenges, strategies, and insights”. In:

Proceedings of the Thirty-Third International Joint Conference on Artificial

Intelligence. IJCAI ’24. Jeju, Korea, 2024. ISBN: 978-1-956792-04-1. DOI:

10 . 24963 / ijcai . 2024 / 835. URL: https : / / doi . org / 10 .

24963/ijcai.2024/835.

[56] Zijie Zeng, Lele Sha, Yuheng Li, Kaixun Yang, Dragan Ga

sevi

c, and Guan-

liang Chen. “Towards automatic boundary detection for human-AI collabo-

rative hybrid essay in education”. In: Proceedings of the Thirty-Eighth AAAI

Conference on Artificial Intelligence and Thirty-Sixth Conference on Innova-

tive Applications of Artificial Intelligence and Fourteenth Symposium on Ed-

ucational Advances in Artificial Intelligence. AAAI’24/IAAI’24/EAAI’24.

AAAI Press, 2024. ISBN: 978-1-57735-887-9. DOI: 10 . 1609 / aaai .

v38i20 . 30258. URL: https : / / doi . org / 10 . 1609 / aaai .

v38i20.30258.

[57] Qihui Zhang, Chujie Gao, Dongping Chen, Yue Huang, Yixin Huang,

Zhenyang Sun, Shilin Zhang, Weiye Li, Zhengyan Fu, Yao Wan, and Lichao

Sun. “LLM-as-a-Coauthor: Can Mixed Human-Written and Machine-

Generated Text Be Detected?” In: Findings of the Association for Com-

putational Linguistics: NAACL 2024. Ed. by Kevin Duh, Helena Gomez,

and Steven Bethard. Mexico City, Mexico: Association for Computational

Linguistics, June 2024, pp. 409–436. DOI: 10 . 18653 / v1 / 2024 .

findings - naacl . 29. URL: https : / / aclanthology. org /

2024.findings-naacl.29/.