Measuring Persuasiveness: A Comparative Analysis of AI and Human Arguments - minalee-research/cs257-students GitHub Wiki

#task, #application, #DebateAI, #AIvsHuman

Nicholas Simonian

Abstract

Artificial intelligence is increasingly used for argumentation, whether in debate, essay writing, or persuasive reasoning. However, how does AI compare to human argumentation in terms of persuasiveness, logical structure, and effectiveness? This project explores the strengths and limitations of AI-generated arguments relative to human-generated ones across four key topics: environment, economy, foreign policy, and politics. To do this, I analyze a dataset of human arguments sourced from high school and college debate formats and compare them to arguments generated by AI models, including DeepSeek and ChatGPT as a baseline. By evaluating AI’s ability to mimic human argumentation, this project aims to provide insight into AI’s logical coherence, biases, and potential role in shaping discourse. The study involved surveying college students on a randomized set of human and AI-generated arguments to assess their persuasiveness, detectability, and logical structure. The quantitative findings reveal that AI-generated arguments are often indistinguishable from human ones, while qualitative differences highlight that AI arguments are structured and balanced but lack the rhetorical and emotional weight found in human debate.

What this project is about

With AI being widely used in writing, debating, and persuasive reasoning, an important question arises: can AI generate arguments as logically sound and persuasive as humans? The ability of AI to construct arguments has major implications, influencing everything from academic integrity to online discourse and policy debates. However, AI models are trained on vast amounts of text data, and their arguments may reflect underlying biases, logical inconsistencies, or structural flaws that distinguish them from human reasoning.

This project seeks to systematically compare AI-generated and human-generated arguments across four contentious topics: environment, economy, foreign policy, and politics. My approach involves selecting a balanced dataset of arguments written by humans at both high school and college debate levels and comparing them with AI-generated arguments from DeepSeek and ChatGPT, which serves as a baseline. This comparison will highlight potential differences in reasoning, rhetorical strategies, and persuasiveness.

The core goal is to evaluate AI’s performance relative to human debate standards and determine whether AI-generated arguments sound professional or align more closely with high school or college-level arguments. By analyzing argument structures, coherence, and survey-based persuasiveness ratings, I aim to uncover whether AI can effectively mimic or outperform human reasoning or if it falls short in key areas.

Progress made so far

Since the initial proposal, I have made considerable progress and completed the survey and qualitative analysis. First, I organized and cleaned a dataset of over 250,000 human-generated arguments sourced from both high school and college debate formats. To ensure a fair comparison, I selected a balanced set of high school and college-level arguments for each of four key, popular topics: environment, economy, foreign policy, and politics. This structuring allows for a clear analysis of what level of human argumentation AI-generated arguments align most closely with as well as providing a set of contentious and trending topics with flourishing debate and widespread understanding. For each sample argument, I highlighted persuasive excerpts and aligned them with the debate questions they were directly responding to.

To couple with the human-made debate text, I generated AI argument responses using DeepSeek and ChatGPT, with the latter serving as a baseline model. Each AI model was prompted with the same debate questions as the human arguments I collected to construct persuasive arguments, ensuring consistency across responses. I programmed a function that prompted DeepSeek with the arguments and asked for a structured response similar in length to the human arguments. I manually asked ChatGPT to respond to each question. The AI arguments were then structured and labeled alongside the human arguments for direct comparison.

I conducted both a qualitative and quantitative review of the arguments, analyzing their logical structure and strength. My qualitative observations suggest that AI-generated arguments are structurally competent and persuasive at a surface level but often lack the depth, adaptability, and strategic framing found in human arguments, particularly at the college level. AI arguments tend to resemble high school-level debate structures more closely than college-level ones, and they also lack the emotional power of human arguments. To complement this, I created a three-part survey that asked participants to choose the strongest argument between the human, DeepSeek, and ChatGPT arguments, to pick whether argument excerpts were written by a human or by AI, and to grade question responses on persuasiveness, insightfulness, and logical cohesion. I received about 100 responses to this survey and summarized the results.

Approach

This project follows a structured comparative methodology, focusing on argument persuasiveness, logical consistency, and rhetorical structure.

Main Approach

Human Argument Selection I compiled arguments from high school and college debate formats to provide an authentic human dataset Arguments were sampled across four topics to maintain balance AI Argument Generation AI arguments were generated from DeepSeek and ChatGPT for the same topics AI models were given the same debate topic questions and instructed to construct persuasive arguments of about the same length as the human arguments for consistency Comparative Analysis AI-generated and human-generated arguments were analyzed for logical structure, clarity, and persuasiveness by human survey respondents Participants judged argument strength, engaged in AI detection, and graded arguments on a number of factors Based on experiment results, conclude whether AI arguments align more closely with high school or college-level human arguments Qualitative analysis comparing human and AI arguments on a number of factors

Baselines

ChatGPT serves as a baseline AI model, allowing me to compare DeepSeek’s argumentation style against a widely used model. Human arguments provide a reference point for logical and persuasive argumentation.

Novelty

Unlike prior studies focusing on AI-generated text quality, this project explicitly examines AI’s ability to replicate structured argumentation and determines where AI arguments fit within human debate levels (high school vs. college).

Experiments

Data

The dataset consists of 39 sets of question-response pairs, with each response including a human, DeepSeek, and ChatGPT generated argument to the question. The 39 human arguments consist of 14 high school level and 25 college level arguments. The AI-generated arguments come from two different models: DeepSeek and ChatGPT, allowing for an analysis of how different AI systems structure arguments. Each argument is labeled based on topic (environment, economy, foreign policy, and politics) and level (high school, college), enabling structured comparisons between AI and human arguments at different skill levels and across different mainstream topics.

Here is a sample data triplet:

Question: How have U.S. nuclear reactors contributed to both reducing the risk of nuclear war and addressing climate change?

Human Argument (Environment 1 - High School): In addition to reducing the risk of nuclear war, U.S. reactors have also been staving off another global challenge: climate change. The low-carbon electricity produced by such reactors provides 20 percent of the nation's power and, by the estimates of climate scientist James Hansen of Columbia University, avoided 64 billion metric tons of greenhouse gas pollution.

DeepSeek Argument: The U.S. nuclear reactors have significantly reduced the risk of nuclear war by providing a reliable energy source that does not depend on fossil fuels. This stability reduces the need for nuclear weapons as energy demands are met through peaceful, non-military means. Additionally, the low carbon emissions of nuclear energy help address climate change by lowering overall greenhouse gas levels. As the world transitions towards renewable energy sources, nuclear reactors play a crucial role in maintaining a stable energy grid while transitioning towards sustainable energy solutions.

ChatGPT Argument: U.S. nuclear reactors have played a dual role in both security and environmental policy. Civilian nuclear programs helped demonstrate that nuclear technology could be harnessed for peaceful purposes, promoting non-proliferation by encouraging international cooperation through agreements like the Nuclear Non-Proliferation Treaty. Simultaneously, nuclear reactors provide a major low-carbon energy source, replacing fossil fuels and reducing greenhouse gas emissions. This dual-purpose legacy highlights how peaceful nuclear development can mitigate climate change while promoting stability by reducing incentives for states to pursue weapons programs under the guise of energy development.

Evaluation Method

To properly evaluate my data, I conducted a three-section survey with about 100 students, mainly from UChicago and some with a debate background. The three sections are as follows:

Choosing the strongest argument (9 questions) - participants are given a prompt and asked to select the most persuasive response from a set of three (1 human, 1 DeepSeek, 1 ChatGPT) arguments AI vs human detection (15 questions) - participants are asked to select whether a provided argument was written by a human or by AI Rating section (12 questions) - participants are given a prompt and a randomly selected argument response to it and asked to grade it on a 1-5 scale on persuasiveness, logical cohesion, and insightfulness This was done for four arguments - one human college, one human high school, one DeepSeek, and one ChatGPT

The prompts were randomly organized and participants answered the survey blindly.

The qualitative analysis placed human and AI arguments side by side to examine structural, tonal, stylistic, rhetorical, content, and persuasive tactic differences. Another key component was determining whether AI arguments align more with high school or college writing styles.

Experimental Details

I took numerous measures with my data and evaluation structure to ensure fair and accurate analyses. The data was cleaned and formatted significantly to ensure strong human arguments were presented on popular, contentious topics with a lot of data behind them. I also selected data entries at random within each topic bubble after assigning topics dynamically to each entry based on its provided summary. The AI models were presented with standardized instructions to format arguments across the topics with various prompts and constraints that mirror human-level arguments and debate-style speeches. The arguments were randomly presented in the survey to eliminate order biases. Participants ranked argument strength, attempted AI detection, and rated each argument on a number of factors.

Results

Qualitative Results:

AI and human arguments differ significantly in structure, tone, rhetoric, and persuasive techniques. Human arguments are shorter, direct, and focused on one or two key ideas, often diving straight into the argument with minimal background context. In contrast, AI-generated arguments are longer and more structured, frequently starting with definitions and background explanations before addressing the core issue. This tendency to over-explain makes AI responses less engaging and urgent compared to human arguments.

In terms of tone and style, humans employ rhetorical flair, emotional appeals, and strong declarative statements, often using memorable phrases, humor, and confident assertions like “this is essential.” AI, on the other hand, remains formal, detached, and overly cautious, frequently hedging its claims with phrases like “some argue that” or “there are multiple perspectives.”

The persuasive tactics of AI and human debaters also diverge. Humans frame arguments as moral crises or urgent injustices, using punchlines, rhetorical questions, and direct emotional engagement (“How is this fair?”). AI, in contrast, presents issues as complex policy problems, focusing on balance, logical thoroughness, and structured reasoning rather than moral urgency. While humans seek to persuade, AI seeks to explain—which often diminishes its rhetorical impact.

Compared to high school human debaters, AI demonstrates stronger organization and logical cohesion but lacks emotional conviction and personal stakes. Against college-level debaters, AI sounds polished but cautious, often avoiding risk-taking, deep theoretical engagement, and creative framing.

Ultimately, AI can generate coherent and persuasive arguments, but it struggles with emotional resonance, strategic depth, and adaptability in dynamic debate settings. AI arguments explain, while human arguments convince.

Quantitative Results: Here is a graph showing the distribution of answers to section 1 of the survey on which response was the strongest for a given question. As you can see, all three sources of arguments came out fairly evenly, with DeepSeek having the slight edge over the other two. When sorted by topic, ChatGPT was consistently the ‘strongest’ response to economy-focused questions, while humans excelled at questions revolving around the environment. DeepSeek performed consistently across the board.

For section 2, respondents correctly chose whether a sample was generated by a human or by AI only 48.9% of the time. This leads me to conclude that humans have very little insight into whether or not short, persuasive phrases are AI-generated or not because the percentage correct is just below what I would expect from people who were fully guessing between the two options. The phrase: “A carbon tax would need to be set at an optimal level that accounts for the economy and climate science. This is an impossible task.” was the phrase respondents answered correctly at the lowest frequency, with only 34.4% correctly stating that the statement was written by a human.

The graph below shows the distribution of average scores for section 3, the grading section. From this, you can see that college-level human generated arguments were slightly more persuasive and logically cohesive than the rest, while high-school level human arguments were the least persuasive of the group, yet they were the most insightful. This could speak to how high school arguments provide important information yet are not as effective at persuading. The ChatGPT and DeepSeek arguments were the least logically cohesive and insightful yet were in the middle for persuasiveness.

The quantitative findings suggest that AI-generated arguments are competitive with human-written ones in terms of perceived strength and persuasiveness. Furthermore, the difficulty in distinguishing AI from human arguments—especially in shorter, structured statements—demonstrates that AI can effectively mimic persuasive human writing. Additionally, college-level human arguments consistently scored the highest in persuasiveness and logical cohesion, while high school-level human arguments were the most insightful but the least persuasive. AI arguments were often perceived as persuasive but lacked insight and depth, reinforcing the idea that while AI-generated responses sound compelling, they struggle to provide novel or strategically framed arguments. These results highlight AI’s growing ability to generate structured and persuasive text but also its limitations in high-level debate settings.

Remaining tasks

There are many areas to explore beyond the work I did in my project. It would be interesting to have a much larger dataset than the 39 prompts I chose and incorporate more topics into the experiment. Additionally, it would be valuable to compare AI-generated rebuttals against human counterarguments, assessing AI’s ability to engage dynamically rather than just constructing standalone persuasive responses. I am also interested in examining how sample length affects argument strength by incorporating long-form persuasive writing such as essays or extended debate speeches. This would allow for analysis of how AI sustains logical flow, adapts to counterarguments, and maintains persuasiveness over multiple paragraphs. AI may perform well in isolated debate-style responses but struggle when required to develop a nuanced, evolving position over extended writing. Broadening the reach of my survey is another potential next step. I was only able to survey about 100 people, primarily UChicago students and a few friends from high school (who were in a debate club with me) at other universities around the country. I would like to survey current high school students, graduates of college, and so on to see if there are differences in response based on education level and experience.

File Upload Link

Simonian_FinalProject.zip