Measuring the Helpfulness of AI Generated Text - minalee-research/cs257-students GitHub Wiki
#helpfulness
, #AI-generated-text
, #CoAuthor
Authors: Esther Goldberg, Yuzhou Wang, Sophia North
Repository: Helpfulness Repo
Our project aims to explore whether it is possible to develop automatic metrics for helpfulness in AI-generated writing contexts. Existing methods rely heavily on human annotations, which are subjective, resource-intensive, and difficult to scale. We aim to design metrics that capture linguistic features of helpfulness.
For creative writing, AI models can assist with generating characters, world-building, brainstorming, and providing stylistic suggestions. In this context, helpfulness may involve coherence, creativity (AI coverage and syntactic similarity), and stylistic fit (tone and part-of-speech analysis). We aim to use pre-existing models built around these characteristics to develop metrics that capture helpfulness in creative writing. Our goal is to create a standardized metric or score that can measure helpfulness without relying solely on human evaluations.
In our project, helpfulness is defined by whether users accept or reject AI-generated suggestions, where a suggestion is considered helpful if the user accepts it. While some metrics, such as POS (part-of-speech) similarity and coherence, show significant differences between accepted and rejected suggestions, other helpfulness metrics are less predictive. Overall, our findings suggest that users’ likelihood of accepting AI-generated text varies based on factors such as lexical structure, emotional similarity, and complexity.
In this project, we aim to measure the “helpfulness” of AI-generated text in collaborative writing scenarios, focusing on both creative and argumentative writing. As AI language models like GPT-4 play a larger part in our lives, it is crucial to evaluate how helpful their suggestions are to human writers. However, assessing helpfulness is challenging because it is a subjective and multifaceted concept.
Our goal is to develop automatic, scalable metrics for evaluating helpfulness that align with human judgements. Current methods rely heavily on human annotations, which are time-consuming and difficult to scale. By designing metrics that capture key linguistic features of helpfulness, we aim to reduce that reliance on human evaluators and make AI writing assistants more practical for real-world applications. Automating the evaluation of helpfulness will help to improve the accuracy and consistency of AI-generated suggestions.
The task involves measuring the helpfulness of AI-generated suggestions during writing sessions. We will assess key aspects of helpfulness, such as:
- Coherence: Measuring how well the suggestion fits within the broader context (Using a sentence embedding model and cosine similarity)
- Creativity: Evaluating the originality of the suggestion (Using metrics such as Exact-Match Analysis and Syntactical Complexity)
- Stylistic fit: Assessing whether the suggestion aligns with the tone and style of the existing text (Using Tone Analysis and POS Analysis)
We will develop a composite helpfulness score that combines these factors to provide a standardized metric for evaluating AI-generated text.
Our main source of data will be the CoAuthor dataset. This is a dataset of interactions between 63 writers and GPT-3, containing both creative and argumentative writing tasks. We will segment the interleaved human and AI-generated text to compare AI suggestions with final human revisions.
Our methodology involves extracting linguistic features and using embedding-based metrics in order to score the key aspects of helpfulness. We will train a regression model to predict helpfulness scores based on these features.
Main approach:
- Our approach focuses on creating metrics for the three key aspects of helpfulness (coherence, creativity, and stylistic fit) and creating a weighted average that will be our overall helpfulness score. We use the CoAuthor dataset as it captures the interactions between human writers and GPT-3 AI-generated suggestions to see what kind of suggestions are deemed “more helpful”.
Baseline:
- We will compare our scores of helpfulness to the human judgements in the CoAuthor dataset by assessing whether the helpfulness score aligns with suggestions that were accepted.
Novelty:
- Unlike the metrics that focus on surface-level features, our weighted helpfulness metric leverages semantic embeddings to capture deeper semantic relationships. This allows us to evaluate helpfulness in a more nuanced and context-sensitive manner.
- While metrics like tone and coherence have been studied in NLP, their application to human-AI collaborative writing is less explored. Our work builds an overall metric that combines multiple different features to evaluate AI-generated suggestions in real-world writing scenarios.
For this experiment, we use the CoAuthor: Human-AI Collaborative Writing Dataset, which consists of 1,447 writing sessions. The analysis focuses specifically on the currentDoc
and currentSuggestions
fields, which contain human-authored prompts and AI-generated responses.
This dataset is good for evaluating helpfulness metrics, including coherence, creativity and stylistic fit, as it provides data of AI-generated text aligned with human-authored prompts. The dataset enables a structured comparison between human and AI text.
- Reference: CoAuthor Dataset
- We used the all-MiniLM-L6-v2 model which is a sentence embedding model that maps sentences and paragraphs to a 384 dimensional dense vector space and is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector which captures the semantic information.
- We then used cosine similarity of the embeddings to calculate a coherence score. A low score means worse coherence while a high score means better coherence.
- Once we had the scores for all suggestions (both accepted suggestions and rejected suggestions), we analyzed the scores for both groups to see if there were correlations between the score and acceptance status.
- Reference: all-MiniLM-L12-v2 model
- Embedding model: all-MiniLM-L6-v2 from the Sentence Transformers Hugging Face library.
- Similarity metric: cosine similarity between embeddings.
- No training was required as the all-MiniLM-L6-v2 model is pre-trained.
- Hyperparameters:
- Context: The current document as it is written up to the point where a suggestion is opened.
- Suggestion: All suggestions generated by the AI are evaluated for coherence.
- We looped through every writing session and stopped at every point where the user opened the suggestions box, and calculated the coherence scores for every suggestion.
- We found that the average coherence score tended to be about 0.392, where scores ranged from -0.089 to 0.998.
Coherence Score Statistics
count | mean | std | min | 25% | 50% | 75% | max |
---|---|---|---|---|---|---|---|
3895.0 | 0.392417 | 0.214796 | -0.088527 | 0.229315 | 0.368402 | 0.534344 | 0.998474 |
Table 1: The basic statistics of all coherence scores.
Figure 1: The distribution of coherence scores for both accepted AI suggestions (green) and rejected AI suggestions (red).
- The average coherence of accepted suggestions were about 0.402 while rejected suggestions were about 0.354. So there seems to be correlation of higher coherence leading to an accepted suggestion.
Coherence Score by Acceptance Status
acceptance_status | count | mean | std | min | 25% | 50% | 75% | max |
---|---|---|---|---|---|---|---|---|
accepted | 3122.0 | 0.401851 | 00.216599 | -0.088527 | 0.235949 | 0.374242 | 0.546223 | 0.998474 |
rejected | 773.0 | 0.354315 | 0.203078 | -0.065787 | 0.201588 | 0.338742 | 0.480832 | 0.981411 |
Table 2: The basic statistics of coherence scores split into accepted AI suggestions and rejected AI suggestions.
To measure creativity, our evaluation framework integrates two complementary components:
We used the DJ_search_exact
algorithm (Lu et al. 2025) to identify spans in AI-generated text that exactly match segments in a large reference corpus. This approach yields a score between 0 and 1, where:
- 0 indicates high novelty (no matching spans found in the corpus).
- 1 indicates low novelty (text completely matches corpus segments).
The script was customized for our specific use-case scenario.
Instead of measuring the raw syntactic complexity of AI-generated suggestions alone, we calculated how closely the syntactic complexity of AI-generated suggestions aligns with the user's own writing. The syntactic complexity was measured based on three linguistic features:
-
Average parse tree depth: normalized by a maximum depth,
d_max=10
- Average number of subordinate clauses per sentence: normalized by a maximum of 2
- Proportion of sentences using passive voice
These metrics were individually normalized to a [0, 1] scale, and the final syntactic complexity score for each piece of text was computed as the average of these normalized values.
The similarity in syntactic complexity between the user's writing and AI-generated suggestions was calculated using:
This produces a similarity metric ranging from 0 (highly dissimilar complexity) to 1 (highly similar complexity).
- Approach: For each AI-generated suggestion, we computed its exact-match score and syntactic similarity relative to human-written text, analyzing the relationship between these scores and whether suggestions were accepted or rejected by users.
- Setup: All parsing and linguistic feature extraction were conducted using Python and the spaCy library.
Overall syntactic similarity between AI-generated text and human-authored text was high (mean ≈ 0.802), indicating strong alignment across all suggestions (n=3895):
Acceptance Status | Count | Mean Similarity | Std. Dev. | Min | 25% | Median | 75% | Max |
---|---|---|---|---|---|---|---|---|
Accepted | 3122 | 0.803 | 0.164 | 0.100 | 0.700 | 0.833 | 0.933 | 1.0 |
Rejected | 773 | 0.796 | 0.170 | 0.133 | 0.700 | 0.833 | 0.933 | 1.0 |
Table 3: Syntactic similarity statistics by acceptance status.
The difference in average syntactic similarity scores between accepted (0.803) and rejected suggestions (0.796) suggests minimal distinction based solely on syntactic complexity alignment. This highlights the need to integrate multiple creativity metrics to more accurately predict user acceptance.
Figure 2: The distribution of syntactical similarity scores for both accepted AI suggestions (green) and rejected AI suggestions (red).
We analyzed coverage using a 6-gram approach, examining the degree to which AI-generated text matched segments in a large reference corpus. Coverage scores range from 0 (no matches in the corpus, indicating high novelty) to 1 (complete matches, indicating lower novelty).
Across all AI-generated suggestions (n=3895), the average coverage was 0.573, with a broad distribution suggesting variability in novelty across suggestions:
Acceptance Status | Count | Mean | Std. Dev. | Min | 25% | Median | 75% | Max |
---|---|---|---|---|---|---|---|---|
Accepted | 3122 | 0.569 | 0.318 | 0.0 | 0.400 | 0.635 | 0.818 | 1.0 |
Rejected | 773 | 0.589 | 0.320 | 0.0 | 0.429 | 0.667 | 0.833 | 1.0 |
Table 4: Coverage scores by acceptance status (unfiltered).
Figure 3: The distribution of coverage scores for both accepted AI suggestions (green) and rejected AI suggestions (red).
Interestingly, rejected suggestions had slightly higher coverage scores (mean=0.589) compared to accepted suggestions (mean=0.569). However, the differences were small, suggesting no strong overall correlation between coverage and acceptance status.
When excluding suggestions with negligible coverage (scores below 0.01), the mean coverage increased to approximately 0.690, indicating these remaining suggestions had greater overlap with the corpus:
Acceptance Status | Count | Mean | Std. Dev. | Min | 25% | Median | 75% | Max |
---|---|---|---|---|---|---|---|---|
Accepted | 2588 | 0.686 | 0.204 | 0.200 | 0.538 | 0.698 | 0.857 | 1.0 |
Rejected | 646 | 0.705 | 0.201 | 0.207 | 0.547 | 0.727 | 0.867 | 1.0 |
Table 5: Coverage scores by acceptance status (filtered: coverage ≥ 0.01).
Figure 4: The distribution of coverage scores when filtered out coverage that are less than 0.01.
After filtering, rejected suggestions showed slightly higher mean coverage (0.705) than accepted suggestions (0.686). This subtle difference suggests a minor tendency for users to prefer slightly more novel AI-generated suggestions.
Overall, these findings suggest that while coverage alone does not strongly predict acceptance, there may be a slight preference for suggestions with lower coverage—indicating higher novelty—in collaborative writing contexts.
- RoBERTa GoEmotions is used to analyze the emotional tone of human-authored prompts and AI-generated responses.
- This model detects 27 distinct emotions and was selected due to its status as the best-performing open-source deep learning model for emotion classification.
- Reference: Go Emotions Dataset
- JSD measures the similarity between the tone distributions of human and AI-generated text.
- A higher JSD score indicates a closer alignment in tone between human-written prompts and AI-generated suggestions.
- Reference: Jensen-Shannon Divergence Overview
- POS tagging is performed using spaCy to analyze the syntactic structure of text.
- This helps determine whether the AI-generated text follows a similar lexical structure as human-written prompts.
- Cosine similarity is used to compare the frequency distribution of POS tags between human and AI-generated text.
- A higher cosine similarity score suggests that the AI-generated text maintains a syntactic structure similar to human writing.
The primary objective of this experiment is to assess whether AI-generated text aligns with human-authored prompts in terms of tone and lexical style. Stylistic fit is an important factor when quantifying helpfulness, and these metrics help measure the stylistic fit of AI-generated text in creative writing.
- RoBERTa GoEmotions extracts six prominent emotional tones from each text segment, assigning a score between 0 and 1 for each detected emotion.
- The resulting tone distributions are then compared between human and AI text using JSD. The JSD scores are further analyzed based on whether AI-generated suggestions were accepted or rejected by users.
- POS is performed using spaCy, and cosine similarity is computed to measure structural alignment between human and AI text.
- A t-test is conducted to determine whether the tone similarity of accepted AI-generated suggestions is significantly different from that of rejected suggestions.
- The distribution of Tone scores between human and AI-generated text is plotted separately for accepted and rejected AI suggestions.
- Both distributions appear uniform, indicating a lack of normality in the data.
Figure 5: The distribution of tone similarity scores for both accepted AI suggestions (green) and rejected AI suggestions (red).
- The distribution of POS similarity scores between human and AI-generated text is plotted separately for accepted and rejected AI suggestions.
- Both distributions appear left-skewed, indicating that the model is generally good at matching the lexical structure of human writing.
Figure 6: The distribution of POS similarity scores for both accepted AI suggestions (green) and rejected AI suggestions (red).
- A scatter plot visualizes the relationship between tone similarity (JSD) and POS similarity (cosine similarity).
- Green points represent accepted AI-generated suggestions, while red points represent rejected suggestions.
- The distribution suggests that AI-generated text generally aligns well with human lexical structures, regardless of whether the suggestions are accepted.
- There is no clear relationship between the two similarity scores.
Figure 7: The scatter plot of tone similarity scores vs. POS similarity scores for both accepted AI suggestions (green) and rejected AI suggestions (red).
Metric | t-Statistic | p-Value | Interpretation |
---|---|---|---|
Tone Similarity | 2.1420 | 0.0324 | Tone similarity significantly influences acceptance. |
POS Similarity | 4.7531 | 0.0000 | POS similarity significantly influences acceptance. |
Table 6: The t-test statistics of the tone similarity and POS similarity scores.
- Tone similarity significantly impacts whether a suggestion is accepted or rejected. This suggests that users might be considering emotional alignment when accepting AI-generated suggestions.
- POS similarity significantly affects acceptance, indicating that users prefer AI-generated suggestions that structurally align with their own writing style.
We implemented a weighted sum approach to combine our helpfulness metrics (coherence, creativity, and stylistic fit) into a single composite measure. This approach allowed us to determine the importance of each metric and provide insights by calculating weights specific to each writer.
We trained a logistic regression model using the following helpfulness metrics as input features:
- Tone similarity
- Part-of-speech (POS) similarity
- Coherence score
- Syntactic complexity similarity
- AI coverage
The logistic regression model was trained to predict whether an AI-generated suggestion would be accepted or rejected by the user. All features were standardized before training to ensure comparability of their coefficients.
We developed a Python function, calculate_metric_weights
, that normalizes logistic regression coefficients to output weights reflecting the relative importance of each metric, enabling us to conduct analyses both globally and individually.
The weights derived from the logistic regression model for all AI-suggestions are summarized below:
Metric | Weight |
---|---|
Tone similarity | 0.1109 |
POS similarity | 0.3968 |
Coherence score | 0.2811 |
Syntactical similarity. | 0.0293 |
AI coverage | 0.1871 |
Table 7: The weights associated for each helpfulness metric derived from the logistic regression model.
The results indicate that POS similarity and coherence score are the strongest predictors of whether a user accepts AI-generated suggestions.
The following histogram shows the distribution of weights for the five helpfulness metrics we have defined. These histograms suggest that users value POS similarity and coherence the most, followed by AI coverage, tone similarity, and syntactical similarity when determining whether to accept a suggestion or not.
Figure 8: Histogram of Helpfulness Metrics
We further calculated weights individually for selected writers to observe variations in metric importance:
- Worker A3DUPRZSMU9W5R: prioritized syntactic complexity similarity (55.5%), with very low importance on coherence (1.7%).
- Worker A2WGW5Y3ZFBDEC: prioritized tone similarity (29%) and syntactical similarity (50%).
- Worker A23EWFNNOUS10B: prioritized POS similarity(33%), tone similarity (29%), and AI coverage (23%), but does not prioritize coherence (8%) and syntactical similarity (7.2%).
These individual variations indicate significant differences in how users perceive the helpfulness of AI-generated suggestions, suggesting that personalized metrics could further improve the effectiveness of AI writing support systems.
Figure 9: Heatmap of Helpfulness Metrics vs. Individual workers (deeper color represents more important)
Our project developed automatic metrics for assessing the “helpfulness” of AI-generated suggestions in collaborative writing settings, measured by whether users accept or reject these suggestions. By assessing the CoAuthor dataset, we identified coherence and POS similarity as the most informative predictors of acceptance of AI-generated suggestions.
While other factors such as tone similarity and coverage analysis also contribute to users' decisions, they are less important for acceptance decisions.
With all these factors considered, it seems that a benchmark for helpfulness in AI suggestions is not feasible. For future work, we plan to adopt adaptive metrics to dynamically measure helpfulness by continually analyzing a user’s style, voice, and creativity.
Our experiments show that combining coherence, lexical structure, coverage, and tone can yield a comprehensive view of helpfulness. Future work could use individualized weightings that emphasize whichever features a particular writer values most, ensuring more precise, context-aware suggestions.
Lee, Mina, Percy Liang, and Qian Yang. “CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities.” CHI Conference on Human Factors in Computing Systems (CHI ’22), ACM, 2022, pp. 1–19, doi:10.1145/3491102.3502030.
Lu, Ximing, et al. “AI as Humanity’s Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text.” arXiv, 2025, https://arxiv.org/abs/2410.04265.