NLP Performance Tests - runtimerevolution/labs GitHub Wiki
How summaries are built
The NLP_Interface
class includes two methods for summarizing text: summarization
and spacy_summarization
. Here's an explanation of how each of these functions works to obtain a summary of the text:
summarization
Method
1. This method leverages a pre-trained transformer model for text summarization provided by the Hugging Face transformers
library. Here’s a step-by-step breakdown of how it works:
-
Tokenization:
- The input text is tokenized using the SpaCy model loaded in
self.nlp
.
- The input text is tokenized using the SpaCy model loaded in
-
Determine Summary Length:
- Calculate the minimum and maximum length for the summary based on the input text length and the
percentage
parameter. Thepercentage
parameter determines the proportion of the text to keep in the summary.
- Calculate the minimum and maximum length for the summary based on the input text length and the
-
Load Summarization Pipeline:
- Use the
pipeline
function from thetransformers
library to create a summarization pipeline with the model specified inSUMMARIZATION_MODEL
, in this case the model facebook/bart-large-cnn.
- Use the
-
Generate Summary:
- Pass the input text to the summarization pipeline along with the calculated
max_length
andmin_length
parameters. The pipeline processes the text and generates a summary.
- Pass the input text to the summarization pipeline along with the calculated
-
Return Summary:
- Extract and return the summary text from the pipeline's output.
spacy_summarization
Method
2. This method implements a custom extractive summarization technique using SpaCy for natural language processing. Here’s how it works:
-
Tokenization:
- Tokenize the input text using the SpaCy model loaded in
self.nlp
.
- Tokenize the input text using the SpaCy model loaded in
-
Text Cleaning and Vectorization:
- Create a frequency dictionary
freq_of_word
for the words in the text, ignoring stop words and punctuation. - Normalize the word frequencies by dividing each frequency by the maximum frequency in the dictionary.
- Create a frequency dictionary
-
Sentence Scoring:
- Identify the sentences in the text using
doc.sents
. - Score each sentence based on the sum of normalized frequencies of the words it contains.
- Identify the sentences in the text using
-
Select Top Sentences:
- Calculate the number of sentences to include in the summary based on the
percentage
parameter. - Use the
nlargest
function to select the top-scoring sentences.
- Calculate the number of sentences to include in the summary based on the
-
Sort and Prepare Final Summary:
- Sort the selected sentences based on their original order in the text.
- Join the sorted sentences to form the final summary.
-
Return Summary:
- Convert the list of selected sentences into a string and return it as the summary.
Comparison and Use Cases
-
summarization:
- Pros: Utilizes a pre-trained transformer model, which can generate high-quality summaries by understanding the context and semantics of the text.
- Cons: Requires more computational resources and may be slower, especially for longer texts.
-
spacy_summarization:
- Pros: Custom implementation that is faster and lightweight as it uses frequency-based extractive summarization.
- Cons: May not capture the context and nuances as effectively as a transformer-based model, since it relies on word frequency and sentence scoring.
Both methods are useful depending on the requirements and constraints of the task at hand. The summarization
method is suitable for generating more coherent and contextually accurate summaries, while the spacy_summarization
method is suitable for quick and resource-efficient summarization tasks.
The test flow
The test flow is presented below.
stateDiagram-v2
start --> Initialize_text_samples
Initialize_text_samples --> Perform_SpaCy_summarization
Perform_SpaCy_summarization --> Record_SpaCy_results
Record_SpaCy_results --> Perform_Hugging_Face_summarization
Perform_Hugging_Face_summarization --> Record_Hugging_Face_results
Record_Hugging_Face_results --> Evaluate_summary_quality
Evaluate_summary_quality --> Combine_results_for_comparison
Combine_results_for_comparison --> Save_combined_results
Save_combined_results --> End
Flow states
Initialize Text Samples
We used 7 sample texts for testing. Three were random texts on various subjects, and the other four were real-life GitHub issues. The lengths of these texts ranged from 42 to 1518 characters.
For each sample, we created 6 reference summaries with approximate lengths of 50, 100, 200, 300, 500, and 1000 characters. These reference summaries were designed for performance evaluation. In the final stage, the summaries generated by each function will be compared to these reference summaries for scoring.
Note: If the reference length is greater than the text length, the reference summary is the same as the original text.
Perform SpaCy Summarization
For each text sample and reference length, a summary was created using the spacy_summarization
method.
Record SpaCy Results
The generated summaries were then stored in a CSV file.
Perform Hugging Face Summarization
For each text sample and reference length, a summary was created using the summarization
method.
Record Hugging Face Results
The generated summaries were then stored in a CSV file.
Evaluate Summary Quality
Understanding ROUGE Scores
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the quality of a summary by comparing it to a reference summary. Here's how it works:
ROUGE-1 (Unigram):
What it Measures: ROUGE-1 looks at single words (unigrams) in the summary and the reference. How it Works: It counts how many words in the summary match words in the reference.
ROUGE-2 (Bigram):
What it Measures: ROUGE-2 looks at pairs of consecutive words (bigrams) in the summary and the reference. How it Works: It counts how many word pairs in the summary match word pairs in the reference.
ROUGE-L (Longest Common Subsequence):
What it Measures: ROUGE-L looks at the longest sequence of words that appear in both the summary and the reference in the same order. How it Works: It finds the longest matching sequence of words between the summary and the reference, even if they are not next to each other.
How We Use These Scores
For each row in our data tables, we have a reference summary and a summary. We calculate ROUGE-1, ROUGE-2, and ROUGE-L scores for these texts. After calculating ROUGE-1, ROUGE-2, and ROUGE-L, we take their average to get a single quality score. This average score gives us a balanced measure of how good the summary is compared to the reference.
Combine Results for Comparison
After scoring each and every summary for both methods, these are put together in a single dataframe.
Save Combined Results
Finally, the results are stored in a CSV file.
Results
Execution Time
For each method, we generated 42 summaries (7 sample texts * 6 reference lengths). The execution times were measured.
SpaCy | Hugging Face | |
---|---|---|
Total execution time (seconds) | 15.29 | 496.37 |
Average execution time (seconds/summary) | 0.36 | 11.82 |
Conclusion
The execution times indicate a significant difference in time performance between the two summarization methods. SpaCy is much faster, generating summaries in an average of 0.36 seconds per summary, compared to Hugging Face, which takes an average of 11.82 seconds per summary. This suggests that while Hugging Face may offer more advanced summarization capabilities, SpaCy is considerably more efficient for generating summaries quickly. The choice between these methods should consider both the quality of the summaries and the available computational resources.
Raw Quality Results
Comparison of Text Quality Ranges
The table below presents a comparative analysis of text quality distribution for the two summarisation methods: Spacy and Huggingface. The quality of text is categorised into five distinct ranges based on percentage scores, with each range representing a different level of text quality.
Quality Range | Spacy Count | Huggingface Count |
---|---|---|
Higher than 80% | 10 | 4 |
Between 60% and 80% | 1 | 0 |
Between 40% and 60% | 7 | 4 |
Between 20% and 40% | 12 | 25 |
Lower than 20% | 12 | 9 |
As we can see, the spacy summarisation method takes the lead, with the highest count of top quality performance cases.
Comparative Analysis of Text Quality Metrics Across Different Reference Lengths
The table below provides a detailed comparison of text quality metrics for two NLP summarisation methods across various reference lengths. The key metrics analysed are Average Quality, Standard Deviation, Normalized Standard Deviation, in order to calculate the Score.
Ref Length | Method | Average Quality | Standard Deviation | Normalised Std Deviation | Score |
---|---|---|---|---|---|
1000 | spacy | 73.09% | 0.27 | 0.64 | 0.58 |
500 | spacy | 56.54% | 0.31 | 0.76 | 0.42 |
1000 | huggingface | 43.57% | 0.26 | 0.58 | 0.35 |
500 | huggingface | 40.13% | 0.25 | 0.57 | 0.33 |
300 | huggingface | 39.33% | 0.26 | 0.59 | 0.32 |
300 | spacy | 46.75% | 0.37 | 1.00 | 0.31 |
200 | huggingface | 30.22% | 0.10 | 0.00 | 0.30 |
100 | huggingface | 27.18% | 0.12 | 0.11 | 0.26 |
200 | spacy | 34.44% | 0.31 | 0.79 | 0.25 |
100 | spacy | 31.50% | 0.31 | 0.79 | 0.23 |
50 | huggingface | 22.25% | 0.29 | 0.69 | 0.17 |
50 | spacy | 23.79% | 0.34 | 0.89 | 0.17 |
The results show that Spacy generally produces higher average quality scores than Huggingface across most reference lengths, with notable differences in standard deviations and normalized deviations, indicating varying levels of consistency in quality assessment between the two methods.
Processed Quality Results
In the beginning of the explanation of the flow states, we pointed out one important fact: "If the reference length is greater than the text length, the reference summary is the same as the original text." This means that, for each reference length greater than the sample text, the spacy method will perform at 100% and huggingface will repeat the last score. This results in repeated summary scores. For that reason, it was decided to remove those repeated scores, in order to better reflect the true score of each method.
Comparison of Text Quality Ranges
Quality Range | Spacy Count | Huggingface Count |
---|---|---|
Higher than 80% | 0 | 2 |
Between 60% and 80% | 1 | 0 |
Between 40% and 60% | 7 | 4 |
Between 20% and 40% | 11 | 20 |
Lower than 20% | 12 | 9 |
The most significant changes are observed in the "Higher than 80%" quality range, where the Spacy count decreases by 10 and the Huggingface count decreases by 2. In the "Between 20% and 40%" quality range, both counts decrease slightly, with Spacy decreasing by 1 and Huggingface by 5. There are no changes in the rest of the quality ranges for both Spacy and Huggingface counts. These changes make huggingface the highest performer in terms of best quality counts.
Comparative Analysis of Text Quality Metrics Across Different Reference Lengths
Ref Length | Method | Average Quality | Standard Deviation | Normalised Std Deviation | Score |
---|---|---|---|---|---|
1000 | spacy | 52.91% | 0.14 | 0.35 | 0.47 |
500 | spacy | 39.16% | 0.09 | 0.12 | 0.38 |
1000 | huggingface | 35.56% | 0.14 | 0.35 | 0.31 |
500 | huggingface | 30.74% | 0.08 | 0.07 | 0.30 |
200 | huggingface | 29.97% | 0.10 | 0.20 | 0.28 |
300 | huggingface | 40.60% | 0.28 | 0.97 | 0.28 |
300 | spacy | 25.46% | 0.08 | 0.11 | 0.25 |
100 | huggingface | 26.43% | 0.13 | 0.33 | 0.24 |
200 | spacy | 23.51% | 0.13 | 0.30 | 0.21 |
100 | spacy | 20.09% | 0.09 | 0.12 | 0.19 |
50 | huggingface | 22.25% | 0.29 | 1.00 | 0.15 |
50 | spacy | 11.09% | 0.06 | 0.00 | 0.11 |
Across most reference lengths and methods, the average quality and score decreased from the raw table to the processed table. Standard deviation and normalised standard deviation also showed a general decrease, indicating less variability in the second set of results. Despite all of these differences, the rank stayed unaltered .
Final conclusion
Execution Time: SpaCy significantly outperforms Hugging Face in terms of execution speed. SpaCy takes an average of 0.36 seconds per summary compared to Hugging Face's 11.82 seconds per summary. This indicates SpaCy is much faster at generating summaries.
Raw Quality Results: When comparing text quality across different metrics and reference lengths, SpaCy generally outperforms Hugging Face in average quality scores. SpaCy consistently shows higher average quality scores and lower standard deviations, suggesting more consistent performance across various text lengths.
Processed Quality Results: After adjusting for cases where reference lengths exceeded sample text lengths, the relative performance between SpaCy and Hugging Face saw some changes in quality ranges. However, the general rank stays the same.
Conclusion: Despite fluctuations in specific quality metrics, SpaCy emerges as the preferred choice for summarization tasks that prioritize speed and overall quality consistency. Hugging Face, while potentially offering more advanced summarization capabilities, lags significantly behind in terms of speed and overall efficiency. Therefore, the choice between these methods should consider the trade-off between summarization quality and computational efficiency based on specific project requirements and available resources.