NLP Performance Tests - runtimerevolution/labs GitHub Wiki

How summaries are built

The NLP_Interface class includes two methods for summarizing text: summarization and spacy_summarization. Here's an explanation of how each of these functions works to obtain a summary of the text:

1. summarization Method

This method leverages a pre-trained transformer model for text summarization provided by the Hugging Face transformers library. Here’s a step-by-step breakdown of how it works:

  1. Tokenization:

    • The input text is tokenized using the SpaCy model loaded in self.nlp.
  2. Determine Summary Length:

    • Calculate the minimum and maximum length for the summary based on the input text length and the percentage parameter. The percentage parameter determines the proportion of the text to keep in the summary.
  3. Load Summarization Pipeline:

    • Use the pipeline function from the transformers library to create a summarization pipeline with the model specified in SUMMARIZATION_MODEL, in this case the model facebook/bart-large-cnn.
  4. Generate Summary:

    • Pass the input text to the summarization pipeline along with the calculated max_length and min_length parameters. The pipeline processes the text and generates a summary.
  5. Return Summary:

    • Extract and return the summary text from the pipeline's output.

2. spacy_summarization Method

This method implements a custom extractive summarization technique using SpaCy for natural language processing. Here’s how it works:

  1. Tokenization:

    • Tokenize the input text using the SpaCy model loaded in self.nlp.
  2. Text Cleaning and Vectorization:

    • Create a frequency dictionary freq_of_word for the words in the text, ignoring stop words and punctuation.
    • Normalize the word frequencies by dividing each frequency by the maximum frequency in the dictionary.
  3. Sentence Scoring:

    • Identify the sentences in the text using doc.sents.
    • Score each sentence based on the sum of normalized frequencies of the words it contains.
  4. Select Top Sentences:

    • Calculate the number of sentences to include in the summary based on the percentage parameter.
    • Use the nlargest function to select the top-scoring sentences.
  5. Sort and Prepare Final Summary:

    • Sort the selected sentences based on their original order in the text.
    • Join the sorted sentences to form the final summary.
  6. Return Summary:

    • Convert the list of selected sentences into a string and return it as the summary.

Comparison and Use Cases

  • summarization:

    • Pros: Utilizes a pre-trained transformer model, which can generate high-quality summaries by understanding the context and semantics of the text.
    • Cons: Requires more computational resources and may be slower, especially for longer texts.
  • spacy_summarization:

    • Pros: Custom implementation that is faster and lightweight as it uses frequency-based extractive summarization.
    • Cons: May not capture the context and nuances as effectively as a transformer-based model, since it relies on word frequency and sentence scoring.

Both methods are useful depending on the requirements and constraints of the task at hand. The summarization method is suitable for generating more coherent and contextually accurate summaries, while the spacy_summarization method is suitable for quick and resource-efficient summarization tasks.

The test flow

The test flow is presented below.

stateDiagram-v2
    start --> Initialize_text_samples
    Initialize_text_samples --> Perform_SpaCy_summarization
    Perform_SpaCy_summarization --> Record_SpaCy_results
    Record_SpaCy_results --> Perform_Hugging_Face_summarization
    Perform_Hugging_Face_summarization --> Record_Hugging_Face_results
    Record_Hugging_Face_results --> Evaluate_summary_quality
    Evaluate_summary_quality --> Combine_results_for_comparison
    Combine_results_for_comparison --> Save_combined_results
    Save_combined_results --> End

Flow states

Initialize Text Samples

We used 7 sample texts for testing. Three were random texts on various subjects, and the other four were real-life GitHub issues. The lengths of these texts ranged from 42 to 1518 characters.

For each sample, we created 6 reference summaries with approximate lengths of 50, 100, 200, 300, 500, and 1000 characters. These reference summaries were designed for performance evaluation. In the final stage, the summaries generated by each function will be compared to these reference summaries for scoring.

Note: If the reference length is greater than the text length, the reference summary is the same as the original text.

Perform SpaCy Summarization

For each text sample and reference length, a summary was created using the spacy_summarization method.

Record SpaCy Results

The generated summaries were then stored in a CSV file.

Perform Hugging Face Summarization

For each text sample and reference length, a summary was created using the summarization method.

Record Hugging Face Results

The generated summaries were then stored in a CSV file.

Evaluate Summary Quality

Understanding ROUGE Scores

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the quality of a summary by comparing it to a reference summary. Here's how it works:

ROUGE-1 (Unigram):

What it Measures: ROUGE-1 looks at single words (unigrams) in the summary and the reference. How it Works: It counts how many words in the summary match words in the reference.

ROUGE-2 (Bigram):

What it Measures: ROUGE-2 looks at pairs of consecutive words (bigrams) in the summary and the reference. How it Works: It counts how many word pairs in the summary match word pairs in the reference.

ROUGE-L (Longest Common Subsequence):

What it Measures: ROUGE-L looks at the longest sequence of words that appear in both the summary and the reference in the same order. How it Works: It finds the longest matching sequence of words between the summary and the reference, even if they are not next to each other.

How We Use These Scores

For each row in our data tables, we have a reference summary and a summary. We calculate ROUGE-1, ROUGE-2, and ROUGE-L scores for these texts. After calculating ROUGE-1, ROUGE-2, and ROUGE-L, we take their average to get a single quality score. This average score gives us a balanced measure of how good the summary is compared to the reference.

Combine Results for Comparison

After scoring each and every summary for both methods, these are put together in a single dataframe.

Save Combined Results

Finally, the results are stored in a CSV file.

Results

Execution Time

For each method, we generated 42 summaries (7 sample texts * 6 reference lengths). The execution times were measured.

SpaCy Hugging Face
Total execution time (seconds) 15.29 496.37
Average execution time (seconds/summary) 0.36 11.82

Conclusion

The execution times indicate a significant difference in time performance between the two summarization methods. SpaCy is much faster, generating summaries in an average of 0.36 seconds per summary, compared to Hugging Face, which takes an average of 11.82 seconds per summary. This suggests that while Hugging Face may offer more advanced summarization capabilities, SpaCy is considerably more efficient for generating summaries quickly. The choice between these methods should consider both the quality of the summaries and the available computational resources.

Raw Quality Results

Comparison of Text Quality Ranges

The table below presents a comparative analysis of text quality distribution for the two summarisation methods: Spacy and Huggingface. The quality of text is categorised into five distinct ranges based on percentage scores, with each range representing a different level of text quality.

Quality Range Spacy Count Huggingface Count
Higher than 80% 10 4
Between 60% and 80% 1 0
Between 40% and 60% 7 4
Between 20% and 40% 12 25
Lower than 20% 12 9

As we can see, the spacy summarisation method takes the lead, with the highest count of top quality performance cases.

Comparative Analysis of Text Quality Metrics Across Different Reference Lengths

The table below provides a detailed comparison of text quality metrics for two NLP summarisation methods across various reference lengths. The key metrics analysed are Average Quality, Standard Deviation, Normalized Standard Deviation, in order to calculate the Score.

Ref Length Method Average Quality Standard Deviation Normalised Std Deviation Score
1000 spacy 73.09% 0.27 0.64 0.58
500 spacy 56.54% 0.31 0.76 0.42
1000 huggingface 43.57% 0.26 0.58 0.35
500 huggingface 40.13% 0.25 0.57 0.33
300 huggingface 39.33% 0.26 0.59 0.32
300 spacy 46.75% 0.37 1.00 0.31
200 huggingface 30.22% 0.10 0.00 0.30
100 huggingface 27.18% 0.12 0.11 0.26
200 spacy 34.44% 0.31 0.79 0.25
100 spacy 31.50% 0.31 0.79 0.23
50 huggingface 22.25% 0.29 0.69 0.17
50 spacy 23.79% 0.34 0.89 0.17

The results show that Spacy generally produces higher average quality scores than Huggingface across most reference lengths, with notable differences in standard deviations and normalized deviations, indicating varying levels of consistency in quality assessment between the two methods.

Processed Quality Results

In the beginning of the explanation of the flow states, we pointed out one important fact: "If the reference length is greater than the text length, the reference summary is the same as the original text." This means that, for each reference length greater than the sample text, the spacy method will perform at 100% and huggingface will repeat the last score. This results in repeated summary scores. For that reason, it was decided to remove those repeated scores, in order to better reflect the true score of each method.

Comparison of Text Quality Ranges

Quality Range Spacy Count Huggingface Count
Higher than 80% 0 2
Between 60% and 80% 1 0
Between 40% and 60% 7 4
Between 20% and 40% 11 20
Lower than 20% 12 9

The most significant changes are observed in the "Higher than 80%" quality range, where the Spacy count decreases by 10 and the Huggingface count decreases by 2. In the "Between 20% and 40%" quality range, both counts decrease slightly, with Spacy decreasing by 1 and Huggingface by 5. There are no changes in the rest of the quality ranges for both Spacy and Huggingface counts. These changes make huggingface the highest performer in terms of best quality counts.

Comparative Analysis of Text Quality Metrics Across Different Reference Lengths

Ref Length Method Average Quality Standard Deviation Normalised Std Deviation Score
1000 spacy 52.91% 0.14 0.35 0.47
500 spacy 39.16% 0.09 0.12 0.38
1000 huggingface 35.56% 0.14 0.35 0.31
500 huggingface 30.74% 0.08 0.07 0.30
200 huggingface 29.97% 0.10 0.20 0.28
300 huggingface 40.60% 0.28 0.97 0.28
300 spacy 25.46% 0.08 0.11 0.25
100 huggingface 26.43% 0.13 0.33 0.24
200 spacy 23.51% 0.13 0.30 0.21
100 spacy 20.09% 0.09 0.12 0.19
50 huggingface 22.25% 0.29 1.00 0.15
50 spacy 11.09% 0.06 0.00 0.11

Across most reference lengths and methods, the average quality and score decreased from the raw table to the processed table. Standard deviation and normalised standard deviation also showed a general decrease, indicating less variability in the second set of results. Despite all of these differences, the rank stayed unaltered .

Final conclusion

Execution Time: SpaCy significantly outperforms Hugging Face in terms of execution speed. SpaCy takes an average of 0.36 seconds per summary compared to Hugging Face's 11.82 seconds per summary. This indicates SpaCy is much faster at generating summaries.

Raw Quality Results: When comparing text quality across different metrics and reference lengths, SpaCy generally outperforms Hugging Face in average quality scores. SpaCy consistently shows higher average quality scores and lower standard deviations, suggesting more consistent performance across various text lengths.

Processed Quality Results: After adjusting for cases where reference lengths exceeded sample text lengths, the relative performance between SpaCy and Hugging Face saw some changes in quality ranges. However, the general rank stays the same.

Conclusion: Despite fluctuations in specific quality metrics, SpaCy emerges as the preferred choice for summarization tasks that prioritize speed and overall quality consistency. Hugging Face, while potentially offering more advanced summarization capabilities, lags significantly behind in terms of speed and overall efficiency. Therefore, the choice between these methods should consider the trade-off between summarization quality and computational efficiency based on specific project requirements and available resources.