Welcome to the world of Large Language Models (LLMs), where natural language processing has been revolutionized and become a cornerstone across various applications. However, with this innovation comes the challenge of evaluating LLM’s or any text generation model performance and capabilities due to the non-predictable nature of language-based outputs.
Traditional machine learning evaluation methods fall short because they don’t work well for NLP models that generate deterministic predictions. For instance, comparing two sentences like “John likes ice cream” and “John enjoys ice cream can be tricky since these sentences convey similar meanings but have different word choices.
Evaluating such details requires an automated and structured approach, especially when dealing with large language models trained on vast amounts of text. That’s where ROUGE metrics come in handy!
Meeting ROUGE and BLEU Measures
ROUGE checks if the computer summaries make sense by emphasizing a recall-oriented assessment that focuses on how much of the important content from the human-written summary is captured in the machine-generated summary. In other words, it measures how well the machine-generated summaries match up with human reference summaries.
ROUGE works by examining various sets of words called n-grams, which are just word groups. For example, ROUGE-1 looks at individual words or unigrams, while ROUGE-2 considers pairs of words or bigrams, and so on. Additionally, ROUGE-L examines the longest common subsequence between the machine-generated and human reference summaries.
ROUGE-1: Capturing Unigrams
Let’s get right into it with an example to understand ROUGE metrics better. Imagine a reference sentence created by a person: “The car is fast.” Now, let’s say the computer-generated summary reads: “The new red car is extremely incredibly fast.” We want to assess how well the computer’s output matches the reference.
Recall:
Measure of how many words from the machine-generated summary match words in the reference summary:
Common terms in reference and machine-generated summaries are: “The” (1), “car” (1), “is” (1), “fast” (1) leading to 4 matched words. The total words in the reference summary are: “The” (1), “car” (1), “is” (1), “fast” (1), totaling 4 words. Therefore, the recall is 4/4 = 1 because the machine-generated summary has correctly included all the words in the reference summary.
Precision:
Reflects the ratio of words in the
machine-generated summary that match words in the reference summary to the total number of words in the machine-generated summary. In our example, precision is 4/10 = 0.4 because only four out of ten words matched between the two summaries.
F1 Score:
A harmonic mean of recall and precision that provides a more balanced evaluation than either metric alone. The F1 score for this example would be (2 * Recall * Precision) / (Recall + Precision), which is 0.8 because the product of recall and precision is 16, and their sum is 5.
ROUGE-L: Longest Common Subsequence
ROUGE-L measures how much of the longest common subsequence between the machine-generated summary and human reference summaries matches up with each other. This metric can be useful for evaluating text generation models that generate longer, more complex sentences or paragraphs. However, it may not work as well for shorter texts because there are fewer opportunities to find a long common subsequence.
ROUGE-W: Weighted Longest Common Subsequence
ROUGE-W is similar to ROUGE-L but adds weights to the longest common subsequence matches based on their significance in the text. This metric can help improve evaluation accuracy by giving more weight to important words or phrases that are commonly used across multiple texts.
In general, ROUGE metrics are fast, automatic, and work well for ranking system summaries. However, they don’t measure coherence or meaning. To evaluate the coherence of a text generation model output, other evaluation methods such as BLEU score or human-in-the-loop evaluations may be more appropriate.