In this tutorial, we’ll be discussing ROUGE-Lsum vs ROUGE-L: A Sentence Level Metric for Text Summarization. These metrics are used in the field of automatic summarization to evaluate how well a computer-generated summary matches a human-written one. Let’s jump right into it!
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a family of metrics that can be used to assess the performance of text extractors, machine translation systems, and automatic summarization algorithms. The ROUGE-L metric calculates the longest common subsequence between the computer-generated summary and the human-written one. However, sometimes this metric may not accurately reflect how well a summary matches its original source because it only considers the length of the longest common sequence.
To address this issue, ROUGE-Lsum was introduced as an alternative to ROUGE-L. The main difference between these two metrics is that ROUGE-Lsum calculates the sum of the lengths of all common subsequences instead of just the length of the longest one. This means that shorter but more accurate summaries can also be evaluated using this metric, which may not have been possible with ROUGE-L.
Let’s take a look at an example to better understand how these metrics work. Suppose we have two texts:
Original Text: The quick brown fox jumps over the lazy dog.
Computer-generated Summary: A fox is jumping over a dog.
Human-written Summary: The quick brown fox hops over the sluggish canine.
Using ROUGE-L, we would calculate the longest common subsequence between these two summaries and compare it to the length of the human-written summary. In this case, the longest common subsequence is “fox jumps” with a length of 9 characters. However, since the computer-generated summary has only 13 characters (compared to the original text’s 20), ROUGE-L would penalize it for not being as long as the human-written summary.
Using ROUGE-Lsum, we would calculate the sum of all common subsequences between these two summaries and compare it to the length of the human-written summary. In this case, there are several common subsequences: “fox”, “jumps”, “over”, “dog”, “canine”. The lengths of these sequences add up to 29 characters (compared to the original text’s 20), which is a better reflection of how well the computer-generated summary matches its source.