Hallucination Detection

9:03

This blogpost is derived from its interactive version on Hugging Face Spaces. You can continue reading there if you want the benefits of playing around with multiple examples or to provide your own input.

‍Introduction

Recent work using transformers on large text corpora has shown great success when fine-tuned on several different downstream NLP tasks. One such task is that of text summarization. The goal of text summarization is to generate concise and accurate summaries from input document(s). There are 2 types of summarization:

Extractive summarization merely copies informative fragments from the input.
Abstractive summarization may generate novel words. A good abstractive summary should cover principal information in the input and has to be linguistically fluent. This blogpost will focus on this more difficult task of abstractive summary generation. Furthermore we will focus mainly on hallucination errors, and less on sentence fluency.

Why is this important? Let’s say we want to summarize news articles for a popular newspaper. If an article tells the story of Elon Musk buying Twitter, we don’t want our summarization model to say that he bought Facebook instead. Summarization could also be done for financial reports for example. In such environments, these errors can be very critical, so we want to find a way to detect them.

To generate summaries we will use the PEGASUS model, producing abstractive summaries from large articles. These summaries often contain sentences with different kinds of errors. Rather than improving the core model, we will look into possible post-processing steps to detect errors from the generated summaries.

Generating summaries

Below you can find the generated summary for an example article. For the full article, different examples or to test things out yourself, we also made an interactive Hugging Face space. We will discuss two approaches that we found are able to detect some common errors. Based on these errors, one could then score different summaries, indicating how factual a summary is for a given article. The idea is that in production, you could generate a set of summaries for the same article, with different parameters (or even different models). By using post-processing error detection, we can then select the best possible summary.

Example summary:

“The OnePlus 10 Pro is the company’s first flagship phone. It’s the result of a merger between OnePlus and Oppo, which will be called “SuperVOOC” The phone is launching in China first on January 11. There’s also no word on a US release date yet. The 10 Pro will have a 6.7-inch display and three cameras on the back. We don’t have a price yet, but OnePlus’ flagship prices have gone up every year so far, and the 9 Pro was $969. The phone will go on sale January 11 in China and January 18 in the U.S.”

Entity matching

The first method we will discuss is called Named Entity Recognition (NER). NER is the task of identifying and categorising key information (entities) in text. An entity can be a singular word or a series of words that consistently refers to the same thing. Common entity classes are person names, organisations, locations and so on. By applying NER to both the article and its summary, we can spot possible hallucinations.

Hallucinations are words generated by the model that are not supported by the source input. Deep learning based generation is prone to hallucinate unintended text. These hallucinations degrade system performance and fail to meet user expectations in many real-world scenarios. By applying entity matching, we can improve this problem for the downstream task of summary generation.

In theory all entities in the summary (such as dates, locations and so on), should also be present in the article. Thus we can extract all entities from the summary and compare them to the entities of the original article, spotting potential hallucinations. The more unmatched entities we find, the lower the factualness score of the summary.

Entity matching applied to the example summary

We call this technique entity matching and here you can see what this looks like when we apply this method on the summary. Entities in the summary are marked green when the entity also exists in the article, while unmatched entities are marked red .

As you can see we have 2 unmatched entities: “January 18” and “U.S”. The first one is a hallucinated entity in the summary, that does not exist in the article. U.S. does occur in the article, but as “US” instead of “U.S.”. This could be solved by comparing to a list of abbreviations or with a specific embedder for abbreviations but is currently not implemented.

Dependency comparison

The second method we use for post-processing is called Dependency Parsing: the process in which the grammatical structure in a sentence is analysed, to find out related words as well as the type of the relationship between them. For the sentence “Jan’s wife is called Sarah” you would get the following dependency graph:

Dependency comparison example

Here, “Jan” is the “poss” (possession modifier) of “wife”. If suddenly the summary would read “Jan’s husband…”, there would be a dependency in the summary that is non-existent in the article itself (namely “Jan” is the “poss” of “husband”).However, often new dependencies are introduced in the summary that are still correct, as can be seen in the example below.

Dependency comparison example parsing

“The borders of Ukraine” have a different dependency between “borders” and “Ukraine” than “Ukraine’s borders”, while both descriptions have the same meaning. So just matching all dependencies between article and summary (as we did with entity matching) would not be a robust method. More on the different sorts of dependencies and their description can be found here.

However, we have found that there are specific dependencies that are often an indication of a wrongly constructed sentence when there is no article match. We (currently) use 2 common dependencies which — when present in the summary but not in the article — are highly indicative of factualness errors. Furthermore, we only check dependencies between an existing entity and its direct connections. Below we highlight all unmatched dependencies that satisfy the discussed constraints for the current example. For more interactive examples, we again refer to the interactive space.

First unmatched dependency from the summary

One of the dependencies that, when found in the summary but not in the article, indicates a possible error is the “amod” (adjectival modifier) dependency. Applied to this summary, we have “First” as the entity, and it is the adjectival modifier of the word “phone”. And indeed, this unmatched dependency indicates an actual error here. The sentence is not factual, since the article talks about a new type of flagship phone, and not the first flagship phone. This error was found by filtering on this specific kind of dependency. Empirical results showed that unmatched amod dependencies often suggest that the summary sentence contains an error.

Second unmatched dependency from the summary

Another dependency that we use is the “pobj” (object of preposition) dependency. Furthermore, we only match pobj dependencies when the target word is “in”, as in this example. In this case the sentence itself contains a factual error (because the article states “there’s no word on a US release date yet”). However, this could have already been found with entity matching (as January 18 is unmatched), and the unmatched dependency can not be completely blamed for this error here.

Bringing it together

We have presented 2 methods that try to detect errors in summaries via post-processing steps. Entity matching can be used to solve hallucinations, while dependency comparison can be used to filter out some bad sentences (and thus worse summaries). These methods highlight the possibilities of post-processing AI-made summaries, but are only a first introduction. As the methods were empirically tested they are definitely not sufficiently robust for general use-cases. But for some different examples where you can play around with the presented methods, we refer to the interactive Hugging Face space.

Below we generate 3 different kind of summaries (for the example article), and based on the two discussed methods, their errors are detected to estimate a summary score. Based on this basic approach, the best summary (read: the one that a human would prefer or indicate as the best one) will hopefully be at the top. We also highlight the entities as done before, but note that the rankings are done on a combination of unmatched entities and dependencies (with the latter not shown here).

Hallucination Detection

Matthias Cami

‍Introduction

Generating summaries

Entity matching

Dependency comparison

Bringing it together

The answers you've been looking for

Frequently asked questions

You might also like

The Smartest Buy Yet: How AI Is Redefining the Future of Procurement

MCP and AI Agents: The Next Big Shift in Engineering Workflows

The Visual Renaissance: AI-powered content creation in consumer industries

‍Introduction

Generating summaries

Entity matching

Dependency comparison

Bringing it together

The answers you've been looking for

Frequently asked questions

1.What are hallucinations in generative AI?

2.Why do large language models hallucinate?

3.How can hallucinations be detected in AI outputs?

4.Can hallucinations be prevented completely?

5.Why is hallucination detection important for businesses using GenAI?

You might also like

The Smartest Buy Yet: How AI Is Redefining the Future of Procurement

MCP and AI Agents: The Next Big Shift in Engineering Workflows

The Visual Renaissance: AI-powered content creation in consumer industries