Rethinking LLMs: Beyond hallucination

Why context and genre matter more than factual consistency

Aug 22, 2024

Jorge Luis Borges is one of my heroes—a literary genius whose Ficciones changed my life, and a philosophical thinker who wrote some oft-overlooked poetry. A fav of mine is the “Parable of Cervantes and the Quixote” reflecting on the nature of fiction and reality. Borges wrote about blurring the lines between fact and fiction— “Pierre Menard, Author of the Quixote” explores how context change meaning in nonfiction and fiction, and “The Analytical Language of John Wilkins” questions the ability of language to represent factual reality.

With the discourse heavy these days around large language models (LLMs) and hallucination, let’s consider these Borgesean themes in order to redefine “hallucination” in LLMs.

Current problems with hallucination discourse

Building on Borges' exploration of the blurred lines between fact and fiction, a more nuanced understanding of 'hallucination' in LLMs must account for the diverse types of claims language models are tasked with generating.

Claims are “assertions open to challenge.” There are different types of claims, each serving a distinct purpose. Factual claims state objective information, such as "The Eiffel Tower is in Paris." Evaluative claims express judgments or comparisons, like "Solar energy is better for the environment than fossil fuels." Causal claims identify relationships between events or phenomena, for example, "Regular exercise reduces the risk of heart disease." Prescriptive claims argue for specific actions or policies, such as "Schools should teach financial literacy to all students."

Understanding these claim types is crucial for bringing nuance to the hallucination discourse. However, the oversimplified nature of current approaches to quantifying 'hallucination' in language models fails to account for these nuanced realities of claims in discourse.

Limitations of quantitative approaches

Current methods of quantifying "hallucination rates" or "factual consistency" in LLMs are flawed and potentially misleading. Approaches to evaluating language models, such as the hallucination leaderboard, offer insights into how often LLMs introduce unfounded information when summarizing texts. However, these methods often oversimplify the complex nature of language and claims.

The limitations of quantitative measures become apparent when considering this diversity of claim types. A binary classification of statements as either factually consistent or inconsistent doesn’t adequately capture the nuances of evaluative or prescriptive claims, which often involve subjective elements. These metrics struggle to account for the contextual nature of causal claims or the cultural dependencies that can influence the interpretation of certain statements.

For instance, a language model's description of 'it's raining cats and dogs' in a casual conversation about the weather would not be considered a 'hallucination,' despite its lack of factual accuracy, because it aligns with the expected genre and communicative intent. In contrast, the same phrase appearing in a scientific meteorology report would likely be flagged as an error."

As a result, while these leaderboards provide a starting point for assessing AI reliability, they don’t fully represent the models' ability to handle the full spectrum of claim types.

*https://github.com/vectara/hallucination-leaderboard*

While the idea of quantifying a person's "factual consistency rate" might seem intriguing, it's important to recognize that such a metric is an oversimplification. It's worth noting that the lack of concern over 'human hallucination rates' stands in stark contrast to the intense scrutiny applied to language models.

The importance of genre and context

Another thing "hallucination rates" overlooks is the diverse nature of texts and their purposes. As Borges reminds us, texts come in various genres, each with its own goals and standards. Not all texts aim for strict factual consistency, and applying such a metric universally is reductive. The value and meaning of a text often extend far beyond its adherence to facts, encompassing subjective, evaluative, and culturally dependent elements that quantitative measures can’t apprehend.

*https://x.com/rodjnaquin/status/1807099760500683226*

A more nuanced understanding of "hallucination" in LLMs should account for genre, context, intent, and interpretation. Russian scholar Mikhail Bakhtin's idea of "speech genres" adds another layer to our discussion about language models and "hallucination." Bakhtin suggested that we communicate in different ways depending on the situation—think about how you'd talk differently in a job interview versus chatting with friends. These various communication styles, or "genres," aren't strict rules but flexible patterns that change over time.

This concept is important when we think about language models because it reminds us that judging LLM outputs isn't just about checking facts. We need to consider the context and purpose of the communication too.

By understanding this, we can see that current methods of measuring "hallucinations" are too simplistic. They often don't account for these different communication styles and contexts, which are crucial in human language use. This connects back to Borges' ideas about blurring fact and fiction, suggesting that assessing LLM-generated text isn't as straightforward as just counting factual errors.

Borges' insights about the blurred boundaries between fact and fiction, as well as Bakhtin's concepts of 'speech genres,' point to the necessity of developing evaluation approaches that consider the purpose, context and interpretation of language model outputs, rather than just their factual consistency. This more nuanced framework could better capture the complex ways in which LLMs engage with and generate diverse forms of claims.

The Science of Dialogue

Discussion about this post