Chain-of-Verification: This Novel Prompting Technique Fights Hallucinations in LLMs

Large language models (LLMs) often hallucinate—generating plausible yet incorrect information. Recent research by Meta AI researchers explores a promising technique to address this issue, termed Chain-of-Verification (CoVe).

Quick Overview of Chain-of-Verification (CoVe)

CoVe takes a systematic approach to enhance the veracity of the responses generated by large language models. It’s a four-step dance:

Drafting an Initial Response: The language model drafts an initial response based on the input query.
Planning Verification Questions: The language model then devises verification questions to fact-check its draft.
Answering Verification Questions: These questions are answered independently to avoid bias.
Generating a Final Verified Response: Utilizing the answers, the model refines its initial draft and generates a final, more accurate response.

This technique pushes the language model to deliberate on its responses, embarking on a self-imposed fact-checking mission before delivering the final answer.

💡 Recommended: Hallucinations in AI – with ChatGPT Examples

Here’s an example from the paper:

Probing the Efficacy of CoVe

The researchers used the FACTSCORE metric (among others) that’s well-established in LLM research:

💡 Results: The study shows that CoVe significantly reduces hallucinations across a variety of tasks including list-based questions from Wikidata, closed book MultiSpanQA, and longform text generation.

When the model uses CoVe, it tends to generate more accurate facts than the original longform answer, thereby improving the overall correctness of the responses.

In particular, here are some highlights from the study:

CoVe significantly improves precision on list-based tasks, more than doubling the precision for the Wikidata task from a Llama 65B few-shot baseline.
CoVe notably reduces the number of hallucinated answers while having a small reduction in the number of non-hallucinations.
CoVe enhances performance on closed book QA, showing a 23% improvement in F1 on MultiSpanQA.
In longform generation, CoVe demonstrates larger gains, with a 28% increase in FACTSCORE from the few-shot baseline.
Further explicit reasoning within the CoVe “factor+revise” method brings large gains in FACTSCORE.
CoVe-based Llama outperforms InstructGPT, ChatGPT, and PerplexityAI on the longform generation task, specifically excelling for more frequent facts.
Shortform verification questions are more accurately answered than longform queries, with around 70% being correctly answered when queried individually in the Wikidata task.
LLM-based verification questions perform better compared to heuristic, rule-based ones, particularly in longform generation.
Open verification questions outperform yes/no-based questions, as seen in the factored version of CoVe.

My Thoughts on CoVe

First, you should note that the researchers used GPT-3.5. When trying to reproduce some of the hallucinations in GPT-4, I failed to do so because the model has already significantly improved, and trivial hallucinations are much less common.

Nevertheless, Chain-of-Verification is an exciting leap forward in making language models more reliable and factual. I think it’s still highly relevant for less performant open-source models.

This technique underscores the importance of self-evaluation and correction in AI, nudging us closer to the era where hallucinations in language models become a tale of the past.

I believe the future of high-performant LLMs will be multi-model systems where multiple AIs or AI agents work together in a dynamic pipeline of self-promptings to generate highly enriched answers that outperform the prompting results of individual models.

That is — at least until the single AGI model that rules them all is developed. 🤖

As a prompt engineer or a tech enthusiast, delving into the CoVe technique could open up new vistas in language model accuracy and reliability. The evidence is solid, the concept is intriguing, and the door to exploring and verifying this technique in different contexts is wide open.

To keep learning, feel free to check out my recent tutorial on the Finxter blog as well: 👇