The AI Hallucination Problem

Hallucination is a problem where generative AI models create confident, plausible outputs that seem like facts, but are in fact are completely made up by the model. The AI ‘imagines’ or 'hallucinates' information not present in the input or the training set. This is a particularly significant risk for Models that output text, like OpenAI's GPT-4, Google Bard, or Anthropic’s Claude 2, since a common usage pattern is for these sorts of models to begin to replace search engines as a source of information. Whilst Search Engines of course can (and do!) also contain plenty of false information, user’s additional familiarity with the technology means that many user’s have relatively mature intuitions about what links seem more or less trustworthy. Since Generative AI models are still so new, a lot of people have not yet developed the same level of sophistication in distinguishing factually grounded responses from hallucinations, since a lot of the previous signals that people used in evaluating search engine results (e.g. does this link point to a reputable website? Can I tell who the author of this information is? etc) are no longer available.

Concerns Surrounding the AI Hallucination Problem:

AI hallucination poses significant security, legal, and reputational concerns for businesses, as it can cause models to generate outputs that are false or misleading. In a recent high profile incident, a passenger filed a personal injury lawsuit against Avianca Airlines. The passenger’s lawyer cited several previous cases which supported the passenger’s position, and wrote in an affidavit that they had used ChatGPT to help with legal research. Unfortunately, ChatGPT had completely invented the supporting evidence[1]!

Something that can increase the risks associated with this sort of hallucination (including security and legal risks) is the apparent confidence with which AI can provide these responses, often asserting them as fact:

Take this example from Anthropic’s bot Claude, “summarizing” an article that doesn’t actually exist, but doing so with an assertive, confident voice:

This example might be harmless, but in other cases, the risks can be much more serious. Take the outstanding lawsuit between Mark Walters and OpenAI [2]: someone had provided ChatGPT a link to a complaint on the Second Amendment Foundation’s web site, and asked it to provide a summary of the accusations in the complaint. ChatGPT responded, describing it as:

“a legal complaint filed by Alan Gottlieb, the founder and executive vice president of the Second Amendment Foundation (SAF), against Mark Walters, who is accused of defrauding and embezzling funds from the SAF. The complaint alleges that Walters, who served as the organization's treasurer and chief financial officer, misappropriated funds for personal expenses without authorization or reimbursement, manipulated financial records and bank statements to conceal his activities, and failed to provide accurate and timely financial reports and disclosures to the SAF's leadership. The plaintiff seeks various forms of relief, including the recovery of misappropriated funds, damages for breach of fiduciary duty and fraud, and removal of Walters from his position as a member of the SAF's board of directors.”

Unfortunately for both ChatGPT, and Mark Walters, this whole thing is entirely made up! Mark Walters had absolutely nothing to do with the linked complaint and in fact the entire contents and nature of the complaint was completely false.

This sort of hallucination can be dangerous in all sorts of circumstances: not just in legal contexts, but in medical or healthcare settings, this sort of mistake could be a matter of life or death if the user of the technology does not know how to interpret it correctly. Hallucination can also open up new threat vectors for attackers: for example, the team at Vulcan have written about how ChatGPT can hallucinate the existence of packages or libraries that do not exist, when asked for coding suggestions. By identifying cases where ChatGPT will hallucinate a package, attackers can publish a malicious package under the name that ChatGPT hallucinates, creating an attack vector.

At Credal, we’ve encountered this risk as well. ChatGPT has directed users to email:

aiassistant569@gmail.com

and has also instructed users to share documents with

openai@gmail.com

Neither of these OpenAI accounts exist, and you should make your users are not tricked into sharing information with them!

Causes of Generative AI Hallucination: Why does ChatGPT Hallucinate?

AI hallucination is mainly caused because generative AI models, such as GPT-3, create outputs based on statistical patterns within their training data rather than the factual accuracy of the data.

A generative AI model might look at a prompt: “Once upon a time” and correctly infer based on statistics and its training data, that the most probable continuation is “in a kingdom...”. After that, it can actually make up a pretty good story:

Large Language Models are designed to take a given input string, and predict the most likely sequence of ‘tokens’ (words/code/etc) to follow that prompt. This makes them great story tellers, in ways that are both good and bad

However, when ChatGPT looks at a prompt like “Summarize this article: nytimes.com/newly-founded-country-madeupland-declares-war-on-neighbour-fictiontown” it uses the exact same process to determine that statistically, a statement like that is typically followed by a summary of an article about a newly founded country declaring war on its neighbor. Put another way, ChatGPT continues the statement “Summarize this article” in the same way it continued “Once upon a time” to create a children’s story. Arguably, it’s actually us as humans that are inconsistent: in the former we expect truth, and in the latter we’re looking for a fairy tale.

Mitigating AI Hallucination:

Several mitigation strategies can help to reduce hallucination in AI:

  1. Retrieval Augmented Generation: Provide the AI with the facts up front.

One such technique is 'Retrieval-Augmented Generation' (RAG), which works by retrieving relevant documents from a vast corpus of data and conditioning the AI's text generation on this retrieved information. Using the RAG approach helps ground AI output in real-world data and reduces the likelihood of hallucination. For example, in the “Summarize this article” approach - if you introduce a preliminary step before the LLM call - which determines if there is relevant information to fetch, and if so, fetches it (in this case, it would be the article from the internet, but in other cases it might be business context from your Slack, Teams, Google Drive, Sharepoint, OneDrive, Box, etc), you can significantly reduce the possibility of hallucination.

Here’s an example of this working in Credal: the user has provided a prompt very similar to something likely to cause a hallucination in ChatGPT. This time however, Credal detects that the user is asking about a webpage, and introduces a preliminary step of actually fetching the provided NYTimes page from the internet, which in this case is a 404 page not found error. Once done, the model is able to immediately identify that this article doesn’t really exist: breaking it out of its default preference to hallucinate, and making it provide a factual answer instead, including a link under the “sources” section to the URL that it referenced in generating its answer.

2. Prompt Engineering: Ask for Sources, Remind ChatGPT to be honest, and ask it to be explicit about what it doesn’t know.

Another technique involved to combat AI hallucination is using smart prompting techniques to encourage the system to cite its sources, be honest, and be upfront when it isn’t sure. Having the AI model refer back to the data it was trained on can sometimes help to ensure its responses are grounded in reality and factual information. If you are using OpenAI, a very common technique is to use the system prompt, and tell the AI it is “a helpful, honest assistant”. Similarly, a lot of Credal users have become accustomed to adding “when you are unsure about something, say you don’t know”.

In this example, we see the difference between ChatGPTs response with the default system prompt  (shown on the left hand side of the images) “You are a helpful assistant”, vs a system prompt that says “You are a helpful, honest assistant”

playground-example.jpg
playground-example-2.jpg
playground-example.jpg
playground-example-2.jpg

However, this is very far from a full proof method! ChatGPT, and other generative models, can often invent citations or reference web links, legal precedent, or academic papers that don’t really exist, so if you’re hoping that encouraging a model to cite its sources will help with hallucination, ensure that you are actually checking those citations, and ideally performing audits of AI outputs more generally.

3. Decrease the “temperature” setting

Most Large Language Models, including Claude, OpenAI and several others, have what’s called a ‘temperature’ setting that controls the extent to which the model will stick to the “most likely” output, or whether it will explore more creative answers, that might be a little bit more unusual according to its measure of statistical probability.

This setting is often really useful in creative contexts (where you don’t want to just regurgitate the same wording or content that’s on the web, and picking more unusual paths to explore can be really valuable). But the same ‘creativity’ setting can also introduce hallucination: in the example below, with temperature (show in the right hand side of the image) set to 0 and a system prompt of “You are an honest assistant”, the ChatGPT turbo model is much more hesitant to hallucinate a NYTimes article. Once the temperature is turned up to 1, the ChatGPT model is very happy to hallucinate:

playground-example-temperature.jpg
playground-example-temperature-with-hallucination.jpg
playground-example-temperature.jpg
playground-example-temperature-with-hallucination.jpg

4. Training or Fine tuning models for honesty.

Strategies like “Retrieval Augmented Generation” and “Asking for Sources” represent changes to the way in which the model is prompted. Another strategy is to try and change the model itself. Different researchers have taken different approaches, for example a provider called Anthropic, founded by several Ex-OpenAI researchers, used an approach they call “Constitutional AI”, which tries to establish certain “ground rules” for these generative models. They describe their offering, a very popular Large Language Model called “Claude” as a more “steerable” model that is a “helpful, honest assistant”. Whilst Claude represents a different model itself, another related approach is called “fine tuning” which is where once you’ve trained the main model, you can then do a second, smaller ‘retraining’, which helps the model learn how to apply its knowledge in certain, specific circumstances.

The basic idea is that traditionally a generative model would be trained to take a prompt and guess the most likely sequence of words (or technically, ‘tokens’) that would follow that prompt, based on examples of writing collected from around the internet (”training data”). During training, the model is gradually exposed to more and more data about what tokens are likely to follow any given prompt, and by exposing it to all this data, it eventually learns statistical patterns that help it predict a likely continuation for any given input.

The idea with fine tuning is that once it has learned the most likely continuation, from a big corpus of text, you can give it a second training period, where instead of just rewarding it for the most likely continuation, you actually reward it for the best continuation, which might be quite different to the one that is most likely. For example, if you’re training a model to be used by children, if you collect your writing examples from the internet, you might find that the most likely continuation of “Oh my”, could be profane, or otherwise offensive. So instead of rewarding it for choosing the most likely continuation, you can choose to reward it for the ‘best’ continuation, however you define that (in this case, if you’re training an LLM for kids, it might be “Goodness!”).

Fine tuning is often done this way, with Large Language Model providers often hiring many people specifically to ‘teach’ the model the difference between the ‘best’ and the ‘most likely’ continuation just by providing it loads of examples: for example OpenAI used an outsourcing firm in Kenya to bring lots of humans to teach ChatGPT not to output certain types of illegal content[3] and hired programmers to teach ChatGPT to get better at coding[4].

Although it can often be very expensive to train or even fine tune your own Large Language Model, if you have plenty of training examples (ideally thousands at least), fine tuning can be a great option, to help your model learn what types of outputs are hallucinations in your use case. Unfortunately, many of the most powerful language models (such as GPT 4, or Claude) don’t permit fine tuning, which means that this approach often comes with a significant performance penalty (and cost increase!) for any use case that isn’t extremely specialized. Moreover, fine tuning techniques to avoid these sorts of problems can create other problems. Some people have speculated that more recent versions of GPT 4 that have been aggressively fine tuned to avoid certain hallucination problems are now less helpful in other circumstances. Take this example:

This well known essay was written well before GPT 4’s cut off in September 2021. But because ChatGPT had such a big problem with hallucination when asked to summarize specific articles or websites, some people have speculated that OpenAI fine-tuned it in ways that cause it to avoid answering these questions, even when it really ought to know the content of the article, thereby making it slightly less helpful in certain circumstances.

5. Provide Acceptable Use policies and teach employees how to recognize hallucinations.

Although retrieval augmented generation can help a lot, and fine tuning models or choosing foundation models trained with honesty in mind, at the end of the day there is simply no sure fire way to guarantee that a Large Language Model will never hallucinate. For that reason, in enterprise contexts where there may be legal, reputational, or commercial risks associated with AI hallucination, its vital that organizations explain to their employees the ways and contexts in which Large Language Models can and should be used, and also emphasize when it is especially important that model outputs should be verified for accuracy and honesty. The best tooling should provide automatic ways to recognize these sorts of high risk usecases, and provide users with an appropriate warning or caution before showing them the model response. Make sure that your tooling provides a compliant audit trail of the decisions users made when shown these warnings, so that you can review the occasions where this happened and follow up to ensure that usage was in line with appropriate guidelines and applicable laws.

An example warning in Credal when a user tries to use a Large Language Model in a way that is not considered an appropriate use by the Enterprise IT administrators.

Ultimately, the right way to mitigate AI hallucinations will depend on your use case, risk levels, budget and employee sophistication. Hallucination remains an area with a lot of interest and scrutiny from both practitioners and AI researchers. Enterprises considering deploying AI should consider approaches including Retrieval Augmented Generation, prompt engineering, training or fine tuning models for greater honesty, and teaching employees how to appropriately use large language models, ideally with automatic policy enforcement built into your tooling.

Building blocks towards secure AI apps

Credal gives you everything you need to supercharge your business using generative AI, securely.

Ready to dive in?

Get Started