Takeaways & lessons from 250k+ LLM calls on 100k corporate docs

When Jack and I set out to help people safely use their data with Generative AI, we thought we understood the problem space. We’d worked with governments and pharma companies on getting “classical” AI to be useful on extremely sensitive data. But that experience only partly applied: the basic lessons (like the importance of good data prep) were still relevant for LLMs, but we had to learn many new tactics in how to apply these lessons.

Since then we’ve been extremely lucky to work with some amazing Enterprises, with thousands of employees and tons of complex, real-world data and usecases for Generative AI. Thanks to their early trust, we’ve seen many subtleties about what’s needed to make systems work. We thought it would be helpful for developers who are starting out on their journey building AI applications for us to transparently share some of these learnings, so that they don’t need to repeat the same mistakes we made!

TL;DR:

AI models have limited attention. The hardest prompts either require the LLM to perform multiple logical steps, or they have lots of potentially relevant context. In either case, the LLM’s attention is split, which hurts its performance. Trying to solve these problems, we learned:

  1. Questions on complex data sources require carefully formatted data
  2. Good prompts focus on the hardest part of the task

Contained within both of these ideas was the realization that we could not rely on libraries like LangChain, which only work reliably for simple problems. We still use some LangChain components, and we are rooting for it to succeed. But as of today, good solutions to the hard parts of what we do with LLMs cannot be fully accomplished in LangChain.

There’s also a separate category of learnings about the security, cost, and compliance details of getting companies to trust you with their data in the AI space, but I’m leaving that out for now and hopefully I can write it up later.

1. Questions on complex data sources require carefully formatted data

The Problem Statement:

LLMs struggle to provide high quality, accurate answers when reading complex data from third party systems. To solve that quality issue, we had to carefully format data, before sending it from company systems to an LLM.

I want to illustrate the nuances required when handling real world documents using this very long document (which happens to be my undergraduate thesis)[1.].

Let’s keep the goals simple:

  1. Enable a user to get an accurate high level summary of this document when asking "What is the main thesis of the paper?"
  2. Enable a user to get accurate responses from this document when asking a specific or complex question, e.g.
    1. "Which author said Said Bin Taimur oversaw a ‘domestic tyranny’?"
    2. “Looking at table 11 which shows the timings of reform, how many monarchs started their reign between 1970 and 1980, and how many of those faced oppositions demanding reforms?”

Let’s start with Question 1: "What is the main thesis of the paper?"

Although this seems like a simple question, it’s tricky because the part of the paper that presents the main thesis never mentions the words “main thesis”. In fact, it doesn’t even mention anything similar to those words. This is a great example of the difference between the demo version - where you get to pick the exact wording of a question - and the real version, where you have thousands of users, so you see very diverse ways of asking the same question.

Some users are tempted to put the whole document into the AI and see what it does. However, this single document is already too long to fit in the context window of any available LLM except Anthropic’s Claude. If we use Claude on the entire document, each question could cost more than $1 to answer. With thousands of users, that can get very expensive.

Instead we want to first retrieve the part of the document relevant to answer this question, and then have the LLM use that part to answer the question. This approach is called “RAG”, or Retrieval Augmented Generation. There are alternative strategies to RAG we could use here - like chunking up the document into smaller pieces and asking the same question about each one - but since Credal is the market’s leading developer RAG platform, I’ll focus on RAG to start with and we can come back to those options in a moment.

The naive, unsuccessful attempt:

The naive thing to do here is to save the document down, then take the user query, search for relevant context inside the document, and then send the relevant context to the LLM for summary.

We need to save the document in a way that lets the LLM identify which part is relevant to a user question. This is typically done with a vector store. A vector store is a place to keep numerical representations of data like documents or videos. The numerical representation captures the meaning of each part of the document. This lets us easily check which part of the document is relevant to the topic of a question  You can read more about vector stores in our specialist guide). To recap, the naive approach is:

  1. Put the document into a Vector store
  2. Take the user query, search for relevant context inside the document.
  3. Return the relevant context to the LLM for use in its answer.

If you’re interested, here’s the code I used to create this naive RAG system (for this example I’m using Pinecone as the vector store but there are many other options):

from puts_docs_in_vector_store import *
from langchain.chat_models import ChatOpenAI, ChatAnthropic
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
import pinecone

docsearch = (
   puts_docs_in_vector_store()
)  # this is using the LangChain Google Drive loader, and putting my undergrad thesis, which you can read (but I'm *DEFINITELY NOT ADVISING YOU TO READ!!!* here: https://drive.google.com/file/d/1_Js3Kf_6KLumtmwGDXCf3ZrvbIxeG77QnFeeVqC7LHA/view?usp=sharing) into a Pinecone vector store.


chatModel = ChatOpenAI(temperature=1, model="gpt-3.5-turbo")
prompt = PromptTemplate(
   input_variables=["retrieved_data", "user_query"],
   template="You are a company AI assistant that helps employees answer research questions based on internal documents.\
       Use the context provided in the  <CONTEXT></CONTEXT> tags to answer a user's question.\
       <CONTEXT> {retrieved_data} </CONTEXT>{user_query}?",
)
chain = LLMChain(llm=chatModel, prompt=prompt)

medium_query = "What is the main thesis of the paper?"
medium_result = docsearch.similarity_search(medium_query)

print(chain.run({"retrieved_data": medium_result, "user_query": medium_query}))

hard_query = "Which author said Said Bin Taimur oversaw a domestic tyranny?"
hard_result = docsearch.similarity_search(hard_query)

print(hard_result)

print(chain.run({"retrieved_data": hard_result, "user_query": hard_query}))

Why it didn’t work:

The problem is that the document contains no structured indicator of which parts are high level summaries of the overall argument. As a result, the context sent to the LLM is not that helpful for answering the question.

The context retrieved when the LLM is asked “What was the main thesis of the paper” is only somewhat relevant[5.], so the answer from the LLM using this excerpt is mediocre as well:

“The main thesis of the paper is to develop a model that seeks to account for the observed pattern of the distribution and timings of reforms in the monarchic world, incorporating insights from Joseph Nye and Samuel Huntington.”

— GPT-4

That’s not wrong exactly, but it doesn’t give me much information about what the thesis is. The right answer is approximately, “The main thesis is that soft power seems to influence whether countries implement reforms, since monarchs who were educated as children in western countries were more likely to implement liberal reforms, and they did this either when faced with public opposition or after they turned about 65.”

Of course there’s no way the LLM could have come up with this answer, since the relevant information wasn’t in the context returned from the search step.

One alternative approach would be to look at the whole document, and either chunk it up into pieces and run the question on each one. Another would be to use a model with a bigger context window (which obviously runs out eventually but for a document like this could work). A chunking approach with GPT 3.5 gets ok results and Claude with its massive context window does even better.[2.]

So if we have the luxury of using the full context window, then a summary approach can work much better than RAG. But that can get very expensive; up to a $2 per chunk depending on the model, and how long the summary is!

Identifying on the fly whether the user’s question is a summary question or a detail question isn’t trivial, and if we’re dealing with 100 papers, one long book or in one real-world case, 4000 written letters, reading the entire text isn’t an option. So for many situations we need to find the right sub-sections of the documents, and send just those to the LLM.

How to make it work:

The document does contain this section, which would be a much better excerpt to use:

“In brief, this model has as its dependent variable, political liberalisation. The independent variables influencing the outcomes are; 1) the presence of mass public opposition to the monarchy and 2) the King’s education, (whether it was foreign/liberal or domestic). The two key actors are the ruling monarch and political opposition, which may consist of social groups and or political parties. The argument forms three stages, which are summarized as propositions above and in graphical form in figure 1 below.”

— The section of the thesis that summarizes its own argument

Which is then immediately followed by Figure 1 below:

In the document, the table is a Google Docs Table, not an image. (It amuses me a little that my college age self had the audacity to call this 2x2 grid a ‘model’, but I was a kid so don’t judge me too harshly!)

This section, with the relevant excerpt and table, didn’t appear in any of the top 20 results returned by Pinecone when we did a search based on the original question.

Adding a keyword component of the search won’t improve results - the words “main” and “thesis” aren’t included in the most relevant section. So we need a better way to represent the document to the LLM, to help it understand that this is a key passage to refer to when writing a summary.

To achieve this, we used LLMs to calculate metadata tags for each document or even each section of each document. Those can then be used in the search. For example, my thesis is in the style of an academic paper on monarchies, so there might be a few concepts we want to use to tag each section: is this section a high level summary, or exposition? Which specific countries are mentioned? Which monarchs? The LLM generates the tags when we ingest the documents, and humans can curate them as much or as little as needed.

If we can save these tags as structured data, when a user question comes in, we can use those tags to pre-filter the text to the sections actually relevant to the user. When the user asks a high level summary question, we’ll return just the high level summary sections. Within those sections, we can do our search, and that outperforms search on the whole document by a long margin.

Of course, figuring out what all the possible tags should be upfront is impossible. Instead, we let users define which tags should apply to what types of documents.

For example, a common patterns is creating tags for customers and product features when ingesting call transcripts. This lets users easily get the status of a particular customer account, or the feedback across customers on a particular product feature. Or identify which customers have mentioned a specific pain point.

If the human expert can tell us the important concepts to look out for, the LLM can identify which parts of each document (or which documents as a whole) are relevant. That’s how we get human-computer symbiosis, with the human directing the AI’s attention, and the computer doing the reading, summarization and fact retrieval, then if needed coming back to the human for verification.

Question 2a: "Which author said Said Bin Taimur oversaw a ‘domestic tyranny’?"

If the first question didn’t seem like the ideal question for a RAG approach, then this should! Just look up that exact quote, pull in the right section of the document, and then look at the footnotes on that page right? In theory, yes. But in practice it turns out to be more nuanced, because how the data is integrated and presented to the LLM really matters.

The naive, unsuccessful attempt:

Using the simple code referenced above, (but this time looking at the “hard_query” branch), we take the user’s question: "Which author said Said Bin Taimur oversaw ‘domestic tyranny’?", and search for documents with similar meaning to the question: in this case, here’s what Pinecone returned in its raw form:

[Document(page_content="________________\r\nOman\r\nKING\r\n\tLIBERAL EDUCATION?\r\n\tREFORM?\r\n\tSAID BIN TAIMUR\r\n\tNO\r\n\tNO\r\n\tQABOOS BIN SAID\r\n\tYES\r\n\tYES\r\n\tTable 7\r\nOman emerged as an independent state in 1951. Since then, it has been led by Said bin Taimur and Qaboos Bin Said. Bin Taimur ruled from 1932 until 1970, when his son, overthrew him in a palace coup. \r\nSaid Bin Taimur oversaw a “domestic tyranny”[75]. He had been unpopular, but powerful. “Until 1970, the Sultanate of Oman was one of the world's most reclusive and backward states.”[76] Moreover, “The Sultan kept several thousand slaves and presided over a barbaric\r\njustice system with torture endemic.”[77] His reign ended with no representative institutions intact.\r\nBy contrast, Qaboos has installed an elected national parliamentary body. The appointed shura consultative council was followed by the establishment of the indirectly elected Majlis Al-Shura, inaugurated on 21st December 1991. This was “revolutionary by Arab Gulf Standards.”[78]", metadata={'source': 'https://docs.google.com/document/d/1_Js3Kf_6KLumtmwGDXCf3ZrvbIxeG77QnFeeVqC7LHA/edit', 'title': 'Monarchic Liberalisation', 'when': datetime.datetime(2023, 10, 12, 18, 10, 19, 730000, tzinfo=tzutc())})]

Now that is pretty nice: it found the exact right section of the document that referenced a domestic tyranny, and even correctly pulled out that the 75th footnote contains the answer to our question. But unfortunately, the answer still isn’t useful:

“The author who stated that Said Bin Taimur oversaw a “domestic tyranny” is mentioned in the document titled “Monarchic Liberalisation”. Unfortunately, the name of the author is not provided in the given context”

— GPT-4

Why it didn’t work:

Our problem is that in the Google Docs APIs, the contents of the critical footnote are provided in a completely separate section of the API response to the main text.

The LangChain Document parser puts the footnotes all together at the end, which naively seems like a reasonable thing to do. But if you’re building a retrieval based system, that behavior means your user will probably never get the contents for a citation in the data they see - because the author’s name and the book they wrote isn’t semantically similar to the quotation itself, and so the two will almost never show up together in  search results. There’s no realistic way for the embeddings (embeddings are the technical term for the numerical representation of text data stored in a vector store) to know that “Calvin Allen Jr.” is somehow related to “domestic tyranny”. That knowledge lives not in the similarity of the text, but in the data structure that tells us the author Calvin Allen JR is the source of the 75th citation.

How to make it work:

Instead of putting the footnotes at the end, when representing a footnoted document to an LLM in a RAG system, you want to insert the details of the footnote in the same place as the quote. Now all of a sudden instead of seeing:

Said Bin Taimur oversaw a “domestic tyranny”[75]. He had been unpopular, but powerful.

The LLM sees:

Said Bin Taimur oversaw a “domestic tyranny” [Calvin Allen Jr, W. Lynn Rigsbee II, Oman under Qaboos : from coup to constitution, 1970-1996, (Frank Cass Publishers 2000).]. He had been unpopular, but powerful.

By taking the underlying data and deliberately restructuring it in a way that makes it easier for an LLM to understand, we turned a question that is impossible for the LLM to answer, into one that is dead simple!

If that seems like a contrived example based on footnotes, you can actually see a second similar version of this problem in this exact same excerpt: which begins with a lot of strange special characters:

________________\r\nOman\r\nKING\r\n\tLIBERAL EDUCATION?\r\n\tREFORM?\r\n\tSAID BIN TAIMUR\r\n\tNO\r\n\tNO\r\n\tQABOOS BIN SAID\r\n\tYES\r\n\tYES\r\n\tTable 7\r\n

That’s actually just a table in Google Docs. Will the model understand it properly? Well sometimes it gets really confused reasoning about the data in these tables, but occasionally, for simpler tables, it does do a really good job. So if you want your end user’s experience to be consitently great: you almost certainly need to make this much easier for the LLM to understand.

Let’s take a look at this in more depth, using question 2b:

Question 2b: “Looking at table 11, which shows the timings of reform, how many monarchs started their reign between 1970 and 1980, and how many of those faced oppositions demanding reforms?”

A very specific question, this time not relying on footnotes at all, but tables.

The naive, unsuccessful attempt:

To answer this question, we obviously need to know what table 11 is. A human looking at it in Google Docs sees this:

The LangChain Google Doc parser transforms it into this:

\r\n________________\r\nTimings of Reforms\r\nKING\r\n\tLIBERAL EDUCATION?\r\n\tSTART OF REIGN\r\n\tDATE OF MOST SIGNIFICANT REFORMS\r\n\tEND OF REIGN\r\n\tREFORMS DEMANDED BY OPPOSITION?\r\n\tDorji Wangchuk\r\n\tYes\r\n\t1952\r\n\t1967-1969\r\n\t1972\r\n\tNo\r\n\tHussein II\r\n\tYes\r\n\t1952\r\n\t1989-1993\r\n\t1999\r\n\tNo\r\n\tHassan II\r\n\tYes\r\n\t1961\r\n\t1996-1998\r\n\t1999\r\n\tNo\r\n\tQaboos\r\n\tYes\r\n\t1970\r\n\t2003-2007\r\n\tACTIVE MONARCH\r\n\tNo\r\n\tJigme Wangchuk\r\n\tYes\r\n\t1972\r\n\t2005-2007\r\n\t2007\r\n\tNo\r\n\tJaber Al-Ahmad Al-Jaber \r\n\tNo\r\n\t1977\r\n\t1992\r\n\t2005\r\n\tYes\r\n\tMswati III\r\n\tYes\r\n\t1986\r\n\t2005\r\n\tACTIVE MONARCH\r\n\tYes\r\n\tHamad Ibn Isa\r\n\tYes\r\n\t1999\r\n\t1999-2002\r\n\tACTIVE MONARCH\r\n\tYes\r\n\tTable 11\r\n

For this question, we’re going to move away from a retrieval approach, and just send the entire document, along with this table, to the LLM. This kind of request is only possible with the most powerful models, with their massive context windows - but even these models could not get good answers.

Anthropic’s response here is almost right, but its clearly confused (it missed one of the Kings):

“There are 8 monarchs listed in the table.

- Of those 8 monarchs, 2 started their reign between 1970 and 1980:

- Qaboos (Oman) - Began reign in 1970
- Jigme Wangchuk (Bhutan) - Began reign in 1972

- Of those 2 monarchs that began their reign between 1970-1980, neither faced opposition demanding reforms:

- For Qaboos, the table shows “No” under “Reforms demanded by opposition?”
- For Jigme Wangchuk, the table shows “No” under “Reforms demanded by opposition?”

Therefore, of the monarchs that began their reign between 1970-1980, 0 faced opposition demanding reforms.”

— Claude 2

It seems to have approximately understood the table, but unfortunately, not well enough to answer correctly.

And GPT-4-32k also gets it wrong, this time in a different way:

“According to table 11, there were three monarchs who began their reign between 1970 and 1980. These monarchs are Qaboos bin Said of Oman, who began his reign in 1970, Jaber Al-Ahmad Al-Jaber Al-Sabah of Kuwait who started in 1977, and Mswati III of Swaziland, who assumed power in 1986. Out of these three, only two faced significant opposition demanding reforms: Qaboos bin Said and Mswati III.”

— GPT-4

Despite being able to identify that Mswati assumed power in 1986, GPT-4 gets confused and thinks 1986 is somehow between 1970 and 1980.

Why it didn’t work:

There’s two different problems here that are interacting to confuse even the most powerful models. The first part is specifically around the representation of the data and the second part is a more fundamental issue with today’s LLMs.

Firstly, we aren’t representing the data to the LLM in the clearest, or most efficient way. Taking LangChain’s version of the table and putting that in the OpenAI tokenizer, we can see that it is 36% more expensive than representing the table as a csv. It’s not just the literal financial cost of the additional tokens that matters: it’s also the performance cost from dissipating the model’s attention whenever that particular table isn’t relevant to the answer:

The second factor is something separate that we learned: GPT-4 and all other  current LLMs cannot consistently reason correctly about dates. Looking back at GPT-4’s answer in detail, you see that it thinks that Mswati III of Swaziland, who assumed power in 1986, somehow began his reign between 1970 and 1980. We’ve observed this in many different situations where given a lot of context and asked to determine what context is in a particular date range, it will get the answer wrong very often!

This gets harder if users ask things like “the month before last”. Even if you give it the current date, with large amounts of context it will frequently trip up and tell you stuff that happened a year ago happened just ‘last month’. To be clear, this isn’t ‘hallucination’ - its a subtler, reasoning problem, that just seems to be especially acute with dates. The fact that the LLM already has trouble reasoning about dates combined with the confusing representation of the table, makes the answer bad across every model.

How to make it work:

Accounting for all that, if we instead represent the table more clearly (by turning the table itself into a little CSV inside the document), we can finally get an accurate answer from GPT-4-32k (but not the other models):

Now all of this nuance is captured in just one single excerpt of one single document. It is a long document, but it’s not absurdly long and is much shorter than many company documents we’ve seen in the real world.

2. Good prompts focus on the hardest part of the task

The Problem Statement:

Our users often use Credal to create custom, domain-specific “AI experts,” that live in a Slack channel or exist as an API endpoint (for customers who want to embed an AI expert in a product).

When a user asks a question in Slack, the question is routed to the most relevant AI expert, which then answers using only relevant data[3.].  Our challenge was to use an LLM to:

  1. Accurately match a user question to an AI expert’s description. [This is the hard and important part].
  2. If a match exists, tell us in a format we can consistently parse in code.
  3. Do this as quickly & cheaply as possible.

Each AI expert has a description that is checked against an incoming message for relevance. But there can be a up to three different AI experts in a single Slack Channel, so we needed an initial step that could triage the user question to the right AI expert with 95%+ accuracy[4.] - an “LLM triager”. This was a language problem so we tried to do it using LLMs.

We use the response from the LLM triager (which is itself an LLM) to determine which AI expert to call in code. So we need the triager’s response in a structured form. That means the initial LLM response should be structured data like JSON. This isn’t the hard part of the prompt, but it is important for the overall flow to work.

Also, we don’t want to use up all our GPT-4 tokens doing this (making a GPT-4 call on every message posted to a 5000 person company Slack channel is... painful), so we deliberately use GPT-3.5 to avoid massive costs and latency.

The naive, unsuccessful attempt:

We needed a structured format and LangChain has a specific module for getting structured formatting out of an LLM. We tried using this LangChain module to get the data out as JSON, and also used some standard prompting tricks - like providing a few examples to the LLM (’few shot learning’) - to get the matching - which was the hardest part of the LLM call - to work.

Below is roughly the version we started out trying to use, with the StructuredOutputParser in LangChain asking for the output to be in JSON.

import { z } from "zod";

import { ChatAnthropic } from "langchain/chat_models/anthropic";
import { PromptTemplate } from "langchain/prompts";
import { StructuredOutputParser, OutputFixingParser } from "langchain/output_parsers";

const parser = StructuredOutputParser.fromZodSchema(
 z.object({
   expertId: z
     .string()
     .uuid()
     .nullish()
     .transform(val => (val === "undefined" || val === "null" ? null : val))
     .describe(
       "The optional expert id of the expert that we think will be able to best answer the user's question. If none of the experts seem relevant, then return undefined for the expertId.",
     ),
   reasoning: z.string().describe("The reasoning for why we picked a particular expertId or returned undefined."),
 }),
);

const formatInstructions = parser.getFormatInstructions();

const prompt = new PromptTemplate({
 template: `You are tasked with monitoring messages that you recieve and determining whether the message should be ignored or routed to a specific expert.
 You will be provided with names and descriptions for each expert. The description will describe what the expert does and potentially include specific instructions for when the expert should be used such as "Only reply to questions and not statements" or "Only reply to questions about our customers".
 Your job is to pick the single best expert for the user's message and return the id of the expert or return undefined if no expert should be used and the message should just be ignored.
 Only return an expert id if you are highly confident that the expert is appropriate for the user's message and that you are following all the instructions specified in the expert description.
 If you are not confident about both of these things, then return undefined.

 Let's say you are given the following expert metadata to pick from:
 Name: "Internal data expert", Description/Routing instructions: "This expert picks the best datas to refer to for internal company data requests. Only respond to questions about which data to look at for analytical questions and ignore any statements.", expertId: "25dc3fds6d-f9d2-42a5-ab3f-709442920313"
 Name: "CRM expert", Description/Routing instructions: "This expert answers general questions about our companies customers.", expertId: "96afdsadf1-ee42-4c35-bb23-9d5599615287"
 Name: "HR expert", Description/Routing instructions: "This expert answers generic employee HR questions. Only respond to questions and not statements.", expertId: "fasdasdf-4ca2-42a3-89a1-1c948723894"

 Here are some user queries followed by the expertId you would pick followed by the reasoning for why:
 User message: "What is going on with the Coca Cola account?", expertId: "96f52391-ee42-4c35-bb23-9d5599615287", reasoning: "This is a question regarding a customer account and the CRM expert mentions that it answers questions about our companies customers."
 User message: "Where can I find data about our usage metrics?", expertId: "25dc316d-f9d2-42a5-ab3f-709442920313", reasoning: "This is a question regarding where to look for specific types of data and the internal data expert says that it answers questions about this. The expert routing instructions say that it only responds to questions and not statements, but this is a question."
 User message: "Where can I find information about our health benefits?", expertId: "f1b469ab-4ca2-42a3-89a1-1c038d253373", reasoning: "This is an HR related question and the HR expert says that it answers generic HR related questions. The expert routing instructions say that it only responds to questions and not statements, but this is a question."
 User message: "I love our health benefits!", expertId: undefined, reasoning: "This looks related to HR benefits, but the HR expert routing instructions say that it only responds to questions and not statements, so we should ignore this."
 User message: "Our usage metrics seem to be going up now finally", expertId: undefined, reasoning: "This looks related to usage metrics, but the Internal data expert routing instructions say that it only responds to questions and not statements, so we should ignore this."
 User message: "The PMF data seems to be broken", expertId: undefined, reasoning: "This looks related to datas, but the Internal data expert routing instructions say that it only responds to questions and not statements, so we should ignore this."
 User message: "I just sent Jim his benefits packet", expertId: undefined, reasoning: "This looks related to HR benefits, but the HR expert routing instructions say that it only responds to questions and not statements, so we should ignore this."
 User message: "What is the capital of France?", expertId: undefined, reasoning: "This is not related to any of the experts, so we should ignore this."

 {format_instructions} ## This would end up becoming about 10-20 lines of instructions about how to output the data as JSON - taking almost as much of the prompt as the example cases.

 Here is the available experts:
 Name: "PUT YOUR AGENT NAME HERE", Description/Routing instructions: "PUT YOUR AGENT DESCRIPTION HERE", expertId: EXAMPLE_ID,
 Here is the user message:
 {user_message}
 `,
 inputVariables: ["experts", "user_message"],
 partialVariables: { format_instructions: formatInstructions },
});

Why it didn’t succeed:

We ended up finding that the structured formatting trick was really great for getting the triager to respond in structured data, but it also distracted GPT 3.5 from the hard part: actually doing the matching correctly. The ‘few shot’ examples weren’t respected. GPT-4 performed well, but we needed a cheap, fast model and GPT-3.5 picked the right choice only 50% of the time, even with just one expert in the channel.

How to make it work:

So in the end, we had to remove the LangChain tool for getting the output as structured data, so that we could concentrate the prompt on the important part - picking the right expert. To get this to work, we hand rolled our own version of JSON formatting, where we used JSON in the few shot examples, with a very short explanation in the prompt. This helped concentrate the model’s attention where we needed it: on the example cases.

In short, we found we needed to focus the prompt on the hard part of the task: matching the description to the user’s question. That meant making half of the entire prompt just the example cases, and using the other half for the core instructions about matching prompt and description. That helped really focus the model on the part of the prompt that was most important, i.e. the core instructions and example cases.

In the final code below, you can see we stripped out all of the Langchain output formatting instructions:

const promptTemplate = `You are tasked with monitoring messages that you recieve and determining whether the message should be ignored or routed to a specific AI expert.
You will be provided with names and descriptions for each agent. The description will describe what the AI expert does and potentially include specific instructions for when the expert should be used such as "Only reply to questions and not statements" or "Only reply to questions about our customers".

Your job is to pick the single best AI expert for the user's message and return in JSON formatting the id of the expert, or return null if no expert should be used and the message should just be ignored.

Only return an expert id if you are highly confident that the expert is appropriate for the user's message and that you are following all the instructions specified in the agent description.
If you are not confident about both of these things, then return null.

Let's say you are given the following agent metadata to pick from:
Name: "Internal data agent", Description/Routing instructions: "This expert tells employees where to find the relevant data for a particular type of analytical question.", expertId: "25dc316d-f9d2-42a5-ab3f-709442920313"
Name: "CRM expert", Description/Routing instructions: "This expert answers general questions about our companies customers.", expertId: "96f52391-ee42-4c35-bb23-9d5599615287"
Name: "HR expert", Description/Routing instructions: "This expert answers generic employee HR questions", expertId: "f1b469ab-4ca2-42a3-89a1-1c038d253373"

Here are some user queries followed by the correct answer which is a JSON object with the expertId or null and a reasoning string for why that agent was picked or why the message should be ignored.
User message: "What is going on with the Coca Cola account?", Correct answer: {"expertId": "96f52391-ee42-4c35-bb23-9d5599615287", "reasoning": "This is a question regarding a customer account and the CRM agent mentions that it answers questions about our companies customers."}
User message: "Where can I find data about our usage metrics?", Correct answer: {"expertId": "25dc316d-f9d2-42a5-ab3f-709442920313", "reasoning": "This is a question regarding where to look for specific types of data and the internal dashboard agent says that it answers questions about this. The agent routing instructions say that it only responds to questions and not statements, but this is a question."}
User message: "Where can I find information about our health benefits?", Correct answer: {"expertId": "f1b469ab-4ca2-42a3-89a1-1c038d253373", "reasoning": "This is an HR related question and the HR agent says that it answers generic HR related questions. The expert routing instructions say that it only responds to questions and not statements, but this is a question."}
User message: "I love our health benefits!", Correct answer: {"agentId": null, "reasoning": "This looks related to HR benefits, but the HR agent routing instructions say that it only responds to questions and not statements, so we should ignore this."}
User message: "Our usage metrics seem to be going up now finally", Correct answer: {"agentId": null, "reasoning": "This looks related to usage metrics, but the Internal data expert routing instructions say that it only responds to questions and not statements, so we should ignore this."}
User message: "The PMF dashboard seems to be broken", Correct answer: {"agentId": null, "reasoning": "This looks related to dashboards, but the Internal data expert routing instructions say that it only responds to questions and not statements, so we should ignore this."}
User message: "I just sent Jim his benefits packet", Correct answer: {"expertId": null, "reasoning": "This looks related to HR benefits, but the HR expert routing instructions say that it only responds to questions and not statements, so we should ignore this."}
User message: "What is the capital of France?", Correct answer: {"agentId": null, "reasoning": "This is not related to any of the agents, so we should ignore this."}

Here are the available agents:
{{ agents }}
Here is the user message:
{{ user_message }};

Occasionally the response was basically right but not in properly formatted JSON. To handle that, when the first call failed a JSON parsing check, we’d make a second GPT-3.5 call that just focused on JSON formatting. Even when we did need two calls, 2x GPT-3.5 calls still worked out both quicker and much cheaper than one call to GPT 4.

Lessons Learned:

Model attention is limited so good prompts are focussed on the hardest part of the task

  1. Sequential prompts work better than single prompts doing multiple things, especially if you can have some kind of accuracy check in between
  2. Use as much of the prompt as you physically can to target the hardest and most important part of the task

LLM’s need very cleanly formatted data to answer questions on complex data sources

  1. Real world data contains a load of nuanced structure that isn’t always neatly captured in the text, or out of the box data loaders from e.g. Langchain. Taking time to format your data in a way that the LLM will understand is essential for high quality performance on the last 30-40% of questions.
  2. Tagging documents or document sections with the key concepts or ideas that they contain
  3. LLMs struggle to reason properly about dates.

Wrapping up

In short, we found that when you’re building a demo that only needs to work on small amounts of highly controlled data, you can be pretty naive in your approach. Use the default, out of the box LangChain solution, choose a few good demo questions and a simple enough data source and you’re good to go!

But the real world is not so easy. In the real world, documents are long and strangely formatted. Costs and latency matter. Queries are phrased differently by each user. To manage all of this, you have to resort to first principles thinking about your usecase, and how best to present the data to the LLM in a way that will actually work.

Throwing money at these problems helps only a little: using GPT-4-32k or even Claude 2’s massive 100k window can help with long documents. But even those context windows are very finite relative to some of the legal contracts or spreadsheets out there, and forget about the entire corpus of a 1,000+ person Enterprise’s GDrive, Confluence, Slack, Sharepoint and more.

When you’re building a general system which many users will use to do a wide variety of tasks, you need to find a way to have the right prompting strategy and data representation per usecase: and this very quickly becomes a really hard problem!

Appendix

[1.] I obviously can’t share any customer data on this post, so I’ll be using my own undergraduate thesis to explain this, because I need a data source that is:

  1. A real data source which I would actually have used AI on if I had access to one when I was writing this
  2. A Source I understood well enough to understand the model’s answers for the questions I tested and
  3. That I can share publicly so that you guys can follow along with my code and reasoning if you desire.

[2.]

If I were to add a second call in here (”chaining”), to create a summary based on these three results, we do get a much better response here, but again, this approach is going to become very inefficient if I’m trying to do this on the largest documents we’ve seen (up to 1000 pages of legalese), or even worse a corpus of such documents.

Using Claude with its huge context window definitely works the best if your document fits into Claude’s window:

Aside: reading this, Claude’s summary is legitimately better than my summary of my own thesis. It’s pretty amazing that we have software that can read so much text and summarize it in seconds, to a higher standard than the actual human author, but maybe that’s just a reflection on the author!



[3.] We use this feature ourselves for an internal ‘customer genie’, which uses our meeting notes and CRM to answer questions about our customers: if someone on our team asks “what usecases is Customer X using Credal for?” - they get an immediate answer in Slack, citing our notes about Customer X. Additionally, a human expert can 👍 or 👎 the response publicly, so if the bot is wrong or the answer isn’t verified - that’s also clear to everybody. We also have a security genie (that answers common security questions our customers have), and a product genie (that explains how different aspects of Credal’s product works).

The problem is nobody can remember to tag the Security bot for security questions, and the CustomerBot for customer questions, and the KitchenBot for today’s lunch menu etc: users can’t and shouldn’t have to remember all these names. The experience for the end user should be really simple: ask a question in a Slack channel and if any AI expert knows the answer, just that expert should respond. If not, there should be no AI response.

[4.] For nerds - we were aiming for 80% precision and 95% recall

[5.] “This study is divided into five parts. In part 1, the extant literature on soft power is critically reviewed in order to clarify the definition of soft power and to emphasize the importance of specifying a model of the causal mechanisms through which it operates. \r\nPart 2 then looks directly at the liberalisation literature. I examine three prominent approaches to monarchic liberalisation, including Samuel Huntington’s formulation of the King’s Dilemma, Rentier State Theory and Michael Herb’s rational choice model. I then discuss the problems that these approaches face in accounting for the observed pattern of reforms in the monarchic world. \r\nIn Part 3, a model is developed that seeks to account for the observed pattern of the distribution and timings of reforms. This model adds to some of the key insights of Joseph Nye and Samuel Huntington, in order to provide a new framework to account for the observed pattern of reforms.”

Building blocks towards secure AI apps

Credal gives you everything you need to supercharge your business using generative AI, securely.

Ready to dive in?

Get Started