Robin Platte
Personified Performance Benchmarks
Download the PDF Here
Summary
We built an objective benchmark to assess the quality of answers for individualised AI Chatbots that extract knowledge from third party sources such as pdfs and websites to answer questions.
Out of 1593 questions, Personified answered 15.76% more accurately than the competing solutions in aggregate. In 5.52% of the questions competing alternatives produced a more accurate answer than Personified.
This benchmark was conducted against Chatbase, My AskAi and Langchain.
Table of contents
Introduction
Goal
Systems tested
Results
Methodology
Limitations
Appendix
1 Introduction
The rapid advancement of artificial intelligence and natural language processing technologies has led to the emergence of AI-powered chatbots as a ubiquitous tool for communication, information retrieval, and customer service. These chatbots have demonstrated their potential in various domains, ranging from e-commerce and healthcare to education and entertainment. However, most existing chatbot systems are designed to cater to a broad audience, often lacking the ability to provide personalised and context-aware responses tailored to individual users.
In this paper, we present a benchmark for systems that create individual AI chatbots with their own knowledge bases, aiming to facilitate the development and evaluation of personalised chatbot systems. The benchmark was conducted on the 14th of April 2023.
2 Goal
The benchmark should address one primary question: Which competing system produces the most accurate answers to file-specific inquiries? In this context, accuracy not only pertains to the correctness of the response, but also evaluates the overall quality of the answer. High-quality answers contain reliable and relevant details.
3 Systems used
The systems compared are Personified, Chatbase, My AskAi and Langchain as standalone Softwares. As all systems have different options for adjustment, we briefly describe the setup for each.
Personified
GPT-3.5
Temperature 0
Base prompt internal
Chatbase
GPT-3.5
No option for adjusting Temperature was found
Base prompt used was default of the system: “I want you to act as a document that I am having a conversation with. Your name is "AI Assistant". You will provide me with answers from the given info. If the answer is not included, say exactly "Hmm, I am not sure." and stop after that. Refuse to answer any question not about the info. Never break character.”
AskAI
General System used: “Chat”
GPT-3.5
No option for adjusting Temperature was found
No option for adjusting base prompt was found
Langchain
As Langchain is a software with a wide range of options there is no easy setup possible. We thus present the code used below.
def langchainpdfloader(pdfstring,
chunksize=800,chunkoverlapp=300,searchinput=3):
loader = PyPDFLoader(pdfstring)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=chunksize, chunk_overlap=chunkoverlapp)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(texts, embeddings)
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":searchinput})
querymaster = RetrievalQA.from_chain_type(
llm=OpenAI(), chain_type="map_reduce", retriever=retriever, return_source_documents=True)
return querymaster
def langchainanswer(query,querymaster):
result = querymaster({"query": query})
return result['result']
4 Results
Syntax:
unique_questions: Individual questions asked.
unique_files: Unique files tested.
perc_p_wins: Percentage of questions where Personified gave a more favorable answer than the competitor.
perc_c_wins: Percentage of questions where the competitor gave a more favorable answer than Personified.
perc_both_win: No answer had a main advantage over the other.
Overall Results:

Figure 1

Table 1
Figure and Table 1 show the overall result of the benchmark. In total 1593 unique questions were answered. In 15.76% of questions Personified generated a better answer than the competitor system, while in 5.52% of the questions it’s the other way around. In 78.72% of cases both systems gave a high quality answer.
Results per Alternative:

Figure 2

Table 2
Figure and Table 2 show results split by competitors. Personified was able to generate more high quality answers compared to any competitor.
Results per PDF:

Figure 3

Table 3
Figure and Table 3 display the results categorised by PDF. In all but two PDFs, Personified outperformed the competitor. This indicates that the benchmark's specific outcomes depend on the PDF used, as different systems process files and utilise information in various ways when answering questions. It is worth mentioning that the PDF selection was random, with the sole criterion being language. Since the authors of this benchmark are proficient in German and English, only English and German PDFs were used to enable double-checking of the results. Further details can be found in the "Methodology" and "Limitations" sections.
5 Methodology
When trying to assess the quality of two systems one needs to make sure that objective measurements are used. In the following we will describe in detail how the benchmark was created. Figure 4 summarises the following steps.
Preparation: We looked for random PDFs on the internet, which cover a wide range of topics, with varying numbers of pages. Among those PDFs are studies in the field of biology, engineering, artificial intelligenceI, novels, short guides, annual reports and political statements. The number of pages per PDF range from 10 to more than 150.
The following steps then happened per one PDF. No system had multiple PDFs uploaded simultaneously.
The PDF was uploaded to 2 different systems, to generate an individual knowledge base per system.
Afterwards, small chunks of this PDF - around 1,500 words per chunk - were fed to GPT-4 with the goal of generating 5-10 questions and answers per chunk. The prompt used for this can be seen in appendix 1. For a given PDF we then had 2 systems ready to receive questions and a number of questions and answers per file.
The questions generated by GPT-4 were then fed to both systems - prepared in step 1 - to generate answers.
In the last step the answers generated by both systems were compared with the “benchmark answer” generated by GPT-4 in step 2. Using GPT-3 Davinci we assessed which answer was closest to the real answer. The prompt for doing so can be seen in appendix 2.
Note: GPT-4 was used for question and answer generation as we found it to be the most capable in doing so. After various tests we also found that for the evaluation of answers, GPT-3 Davinci performed better than GPT-4 so we used this for evaluation.

Figure 4
6 Limitations
There are 2 big limitations of this benchmark:
PDFs As explained in the “Result” section, the outcome of the benchmark is dependent on which PDFs are fed into the system. Although we choose random PDFs the results could have been different with other files.
Number of Questions Overall the more questions asked the more robust the outcome of the benchmark would be. While 1593 questions seem to be a reasonable size one could argue that this is not enough to draw conclusions.
7 Appendix
Appendix 1: Get answers and questions from textchunk
#[{"q":"question1","a":"answer1"},{"q":"question2","a":"answer2"}]
def getquestions(textchunk):
returns=openai.ChatCompletion.create(
model="gpt-4",
temperature = 0.2,
messages=[
{"role": "user", "content": """Generate 5 to 10 useful questions and answers from the following text.
Questions should emphasize text and not numbers.
Use format: [{"q":"question1","a":"answer1"},{"q":"question2","a":"answer2"}]
Do not use numbers.
Text:
"""+textchunk+"""+"""}
]
)
content=returns['choices'][0]['message']['content']
content=content.replace('\n','')
content=content.replace('\n\n','')
content=ast.literal_eval(content)
return content
Appendix 2: Evaluate which answer is better It is important to explain our rationale for using a lengthy prompt. We discovered that priming the model with examples enhanced its ability to evaluate answers. Moreover, we observed that providing more examples further improved the evaluation quality, which led to the use of an extended prompt.
def evaluateanswerandresponse(realsnwer,v1,v2):
response=openai.Completion.create(
model='text-davinci-003',
prompt=f"""A: The tree is green.
V1: Green is the tree.
V2: A banana is yellow
W: V1
A: Sundar Pichai is CEO.
V1: I dont know.
V2: Sundar Pichai is CEO.
W: V2
A: Mark went to the mall.
V1: In the afternoon mark went to the mall
V2: Mark went to the mall with Amy.
W: V1,V2
A: The company is profitable.
V1: The company makes revenue.
V2: The company has cashflow.
W: V2
A: Deepl can translate all languages.
V1: All languages can be translated by deepl.
V2: Deepl is an orange farm
W: V1
A: Mark has blue hair
V1: A banana is yellow
V2: I like icecream
W: None
A: Mark has blue hair
V1: Mark has blue hair
V2: Blue Hair are on Mark
W: V1,V2
A: The pile was created in 1990
V1: 1990 the pile was invented
V2: I like icecream
W: None
A: At the Yale-New Haven Psychiatric Hospital.
V1: Regarding your question about where Owen is when he is sedated, I do not have any information from the sources you provided that indicates where Owen is when he is sedated. Could you please provide me with more context or information so that I can better understand your question?
V2: Owen is in an ambulance when he is sedated.
W: V2
A: Food choices, particularly the western diet of added oils, dairy, meat, fowl, fish, sugary foods, refined carbohydrates, fruit juices, syrups, and molasses, are a major cause of CAD.
V1: To answer your question, the major cause of CAD is the development of plaques in the coronary arteries that supply blood to the heart muscle. These plaques are formed by a combination of factors, including high blood pressure, high cholesterol, smoking, diabetes, obesity, and a sedentary lifestyle. The western diet, which is high in added oils, dairy, meat, fowl, fish, and sugary foods, can also exacerbate these factors and contribute to the development of CAD. The process of CAD begins with progressive endothelial injury, inflammatory oxidative stress, diminution of nitric oxide production, foam cell formation, and development of plaques that may rupture to cause a myocardial infarction (MI) or stroke. Therefore, addressing the root cause of CAD requires a different approach, such as a plant-based nutrition intervention, which has been shown to prevent, halt, and reverse CAD by improving endothelial function and reducing the risk of major cardiovascular events.
V2: The major cause of CAD is a western diet of added oils, dairy, meat, fowl, fish, sugary foods, refined carbohydrates, fruit juices, syrups, and molasses.
W: V1,V2
A: There are 547 dots in the scatterplot.
V1: Regarding your question about the number of dots in the scatterplot of Figure S7 in the Tempo and Mode of Genome Evolution in the Budding Yeast Subphylum paper, I apologize for my previous response as I misunderstood your question. However, there are no dots in the scatterplot of Figure S7. The figure is a time-calibrated phylogenetic tree that shows the estimated divergence times for major lineages within the budding yeast subphylum. The estimated divergence times for each lineage are shown at the tips of the branches and are represented as horizontal bars with error bars indicating the 95% highest posterior density (HPD) intervals. The figure provides insights into the evolutionary history and diversification of budding yeast, including the timing and patterns of lineage splitting and the relationships among major yeast groups.
V2: There are 547 dots in the scatterplot of Figure S7.
W: V2
A: """+realsnwer+"""
V1: """+v1+"""
V2: """+v2+"""
W:""",
max_tokens=100,
temperature=0,
n=1,
presence_penalty=0,
stop=['A:','V1:','V2:','W:']
)
response = response['choices'][0]['text']
response = response.replace('\n\n', '')
response = response.replace('\n', '')
response = response.replace(' ', '')
return response