Tutor.AI Technology
With Tutor.AI, current developments of LLM are shared in accordance with the current technology standard. This documentation should therefore help to understand Tutor.AI and thus also make it transparent for users how good and bad answers are generated, as well as how the Tutor.AI and other chatbots such as ChatGPT work as a black box.
The tutor is based on three core components:
- chat interface
- embeddings
- large language model
These components and their interaction are explained below.
Table of Contents
Chat Interface
The chat interface is what users can see and interact with. It is the interface for communicating with the tutor and enables a seamless dialog. For both chatbots and ChatGPT, the user’s text entries are saved to enable a coherent conversation. Each request to an LLM is processed individually without the model knowing the history of the conversation. Instead, the context of the conversation is passed to the LLM as additional information, but this remains invisible to the user.
Such a request is called a prompt, whereby a distinction between
- User prompt (i.e. the input of users),
- system prompt,
- and the context and chat history
must be considered.
User Prompt
The user prompt refers to the user’s direct input or questions. When prompting, it is crucial to provide the LLM with clear instructions or precise questions. Hallucinations, i.e. incorrect or irrelevant answers, can occur if the language model does not have sufficient information or the query is formulated too generally.
System Prompt
The system prompt is an often intransparent layer between the user request and the language model. It contains instructions for the language model, such as generating friendly and helpful responses and avoiding security risks. The exact system prompts are often not revealed, but can be extracted by intelligent queries, see e.g. the extracted system prompt of ChatGPT4 . These may also contain further instructions, such as copyright compliance and avoidance of potentially dangerous content.
There are several aspects to consider when designing a system prompt for Tutor.AI, including the use of sources in context, knowledge of the topic of conversation, and honesty in the absence of information.
Context
The context provides the language model with additional information that is relevant for generating responses. This can include previous chat history to continue the conversation, as well as other sources and information. In the case of Tutor.AI, sources about the course are provided, but not all directly in context due to the limited capacity of the language model. Instead, sections of the user’s query are first searched for in the sources that match the query and then the best matches are provided.
Embeddings
Embeddings are a method of representing words or sentences in a numerical vector space. This representation allows to capture semantic similarities between words or sentences. This information is critical for the generation of answers by Tutor.AI and for determining the relevance of an answer to a user query.
The embedding of the words is therefore an abstract representation of the so-called tokens. In principle, tokens can be words, but they can also be individual word stems. When embedding, tokens are translated into a vector, whereby a vocabulary is used.
Text extraction
In order to generate the embeddings, the sources, which are usually available as PDF documents, must first be extracted and converted into text files. While traditional methods often require complex text cleaning, modern LLM embeddings can also deal with poorly cleaned text, as they have been trained with a large amount of text data.
Tokenizer
The Tokenizer converts texts into individual tokens, e.g. words or word stems. It converts the tokens into numbers using a vocabulary and marks unknown words as well as the beginning and end of a sequence. In addition, so-called MASK tokens are used during training to train the model to predict missing words. Since an LLM can only process a certain number of tokens, sequences are padded to achieve the desired length.
Embeddings
Embeddings encode tokens into numerical vectors that represent the text. These vectors have a fixed length and are calculated from the text data by the embedding layers of the LLM. In addition to token embeddings, positional embeddings are also used to embed the position of each token in the text. Embeddings make it possible to convert categorical data into a continuous representation, taking into account similarities and relationships between the tokens.
To embed entire sections, the embeddings of the section’s tokens are then averaged. Naturally, information is lost in the process, but this allows entire sections to be embedded. The Sentence Transformers package was used for Tutor.AI to embed sections.
The following diagram shows how the embeddings are generated.
Similarity
The abstract representation by embeddings makes it possible to find similar entries in the database. The similarity between two vectors is often calculated using metrics such as cosine similarity (i.e. the angle between the points in the (d)-dimensional vector space of the embeddings). The most similar matches are then returned with the corresponding sections from the database and made available to the LLM in the context for answer generation.
Large Language Model
The large language model (LLM) can generate answers using the information provided in the context. However, as language models require increased hardware resources, an API is used. Tutor.AI uses Mistral , which is hosted in France in accordance with European data protection regulations and is also available as an open source model . The University of Münster is currently working on hosting a language model, such as Mistral, on its own resources so that Tutor.AI can be provided independently of external providers in the future.
The language model formulates the answer based on the information provided in the context and then returns it to the user in the chat.
Deployment
Except for the language model, Tutor.AI is located entirely in the Cloud of the University of Münster . The following components are hosted as individual containers in the university’s Kubernetes cluster and can communicate with each other internally:
- chat interface
- embedding
- embedding database
QDrant is used as the embedding database, while fastembed , also from QDrant, is used for embedding, which is specially optimized for operation on smaller hardware. The embeddings from the sources and documents were previously generated with Sentence Transformers on the university’s hardware, whereby graphics cards are available in the university’s JupyterHub to accelerate the calculation of the embeddings.
The chat interface (1.) forwards the users’ requests to the embedding (2.), which then returns them to the interface. The interface compares the requests with the embedding database (3.) and sends the answers in context to Mistral.
The following diagram shows how a request is processed by Tutor.AI.
As the embedding is sometimes too slow on normal hardware without graphics card support, the Mistral embeddings API is implemented as an alternative solution.
The containers are exposed to different workloads, with the database and embedding requiring more resources than the chat interface. Kubernetes ensures that multiple instances of the containers (so-called pods) can run simultaneously and process the requests independently of each other. If a pod fails, it is automatically restarted and the remaining pods take over the requests.