This blog post examines the operational expenses of the AI-powered chatbot solution introduced in our previous blog posts, Custom AI chatbot using RAG and Deployment of AI-Powered Chatbot. How to estimate the cost of a RAG chatbot correctly? Start by separating infrastructure, model usage, embedding usage, and expected traffic, then calculate each layer on its own. This solution operates on the AWS cloud while integrating OpenAI’s language model. We will discuss the financial aspects of infrastructure costs and analyze how to estimate expenses for AI resources, such as Language Model as a service, in a production RAG chatbot.
Cost Structure of a Custom AI Chatbot Solution
Operating expenses for our AI chatbot solution fall into two primary categories: infrastructure and AI resources. These costs will vary depending on your system’s scale and usage frequency and on how much data your RAG chatbot retrieves per request.
Cost of Infrastructure in AWS Cloud
The basic configuration of backend application components we recommend for deployment on AWS is tailored to support up to 10,000 standard chatbot user sessions per day. Below, you will find the specifications of its components and their approximate costs at a relatively high load.
Basic Configuration
If you need support for a higher load, we are ready to help you choose the best component configuration that aligns with your needs and remains cost-effective. AWS and other cloud providers provide significant flexibility for this purpose, which is especially valuable when a chatbot must scale without driving costs too high.
For example, an infrastructure configuration designed to support up to 60,000 users per day is presented below. This setup utilizes higher-performance Amazon EC2 and AWS RDS instances, expands the database deployment across multiple Availability Zones rather than just one as in the basic configuration, and includes increased storage capacity and greater outbound traffic capabilities for a larger RAG chatbot environment.
The higher-performance configuration
These are rough estimates, subject to change depending on the amount of resources used. Choosing Neo4j as your main database, with PostgreSQL performing supplementary roles under light load, can notably decrease the operational expenses for both standard and advanced setups. For both scenarios, the expense associated with AWS RDS for PostgreSQL constitutes a significant part of the total budget.
AI Costs: LLM as a Service Calculations
Let’s peel back the layers and explore the core of our AI expenditure, exemplified by our AI Chatbot project leveraging GPT-4 Turbo. In any RAG chatbot, this is typically where the main variable cost appears.
At the heart of our cost analysis is the AI’s token economy. Tokens are the currency of AI’s linguistic capabilities; each token represents a piece of the puzzle in understanding or generating human language. The OpenAI tokenizer effectively illustrates the concept of tokens. The price of a Language Model as a Service (LMaaS) depends on the volume of input tokens it processes and the quantity of output tokens it generates. Generally, there are less output than input tokens, and the output tokens tend to be costlier. For instance, under the GPT-4 Turbo pricing model, every 1000 input tokens are billed at $0.01, whereas 1000 output tokens have a price tag of $0.03. For a RAG-based chatbot, these totals grow not only from the user prompt but also from the retrieved context sent to the model.
The number of tokens transmitted is influenced by several factors: the daily user count, the number of requests from each user, and the number of tokens expended per request. The latter factor warrants special attention for a more in-depth examination. Upon receiving a user query, the backend system starts a sequence of interactions with the Large Language Model (LLM) to jointly craft a response. The size of these messages, measured in tokens, determines the LLM’s operational costs. Let’s examine a typical cycle of processing a single user request from this perspective. In this example, we’ve taken the token counts from a real-life case and rounded them for simplicity and clarity. This is also the core cost pattern behind a typical RAG chatbot request.
1. User Question: For an example, let’s take a short question of 10 tokens like “Do you have experience with AI or ML projects?”.
2. Backend Instructions to LLM: Upon receiving the user’s question, the backend formulates a query to the LLM consisting of the user’s question and instructions to guide the LLM, ensuring precision and clarity in the task at hand. This request adds 1000 input tokens to our pool.
3. LLM Calls a Specific Tool: LLM returns to the backend a 100-token response, which includes a call to a specific Tool. Think of tools as a set of commands that the backend application can execute upon request from the LLM. In our case, such a command would be the search for projects in the portfolio relevant to the AI and ML themes that the user is inquiring about. In a RAG flow, this is the step that triggers retrieval from the knowledge base.
4. Backend Retrieves Data: Executing the request from the LLM, the backend retrieves from the knowledge base data about the found portfolio projects relevant to the projects and forms from them an augmented context for LLM of 9000 tokens in length. That retrieved context is what makes this a RAG solution rather than a standalone chatbot responding without retrieved business context.
5. LLM Generates Output: Finally, the LLM synthesizes the information and produces a 700-token response, encapsulating the information sought by the user. It should be noted that at this stage, the LLM may return to step 3 if it deems it necessary to request additional context. That extra loop can further increase the cost of a RAG chatbot session.
Let’s now calculate the total number and cost of the input tokens and separately for the output tokens, keeping in mind that they have different costs. We will add up their costs to determine the final cost of processing this user request inside the chatbot workflow.
- Input tokens: 1000 + 9000 = 10000 (tokens) = 0,1 USD (at a rate of 0.01 USD per 1000 tokens)
- Output tokens: 100 + 700 = 800 (tokens) = 0,024 USD (at a rate of 0.03 USD per 1000 tokens)
Total: 0,1 + 0,024 = 0,124 USD
Taking this processing cycle’s token counts as average, let’s use them to estimate the cost of using certain Language Model as a Service offerings from OpenAI and AWS Bedrock. Assuming an average of 5 requests per user session and 10 users per day (totaling 50 requests daily), we find the following cost scenario based on prices* and LLM versions at the time of writing this blog post.
*Prices may change over time. Please check the current pricing for OpenAI and Amazon Bedrock.
In our website’s chatbot, we use GPT-4 Turbo because it surpasses other LLMs in intellectual performance. However, if more cost-effective models can effectively manage tasks, there’s no need to pay extra for a more advanced model. Reach out to us, and we’ll find the LLM that best fits your requirements for your RAG chatbot use case.
AI Costs: Embedding Model as a Service Calculations
Embedding models are significantly less expensive than LLMs and usually represent a minor fraction of the overall budget for AI resources. For those who are not yet familiar, embeddings convert text and images into a unique numerical format, capturing their essence. This transformation lets computer algorithms efficiently identify text elements with similar meanings, regardless of word differences. When processing a search query, we convert it into this numerical representation using an embedding service, searching for the closest semantic matches in our knowledge base, also stored numerically. We incur embedding costs whenever we add or update content in the knowledge base and for every user query. For a RAG chatbot, embeddings are essential because they power the retrieval layer.
How to estimate RAG chatbot costs more accurately
- Start with real chatbot usage assumptions, including users per day, requests per session, and peak chatbot traffic.
- Measure how much context your RAG workflow retrieves on a typical request, because retrieval size directly affects token cost.
- Separate fixed infrastructure costs from variable RAG usage costs, so you can see what scales with demand.
- Compare several models, because the best LLM for one chatbot may be too expensive for another.
- Include embeddings, storage, and database expenses, since a RAG chatbot depends on more than the model alone.
- Review costs again as your knowledge base grows, because a larger RAG system can change retrieval behavior and chatbot spend.
Unlock the Power of Your Business Knowledge
We help you find the perfect Language Model as a Service (LaaS) – one that’s both smart and affordable. Our app lets you choose between top-notch performance or maximum cost savings, or even find the sweet spot in between, depending on your workload and the goals of your RAG chatbot.
Let’s use your knowledge base to drive real business results. We’ll work with you to design a custom setup that tackles your specific challenges and goals with the right RAG architecture and user experience.
Ready to see how it works? Let’s kick things off with a personalized demo to answer your questions and show you the power of our solution. Contact us today to schedule a meeting and discuss the right RAG chatbot setup for your business!