How to select LLM for SQL generation

04 Dec 2024

Mariia Horova (Full stack developer), Oleksii Svitlychnyi (Technical writer)

Imagine a scenario where a data analyst needs to generate SQL queries to extract specific information from a complex data warehouse, but without deep knowledge of SQL. Our goal was to create an AI assistant that translates natural language to SQL, and we wanted to find the optimal large language model (LLM) for this task. We needed a solution that balances performance and cost while handling the limitations of context windows. A context window refers to the amount of text or information the model can remember and process at one time.

In this blog post, we’ll explore the research our team conducted as the initial step in developing our Natural Language Interface for Data Platform.This solution is built on the Retrieval-Augmented Generation (RAG) architecture, which enriches the LLM’s context by integrating external knowledge bases. In RAG, relevant data is retrieved from these knowledge bases to provide the LLM with additional information, improving the accuracy and relevance of its responses. We assessed different LLMs — Llama, Titan, Claude through Amazon Bedrock, and ChatGPT from OpenAI — to determine which would perform best in generating SQL queries from natural language inputs. This research aimed to highlight how prompt chaining can enable smaller, more affordable models to tackle sophisticated tasks with limited context windows.

We will also discuss when it becomes more efficient to upgrade to a more advanced model, like Claude 2.1, especially when tool integration becomes viable. In this context, a tool refers to an external function or utility that the LLM can call to perform specific actions, such as retrieving data from a database or validating an SQL query. These external functions are created by the development team of the application or system that uses the LLM as an API, allowing the LLM to extend its capabilities beyond language understanding.

The problem of limited context windows in LLMs

The primary challenge we faced was working with LLMs that have limited context windows, particularly when generating SQL queries for tasks that require retaining context over multiple steps. Smaller models, such as Llama and Titan, have significant restrictions on the number of tokens they can process, making it difficult to maintain a coherent conversation that includes follow-up questions or clarifications.

While the specific task we focused on was generating SQL queries, the underlying challenge of limited context windows applies broadly to any complex task involving multi-step reasoning or dynamic conversation management. These limitations necessitate creative solutions like prompt chaining to extend a model’s capability.

Prompt chaining to extend the capability of smaller LLMs

To overcome the limitations of smaller models, we used a technique called prompt chaining. This approach involves breaking down a complex task into smaller, manageable prompts, each feeding into the next to maintain the conversation flow and ensure continuity.

In our case, the task of generating SQL queries was split into stages:

Identifying user intent: The initial prompt focused on understanding what the user wanted to achieve (e.g., finding students enrolled in a computer science course).
Retrieving relevant data: The next prompt retrieved necessary metadata from the knowledge base, such as tables, views, and field descriptions.
Generating the SQL: Using the gathered metadata, the LLM generated the SQL query.
Validation: Finally, a separate prompt was used to validate the generated SQL and make adjustments if needed.

This method allowed models like Llama and Titan to handle more complex queries by breaking down the task into smaller components that fit within their token limits.

From prompt chaining to tool integration with advanced LLMs

With newer versions of Claude 2.1 and GPT-4 from OpenAI, tool integration became a game changer for handling complex tasks. This capability allowed these models to independently fetch necessary data and execute tasks, effectively reducing the need for prompt chaining.

For example, Claude and GPT-4 could access a knowledge base to retrieve specific database information or execute an SQL validation without manual prompts. By allowing the models to call functions (tools) on their own, we streamlined the workflow and significantly enhanced efficiency.

The diagram below shows how tool integration enabled Claude and GPT-4 to manage SQL generation and validation autonomously. Tool calls made by the LLMs are highlighted in green on the diagram.

Click or tap on the picture to open it in full size

Comparing GPT, Claude, Llama, and Titan for complex task handling

We tested four popular LLMs — Meta Llama-3, Amazon Titan Text G1 – Premier, Anthropic Claude 3, and GPT-4 Turbo — to evaluate their performance, efficiency, and ability to handle SQL generation tasks under different conditions. Our demo video below provides an in-depth comparison of how each LLM performs in generating SQL queries, managing follow-up questions, and showcases the strengths and limitations of each model in practical, real-world scenarios.

<span style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" data-mce-type="bookmark" class="mce_SELRES_start"></span>

Meta Llama-3: Llama, the most affordable model, struggled with retaining context across multiple steps. Even with prompt chaining, it lacked the ability to handle complex conversational memory, often treating follow-up questions as entirely new requests. As a result, it was more prone to generating inaccurate SQL queries.
Amazon Titan Text G1 – Premier: Titan performed better than Llama, particularly in managing tasks that involved smaller data sets. However, it still relied heavily on prompt chaining, and the limited token capacity made it difficult to provide adequate context for more sophisticated queries.
Anthropic Claude 3: Claude outperformed both Llama and Titan due to its larger context window, which allowed it to retain a more comprehensive understanding of the task. This significantly reduced the need for prompt chaining, enabling it to generate more accurate SQL queries with less manual intervention.
GPT-4 Turbo: ChatGPT, like Claude, demonstrated the ability to use conversation history effectively. It could independently decide which data it needed at each stage of the task, and it integrated tools to streamline the query generation process. However, both Claude and ChatGPT came with higher operational costs compared to Llama and Titan.

While Llama and Titan were able to generate SQL queries, they struggled to handle follow-up questions due to their limited context windows. Claude and GPT, on the other hand, utilized the conversation history, making them more suitable for tasks that require dynamic interactions.

Conclusion on choosing LLMs for business needs

Our research reveals that the choice between smaller, cost-effective models and advanced, feature-rich LLMs depends heavily on the specific business use case, expected query volume, and the importance of response quality. Here’s how you can make an informed decision:

When smaller models are optimal

For tasks where the business objective allows a certain level of approximation in outputs, or where the use case involves a high volume of relatively simple user requests, smaller models like Llama and Titan are highly effective. By leveraging prompt chaining to segment complex tasks into smaller, context-fitting chunks, these models can perform beyond their inherent limitations. This approach not only maximizes their utility but also significantly reduces operational costs, making them an excellent choice for applications prioritizing affordability over high precision.

For example, if the task involves generating basic SQL queries or performing straightforward data lookups where minor inaccuracies are tolerable, smaller models can deliver acceptable results with well-designed prompt sequences. By carefully orchestrating the prompts, businesses can avoid unnecessary upgrades while still achieving reliable outputs.

When advanced models are necessary

For more complex tasks requiring nuanced understanding, dynamic interactions, or the ability to retain extensive context across multiple steps, advanced models like Claude or GPT-4 Turbo become indispensable. These models, especially when integrated with tools, excel in automating intricate workflows by autonomously fetching and validating data or dynamically adapting to user interactions.

However, even with advanced models, cost optimization is critical. Here, the intelligent use of prompt chaining plays a pivotal role. By minimizing the token usage for both inputs and outputs, and ensuring that each interaction is strategically structured, businesses can control costs while harnessing the full potential of these models. This strategy is particularly beneficial in high-stakes scenarios, such as generating SQL queries for large-scale, complex databases where accuracy is paramount.

Selecting the best fit for our use case

Based on the results of this research, we selected Anthropic’s Claude 3 as the optimal LLM for our use case. While GPT-4 Turbo offered comparable accuracy in SQL generation tasks, Claude delivered the same level of precision at half the cost, making it the more economical choice without compromising quality. However, if Claude had not provided sufficient accuracy for our requirements, we would have chosen GPT-4 Turbo despite its higher cost, as ensuring reliable and precise results takes precedence over cost considerations in critical business applications.

Our expertise in both scenarios

At ABCloudz, we have extensive experience implementing tailored solutions for both scenarios:

We excel at designing efficient prompt-chaining frameworks for smaller models, allowing businesses to overcome context window limitations and achieve high performance at a fraction of the cost.
For advanced models, we specialize in creating data exchange strategies that minimize token usage without compromising response quality. By integrating tools and leveraging prompt chaining effectively, we help businesses unlock the full power of advanced LLMs while controlling operational expenses.

Whether you aim to optimize costs with smaller models or leverage the sophisticated capabilities of advanced LLMs, our expertise ensures that your AI-driven solutions are both cost-effective and high-performing.

If you’d like to explore how we can help you implement the most suitable LLM architecture for your business needs, feel free to reach out.

Contact us

AWS cost optimization
cutting costs
by 38%

Generative AI solutions
certified by AWS

Categories:

Share this article:

Facebook Twitter Linkedin

Hi there! How can I help you?