Imagine a scenario where a data analyst needs to generate SQL queries to extract specific information from a complex data warehouse, but without deep knowledge of SQL. Our goal was to create an AI assistant that translates natural language to SQL, and we wanted to find the optimal large language model (LLM) for this task. How to choose the best LLM for SQL generation? Start by matching the model’s context capacity, tool-use ability, and cost profile to the complexity of the task. We needed a solution that balances performance and cost while handling the limitations of context windows. A context window refers to the amount of text or information the model can remember and process at one time.
In this blog post, we’ll explore the research our team conducted as the initial step in developing our Natural Language Interface for Data Platform.
This solution is built on the Retrieval-Augmented Generation (RAG) architecture, which enriches the LLM’s context by integrating external knowledge bases. In RAG, relevant data is retrieved from these knowledge bases to provide the LLM with additional information, improving the accuracy and relevance of its responses. We assessed different LLMs — Llama, Titan, Claude through Amazon Bedrock, and ChatGPT from OpenAI — to determine which would perform best in generating SQL queries from natural language inputs. This research aimed to highlight how prompt chaining can enable smaller, more affordable models to tackle sophisticated tasks with limited context windows. It also helped us evaluate what actually makes the best LLM for a production workflow, not just for a lab demo.
We will also discuss when it becomes more efficient to upgrade to a more advanced model, like Claude 2.1, especially when tool integration becomes viable. In this context, a tool refers to an external function or utility that the LLM can call to perform specific actions, such as retrieving data from a database or validating an SQL query. These external functions are created by the development team of the application or system that uses the LLM as an API, allowing the LLM to extend its capabilities beyond language understanding. That matters because the best LLM is not automatically the biggest model, but the one that fits the business workflow most effectively.
The problem of limited context windows in LLMs
The primary challenge we faced was working with LLMs that have limited context windows, particularly when generating SQL queries for tasks that require retaining context over multiple steps. Smaller models, such as Llama and Titan, have significant restrictions on the number of tokens they can process, making it difficult to maintain a coherent conversation that includes follow-up questions or clarifications.
While the specific task we focused on was generating SQL queries, the underlying challenge of limited context windows applies broadly to any complex task involving multi-step reasoning or dynamic conversation management. These limitations necessitate creative solutions like prompt chaining to extend a model’s capability. In practice, any team searching for the best LLM must examine how well a model preserves context when the interaction unfolds across several steps.
Prompt chaining to extend the capability of smaller LLMs
To overcome the limitations of smaller models, we used a technique called prompt chaining. This approach involves breaking down a complex task into smaller, manageable prompts, each feeding into the next to maintain the conversation flow and ensure continuity.
In our case, the task of generating SQL queries was split into stages:
- Identifying user intent: The initial prompt focused on understanding what the user wanted to achieve (e.g., finding students enrolled in a computer science course).
- Retrieving relevant data: The next prompt retrieved necessary metadata from the knowledge base, such as tables, views, and field descriptions.
- Generating the SQL: Using the gathered metadata, the LLM generated the SQL query.
- Validation: Finally, a separate prompt was used to validate the generated SQL and make adjustments if needed.
This method allowed models like Llama and Titan to handle more complex queries by breaking down the task into smaller components that fit within their token limits. For cost-sensitive workloads, this can still be the best LLM strategy even when the underlying model is relatively small.
From prompt chaining to tool integration with advanced LLMs
With newer versions of Claude 2.1 and GPT-4 from OpenAI, tool integration became a game changer for handling complex tasks. This capability allowed these models to independently fetch necessary data and execute tasks, effectively reducing the need for prompt chaining.
For example, Claude and GPT-4 could access a knowledge base to retrieve specific database information or execute an SQL validation without manual prompts. By allowing the models to call functions (tools) on their own, we streamlined the workflow and significantly enhanced efficiency.
The diagram below shows how tool integration enabled Claude and GPT-4 to manage SQL generation and validation autonomously. Tool calls made by the LLMs are highlighted in green on the diagram. For more dynamic workflows, this can become a deciding factor in choosing the best LLM.
Click or tap on the picture to open it in full size
Comparing GPT, Claude, Llama, and Titan for complex task handling
We tested four popular LLMs — Meta Llama-3, Amazon Titan Text G1 – Premier, Anthropic Claude 3, and GPT-4 Turbo — to evaluate their performance, efficiency, and ability to handle SQL generation tasks under different conditions. Our demo video below provides an in-depth comparison of how each LLM performs in generating SQL queries, managing follow-up questions, and showcases the strengths and limitations of each model in practical, real-world scenarios. This comparison was essential because the best LLM for real use is the one that balances accuracy, continuity, speed, and cost.
- Meta Llama-3: Llama, the most affordable model, struggled with retaining context across multiple steps. Even with prompt chaining, it lacked the ability to handle complex conversational memory, often treating follow-up questions as entirely new requests. As a result, it was more prone to generating inaccurate SQL queries. That made it difficult to treat Llama as the best LLM for workflows where users refine requests over several turns.
- Amazon Titan Text G1 – Premier: Titan performed better than Llama, particularly in managing tasks that involved smaller data sets. However, it still relied heavily on prompt chaining, and the limited token capacity made it difficult to provide adequate context for more sophisticated queries. Titan can still be a strong choice in narrower scenarios where cost matters more than rich conversational handling.
- Anthropic Claude 3: Claude outperformed both Llama and Titan due to its larger context window, which allowed it to retain a more comprehensive understanding of the task. This significantly reduced the need for prompt chaining, enabling it to generate more accurate SQL queries with less manual intervention. For our scenario, Claude quickly emerged as a strong candidate because it combined quality with practical efficiency.
- GPT-4 Turbo: ChatGPT, like Claude, demonstrated the ability to use conversation history effectively. It could independently decide which data it needed at each stage of the task, and it integrated tools to streamline the query generation process. However, both Claude and ChatGPT came with higher operational costs compared to Llama and Titan. GPT-4 Turbo remained a strong alternative for teams that prioritize maximum adaptability.
While Llama and Titan were able to generate SQL queries, they struggled to handle follow-up questions due to their limited context windows. Claude and GPT, on the other hand, utilized the conversation history, making them more suitable for tasks that require dynamic interactions. That difference is central when selecting the right model for an interface that must support clarifications and iterative refinement.
Conclusion on choosing LLMs for business needs
Our research reveals that the choice between smaller, cost-effective models and advanced, feature-rich LLMs depends heavily on the specific business use case, expected query volume, and the importance of response quality. Here’s how you can make an informed decision:
When smaller models are optimal
For tasks where the business objective allows a certain level of approximation in outputs, or where the use case involves a high volume of relatively simple user requests, smaller models like Llama and Titan are highly effective. By leveraging prompt chaining to segment complex tasks into smaller, context-fitting chunks, these models can perform beyond their inherent limitations. This approach not only maximizes their utility but also significantly reduces operational costs, making them an excellent choice for applications prioritizing affordability over high precision. In these conditions, the most practical choice is often the model that delivers acceptable results at sustainable scale.
For example, if the task involves generating basic SQL queries or performing straightforward data lookups where minor inaccuracies are tolerable, smaller models can deliver acceptable results with well-designed prompt sequences. By carefully orchestrating the prompts, businesses can avoid unnecessary upgrades while still achieving reliable outputs. That can make a smaller model the most rational option from an ROI perspective.
When advanced models are necessary
For more complex tasks requiring nuanced understanding, dynamic interactions, or the ability to retain extensive context across multiple steps, advanced models like Claude or GPT-4 Turbo become indispensable. These models, especially when integrated with tools, excel in automating intricate workflows by autonomously fetching and validating data or dynamically adapting to user interactions.
However, even with advanced models, cost optimization is critical. Here, the intelligent use of prompt chaining plays a pivotal role. By minimizing the token usage for both inputs and outputs, and ensuring that each interaction is strategically structured, businesses can control costs while harnessing the full potential of these models. This strategy is particularly beneficial in high-stakes scenarios, such as generating SQL queries for large-scale, complex databases where accuracy is paramount. In these scenarios, quality under real operational pressure matters more than raw model size alone.
Selecting the best fit for our use case
Based on the results of this research, we selected Anthropic’s Claude 3 as the optimal LLM for our use case. While GPT-4 Turbo offered comparable accuracy in SQL generation tasks, Claude delivered the same level of precision at half the cost, making it the more economical choice without compromising quality. However, if Claude had not provided sufficient accuracy for our requirements, we would have chosen GPT-4 Turbo despite its higher cost, as ensuring reliable and precise results takes precedence over cost considerations in critical business applications. For us, Claude was the best LLM because it reached the required quality level without forcing unnecessary spend.
Our expertise in both scenarios
At ABCloudz, we have extensive experience implementing tailored solutions for both scenarios:
- We excel at designing efficient prompt-chaining frameworks for smaller models, allowing businesses to overcome context window limitations and achieve high performance at a fraction of the cost.
- For advanced models, we specialize in creating data exchange strategies that minimize token usage without compromising response quality. By integrating tools and leveraging prompt chaining effectively, we help businesses unlock the full power of advanced LLMs while controlling operational expenses.
Whether you aim to optimize costs with smaller models or leverage the sophisticated capabilities of advanced LLMs, our expertise ensures that your AI-driven solutions are both cost-effective and high-performing. We help clients identify the best LLM for their architecture, workload profile, and quality requirements.
How to choose the best LLM for SQL generation
- Start with the business goal, because the best LLM for exploratory analytics may differ from the right model for a production assistant.
- Measure how many follow-up questions users ask, since frequent clarification quickly changes model requirements.
- Check context-window limits early, because long schemas and conversation history can break query quality in smaller models.
- Use prompt chaining when budget matters, especially if you need smaller models to produce reliable SQL queries within tight token limits.
- Adopt tool integration when workflows become more dynamic, because the best LLM often depends on whether the model can retrieve metadata and validate an SQL query on its own.
- Compare cost and accuracy together, since the best LLM is the one that meets quality targets without overspending.
- Test models on real business prompts, not generic benchmarks, because the best LLM for your team must succeed on your actual use cases and follow-up patterns.
Contact us
At ABCloudz, we hav
If you’d like to explore how we can help you implement the most suitable LLM architecture for your business needs, feel free to reach out.