The Problem: modern LLMs write syntactically correct SQL, but they struggle with real-world relational data. This is because real world data and schema is messy, natural language can often be ambiguous and LLMs are not trained on your specific dataset.
Solution: The core NL-to-SQL engine in Dataherald is an LLM based agent which uses Chain of Thought (CoT) reasoning and a number of different tools to generate high accuracy SQL from a given user prompt. The engine achieves this by:
- Collecting context at configuration from the database and sources such as data dictionaries and unstructured documents which are stored in a data store or a vector DB and injected if relevant
- Allowing users to upload sample NL <> SQL pairs (golden SQL) which can be used in few shot prompting or to fine-tune an NL-to-SQL LLM for that specific dataset
- Executing the SQL against the DB to get a few sample rows and recover from errors
- Using an evaluator to assign a confidence score to the generated SQL
The repo includes four services https://github.com/Dataherald/dataherald/tree/main/services:
1- Engine: The core service which includes the LLM agent, vector stores and DB connectors.
2- Admin Console: a NextJS front-end for configuring the engine and observability.
3- Enterprise Backend: Wraps the core engine, adding authentication, caching, and APIs for the frontend.
4- Slackbot: Integrate Dataherald directly into your Slack workflow for on-the-fly data exploration.
Would love to hear from the community on building natural language interfaces to relational data. Anyone live in production without a human in the loop? Thoughts on how to improve performance without spending weeks on model training?
You still need someone who understands why you should use which approach to get the data you need without getting completely wrong numbers back that _look_ perfectly fine but reflect fantasy, not reality.
Who is asking?
Don't know if this is that at all, but it's always worth considering.
It would not be easy, by any means, but I believe it is theoretically possible.
Much more time for us goes to things like:
* Getting the right table, column name spelling
* Disambiguating typos when users define names, and deciding whether they mean a specific name or are using a shorthand
* Disambiguating selection when there are multiple for the same thing: hint - this needs to be learned from usage, not by static schema analysis
* Guard rails, such as on perf
* Translation from non-technical user concepts to analyst concepts
* Enterprise DB schemas are generally large and often blow out the LLM context window, or make things slow, expensive, and lossy if you rely on giant context windows
* Learning and team modes so the model improves over time. User teaching interfaces are especially tricky once you expose them - learning fuzzy vs explicit modes, avoid data leakage, ... .
* A lot of power comes from being part of an agentic loop with other tools like Python and charting, which creates a 'composition' problem that requires AI optimization across any sub-AIs
We have been considering OSS this layer of louie.ai, but it hasn't been a priority for our customers, who are the analyst orgs using our UIs on top (Splunk, OpenSearch, Neo4j, Databricks, ...), and occasionally building their own internal tools in top of our API. Our focus has been building a sustainable and high quality project, and these OSS projects seem to be very different to sustain without also solving that, which is hard enough as-is..
I thought it was fairly obvious that we were talking about a local LLM agent... if DataHerald is a wrapper around only OpenAI, and no other options, then that seems unfortunate.
You can read more on how to do that on Postgres here https://www.2ndquadrant.com/en/blog/application-users-vs-row...
"what are my latest orders" -> select * from orders where user_id=x order by created_date
But really struggle when you have a complex schema that requires joins, and basically has no support when you are describing something that needs outer joins or the like. Would be great to hear if DataHerald has cracked that nut or if it's still a challenge for you as well (no judgement if it is, it seems like a hard problem).
this is precisely why we created Dataherald-- to make it much easier to add that business context so that NL to SQL could actually be good enough to get into production
discord invite in case anything comes up
https://github.com/ryanstout/question_to_sql
These sorts of businesses are really hard to build: incumbents have such an advantage. Makes so much more sense for this to be (a) open source (b) tied to snowflake / powerbi that have free distribution and a good security story.
The agent also can be finetuned on sample NL <> SQL pairs or they can be used in few shot prompting.
https://github.com/Dataherald/dataherald/blob/main/services/...
They are then retrieved and injected based on relevance to the query.
I produce a free sql editor that allows users to plugin openai to perform text to sql: https://www.timestored.com/qstudio/help/ai-text2sql so far uptake is slow and the only good benefit is to spit out a few queries as a starting point. The accuracy went up significantly by sending schema and sample data but it sounds like you've done a good job at going beyond that. I wouldn't say my users or I am convinced it's the future but I'll certainly look at your product tomorrow. Good work and congratulations.
At least with that you get 100% accuracy at the expense of having to use a fixed syntax.
How ould the NLP+ORM system do this?
- the engine keeps an updated catalog of the data (low cardinality columns, their values etc) - taps into query history and finetunes the model to the schema - allows uploading context from unstructured sources like docs and data dictionaries - has an agent which collects all relevant info, generate the SQL, tries to retrieve a few rows to recover from errors and provides an confidence score to the generated SQL
Communication always has been, and continues to be, the hardest part of software development.
any way to follow these developments vs pure LLM research?
Kudos for open sourcing. I think it's really difficult to build a business around that, but there are some successful examples in the space: metabase, airbyte, dbt, (maybe databricks?)
(1) https://github.com/Snowboard-Software/awesome-ai-analytics
Is retraining being done automatically after each PR that modifies the DB? Is there a way to inject the DB structure in the context?
You’re ahead of us in terms of interface but we’re ahead of you in terms of adoption (because of specific choices we’ve made and partnerships we’ve done).
Basically all the enterprises with a lot of data need to "chat with their data" right now.
I can't imagine how many teams are doing similar stuff right now.
are any startups applying LLMs profitable at all ? or is it just a mirage - ie, in the real world, startups are not able to solve users problems well using LLMs.
Of course, there will be some ambiguities, but maybe over time you can somehow constrain the input language a bit, adding some structure to it, such that you can query a database in English-like syntax without any ambiguities.
That would be nice!
- awesome-Text2SQL: https://github.com/eosphoros-ai/Awesome-Text2SQL :
> Curated tutorials and resources for Large Language Models, Text2SQL, Text2DSL、Text2API、Text2Vis and more.
- Awesome-code-llm > Benchmarks > Text to SQL: https://github.com/codefuse-ai/Awesome-Code-LLM#text-to-sql
- underlines/awesome-ml//llm-tools.md > RAG > OpenAI > dataherald,: https://github.com/underlines/awesome-ml/blob/master/llm-too...
- underlines/awesome-ml//llm-tools.md > Benchmarking > Benchmark Suites, Leaderboards: https://github.com/underlines/awesome-ml/blob/master/llm-too... :
- sql-eval: https://github.com/defog-ai/sql-eval :
> This repository contains the code that Defog uses for the evaluation of generated SQL. It's based off the schema from the Spider, but with a new set of hand-selected questions and queries grouped by query category. For an in-depth look into our process of creating this evaluation approach, see this.
> Our testing procedure comprises the following steps. For each question/query pair: 1. We generate a SQL query (possibly from an LLM). 2. We run both the "gold" query and the generated query on their respective database to obtain 2 dataframes with the results. 3. We compare the 2 dataframes using an "exact" and a "subset" match. TODO add link to blogpost. 4. We log these alongside other metrics of interest (e.g. tokens used, latency) and aggregate the results for reporting
- dataherald/services/engine/dataherald/tests/sql_generator/test_generator.py: https://github.com/Dataherald/dataherald/blob/main/services/...