Instead of going for generative, few-shot models, I am starting to look at the other end of the spectrum: Binary classification into deterministic query building.
With ChatGPT, you cannot realistically explain why you got some output in a way that anyone other than an AI/ML expert would find satisfying. With binary classifiers, you can precisely explain how some input resulted in some output in terms that a business person could easily digest - "You mentioned this table so it assumed you wanted to constrain on XYZ. Here's the trace from the classifiers...".
I've proposed a few schemes where you define groups of classifiers for each SQL building concern - Which tables are involved, which columns, is a join or aggregate implied, general context of business use, etc. Clearly, there are holes with this scheme, but in our domain we could plausibly fine-tune our humans to be a little bit more verbose in their use of the automagic SQL vending machine. Several hours spent training humans is probably a lot cheaper & easier than getting ChatGPT, et. al. to consistently play by our rules.
Working with these models every day, it's clear that they can certainly interpolate between points in latent space and generate sensible answers to unseen questions, but it's pretty clear that they don't generalize. I've seen far to many examples of models failing to display any sense of generalization to believe otherwise.
That's not to say that interpolation in a rich latent space of text isn't very useful. But it's not the same level of abstraction that comes from true generalization in this space.
-ChatGPT initially gives the same diagnosis the vets did, Babesiosis
-It notes that Babesiosis may have either been a misdiagnosis or there may be a secondary condition/infection causing the remaining symptoms after the Babesiosis treatment didn't resolve all of them
-It suggests such a hypothetical secondary condition could be IMHA, which the article notes is an extremely common complication of Babesiosis with this specific dog breed
-A quick Google search brings up a fair amount of literature about the association between Babesiosis and IMHA
So in fact this is the opposite of a never before seen situation, ChatGPT was just regurgitating common comorbidities of Babesiosis and the vets in question are terrible at their job.
IMHO OP is talking about "explainability" of the results, which is notoriously bad for current AI. For certain applications (idk if SQL would be one but mortgage application might be one) it is required to be able to explain how the computer got to the decision.
Binary classifiers don't generalize?
Just because my output is not generative does not mean we are cannot learn / generalize elsewhere. Think of it as a 2-stage process.
At the same time, we're also moving to make a lot more of that process AI controlled, just across LLM calls (e.g., LLM-dictated ones), so it's a funny maturity process.
I have no clue what this means.
paying engineer for the same job will be many factors more expensive.
Sure, some use cases might work but it’s not going to be a thing that Just Works™ for products even accuracy issues aside. There’s just so much data to feed into each and every prompt, schemas and all. Many of them too if you want to enable joins.
It's hard for something you interact with manually to provide positive value less than 0.2c.
Finetuned is 1.2cents/1k in and 1.6cents/1k out. So it'll likely be closer to 2cents depending on what you're doing.
I'm not saying it's not useful, at 2c per query you have to be more "purposeful" as they could certainly add up depending on how you use it compared to 0.2c.
In Zillion* I added an experimental feature that uses OpenAI to form a report API call from natural language. Fine tuning (vs prompt tuning) this on the semantic model with gpt-3.5 would probably yield some notable improvement in abilities, as I view it as more of a toy feature at the moment.
A semantic data layer is an abstraction layer that sits between the raw data stored in databases and the applications or analytics tools that use this data. The purpose of this layer is to provide a consistent, business-friendly interface to data that might be stored in a variety of formats, tables, or systems. By doing so, the semantic data layer helps in simplifying complex data structures into more meaningful, understandable models. Key Features:
Unified View: It provides a unified view of data from multiple sources, making it easier for users to access and understand the data without having to know the underlying structure or complexity.
Data Governance: It can enforce business rules, security policies, and data quality measures, ensuring that data is consistent, compliant, and accessible only by authorized users.
Flexibility: A semantic layer is often designed to be flexible, allowing business users to adapt to changes in the data model without requiring changes to the applications themselves.
Query Simplification: The layer simplifies the process of querying data by providing a more user-friendly way of accessing and manipulating data, often through a drag-and-drop interface or other graphical tools.
Decoupling: It decouples application development from data source changes. When underlying data changes, you don't necessarily have to update all the applications that use it; you might only need to update the semantic layer.
Data Integration: It can integrate data from multiple sources, providing a single "source of truth" for business users and applications.
Purpose: Simplification: Make data more accessible and easier to understand for non-technical users.
Consistency: Ensure that everyone is working from the same definitions and business rules.
Efficiency: Reduce the time and complexity involved in generating reports, analytics, and other data-driven functions.
Data Quality: Help to enforce data quality and governance policies.
Security: Provide a mechanism for enforcing security rules on who can see or modify data.
Adaptability: Enable quicker adaptation to changes in business needs or data structures.
By providing a semantic data layer, organizations can ensure that their data is not only high-quality and secure but also that it can be easily used for making informed decisions. This is particularly valuable in today's data-driven business environment.Also, wondering if anyone has found research on the inverse of this approach to the problem, i.e., instead of training the model to understand the data, you improve the data to be more understandable to a model? This seems more promising when you are looking at enterprise use cases without much training data. Spider seems like quite a simple dataset compared to the ones I encounter on the job, and LLMs struggle even with those.
Since the issue is often the context, plugging in data dictionaries (and passing those to the LLM) can help
I see in their training set they've got comments about columns too. e.g.
"Country" TEXT NOT NULL, - country where the singer born
But thats still not enough.You also need a bunch of information about the real business that the data is describing. And you also need to analyse the whole database - is that field actually used? What are the common values for this picklist? What does that status actually mean in terms of business? If there are two of those rows that match the join but I want to avoid duplicates, which one do I take? - the newest, or the one with a certain status? etc.
While the article focuses on finetuning GPT-3.5-turbo, how you use the text-to-SQL engine within the architecture of your overall solution is for you to decide. Providing this business context from vectorized context stores in the actual prompt would be a step in the right direction.
GPT is a tool for learners, and they will keep shooting them feet until they learn, just the weapon is different now.
Look at the law system, written in natural language. It is misleading and doesn't work well, so we have to gather thousands of people around courts doing non-deterministic work in order to process them. Natural language is a tool for learning, not making systems. You have to shrink your vocabulary down to code at some point in order to make systems, and you can do it much faster and better than GPT.
The training format is a series of question/answer pairs: i.e, what we might think of as "supervised" learning. It can be challenging to build good data sets for this scenario.
But transformer-based models actually don't require supervised learning: the bulk training of the GPT family is to just throw masses of text at it and have it do next-token prediction.
What's going on with the difference? When I "fine-tune" GPT-3.5 turbo, am I actually training the transformer, or the RLHF model that sits on top? Or both?
More to the point, is there any way to fine-tune OpenAIs models in a n "unsupervised" fashion? I.e, if I want to teach a model SQL, do I need to get a curated data set of question/answer pairs, or I can I just dump in a bunch of schema and SQL queries that will in theory make the model "better at SQL" in a generic way?
What I don’t know is how they’re applying RLHF after user fine-tuning. Are they redoing the PPO with the original reward model after tuning on your input? Are they just letting it slide, and hoping that the fine-tuning doesn’t cause the model to forget the RLHF? It’s unclear from what I’ve read.
What would be more useful IMO is natural language to OData, GraphQL, and OpenAPI/Swagger. Then you could let users do ad-hoc query but only against data they are allowed. I did a PoC using GPT3 to query OData and it was pretty fun, but did occasionally get the query wrong. I also ran into the context window issue. It would get lost when fed larger OData schema.
See the following for more info:
https://yale-lily.github.io/spider https://github.com/taoyds/spider/tree/master/evaluation_exam...