My team created an identical hypothesis to this doc ~2 years ago and generated a proof of concept. It was pretty magic, we had fortune 500 execs asking for reports on internal metrics and they’d generate in a couple of minutes. First week we got rave reviews - followed by an immediate round of negative feedback as we realized that ~90% of the reports were deeply wrong.
Why were they wrong? It had nothing to do with the LLMs per se, 03-mini doesn’t do much better on our suite than gpt 3.5. The problem was that knowing which data to use for which query was deeply contextual.
Digging into use cases you’d fine that for a particular question you needed to not just get all the rows from a column, you needed to do some obscure JOIN ON operation. This fact was only known by 2 data scientists in charge of writing the report. This flavor or problem - data being messy, with the messiness only documented in a few people’s brains, repeated over and over.
I still work on AI powered products and I don’t see even a little line of sight on this problem. Everyone’s data is immensely messy and likely to remain so. AI has introduced a number of tools to manage that mess, but so far it appears they’ll need to be exposed via fairly traditional UIs.
I can't get anyone to listen to this point. I'm seeing plans going full steam ahead deploying AI when they don't even have a good definition of the PROBLEM much less how to train the AI to do things well and correctly. I was in a 90 minute meeting with some execs who were all high on ChatGPT Operators. He was saying we could replace 80 people at this company RIGHT NOW with this tool. I asked the presenter to type in one simple request to the AI, the entire demo went wildly off the rails from then on and the presenter wasn't even remotely bothered by that. People are either completely taken in by the marketing and believe like it's a religion, or they have solid, sensible concerns about reliability. But the number of people in category 2 is a smaller number than the true believers.
The other issue is that the first group are labelled as innovative go-getters, while the second group are labelled as negative crusty curmudgeons and this has an impact on the careers of both groups.
First time? I did AI work years before the current generative AI boom and it was the same then too, managers wanted to stick AI into everything without even knowing what the hell they actually wanted in the end.
I'm currently working as a scientist. I wonder if researchers will be willing to annotate their papers, data, reasoning, and arguments well enough that ai agents can make good use if it all.
If you write your papers in an AI friendly way, maybe that means more citations? Does this mean switching to new publishing formats? Pdfs are certainly limiting
We've looked at using agents at my current job but most of the time, once the data is properly structured, a more traditional approach is faster and less expensive.
We can't let a LLM loose on a database and expect it to figure out everything.
Sure, you can do some basic filtering (but it would fail here making bad assumptions) and any (correct) joins were a crap-shoot. I was including schema and sample rows from all my tables, I wrote 10’s of lines of instructions explaining the logic of the tables and that still didn’t begin to cover all the cases.
Prompt engineering tons of business logic is a horrible job. It hard to test and it feels so “squishy” and unreliable. Even with all of my rules, it would write queries that didn’t work and/or broke a rule/concept that I had laid out.
In my experience, you’re much better off using AI to help you write some queries that you add to the codebase (after tweaking/checking) then you are having AI come up with queries at run time.
This is why I'm building a federated query optimizer: we want to let the LLM reason and formulate queries at the ontological level, with query execution operating behind a layer of abstraction.
My team had these ontologies available to the LLM and provided it in the context window. The queries were ontologically sensible at a surface level, but still wrong.
The problem is that your ontology is rapidly changing in non-obvious and hard to document ways e.g. "this report is only valid if it was generated on a tuesday or thursday after 1pm because that's when the ETL runs, at any other time the data will be incorrect"
>I still work on AI powered products and I don’t see even a little line of sight on this problem. Everyone’s data is immensely messy and likely to remain so.
I've worked in the space as well and completely unstructured data is better than whatever you call a database with a dozen ad hoc tables each storing information somewhat differently to each other for reports written by a dozen different people over a decade.
I have a benchmark for an agentic system which measures how many joins between tables the system can do before it goes off the rails. But there is nothing off the shelf that does it and for whatever reason no one is talking about it in the open. But there are companies working to solve it in the background - since I've worked with three so far.
Without documentation giving some grounding about what the table is doing, you're left with hoping the database is self documenting enough for the agent to figure out what the column names mean and if joining on them makes sense - good luck doing it on id1, id2, idCustomerLocal, id_customer_foreign though.
My favorite example was a report that was only accurate if generated on a Tuesday or Thursday due to when the ETL pipeline ran. A small config change on the opposite side of a code base completely altered the semantics of the data!
This reminds me of one of the key plot points in "The Sparrow" by Mary Doria Russell. Small spoiler ahead so if you haven't read it and want to be surprised, stop reading.
...
...
Basically, one of the characters works as an AI implementer, replacing humans in their jobs by learning deeply about how they do their work and coding up an AI replacement. She run across a SETI researcher and works on replacing him, but he has a human intuition when matching signals that she would never have discovered because it was so random.
Great book if you haven't read it.
In reality, as always, I suspect the truth will be somewhere in between. SaaS products that succeed will be those that have a good UI _and_ and good API that LLMs can use.
An LLM is not always the best interface, particularly for data access. For most people, clicking a few times in the right places is preferable to having to type out (or even speak aloud) "Show me all the calls I did today", waiting for the result, having to follow up with "include the time per call and the expected deal value", etc etc.
There is undoubtedly an opportunity for disruption here, but I think an LLM only SaaS platform is going to be a very tough sell for at least the next decade.
I agree that the amount of bespoke UI that needs to exist probably won't stagnate. Humans need about the same amount of visual information to verify a task was done correctly as they need to do the task.
LLM generated UI is an interesting field. Sure, you can get ChatGPT to generate schema to lay out some buttons. But it seems harder to identify the context and relevant information that must be displayed for the human to be a valuable/necessary asset in the process.
As an industry, we have been through a textual user interface already: terminals, and we moved away from that.
And voice UIs are not new either: we have had voice assistant for quite some time now, and they didn't see the success Apple, Google or Amazon were expecting (recently it came out that most of echo use cases were about setting timers).
How do LLM SaaS replacements solve that?
> The underlying SaaS platform is reduced to a “database” or “utility” that an agent can switch out if needed.
I agree that UI isn’t going away completely. Language is a slow and imprecise tool. A well developed UI can be much more efficient. I think it will be much more like the Star Trek universe, where we use a blend of the two.
In any case, if the AI agent can generate UI on the fly, it seems their point still stands?
It never panned out, arguably because the technology wasn't quite there yet (this was well before ChatGPT came out), but I thought the bigger problem was that people thought that a chat UI was the ultimate user interface. Just didn't feel right to me. For simple tasks, sure, but otherwise it felt like for "exploratory" tasks it made more sense to have a graphical user interface of some kind.
Same sentiments apply to the hype around agents. Even in a hypothetical world where agents work as well as any human I don't think an agent/chatbot UI is necessarily the ultimate user interface. If I'm asking an agent questions, it makes sense for it to show rather than tell in many contexts. Even in a world where agents capture much of the way we interact with computers, it might make more sense for them to show us using 3rd party SaaS apps.
This writeup seems to be authored by a senior designer at Salesforce and I can see the motivation from the their perspective. Their challenges are different than what a new SaaS product will encounter.
Like all the incumbents of their time they are a core-ish database that depended on a plethora of point solutions from vendors and partners to fill in the gaps their product left in constructing workflows. If they don't take an approach like being discussed here – or in the linked OpenAI/Softbank video – they will risk alienating their vendors/partners or worse see them becoming competitors in their own right.
Disclaimer – I'm biased too, I'm building one of the upstarts that aims to compete with Salesforce.
You Will.
Except this time with full admin access to everything.
Many SaaS (especially the complex ones, which are the also the most important ones) have a tonne of UI often imposing a huge amount of non-work work onto users - all the clicking you have to do as part of entering or retrieving data, especially if the UI flow doesn't fit exactly what you're trying to do at that moment. An example might be quicly creating an epic and a bunch of related tickets in Jira, and having them all share some common components.
A generative UI would be able to construct a custom UI for the particular thing the user is trying to do at any point in time. I think it's a really powerful idea, and it could probably be done today by smartly using eg Jira's APIs.
The ability to span applications would be even more powerful. Done well it might even kill the need to maintain complex integrations between related Saas (eg how some product development application might need to sync data to/from Jira or ADO) by having the AI just keep track of changes and move them from one system to another.
Once it gets to the point where the Gen UI is go-to system for interactions you have to wonder what all the designers and UI builders at the myriad SaaS will be doing...
Pro: The GUI dynamically adjusts.
Con: There's no consistent mental model for you to learn, when you need to use something it's not there, and the stuff which is there might not do what you expect.
Who's going to bet millions of dollars these agents after going to get it right. Based on what evidence?
I'm sure some of those adjustments are reasonable, but I'm also sure this gets used to create a stack of lies to please upper management.
There's some obvious issues with some sort of AI in such an environment. Do you train the AI to tell the right sorts of lies?
You can have Agents run behaviors async by attaching triggers to them, for example when you get a specific email or something gets updated in a CRM. You can also give the agent access to basically any third-party action you can think of.
Like others in this thread have pointed out, there's a nice middle-ground here between an LLM-only interface and some nice UI around it, as well as ways to introduce determinism where it makes sense.
The product is still in its early days and we're iterating rapidly, but feel free to check it out and give us some feedback. There's a decent free plan.
Agents have a pre-iPhone feel to them (when everyone was making phones with keyboards). What do you think the ultimate Agents looks like?
I definitely agree that agents are very early days - as for the "ultimate agent" it feels like everything is moving towards them being a sort of co-worker, if that makes sense. I think handling human-in-the-loop scenarios nicely is going to be vital to an agent actually being useful. i.e. you clock in in the morning and check out all the stuff your agent did overnight and can approve/change/reject tasks. There's a ton of healthy competition in the space, so in a year or two we'll all have a much better idea of where the tech is going.
I wouldn't call it a moat, but it's definitely a giant head start.
There's a reason we're still using apps instead of talking to Siri…for a huge number of tasks, visual UIs are so much more efficient than long-form text.
It's gonna be: reusable saas components + ai orchestrator + specialized UI
On a related note, there's probably gonna be an extinction level event in the software industry as there's no software moat anymore.
When every application, every feature, every function can be replicated/reproduced by another company in a matter of minutes / hours using AI tools, you don't have a moat anymore.
Why will businesses trust a black box that claims to make good decisions (most of the time) when they have existing human relationships they have vetted, measured, and know the ongoing costs and benefits of?
If the reason is humans are expensive, I have news for you. We've had robotics for around 100 years and the humans are still much cheaper than the robots. Adding a bunch of graphics cards and power plants to the mix doesn't seem to change that equation in a positive direction.
So let me get this straight- we are going to train AI models to perform screen recognition of some kind (so it can ascertain layout and detect the "important" ui elements), and additionally ask that AI to OCR all text on the screen so it has some hope of being able to follow some natural language instructions (OCR being a task which, as a HN thread a day or two ago pointed out, AI is exceedingly bad at), and then we're going to be able to tell this non-deterministic prediction engine what we want to do with our software, and it's just going to do it?
Like Homer Simpson's button pressing birdie toy? :smackshead:
Why do I have reservations about letting a non-deterministic AI agent run my software?
Why not expose hooks in some common format for our software to perform common tasks? We could call it an "application programming interface". We might even insist on some kind of common data interchange format. I hear all the cool people are into EBCDIC nowadays.
Then we could build a robust and deterministic tool to automate our workflows. It could even pass structured data between unrelated applications in a secure manner. Then we could be sure that the AI Agent will hit the "save the world" button instead of the "kill all humans" button 100% of the time.
On a serious note, we should study various macro recording implementations, to at least have a baseline of what people have been successfully doing for 40+ odd years to automate their workflows, and then come up with an idea that doesn't involve investing in a new computer, gpu, and slowly boiling the oceans.
This reeks of a solution in search of a problem. And the solution has the added benefit of being inefficient and unreliable. But, people don't get billion dollar valuations for macro recorders.
Is this what they meant by "worse is better"?
Edit: and for the love of FSM, please do not expose any new automation APIs to the network.
The scariest part is, as this advances, the level of disasters we're likely to see will at best be bankrupt corporations, and at worst, people being hurt/killed (depending on how carelessly these tools are integrated into mission critical systems).
We can call it OpenAI.
I'll see myself out.
This comparison is especially apt, given that one of the main use-cases for LLMs is the same kind of... well, fraud: To give the illusion that you did the work of understanding or reviewing something, but actually just (smart-)phoning it in.
In one Apple iPhone advertisement, the famous actor is asked by their agent what they think of a script. They didn't read it, so they ask the LLM-assistant to sum it up in couple sentences... and then they tell their agent it sounds good.
The reality is that most applications and websites don’t expose enough context about the what of what you’re actually doing for AIs to be able to meaningfully infer from natural language the steps required to complete a given task.
We humans are very good at filling in the blanks based on if we’re working in Photoshop or VS Code or Excel. We infer a lot of context from the specific files we’re working on or the particular client or even the files’ organization within the file system, or even what month or day it is.
I am skeptical that models will be able to replicate a complex workflow when there’s very little in the way of labels and UI controls even visible.
I know a weekly spreadsheet from a monthly and quarterly, etc. I know the minutiae about which options to use to generate the specific source reports, etc.
Workflows can be quite complex, no matter your role.
I mean I can just see it now: gift receipts being sent to the recipient before their birthday, internal draft proposals prematurely sent to clients, mixing up clients or commingling their data, overwriting or losing data; this whole thing just screams disaster. And I’m not even thinking about people involved with safety, or finance, or legal/regultory, or medical. Law enforcement?
This kind of thing can be done properly with well defined interfaces, common standards, and reasonable and prudent guardrails.
But it won’t be. It’ll be YOLOed on a paper thin training budget and it’ll be like your own little personal chaos monkey on ketamine.
AI is amazing at OCR, we've had tesseract ocr for 40 years and if you read the fine manual it has essentially a 0% error rate per character.
OCR on VLMs is terrible.
For some reason consistent x-heights between 10 to 30 pixels with guaranteed mono-column layout is not something venture capitalists get excited about, and as a result I'm not the founder of a unicorn.
That being said, I thought the purpose of OCR was to take text from a non-digital source and make it digital.
Why should we have to OCR something that exists already in a perfectly interchangeable digital format already?
Autonomy is just more sexy, but in my opinion, it’s a poor design direction for a lot of applications.
Especially since most attempts will have a "under no circumstances should you voluntarily involve a human" in the prompt.
2 bags of peanuts if the actual product isn’t an OS and barely passes as AI
I fundamentally believe that human-oriented web apps are not the answer, and neither is REST. We need something purpose-built.
The challenge is, it has to be SIMPLE enough for people to easily implement in one day. And it needs to be open source to avoid the obvious problems with it being a for-profit enterprise.
There’s existing solutions but everything is its own special snowflake. Oauth is a lie, sso sometimes works. But sso doesn’t provide a differentiation between my employee and their broken script.