Building an email-to-calendar LLM (opens in new tab)

(ngacho.com)

128 pointsbookmark992y ago43 comments

43 comments

"Setting up LLMs to output structured data is incredibly hard." resonated strongly with my experience working in similar one-off projects. I've almost always implemented some level of fuzzy-matching to validate and convert the LLM output back into my expected structured format.

I've also noticed that the LLMs are much better at writing code than structured JSON (no real surprise given the popularity of code assistants). If it makes sense in the specific situation, I now have the LLM generate code and parse it into the right structure rather than requesting structured data directly:

`generate_event("I need to do X", new Date("1-1-2025"))` seems to be more reliable to generate than `{ "description": "I need to do X", "when": "1-1-2025" }`

BoorishBears1y ago

I really don't get what people are doing wrong here.

I have a 7000 token prompt that generates JSON chugging away in production and at scale I'm seeing ~1 in 4000 generations require a re-generation, and even that could probably be killed with some basic "healing" code.

OSS are prone to outputting garbage in my experience, but OP mentions ChatGPT:

How are you running into issues if you simply prefill the response with ```json and set ``` as your stop token?

Also, are people also just not trying to parse the opening and closing bracket and treating it as broken if there's a preamble? The prefill gets rid of the preamble, but if you're not willing/able to prefill, how hard is getting JSON out of a string?

refulgentis2y ago

If you're doing it locally, it's likely got llama.cpp underneath it somewhere. Ask the dev to allow specifying a JSON schema via using its grammar feature.

el_nahual2y ago

As long as you can sanitize the LLM output somehow. You should never `eval` LLM code straight from the tap!

dns_snek1y ago

You shouldn't sanitize, if you're taking the approach described above, you should run it inside a minimal interpreter that doesn't implement any potentially dangerous APIs.

Mathnerd3142y ago

I think it's the training data, there is not a lot of JSON. It's much easier to get it to generate list-style data, like "foo:\n* prop1 - val1\n* prop2 - val2", or similar formats, as the models seem to have seen a lot of that sort of data.

bboygravity2y ago

You're aware ChatGPT4 has a json only mode?

nosefurhairdo2y ago

Some ideas in this forum thread on the same topic: https://genai.stackexchange.com/questions/202/how-to-generat...

Zetobal1y ago

openai, Claude, Mistral large and all models that you can infer with ollama have JSON only modes!?

jakecodes2y ago

Hey! It's awesome to read other people's solutions to this.

I've been working on solving this for the past 2 years or so and I went through much of the same struggles in the beginning until we came up with a solution which is fairly complex, to get LLM's to output data in a way we can use.

The big problem is that 95% accuracy is not good enough for calendars. People lose confidence after 1 failed attempt. Trying to get LLM's to output JSON can have a 1 in 1000 invalid JSON problem which is unrecoverable. What I wound up doing is training models for the tasks with tremendous amounts of data. I did not use OpenAI's models as they were not right for the job. Would love feedback.

convoke.ai

ljlolel2y ago

You can just force it to output into a specified BNF grammar this is quite easy

https://www.imaurer.com/llama-cpp-grammars/

mofosyne2y ago

Due to the vagueness of human language, could we also output a degree of confidence in the translation

politelemon2y ago

I'm looking at the AppScript site.

Is this like PowerAutomate, but for G Suite products?

Also I don't know why, looking at the site makes me think it's a candidate that Google is going to kill off without warning: https://www.google.com/script/start/

simonw2y ago

Apps Script is positively ancient at this point, first released in 2009: https://en.wikipedia.org/wiki/Google_Apps_Script

It's one of the Google products I worry least about, mainly because there are 15+ years of existing Google Sheets documents that people have built using it at this point. I don't think even Google would lightly break THAT many of their existing (often paid) users.

htrp2y ago

> I don't think even Google would lightly break THAT many of their existing (often paid) users.

Says everyone using GCP services that get deprecated.

1 more reply

tmpz222y ago

It stands out that it doesn't follow the Design System of many other Google products - wonder if that means it has its own fiefdom.

dudus1y ago

It's definitely quite popular inside Google itself, where all sorts of small and one-off systems are built and deployed entirely in AppsScript.

And it has had significant updates as well. The newer runtime is based on V8 using some clever way to isolate the code. Previously they used Mozilla Rhino as the runtime because it was easier to sandbox, but it was also very frustrating to work with.

Now the DX is much better with more recent and performant executions and a better UI.

If I sound like a sucker it's because I am and building with Apps Script is fun even with its limitations.

While Power Platform in comparison might be 100x more powerful for all I care and I still wouldn't touch it with a 10ft pole. Most frustrating experience I ever had.

denton-scratch2y ago

Just use CALDAV; it's designed for making calendar entries automatically via email. I'm not hip with the fashion for putting an LLM into everything. I think it's lazy.

IanCal2y ago

That's a great solution to a different problem. Unless caldav has a process for extracting dates and actions from unstructured emails? But that doesn't seem related to caldav.

denton-scratch1y ago

Point taken; you do need a CALDAV client to source or consume CALDAV messages, which are indeed very much structured.

darby_eight2y ago

IDK about "lazy", but it's certainly an extremely expensive solution.

dugite-code2y ago

Isn't the expensive part of LLM's the training? My understanding is once they are trained they can often be optimized to run quite cheaply. Not as cheaply as a well designed program but cheaply enough it shouldn't be too prohibitive to run.

1 more reply

refulgentis2y ago

A 3B model runs on Android phones from 2 years ago at 6 tkns/s.

1 more reply

swsieber2y ago

If I'm generating the data, sure, I'll use CALDAV.

But for ingesting unstructured data I didn't generate, I'm going to reach for an LLM now (provided latency isn't an issue).

iot_devs2y ago

I am working on GabrielAI, which is a tool to filter and autodraft reply for Gmail and outlook and of course it uses LLM under the hood.

https://getgabrielai.com

In cases like this the author could just set up a filter like:

"If the email contains a task for me" (or some variation)

Then add a Gmail label to it.

In this way the author will immediately find all the actionable emails for him in a specific folder, much faster to skim and to keep track of all of them.

Another option would it be to have GabrielAI generate a Draft like "reply acknowledging the task and put a to-do date in the email in 1 week"

This would allow Google to track the email and the deadline.

bookmark99OP2y ago

author here. This definitely sounds all good. I'll be trying out your app. Funnily enough, when we were building this, friend pitched the idea you're doing (seems fantastic) on GabrielAI so we will be signing up for the beta.

A bit classless, but do you mind if I reach out to you about your experience building GabrielAI?

iot_devs2y ago

I definitely don't mind!

Feel free to use the email address on the GabrielAI website!

I'd be happy to chat about it!

1 more reply

maliker2y ago

I used to use calendar integrations for these kind of things. Then I realized I'd prefer to have the low priority stuff disappear until I have to deal with, so I switched to followupthen.com and have been happy with it. A nice effect is that it creates a paper trail so I know how many times I've put things off.

rockwotj2y ago

You can also do this with https://shortwave.com (and get iOS/Android/Web)

https://twitter.com/Shortwave/status/1760723475923390598

bookmark99OP2y ago

My friend and I built a gmail add on to easily parse tasks from an email and add them to your calendar.

phillipcarter1y ago

The fact that GPT could reliably produce the right JSON structure but an open model couldn't is fascinating to me. It's impressive how far ahead OpenAI is.

kkzz991y ago

How is this fascinating? One is a 175-1000+B parameter model the other is 3-70B parameter model.

phillipcarter1y ago

I’m allowed to find it fascinating, that’s why.

jerrygenser2y ago

I have a probably janky habit of creating slack reminders and then rolling them over by some mix of 3 hours, next day, or next Monday.

ac50hz1y ago

I simplify my prioritization strategies to use only 2 priorities: High (do it now) and Low (do it later). There should only be 1 high-priority item at any time, and when it's completed, the (next) low-priority item becomes the single high-priority item.

Maintaining multi-level priorities requires more decisions to evaluate relative priorities of different tasks and possible priority re-evaluation when new tasks arrive. Throw some colleagues, friends or others into the mix and agreements on the decisions become more distant.

Within the low-priority list these are sorted on the date and time required. If you then choose to ignore the priorities or sorting, the deviation will take you down the priority re-evaluation rabbit hole again. It's then your choice to follow the process, or not. Avoiding adding complexity to task scheduling and processes ensures I have focus.

Of course, this will not be for everyone. Good luck with that LLM!

ekianjo2y ago

structured output by LLM can be achieved by using the Python outlines library

j / k navigate · click thread line to collapse

43 comments

robertclaus2y ago

`generate_event("I need to do X", new Date("1-1-2025"))` seems to be more reliable to generate than `{ "description": "I need to do X", "when": "1-1-2025" }`

BoorishBears1y ago

I really don't get what people are doing wrong here.

OSS are prone to outputting garbage in my experience, but OP mentions ChatGPT:

How are you running into issues if you simply prefill the response with ```json and set ``` as your stop token?

refulgentis2y ago

If you're doing it locally, it's likely got llama.cpp underneath it somewhere. Ask the dev to allow specifying a JSON schema via using its grammar feature.

el_nahual2y ago

As long as you can sanitize the LLM output somehow. You should never `eval` LLM code straight from the tap!

dns_snek1y ago

You shouldn't sanitize, if you're taking the approach described above, you should run it inside a minimal interpreter that doesn't implement any potentially dangerous APIs.

Mathnerd3142y ago

bboygravity2y ago

You're aware ChatGPT4 has a json only mode?

nosefurhairdo2y ago

Some ideas in this forum thread on the same topic: https://genai.stackexchange.com/questions/202/how-to-generat...

Zetobal1y ago

openai, Claude, Mistral large and all models that you can infer with ollama have JSON only modes!?

jakecodes2y ago

Hey! It's awesome to read other people's solutions to this.

convoke.ai

ljlolel2y ago

You can just force it to output into a specified BNF grammar this is quite easy

https://www.imaurer.com/llama-cpp-grammars/

mofosyne2y ago

Due to the vagueness of human language, could we also output a degree of confidence in the translation

politelemon2y ago

I'm looking at the AppScript site.

Is this like PowerAutomate, but for G Suite products?

Also I don't know why, looking at the site makes me think it's a candidate that Google is going to kill off without warning: https://www.google.com/script/start/

simonw2y ago

Apps Script is positively ancient at this point, first released in 2009: https://en.wikipedia.org/wiki/Google_Apps_Script

htrp2y ago

> I don't think even Google would lightly break THAT many of their existing (often paid) users.

Says everyone using GCP services that get deprecated.

1 more reply

tmpz222y ago

It stands out that it doesn't follow the Design System of many other Google products - wonder if that means it has its own fiefdom.

dudus1y ago

It's definitely quite popular inside Google itself, where all sorts of small and one-off systems are built and deployed entirely in AppsScript.

Now the DX is much better with more recent and performant executions and a better UI.

If I sound like a sucker it's because I am and building with Apps Script is fun even with its limitations.

While Power Platform in comparison might be 100x more powerful for all I care and I still wouldn't touch it with a 10ft pole. Most frustrating experience I ever had.

denton-scratch2y ago

Just use CALDAV; it's designed for making calendar entries automatically via email. I'm not hip with the fashion for putting an LLM into everything. I think it's lazy.

IanCal2y ago

That's a great solution to a different problem. Unless caldav has a process for extracting dates and actions from unstructured emails? But that doesn't seem related to caldav.

denton-scratch1y ago

Point taken; you do need a CALDAV client to source or consume CALDAV messages, which are indeed very much structured.

darby_eight2y ago

IDK about "lazy", but it's certainly an extremely expensive solution.

dugite-code2y ago

1 more reply

refulgentis2y ago

A 3B model runs on Android phones from 2 years ago at 6 tkns/s.

1 more reply

swsieber2y ago

If I'm generating the data, sure, I'll use CALDAV.

But for ingesting unstructured data I didn't generate, I'm going to reach for an LLM now (provided latency isn't an issue).

iot_devs2y ago

I am working on GabrielAI, which is a tool to filter and autodraft reply for Gmail and outlook and of course it uses LLM under the hood.

https://getgabrielai.com

In cases like this the author could just set up a filter like:

"If the email contains a task for me" (or some variation)

Then add a Gmail label to it.

In this way the author will immediately find all the actionable emails for him in a specific folder, much faster to skim and to keep track of all of them.

Another option would it be to have GabrielAI generate a Draft like "reply acknowledging the task and put a to-do date in the email in 1 week"

This would allow Google to track the email and the deadline.

bookmark99OP2y ago

A bit classless, but do you mind if I reach out to you about your experience building GabrielAI?

iot_devs2y ago

I definitely don't mind!

Feel free to use the email address on the GabrielAI website!

I'd be happy to chat about it!

1 more reply

maliker2y ago

rockwotj2y ago

You can also do this with https://shortwave.com (and get iOS/Android/Web)

https://twitter.com/Shortwave/status/1760723475923390598

bookmark99OP2y ago

My friend and I built a gmail add on to easily parse tasks from an email and add them to your calendar.

phillipcarter1y ago

The fact that GPT could reliably produce the right JSON structure but an open model couldn't is fascinating to me. It's impressive how far ahead OpenAI is.

kkzz991y ago

How is this fascinating? One is a 175-1000+B parameter model the other is 3-70B parameter model.

phillipcarter1y ago

I’m allowed to find it fascinating, that’s why.

jerrygenser2y ago

I have a probably janky habit of creating slack reminders and then rolling them over by some mix of 3 hours, next day, or next Monday.

ac50hz1y ago

Of course, this will not be for everyone. Good luck with that LLM!

ekianjo2y ago

structured output by LLM can be achieved by using the Python outlines library

j / k navigate · click thread line to collapse