This is great - we do capture all logs for each run including any retries, so you can see errors and general successes. All of these other metrics we have internally, but need to expose to our users!
Observability is super key for background work even more so since it's not always tied to a specific user action, so you need to have a trail to understand issues.
> One thing that is easy to overlook is giving users the ability to define a specific “urgency” for their jobs which would allow for different alerting thresholds on things like running time or waiting.
We are adding prioritization for functions soon so this is helpful for thinking about how to think about telemetry for different priority/urgent jobs.
re: timeouts - managing timeouts usually means managing dead-letter queues and our goal is to remove the need to think about DLQs at all and build metrics and smarter retry/replay logic right into the Inngest platform.