There’s a demo video at https://www.youtube.com/watch?v=ib3mRh2tnSo and a sandbox to try out (no sign-in required!) at https://demo.runtrellis.com/. An interesting historical archive of unstructured data we thought it would be interesting to run Trellis on top of are old Enron emails which famously took months to review. We’ve created a showcase demo here: https://demo.runtrellis.com/showcase/enron-email-analysis, with some documentation here: https://docs.runtrellis.com/docs/example-email-analytics.
Why we built this: At the Stanford AI lab where we met, we collaborated with many F500 data teams (including Amazon, Meta, and Standard Chartered), and repeatedly saw the same problem: 80% of enterprise data is unstructured, and traditional platforms can’t handle it. For example, a major commercial bank I work with couldn’t improve credit risk models because critical data was stuck in PDFs and emails.
We realized that our research from the AI lab could be turned into a solution with an abstraction layer that works as well for financial underwriting as it does for analysis of call center transcripts: an AI-powered ETL that takes in any unstructured data source and turns it into a schematically correct table.
Some interesting technical challenges we had to tackle along the way: (1) Supporting complex documents out of the box: We use LLM-based map-reduce to handle long documents and vision models for table and layout extraction. (2) Model Routing: We select the best model for each transformation to optimize cost and speed. For instance, in data extraction tasks, we could leverage simpler fine-tuned models that are specialized in returning structured JSONs of financial tables. (3) Data Validation and Schema Guarantees: We ensure accuracy with reference links and anomaly detection.
After launching Trellis, we’ve seen diverse use cases, especially in legacy industries where PDFs are treated as APIs. For example, financial services companies need to process complex documents like bonds and credit ratings into a structured format, and need to speed up underwriting and enable pass-through loan processing. Customer support and back-office operations need to accelerate onboarding by mapping documents across different schema and ERP systems, and ensure support agents follow SOPs (security questions, compliance disclosures, etc.). And many companies today want data preprocessing in ETL pipelines and data ingestion for RAG.
We’d love your feedback! Try it out at https://demo.runtrellis.com/. To save and track your large data transformations, you can visit our dashboard and create an account at https://dashboard.runtrellis.com/. If you’re interested in integrating with our APIs, our quick start docs are here: https://docs.runtrellis.com/docs/getting-started. If you have any specific use cases in mind, we’d be happy to do a custom integration and onboarding—anything for HN. :)
Excited to hear about your experience wrangling with unstructured data in the past, workflows you want to automate, and what data integration you would like to see.
Great use case! Worked on exactly this a decade ago. It was Hard™ then. Could only make so much progress. Getting this right is a huge value unlock. Congrats!
Even though it's 2024, banks, financial institutions like insurance companies etc. tend to be _very_ cautious with valuable documents involving customers. There are also regional regulations that prevent things like patient data being shared with _any_ 3rd parties. Even one of the big 4 oil companies that I've dealt with as prospective customer - very strict rules requiring on premise solutions.
The good news is many are using things like Kubernetes and OpenShift internally, so it should be possible to port what you do on AWS to on-premise.
Better still if it can then become a source of truth for further departures from reality.
I used OpenAI's function calling (via Langchain's https://python.langchain.com/v0.1/docs/modules/model_io/chat... API).
Some of the challenges I had:
1. poor recall for some fields, even with a wide variety of input document formats
2. needing to experiment with the json schema (particularly field descriptions) to get the best info out, and ignore superfluous information
3. for each long document, deciding whether to send the whole document in the context, or only the most relevant chunks (using traditional text search and semantic vector search)
4. poor quality OCR
From the demo video, it seems like your main innovation is allowing a non-technical user to do #2 in an iterative fashion. Have I understood correctly?
OOC which openai model were you using? Would recommend trying 4o as well as Anthropic claude 3.5 sonnet if ya haven't played around with those yet
I was using gpt-3.5-turbo-0125. It was before the recent pricing change.
But I have a bunch of updates to make to the json schema, so will re-run everything with gpt-4o-mini.
Sonnet seems a lot more expensive, but I'll 'upgrade' if the schema changes don't get sufficiently good results.
I'm curious about what types of source documents you tried, and whether you ever suffer from hallucinations?
You guys came out of an academic lab, so you must know that hypothesis fishing expeditions are not viable.
> ... a major commercial bank... couldn’t improve credit risk models because critical data was stuck in PDFs and emails.
In this example there will be no improvement to the risk model or whatever, because 19/20 times there will be no improvement. In an academic setting this is seen as normal, but in a business setting with no executive champions, only product managers, this will be seen as a failure, and it will be associated with you and your technology, which is bad.
Unfortunately these people are not willing to pay more money for less risk. What they want is a base consulting cost (i.e., a non-venture business) to identify the lowest risk, promotion worthy endeavor, and then they want to pay as little as possible to achieve that. In a sense, the kind of customers who need unstructured data ETLs are poorly positioned to use such a technology, because they don't value technology generally, they aren't forward looking.
Assembling attractive websites that are really features on top of Dagster? There's a lot of value in that. Question is, are people willing to pay for that? Anyone can make attractive Dagster UIs, anyone can do Python glue. It's very challenging to differentiate yourselves, even when you feel like you have some customers, because eventually, one of those middlemen at BankCo are going to punch your USP into Google, and find the pre-existing services with huge account management teams (i.e., the hand holding consulting business people really pay for) that outpace you.
I've seen quotes like this many times. It's silly. I worked at a big bank for over a decade. 95% of the data we cared about was already in a SQL database. Maybe ~80% of our data was "unstructured", but it wasn't stuff we cared about for risk management or other critical functions.
> people are not willing to pay more money for less risk
I'd disagree here. Banks are willing to pay money to reduce risk, it's just unlikely to come from scraping data out of PDFs with an LLM because they've already done this if it's worth it.
Our customers are asking for integration with a lot of their systems (say HR / patrolling), but never ever offer to hook up their accounting system. If we want financial data, we either get a PDF with their audited financial statement or in exceptional cases a custom audited statement (you know, the one where a print of a part of the ledger gets a signature from the CPA for a not insignificant bill).
So I am enthusiastic from a data science point of view. Financial data processing of customer data is / was scarce since limited to what was feasible to manually process. That is nearly in the past.
It makes sense on paper from a VC perspective as a big bet.. but good luck to smaller VC-funded founders competing with massive BD teams fronting top AI dev teams. We compete in adjacent spaces where we can differentiate, and intentionally decided against going in head-on. For those who can, again, good luck!
There are some elements that might resemble Dagster, but I believe the challenging part is constructing validation systems that ensure high accuracy and correct schemas while processing all kinds of complex PDFs and document edge cases. Over the past few weeks, our engineering team has spent a lot of time developing a vision model robust enough to extract nested tables from documents
In critical scenarios companies won't risk using 100% automation, the human is still in the loop, so the cost doesn't go down much.
I work on LLM based information extraction and use my own evaluation sets. That's how I obtained the 90% score. I tested on many document types. It looks like it's magic when you try an invoice in GPT-4o and skim the outputs, but if you spend 15 minutes you find issues.
Can you risk an OCR error confusing a dot for a comma to send 1000x more money in a bank transfer, or to get the medical data extraction wrong and someone could suffer because there was no human in the document ingestion pipeline to see what is happening?
Launch HN: Synnax (YC S24) – Unified hardware control and sensor data streaming - https://news.ycombinator.com/item?id=41227369 - Aug 2024 (23 comments)
also recent:
Launch HN: Stack Auth (YC S24) – An Open-Source Auth0/Clerk Alternative - https://news.ycombinator.com/item?id=41194673 - Aug 2024 (140 comments)
Launch HN: Firezone (YC W22) – Zero-trust access platform built on WireGuard - https://news.ycombinator.com/item?id=41173330 - Aug 2024 (88 comments)
Launch HN: Airhart Aeronautics (YC S22) – A modern personal airplane - https://news.ycombinator.com/item?id=41163382 - Aug 2024 (618 comments)
That's 4 of the 8 most recent Launch HNs btw. But it's true that there are reams of AI startups nowadays.
I'm developing an "AI wrapper" myself and I know how difficult it is to create a reliable system using LLM integration and I guess these many similar projects are competing on being the one to create something that won't risk ruining their customers reputation. But I see no differentiation, no eye-catching tech, algorithm, invention.
YC and HN used to be the bastion of innovation in tech.
> This year, we’ll fund more than 500 companies out of 50,000 applications, and almost all of them are related to AI in some way.
Source: https://www.ycombinator.com/blog/why-yc-went-to-dc/
(Edited to be more precise.)
We believe that AI is only one part of our product. A significant amount of value comes from building robust integrations with different data sources and managing the business logic that operates on top of this unstructured data.
Here's an example of the Enron email demo using the edsl syntax/package & a few different LLMs: https://www.expectedparrot.com/content/6607caa1-efc5-439f-85...
Everyone here knows that it's a really big problem that no one has nailed yet.
My 2 cents:
1. It took us (newscatcherapi.com) three years to realize that customers with the biggest problems and with the biggest budgets are the most underserved. The reason is that everyone is building an infinitely scalable AI/LLM/whatever to gain insights from news.
In reality, this NLP/AI works quite OK out of the box but is not ideal for everyone at the same time. So we decided to do Palantir-like onboarding/integration for each customer. We charge 25x more, but customers have a perfect tailor-made solution and a high ROI.
I see you already do the same! "99%+ accuracy with fine-tuning and human-in-the-loop" is what worked great for us. This way, your competitor is a human on payroll (very expensive) and not AWS Tesseract.
Going from 95% to 99% is just a fractional improvement, but it can be "not good enough" to a "great solution" change that can be charged differently.
2. "AI-powered workflow for unstructured data" what does it even mean? Why don't you say "99%+ accuracy extraction"? It's 2024, everyone is using AI, and everyone knows you need 2 hours to start applying AI from 0. So don't lower my expectations.
1. I completely agree. Last-mile accuracy is crucial for enterprise buyers, and the challenge isn't just the AI. It's about mapping their business logic and workflows to the product in a way that demonstrates fast time to value.
2. Thanks for the feedback. We're still refining the messaging and don't want to be overly focused on just the extraction aspect. Do you think positioning it as ETL for unstructured data or high-accuracy extraction for enterprises might work better?"
I'd be mindblown if you said, "We turn PDFs into structured data with 99.99% accuracy. Here is how:"
And then tell me about fine-tuning human-in-the-loop stuff.
For instance, I have 100 pdfs, each with 10-100 individual products listed (in different formats).
I want to create a single table with one row per product appearing in any of the PDFs, with various details like price, product description, etc.,
From what I can tell from the demo, it seems like 1 file = 1 row in Trellis?
Trellis looks amazing... but only if it works well enough, i.e., if the rate of edge cases that trip up the service consistently remains close to 0%.
Every organization in the world needs and wants this, like, right now.
If you make it work well enough, you'll have customers knocking on your door around the clock.
I'm going to take a look. Like others here, I'm rooting for you guys to succeed.
The biggest challenge I see for you guys is that your best customer prospects, i.e., those organizations which need this most urgently and are willing to pay the most for it are the ones already spending gobs of money to do it with human labor because mistakes are too costly, so they need at least human-level performance.
As you know, current-generation LLMs/LMMs are not yet reliable enough to do it on their own. They need all the help they can get -- sanity data checks, post-processing logic, ensembles of models, organization into teams of agents, etc., etc. -- I'm sure you're looking at all options.
Absent human beings in the loop, you're at the frontier of LLM/LMM research.
If you pull it off, you'll make megabucks.
I assume that in the validation step if you don't get all those data points, then that routes to an error state for further review or something.
Our thesis is that foundational models will become good and affordable enough to be used in almost all data processing pipelines. We build systems on top of that to manage workflows, integrations, and data applications that people may want to develop.
I'd need to actually dig into your product to make an informed statement but my guess is that if you build your business around AI secret sauce you're going to get your business eaten and pivot or fail, and if you build your business around a UI and specific integrations/tools real customers you're already in contact with want right now, you'll be ok.
non-snarky genuine question: is "generate structured data from unstructured data using AI" intended to be a moat or differentiator?
catalyst for my question: I just read about this capability becoming available from other AI vendors, e.g.
https://openai.com/index/introducing-structured-outputs-in-t...
1. writing connectors for various sources
2. writing connectors for destination
3. supporting multiple models, embeddings, vector database, text extractors
3. workflow automation engine(cron jobs)
4. performance tuning for speed and costs
5. security and compliance
Model routing architecture has been quite interesting to explore.
(congrats on the launch!)
A lot of the use cases that we see, like extracting data from nested tables in 100-page-long private credit documents or flagging transactions and emails that contain a specific compliance violation, are impossible to do with NER.
NER is good for really simple things (like getting names, addresses, etc.).
A lot of the use cases that we see, like extracting data from nested tables in 100-page-long private credit documents or flagging transactions and emails that contain a specific compliance violation, are impossible to do with NER.
With Trellis, the idea is taht you can write any mappings and transformations (no matter how complex the tasks or the source data are).
This is why logistics companies, financial services, and insurance firms have in-house teams to process these documents (e.g., insurance claims adjusters) or outsource them to BPOs. These documents can vary significantly from one to another.
With LLMs fine-tuned on your data, the accuracy is much higher, more robust, and more generalizable. We have a human in the loop for flagged results to ensure we maintain the highest accuracy standards in critical settings like financial services.
Because browsers have an autocomplete feature.
How many upvotes does your comment have?
For data ingestion and mapping, I agree that in an ideal world, we would all have first-party API integrations. However, many industries still rely on PDFs and CSV files to transfer data.
isn't it obvious that this would be a problem that will eventually be solved by the LLM providers themselves including the ability to flag and apply business logic on top of the structured outputs?
Like I'm not sure if this is well known but LLM providers have huge pressure to turn a profit and will not hesitate to copy any downstream wrappers out of existence rather than acquiring them outright.
Its like selling wrapping tape around the shovel handle for better grip and expecting the shovel makers to not release their new shovels with it in the near future.
The shovel makers don't even need to do any market research or product development and the buyers don't have any incentive to seek or pay a dedicated third party for what their vendors will release for free and at lower costs if that makes sense.
Basically when we onboard a new client they dump all their audiograms on us as PDFs.
The data needs extraction needs to be perfect because the tables values are used to detect hearing loss over time.
We settled on a pipeline that looks roughly like
PDF -> gpto pre filter phase -> OCR to extract text tables and forms -> things branch out here
We do a direct parse of forms and text through an LLM
Extract audiogram graphs and send them to a foundation convnet
Attempt to parse tables programmatically
-> an audiogram might have 3 separate places where the values are so we pass the results of all three of these routes through Claude sonnet and if they match they get auto approved. If they don’t, they get flagged for manual review.
All in all it’s been a journey but the accuracy is near 100 percent. These tools are incredible
At Trellis, we're focused on building the AI tool that supports document-heavy workflows (this includes building the dashboard for teams to review, update, and approved results that were flagged, reading and writing directly to your system of record like Salesforce, and allowing customers to create their own validations around the documents).
Rooting for you guys!
Filters are a really important feature downstream of that which this system can provide.
We have also worked with the Enron corpus for demos and fast, reliable ETL for a set of documents that large is more difficult than it seems and a commendable problem to solve.
Exciting stuff!
How do your capabilities compare to Google Document AI or Watson SDU? Also what about standalone competitors such as Indico Data or DocuPanda?
Google Document AI and Watson SDU seem to be an afterthought for IBM/Google. The accuracy and configurability often fall short when you want to use them in a production setting.
Comparing to other legacy document processing companies, I think there are a few areas where we differentiate:
1. We handle end-to-end workflows from integrating with data sources, defining the transformation, and automatically triggering new runs when there’s an update to the data. 2. We built our entire stack on LLM and Vision transformers and use OCR/parser to check the results. This allows the mapping and tasks to be a lot more robust and flexible. 3. We have validations, reference checking, and confidence score metrics that enable fast human-in-the-loop iteration.
The way unstructured built their parsing and extraction are mostly based on traditional OCR and rule based extraction. We built all preprocessing pipeline in an LLM and vision model first way that allows us to be flexible when the data is quite complex (like tables and images within documents).
I'm curious, have you (or your customers) deployed this in a RAG use case already, and what have been the results like?