platypii on Hacker News

Ask HN: Where are you keeping your LLM logs?

LLM logs are crushing my application logging system. We recently launched AI features on our app and went from ~100mb/month of normal website logs to 3gb/month of llm conversation logs and growing. Our existing logging system was overwhelmed (queries timing out, etc), and costs started increasing. We’re considering how to re-architect our llm logs specifically so we can handle more users plus the increasing token use from things like reasoning models, tool calling, and multi-agent systems. I’m not selling any solutions here, genuinely curious what others are doing. Do you store them alongside APM logs? Dedicated LLM logging service? Build it yourself with open source tools?

Best way to annotate large parquet LLM logs without full rewrites?

I asked this on the Apache mailing list but haven’t found a good solution yet. Wondering if anyone has some ideas for how to engineer this?

Here’s my problem: I have gigabytes of LLM conversation logs in parquet in S3. I want to add per-row annotations (llm-as-a-judge scores), ideally without touching the original text data.

So for a given dataset, I want to add a new column. This seemed like a perfect use case for Iceberg. Iceberg does let you evolve the table schema, including adding a column. BUT you can only add a column with a default value. If I want to fill in that column with annotations, ICEBERG MAKES ME REWRITE EVERY ROW. So despite being based on parquet, a column-oriented format, I need to re-write the entire source text data (gigabytes of data) just to add ~1mb of annotations. This feels wildly inefficient.

I considered just storing the column in its own table and then joining them. This does work but the joins are annoying to work with, and I suspect query engines do not optimize well a "join on row_number" operation.

I've been exploring using little-known features of parquet like the file_path field to store column data in external files. But literally zero parquet clients support this.

I'm running out of ideas for how to work with this data efficiently. It's bad enough that I am considering building my own table format if I can’t find a solution. Anyone have suggestions?

2platypii4mo ago2

Ask HN: Local tools for working with LLM datasets?

I’ve been doing data science for years, and am very familiar with jupyter notebooks and more recently been using duckdb a lot. But now I have this huge pile of output tokens from my 4090s, and it feels characteristically different from data I’ve worked with in the past. Notebooks and duckdb on the CLI don’t feel like they’re built for working with huge volumes of text data like my training set and llm output traces.

What have you found work well for this? I’m trying to fine-tune on a text dataset and be able to inspect the output from eval runs. I would prefer local and open source tools to a paid service.

What UI do you use on top of data engineering tools to look at data?

Tools like DuckDB Wasm and data engineering platforms like Iceberg leverage Parquet’s built-in indexing to very efficiently query files over the network. But as I’ve been building data tools myself, the stack gets complicated fast, especially once you try to visualize or explore the data instead of just querying it. I’m intrigued by some of the modern tricks people are using to do more data engineering client-side.

With OPFS + Parquet + Wasm, the browser already has everything it needs to handle multi-GB LLM datasets client-side.

Is the world of data UIs evolving? Are there new data tools and best practices beyond notebooks and DuckDB?

Ask HN: How far can we push the browser for large-scale data parsing?

How far can we push the browser as a data engine — not just for visualizations, but for curating and querying large datasets? Do we need traditional backend architectures?

I wanted to see what happens when we treat the browser like part of the data stack, using pure JavaScript to load, slice, and explore datasets interactively. That experiment led to a small set of open-source tools — Hyparquet and HighTable. They’re designed to test the limits of browser-native data processing to see where the browser stops being a thin client and starts acting like a real data engine.

Curious what others think about the future of browser-first data tools:

- Where do you see the practical limits for client-side data processing? - What would make browser-based architectures a viable alternative to traditional data stacks?

Ask HN: Where are you keeping your LLM logs?

Best way to annotate large parquet LLM logs without full rewrites?

I asked this on the Apache mailing list but haven’t found a good solution yet. Wondering if anyone has some ideas for how to engineer this?

Here’s my problem: I have gigabytes of LLM conversation logs in parquet in S3. I want to add per-row annotations (llm-as-a-judge scores), ideally without touching the original text data.

I've been exploring using little-known features of parquet like the file_path field to store column data in external files. But literally zero parquet clients support this.

I'm running out of ideas for how to work with this data efficiently. It's bad enough that I am considering building my own table format if I can’t find a solution. Anyone have suggestions?

Ask HN: Local tools for working with LLM datasets?

What have you found work well for this? I’m trying to fine-tune on a text dataset and be able to inspect the output from eval runs. I would prefer local and open source tools to a paid service.

What UI do you use on top of data engineering tools to look at data?

With OPFS + Parquet + Wasm, the browser already has everything it needs to handle multi-GB LLM datasets client-side.

Is the world of data UIs evolving? Are there new data tools and best practices beyond notebooks and DuckDB?

Ask HN: How far can we push the browser for large-scale data parsing?

How far can we push the browser as a data engine — not just for visualizations, but for curating and querying large datasets? Do we need traditional backend architectures?

Curious what others think about the future of browser-first data tools:

- Where do you see the practical limits for client-side data processing? - What would make browser-based architectures a viable alternative to traditional data stacks?

platypii

Recent submissions

A visual explainer of how to scroll through billions of rows in the browser (opens in new tab)

Ask HN: Where are you keeping your LLM logs?

Show HN: Squirreling: a browser-native SQL engine (opens in new tab)

Best way to annotate large parquet LLM logs without full rewrites?

Ask HN: Local tools for working with LLM datasets?

What UI do you use on top of data engineering tools to look at data?

Show HN: We built an AI tool for working with massive LLM chat log datasets (opens in new tab)

Lessons from Hyperparam's year of open source data transformation (opens in new tab)

Ask HN: How far can we push the browser for large-scale data parsing?

The Quest for Instant Data (opens in new tab)

Show HN: Hyperparam: OSS tools for exploring datasets locally in the browser (opens in new tab)

Recent submissions

A visual explainer of how to scroll through billions of rows in the browser (opens in new tab)

Ask HN: Where are you keeping your LLM logs?

Show HN: Squirreling: a browser-native SQL engine (opens in new tab)

Best way to annotate large parquet LLM logs without full rewrites?

Ask HN: Local tools for working with LLM datasets?

What UI do you use on top of data engineering tools to look at data?

Show HN: We built an AI tool for working with massive LLM chat log datasets (opens in new tab)

Lessons from Hyperparam's year of open source data transformation (opens in new tab)

Ask HN: How far can we push the browser for large-scale data parsing?

The Quest for Instant Data (opens in new tab)

Show HN: Hyperparam: OSS tools for exploring datasets locally in the browser (opens in new tab)