undefined | Better HN

0 pointsiknownothow1y ago0 comments

Thanks for the reply and apologies for the general cynicism. It's not lost on me that it's people like you that build tools that make the work tick. I'm just a loud potential customer and I'm just forwarding the frustration that I have with my own customers onto you :)

Your customers are software devs like me. When we're in control of generating timestamps, we know we must use standard ISO formatting.

However, what do I do when my customers give me access to an S3 bucket with 1 billion timestamps in an arbitrary (yet decipherable) format?

In the GitHub issue you seem to have undergone an evolution from purity to pragmatism. I support this 100%.

What I've also noticed is that you seem to try to find grounding or motivation for "where to draw the line" from what's already been done in Temporal or Python stdlib etc. This is where I'd like to challenge your intuitions and ask you instead to open the flood gates and accept any format that is theoretically sensible under ISO format.

Why? The damage has already been done. Any format you can think of, already exists out there. You just haven't realized it yet.

You know who has accepted this? Pandas devs (I assume, I don't them). The following are legitimate timestamps under Pandas (22.2.x):

* 2025-03-30T (nope, not a typo)

* 2025-03-30T01 (HH)

* 2025-03-30 01 (same as above)

* 2025-03-30 01 (two or more spaces is also acceptable)

In my opinion Pandas doesn't go far enough. Here's an example from real customer data I've seen in the past that Pandas doesn't parse.

* 2025-03-30+00:00 (this is very sensible in my opinion. Unless there's a deeper theoretical regex pattern conflicts with other parts of the ISO format)

Here's an example that isn't decipherable under a flexible ISO interpretation and shouldn't be supported.

* 2025-30-03 (theoretically you can infer that 30 is a day, and 03 is month. BUT you shouldn't accept this. Pandas used to allow such things. I believe they no longer do)

I understand writing these flexible regexes or if-else statements will hurt your benchmarks and will be painful to maintain. Maybe release them under an new call like `parse_best_effort` (or even `youre_welcome`) and document pitfalls and performance degradation. Trust me, I'd rather use a reliable generic but slow parser than spend hours writing a write a god awful regex that I will only use once (I've spent literal weeks writing regexes and fixes in the last decade).

Pandas has been around since 2012 dealing with customer data. They have seen it all and you can learn a lot from them. ISOs and RFCs when it comes to timestamps don't mean squat. If possible try to make Whenever useful rather than fast or pure. I'd rather use a slimmer faster alternative to pandas for parsing Timestamps if one is available but there aren't any at the moment.

If time permits I'll try to compile a non exhaustive list of real world timestamp formats and post in the issue.

Thank you for your work!

P.S. seeing BurntSushi in the GitHub issue gives me imposter syndrome :)

0 comments

1 comments · 1 top-level

burntsushi1y ago

Because you pinged me... Jiff also generally follows in Temporal's footsteps here. Your broader point of supporting things beyond the specs (ISO 8601, RFC 3339, RFC 9557, RFC 2822 and so on) has already been absorbed into the Temporal ISO 8601 extensions. And that's what Jiff supports (and presumably, whenever, although I don't know enough about whenever to be absolutely precise in what it supports). So I think the philosophical point has already been conceded by the Temporal project itself. What's left, it seems, is a measure of degree. How far do you go in supporting oddball formats?

I honestly do not know the answer to that question myself. But I wouldn't necessarily look to Pandas as the shining beacon on a hill here. Not because Pandas is doing anything wrong per se, but because it's a totally different domain and use case. On the one hand, you have a general purpose library that needs to consider all of its users for all general purpose datetime use cases. On the other hand, you have a data scienc-y library designed for trying to slurp up and make sense of messy data at scale. There may be things that make sense in the latter that don't in the former.

In particular, a major gap in your reasoning, from what I can see, is that constraints beget better error reporting. I don't know how to precisely weigh error reporting versus flexible parsing, but there ought to be some deliberation there. The more flexible your format, the harder it is to give good error messages when you get invalid data.

Moreover, "flexible parsing" doesn't actually have to be in the datetime library. The task of flexible parsing is not, in and of itself, overtly challenging. It's a tedious task that can be build on top of the foundation of a good datetime library. I grant that this is a bit of a cop-out, but it's part of the calculus when designing ecosystem libraries like this.

Speaking for me personally (in the context of Jiff), something I wouldn't mind so much is adding a dedicated "flexible" parsing mode that one can opt into. But I don't think I'd want to make it the default.

j / k navigate · click thread line to collapse

0 comments

1 comments · 1 top-level

burntsushi1y ago

j / k navigate · click thread line to collapse