rjpruitt16 on Hacker News

Ask HN: How to scale agent systems when Layer 7 is unreliable?

Agent workflows often involve 10+ API calls to different services (LLMs, data APIs, web scraping). Layer 7 being unreliable = workflows fail or cause retry storms.

Common failure modes I'm thinking about: - 429 rate limits → agents retry → hammer API worse - Partial outages → synchronized retries across customers - LangGraph workflows fail mid-execution → how to resume?

For those running agent systems at scale: - How do you handle Layer 7 failures? - Retry coordination? Circuit breakers? - How do you prevent retry storms to downstream dependencies? - Do LangGraph workflows gracefully handle API failures?

Curious what the production reality looks like.

Ask HN: How do you handle API rate limits in production?

I'm building data pipelines that sync data from various third party apis. We constantly hit 429 rate limits, and our janky retry system fails regularly. For those running production data syncs or microservices calling external APIs heavily:

How do you handle rate limiting across multiple workers? Do you use circuit breakers, retry libraries, or something custom? How do you prevent retry storms when 100 workers all hit the same rate limit?

Curious what's working at scale.

Ask HN: Is anyone losing sleep over retry storms or partial API outages?

I’m working on infrastructure to solve retry storms and outages. Before I go further, I want to understand what people are actually doing today. Compare solutions and maybe help someone see potential solutions. The problems:

Retry storms - API fails, your entire fleet retries independently, thundering herd makes it worse.

Partial outages - API is “up” but degraded (slow, intermittent 500s). Health checks pass, requests suffer.

What I’m curious about: ∙ What’s your current solution? (circuit breakers, queues, custom coordination, service mesh, something else?) ∙ How well does it work? What are the gaps? ∙ What scale are you at? (company size, # of instances, requests/sec)

I’d love to hear what’s working, what isn’t, and what you wish existed.

Ask HN: Are retries the wrong abstraction under rate limits?

Over the last few years, I’ve watched a lot of production systems fail in ways that feel… strangely predictable.

When services hit 429s or timeouts, the standard response is almost always the same: retries with backoff, sleep loops, jitter, etc. This is treated as a best practice across languages and platforms.

But in systems with high concurrency, fan-out, or shared downstream dependencies, retries often seem to amplify load instead of smoothing it. What starts as localized failure can turn into retry storms, thundering herds, and cascading outages.

It’s made me wonder whether retries are solving the wrong problem at the wrong layer — treating a coordination issue as an application-level error-handling concern.

I wrote up a longer piece exploring this idea and arguing for making failure boring again by handling it at a different layer: https://www.ezthrottle.network/blog/making-failure-boring-again

Curious how this matches others’ experience:

Have retries actually improved stability for you under sustained rate limiting?

Have you seen cases where they clearly made things worse?

If retries aren’t the right abstraction, what is?

Interested in war stories, counterexamples, and alternative approaches.

Ask HN: Transitioning from FAANG to High-Paying Contracting–Any Advice?

I worked at Amazon and Twitch for about three years, with an additional two years of experience before that. After spending time in FAANG, I’ve decided that the lifestyle is overrated and often toxic, so I’m looking to transition into contract work at a high hourly rate. I’ve asked both Claude and ChatGPT, and they agree that finding contracts in the range of $150–$200 per hour should be realistic.

I’d love to hear from anyone who has transitioned from FAANG to contracting—how did you find your first clients, and what advice do you have for making the switch?

Currently, I’m building a startup and learning how to create content in the style of Fireship. My goal is to work around 40–60 hours per month at $200/hour, using the rest of my time to grow my business until it becomes profitable. At that rate, I estimate needing 2–3 clients, each requiring 15–20 hours per month. Given my experience in scalable software engineering and ability to create valuable content, I assume there are startups or larger companies that would find my skill set valuable.

I’m based in Seattle but currently living off my savings while I focus on my startup and content creation. I’ve considered moving back to my hometown, which has a lower cost of living, but I’m unsure if that would impact my ability to find high-paying remote contracts.

A few specific questions for those with experience in high-paying contract work:

What are the best ways to find $200/hour contracts? Do companies typically post these, or is it more about networking? How important is location? Would moving to a lower-cost city hurt my chances of finding premium contracts? Do you recommend going through agencies, or is it better to source clients directly? How do you structure contracts to minimize risk (e.g., payment terms, project scope, client expectations)? Are there any common pitfalls to avoid when negotiating high hourly rates? If you were starting over today, what would you do differently? I’d really appreciate any insights from those who have made this transition successfully!

2rjpruitt161y ago9

Ask HN: How to scale agent systems when Layer 7 is unreliable?

Agent workflows often involve 10+ API calls to different services (LLMs, data APIs, web scraping). Layer 7 being unreliable = workflows fail or cause retry storms.

Curious what the production reality looks like.

Ask HN: How do you handle API rate limits in production?

Curious what's working at scale.

Ask HN: Is anyone losing sleep over retry storms or partial API outages?

Retry storms - API fails, your entire fleet retries independently, thundering herd makes it worse.

Partial outages - API is “up” but degraded (slow, intermittent 500s). Health checks pass, requests suffer.

I’d love to hear what’s working, what isn’t, and what you wish existed.

Ask HN: Are retries the wrong abstraction under rate limits?

Over the last few years, I’ve watched a lot of production systems fail in ways that feel… strangely predictable.

When services hit 429s or timeouts, the standard response is almost always the same: retries with backoff, sleep loops, jitter, etc. This is treated as a best practice across languages and platforms.

It’s made me wonder whether retries are solving the wrong problem at the wrong layer — treating a coordination issue as an application-level error-handling concern.

I wrote up a longer piece exploring this idea and arguing for making failure boring again by handling it at a different layer: https://www.ezthrottle.network/blog/making-failure-boring-again

Curious how this matches others’ experience:

Have retries actually improved stability for you under sustained rate limiting?

Have you seen cases where they clearly made things worse?

If retries aren’t the right abstraction, what is?

Interested in war stories, counterexamples, and alternative approaches.

Ask HN: Transitioning from FAANG to High-Paying Contracting–Any Advice?

I’d love to hear from anyone who has transitioned from FAANG to contracting—how did you find your first clients, and what advice do you have for making the switch?

A few specific questions for those with experience in high-paying contract work:

rjpruitt16

Recent submissions

Layer 8: The coordination protocol AI agents and embedded devices don't have yet (opens in new tab)

Ask HN: How to scale agent systems when Layer 7 is unreliable?

Ask HN: How do you handle API rate limits in production?

Show HN: Stop Losing LangGraph Progress to 429 Errors (opens in new tab)

Ask HN: Is anyone losing sleep over retry storms or partial API outages?

Ask HN: Are retries the wrong abstraction under rate limits?

Show HN: EZThrottle – Coordinated retries and region racing for APIs Gleam/BEAM (opens in new tab)

Glixir: A safe(ish) OTP library for gleam (opens in new tab)

Ask HN: Transitioning from FAANG to High-Paying Contracting–Any Advice?

Recent submissions

Layer 8: The coordination protocol AI agents and embedded devices don't have yet (opens in new tab)

Ask HN: How to scale agent systems when Layer 7 is unreliable?

Ask HN: How do you handle API rate limits in production?

Show HN: Stop Losing LangGraph Progress to 429 Errors (opens in new tab)

Ask HN: Is anyone losing sleep over retry storms or partial API outages?

Ask HN: Are retries the wrong abstraction under rate limits?

Show HN: EZThrottle – Coordinated retries and region racing for APIs Gleam/BEAM (opens in new tab)

Glixir: A safe(ish) OTP library for gleam (opens in new tab)

Ask HN: Transitioning from FAANG to High-Paying Contracting–Any Advice?