How I hunt down and fix errors in production (opens in new tab)

(mattsegal.dev)

64 pointsThe_Amp_Walrus4y ago25 comments

25 comments

24 comments · 8 top-level

aaronbwebber4y ago· 5 in thread

An important step here that is missing here is evaluating if your fix is going to cause other, potentially worse problems. I suspect that in this case, it's fairly unlikely that increasing the maximum POST body size to 60 MB is going to cause problems - eyeballing that Sendgrid chart, it looks like we are not dealing with very high throughput here. But it's not hard to imagine a situation where tripling the max POST body size would result in a large increase in server memory usage, which could result in things like OOM kills, which could result in a lot of people not getting their reply emails or whatever.

So don't just rush a fix out. Think about what the effects of a configuration change like this might be, and whether you are just making more problems for yourself down the line trying to fix something quickly.

iforgotpassword4y ago

The bad part is that this only made the problem more unlikely, but didn't fix it.

Ideally you want to inform the client that their mail was discarded due to size. But you cannot make the mail bounce at that point because it magically turned into http already. The actual delivery is already done. You also cannot trigger an automated reply in your Django app, because it was nginx who dropped it and your app never saw it.

SOLAR_FIELDS4y ago

It looks like SendGrid SMTP servers will probably bounce anything over 30 mb since their v3 api says it only supports attachments up to 30 mb

https://docs.sendgrid.com/api-reference/mail-send/mail-send

So OP should have actually only increased the size limit to 30 mb since that’s all that SendGrid supports anyway. Then OP can simply rely on SendGrid’s SMTP server to respond to the client appropriately when they attach too big of files. OP would want to verify with docs and/or SendGrid support that they actually do that, of course, and won’t arbitrarily increase the limit without releasing a new API version before completely relying on SendGrid. But presumably as a paying user of the product he could do that.

teddyh4y ago

Attachment size is not mail size. Attachments usually get base64 encoded , i.e. 33% larger. So a maximum mail size of 30 mb is actually a maximum attachment size of 22.5 mb.

philliphaydon4y ago

Can’t he pull the file out if it’s larger than 30mb and provide a link to the file? I think Google does something similar where it will put the file into Google drive if it’s too large.

The_Amp_WalrusOP4y ago

good advice. it was like a bit of a YOLO fix, but as you noticed pretty low volume/small scale stuff

invalidname4y ago· 5 in thread

I'm very much in favor of this but his observability stack is seriously lacking. With Developer Observability tools this is much easier and more powerful: https://www.youtube.com/watch?v=k0DPO5jlZtU

sa464y ago

How would Lightrun have helped here? Skimming their docs, it looks like a fancy debugger that can log stuff on prod and display it in your IDE.

A debugger is most useful when you know where the problem is and you can either reproduce it or it occurs often enough to observe. Neither was true for OP. Additionally, the problem was in nginx and it's not clear if the error was visible from a part of the stack that Lightrun supports.

That said, I'm always game to hear about alternative debugging techniques.

invalidname4y ago

This is classic for a tool like that... You know an email isn't sent to some users. You start placing conditional snapshots (like conditional breakpoints but they don't stop) in key locations.

You can use conditions like "currentUser == 'user experiencing problem'". Or you can place the snapshots in places where code "should" reach in case of error.

Go make coffee. Come back and you'll see stack traces and variable values you can analyze to see the problem. The nice thing is that it will even work if you have a cluster. Just place the snapshot in a tag which applies to all the servers. You don't need to update the code, there's no risk involved etc.

The_Amp_WalrusOP4y ago

Would you add any categories to this list, or did you think the tools recommended within the categories were lacking?

- Uptime monitoring

- Error reporting

- Log aggregation

- Performance monitoring

invalidname4y ago

A developer observability tool like Lightrun would make a huge difference.

trog4y ago

Never heard of Lightrun before, but it looks very neat - although I'm immediately turned off by the pricing page just saying "contact us".

1 more reply

chaps4y ago· 4 in thread

Here's how I do it:

  xargs -I'hostname' -a hosts.txt -P128 bash -c "ssh 'hostname' find / -type f -mmin -20 | xargs -P128 -Ifilename grep -cHia error filename 2>/dev/null | sed 's/^/hostname:/' ; :" | sort -nrk3 -t':'

flukus4y ago

Personally I like to rsync the logs locally so I can have the context of the errors and also historical data, the later can be useful for blame avoidance. I also prefer awk's pattern matching for the filtering, counting, etc. Especially on legacy code where some errors aren't really errors, or errors that aren't for me, etc.

The hardest/funnest bug I ever fixed was from grepping several years worth of log files and noticing that an error occasionally happened within 5 seconds of each other. From that realization I just had to search the code base for "sleep(5000)" to find the problem.

If you have a lot of environments, servers or apps I highly recommend recutils over a hosts file too.

The_Amp_WalrusOP4y ago

nice one liner is that any file modified in the last 20 minutes on any machine in hosts.txt containing the word error?

chaps4y ago

It is, just... don't run it in prod :)

aaronbwebber4y ago

This is probably the best advertisement for loki I've ever seen.

1 more reply

notaspecialist4y ago· 1 in thread

When a user comes over and says "this isn't happening" I write a test and sure enough, the test fails. I fix the case, re-run all the tests, push to UAT, and ask the user to verify it works in the UAT system. It's pushed into production after hours.

Prior to TDD I would spend hours stepping through code, setting variables to replicate the scenario, scratching my head, and usually fix it after a week or so. Then I would get a bug report of something else weird happening. And repeat that process.

vitus4y ago

This is great, if your user report has enough details for you to replicate the problem in a test case.

More often than not, I spend a fair amount of time looking for error logs, tracing through the code, and generally getting a good sense of the exact parameters of the underlying issue.

But yes, if you identify a defect that you can replicate, write a test for it so you can confirm that a) your fix actually works and b) you don't backslide in the future.

ricardobayes4y ago· 1 in thread

Lately the only technical question we ask when hiring is to debug an issue. Experience in this is really difficult to fake unlike memorizing leetcode issues etc.

The_Amp_WalrusOP4y ago

I did a code test for my 2nd job where they had a simple Django project setup with, iirc, a bug and a performance issue and you would sit with a dev and work through it to fix them both. They'd answer framework specific questions (I didn't know Django at the time). Best interview experience I've ever had.

mtippett4y ago

I agree with most of what is suggested in the article.

However a big part is missing is the reality that there are a set of hypotheses (is that right) in play at any point in time. A lot of debugging is the cycle of

1. Think about the system, gather any available data - you can't boil the ocean 2. Consider a set of hypotheses possible cause (even if it is a partial cause) 3. Seek any method to either refute or confirm the possible cause which gives more data.

Wash, rinse, repeat. Each cycle will likely get closer to the problem.

Each cycle also is likely to find other tech debt that needs to be solved.

Rarely is there a single hypothesis that is right first time. Although an experienced person will prune out a lot of poor ideas automatically, and likely subconsciously.

Observability goes a long way to getting the data needed to confirm or refute.

rmbyrro4y ago

If there's an issue receiving emails, there's an endpoint /email/receive/ and nginx logs files, I would have promptly searched these logs for "[error] * /email/receive/"

ge964y ago

random thoughts about this subject

- sucks when your bug completely blows your project up (type error blank page)

- I'm tempted to track every click/event and log it for reproducibility

- sucks when your product fails not because of a bug but just people not knowing how to use it (training issue I guess) eg. permissions not accepted, why isn't it working?

j / k navigate · click thread line to collapse

25 comments

24 comments · 8 top-level

aaronbwebber4y ago· 5 in thread

iforgotpassword4y ago

The bad part is that this only made the problem more unlikely, but didn't fix it.

SOLAR_FIELDS4y ago

It looks like SendGrid SMTP servers will probably bounce anything over 30 mb since their v3 api says it only supports attachments up to 30 mb

https://docs.sendgrid.com/api-reference/mail-send/mail-send

teddyh4y ago

Attachment size is not mail size. Attachments usually get base64 encoded , i.e. 33% larger. So a maximum mail size of 30 mb is actually a maximum attachment size of 22.5 mb.

philliphaydon4y ago

Can’t he pull the file out if it’s larger than 30mb and provide a link to the file? I think Google does something similar where it will put the file into Google drive if it’s too large.

The_Amp_WalrusOP4y ago

good advice. it was like a bit of a YOLO fix, but as you noticed pretty low volume/small scale stuff

invalidname4y ago· 5 in thread

I'm very much in favor of this but his observability stack is seriously lacking. With Developer Observability tools this is much easier and more powerful: https://www.youtube.com/watch?v=k0DPO5jlZtU

sa464y ago

How would Lightrun have helped here? Skimming their docs, it looks like a fancy debugger that can log stuff on prod and display it in your IDE.

That said, I'm always game to hear about alternative debugging techniques.

invalidname4y ago

This is classic for a tool like that... You know an email isn't sent to some users. You start placing conditional snapshots (like conditional breakpoints but they don't stop) in key locations.

You can use conditions like "currentUser == 'user experiencing problem'". Or you can place the snapshots in places where code "should" reach in case of error.

The_Amp_WalrusOP4y ago

Would you add any categories to this list, or did you think the tools recommended within the categories were lacking?

- Uptime monitoring

- Error reporting

- Log aggregation

- Performance monitoring

invalidname4y ago

A developer observability tool like Lightrun would make a huge difference.

trog4y ago

Never heard of Lightrun before, but it looks very neat - although I'm immediately turned off by the pricing page just saying "contact us".

1 more reply

chaps4y ago· 4 in thread

Here's how I do it:

  xargs -I'hostname' -a hosts.txt -P128 bash -c "ssh 'hostname' find / -type f -mmin -20 | xargs -P128 -Ifilename grep -cHia error filename 2>/dev/null | sed 's/^/hostname:/' ; :" | sort -nrk3 -t':'

flukus4y ago

If you have a lot of environments, servers or apps I highly recommend recutils over a hosts file too.

The_Amp_WalrusOP4y ago

nice one liner is that any file modified in the last 20 minutes on any machine in hosts.txt containing the word error?

chaps4y ago

It is, just... don't run it in prod :)

aaronbwebber4y ago

This is probably the best advertisement for loki I've ever seen.

1 more reply

notaspecialist4y ago· 1 in thread

vitus4y ago

This is great, if your user report has enough details for you to replicate the problem in a test case.

More often than not, I spend a fair amount of time looking for error logs, tracing through the code, and generally getting a good sense of the exact parameters of the underlying issue.

But yes, if you identify a defect that you can replicate, write a test for it so you can confirm that a) your fix actually works and b) you don't backslide in the future.

ricardobayes4y ago· 1 in thread

Lately the only technical question we ask when hiring is to debug an issue. Experience in this is really difficult to fake unlike memorizing leetcode issues etc.

The_Amp_WalrusOP4y ago

mtippett4y ago

I agree with most of what is suggested in the article.

However a big part is missing is the reality that there are a set of hypotheses (is that right) in play at any point in time. A lot of debugging is the cycle of

Wash, rinse, repeat. Each cycle will likely get closer to the problem.

Each cycle also is likely to find other tech debt that needs to be solved.

Rarely is there a single hypothesis that is right first time. Although an experienced person will prune out a lot of poor ideas automatically, and likely subconsciously.

Observability goes a long way to getting the data needed to confirm or refute.

rmbyrro4y ago

If there's an issue receiving emails, there's an endpoint /email/receive/ and nginx logs files, I would have promptly searched these logs for "[error] * /email/receive/"

ge964y ago

random thoughts about this subject

- sucks when your bug completely blows your project up (type error blank page)

- I'm tempted to track every click/event and log it for reproducibility

- sucks when your product fails not because of a bug but just people not knowing how to use it (training issue I guess) eg. permissions not accepted, why isn't it working?

j / k navigate · click thread line to collapse