So don't just rush a fix out. Think about what the effects of a configuration change like this might be, and whether you are just making more problems for yourself down the line trying to fix something quickly.
Ideally you want to inform the client that their mail was discarded due to size. But you cannot make the mail bounce at that point because it magically turned into http already. The actual delivery is already done. You also cannot trigger an automated reply in your Django app, because it was nginx who dropped it and your app never saw it.
https://docs.sendgrid.com/api-reference/mail-send/mail-send
So OP should have actually only increased the size limit to 30 mb since that’s all that SendGrid supports anyway. Then OP can simply rely on SendGrid’s SMTP server to respond to the client appropriately when they attach too big of files. OP would want to verify with docs and/or SendGrid support that they actually do that, of course, and won’t arbitrarily increase the limit without releasing a new API version before completely relying on SendGrid. But presumably as a paying user of the product he could do that.
A debugger is most useful when you know where the problem is and you can either reproduce it or it occurs often enough to observe. Neither was true for OP. Additionally, the problem was in nginx and it's not clear if the error was visible from a part of the stack that Lightrun supports.
That said, I'm always game to hear about alternative debugging techniques.
You can use conditions like "currentUser == 'user experiencing problem'". Or you can place the snapshots in places where code "should" reach in case of error.
Go make coffee. Come back and you'll see stack traces and variable values you can analyze to see the problem. The nice thing is that it will even work if you have a cluster. Just place the snapshot in a tag which applies to all the servers. You don't need to update the code, there's no risk involved etc.
- Uptime monitoring
- Error reporting
- Log aggregation
- Performance monitoring
xargs -I'hostname' -a hosts.txt -P128 bash -c "ssh 'hostname' find / -type f -mmin -20 | xargs -P128 -Ifilename grep -cHia error filename 2>/dev/null | sed 's/^/hostname:/' ; :" | sort -nrk3 -t':'The hardest/funnest bug I ever fixed was from grepping several years worth of log files and noticing that an error occasionally happened within 5 seconds of each other. From that realization I just had to search the code base for "sleep(5000)" to find the problem.
If you have a lot of environments, servers or apps I highly recommend recutils over a hosts file too.
Prior to TDD I would spend hours stepping through code, setting variables to replicate the scenario, scratching my head, and usually fix it after a week or so. Then I would get a bug report of something else weird happening. And repeat that process.
More often than not, I spend a fair amount of time looking for error logs, tracing through the code, and generally getting a good sense of the exact parameters of the underlying issue.
But yes, if you identify a defect that you can replicate, write a test for it so you can confirm that a) your fix actually works and b) you don't backslide in the future.
However a big part is missing is the reality that there are a set of hypotheses (is that right) in play at any point in time. A lot of debugging is the cycle of
1. Think about the system, gather any available data - you can't boil the ocean 2. Consider a set of hypotheses possible cause (even if it is a partial cause) 3. Seek any method to either refute or confirm the possible cause which gives more data.
Wash, rinse, repeat. Each cycle will likely get closer to the problem.
Each cycle also is likely to find other tech debt that needs to be solved.
Rarely is there a single hypothesis that is right first time. Although an experienced person will prune out a lot of poor ideas automatically, and likely subconsciously.
Observability goes a long way to getting the data needed to confirm or refute.
- sucks when your bug completely blows your project up (type error blank page)
- I'm tempted to track every click/event and log it for reproducibility
- sucks when your product fails not because of a bug but just people not knowing how to use it (training issue I guess) eg. permissions not accepted, why isn't it working?