That seems too important to have zero visibility on to me. Just eyeing the graphs, your queue size grew at 750m/s from 17:49 to 17:50. You then starting shedding at 17:50 for 40s. Assuming the ingress rate was roughly linear (which it looks like it was) you shed ~30,000 requests out of 3-4M. Does that not seem high to you?
This system seems great for at most once delivery. I wish I had more problems to solve with that constraint.