On the Perils of Failing Silently

Last week we discovered a bug in production where a certain user flow would crash. Naturally, since this was a high-profile flow, the bug was discovered on Thursday evening (the end of the work week in Israel). And naturally, we could quickly reproduce the error in our live site, but it would not reproduce in any non-production environment. Par for the course.

Diving into the issue, we saw that under this set of circumstances, the client was receiving a(n inappropriate) 403, with no further information. Since the flow did not require authentication, this was particularly suspect. We started iterating through the various usual suspects – could our nginx be at fault? Is our load-balancer denying this particular flow? A broken SSL certificate perhaps? Or was something in the code broken, or just deep-six’ing the request with no further info?

Our usual log systems did not show any errors, and upon further inspection we realized that the breaking flow did not result in any applicative logging at all, which was especially odd since we make heavy use of logs and error-log anything that breaks. The lack of applicative logs along with the silent 403 led to the expected back-and-forth between the application code developers and the dev-ops infra people, where each side assumed it’s the other’s fault. Hours spent, dollars lost.

Eventually we traced it down to a particular legacy before_filter (we ride on Rails) which was affected by recent site configuration changes. This before_filter denied the request, but did so extremely silently:

render :nothing => true, :status => 403 and return

Once found, we could quickly verify this was indeed the issue and fix the before_filter to match the configuration changes, making everything go back to normal. However, it had taken us precious time to debug this error – hours with the flow broken, affecting users.

This could have been solved much, much quicker if the before_filter would not be so silent. The act of denial could be appropriately logged, but that’s not so straightforward – we wouldn’t want to log it as an error (since a truthful denial/403 is not an error), and any non-error logging would not be immediately obvious (what we would be looking for, anyway?).

A good way to solve this would be to simply return a non-empty error message to the client. Say:

render :text => "Action denied, Sorry-O.", :status => 403 and return

With a non-empty error message as above, a simple grep would quickly locate the root problem, enabling near-immmediate response.

$ grep "Action denied, Sorry-O." -ir .

Don’t give out compromising information, but *say something*, so your error is identifiable. Rendering ‘nothing’ is the equivalent of saying “Why? Because Fuck You, that’s why.”

Moral of the story: Never fail silently. Just like kids are taught: if you see something wrong, SAY SOMETHING.