Friday, August 10, 2012

The Grey Screen of Death

In the previous post I mentioned that our team is focused on fixing a long-standing ever-elusive bug that has consistently been very disruptive to end users (users only see the message "It's not your fault").  One of our implementations is consistently seeing an increase in this message that can only be resolved by stopping the web application and restarting it.  Due to the ability of this bug to completely disrupt the entire lab, and up to this point our inability to figure out what was causing this bug, we gave this bug the name of "the grey screen of death" - alluding to a common bug that computer users see when their computer system crashes and all they see is a "blue screen of death" and must restart their systems.

Current Grey Screen of Death message (in french):
The grey screen of death has been so elusive that up until now our troubleshooting process hasn't been able to determine what actually causes it - asking users what steps they took, looking at the logs, attempting to replicate the bug - none of it made sense or helped us solve it.  All we knew was that this error was supposed to appear when there was an uncaught exception in the system (for the technical folks - we put this in place so stack traces never printed to the browser).  Theoretically, the user should have never seen it.  Unfortunately, in our development priorities, we never found time to improve our error pages to give the user more information to help us help them.

This week we finally found some clues and have put in fixes for what we think is causing the problem. First, we figured out that this error appears for both uncaught exceptions (system errors) and 404 page errors.  So we've separated those out to be two different error screens.

For the uncaught exceptions, the problem appears to come from two different workflows:

  1. Lab test results are automatically imported from an analyzer after a sample was marked non-conforming.  In other words, no sample entry was performed in the system but a non-conformity event was entered for the sample.  When the results are saved, the user is told that it is successful, but it wasn't.  The analyzer result reappeared on the screen.
  2. Prior to sample entry the analyzer imports of two different sample types are accepted. When the 2nd import is saved, it causes the grey screen of death.  This particular workflow puts the application into an unstable state that causes all of the users in the system to experience the grey screen of death until the system is restarted.  For the technical folks - This is due to Hibernate not knowing what to do with a transient object that is created when it tries to flush the cache an exception is thrown and the object remains in the cache.  The system is unable to remove it from the cache so every Hibernate action continues to cause exceptions.
The reason that it has been so difficult to troubleshoot is that the logs showed us events well after the cause because the whole application became unstable.

Along with determining the cause of the bug, our users adamantly told us that they needed better feedback from the screen when they receive any type of error.  Simply stating "It's not your fault" only frustrated them because they wanted to know what was happening to the system and how to get help to solve it.  So we spent time improving this.  I will talk about those improvements in the next post.


No comments:

Post a Comment