Timeline of Events
Yesterday afternoon at around 3:30 pm, the Help Desk reported that some people were calling in and reporting that Blackboard was running extremely slow. Roger found one BB server that appeared to be the problem and rebooted it. After rebooting the server, the symptoms did not improve, so Roger rebooted all of the BB servers, one at a time. After the server reboots, all Blackboard servers were unresponsive, returning a “503 Service Temporarily Unavailable” error message. It then became obvious that the problem was something that was being caused by some unidentified root cause that was identical across all systems. About this time, the ITS Services Status Blog was updated and a campus-wide e-mail was sent out letting everyone know we were aware of and working on the problem.
Roger and Tom worked together to try to find commonalities across the servers, trying several different ideas for resolving the issue. Not finding a quick solution, they contacted Blackboard Technical support who joined in the hunt for a resolution. This process continued for several hours. Shortly after 7:00 pm, one of the five BB web servers began responding normally again. Within 30 minutes or so, all 5 web servers were functioning as expected.
While we are happy to see the servers working again, we still don’t know the root cause of the problem. What we do know is that the servers, which normally take about 2-3 minutes to perform a full reboot, are taking about 30 minutes to complete the boot cycle. BB TS has collected some data about the boot processes and is studying it to see if they can identify why BB is taking so long to come up.
Speaking for myself, I think the response by everyone involved yesterday was top notch: the Help Desk communicated the issue promptly; the development team took immediate steps to resolve the issue as quickly as possible; BB TS was engaged at the right time; other ITS staff provided all the help they could; and communications with the campus was timely and effective. Sometimes, the best thing people can do is to protect the people who are working on resolving these issues from unneeded disruptions, which seemed (to me, at least) to be the case yesterday.
Some things that we’ve discovered through this experience include:
- The monitors we have in place could be tuned to better identify problems from the user perspective in real time. For example, in this case the HTTP web service running on the server was functioning, but not returning the expected login page. Our monitors should be tuned to ensure that HTTP not only responds, but responds as expected.
- The existing BB load balancers also do not check for expected data in the HTTP/HTTPS response, so they did not mark the failed servers out of service. For a time, this allowed some people to log in normally while others kept getting the error pages. It would be beneficial to implement an enhanced load balancing solution for Blackboard that will allow such fine tuning of our tests to ensure the best possible user experience.
My sincerest thanks to everyone who participated in any way, but especially to Roger, Tom, and the Help Desk staff for sticking with it into the late evening hours until Blackboard was fully functional once again. I invite others to add their comments about this incident, how it was managed, and ways that we could improve our response in the future.