Begin forwarded message:
Yesterday’s UIUC Shibboleth outage was caused by a problem with the terracotta cluster engine, which is used by Shibboleth to share live session data across our Shibboleth IDP nodes. From Keith Wessel:
The standby node lost contact with the active node a couple times during the DC network work yesterday morning. The last time it did, it was half way through a re-initialization of its database. This left it in a hung and uninitialized state. We’ve only seen this once before.
By early afternoon unstable terracotta has consumed enough memory to disrupt Tomcat service on the lead Shibboleth node. The problem was reported to the ID Management group at 2:30pm, and the service was restored ~2:50pm. (Unfortunately response from service management was somewhat delayed due to a fire drill going on in DCL.)
In the short term, Keith will setup a monitor script to watch the terracotta logs and alert on this problem, so next time it can be fixed before it causes a service disruption. In the longer term, the plan is to reconfigure Shibboleth such that terracotta is not required.
Mark Nye (email@example.com)
Collaborative Services Team, CITES AS
University of Illinois at Urbana-Champaign