Last week, one of our Glassfish instances stopped responding. The process was running, but no longer handling requests. The good news is that the load balancer automatically failed over so there was no downtime to the site. The bad news is that we didn’t receive any direct notification of the failure. We have monitoring on the box, but it is primarily at a system-level. In this case, everything was fine with the system, it was just the JVM that was having issues. And the problem wasn’t load per se, more lack thereof. Looking at the Ganglia graphs, the only thing suspicious was the absence of activity.

To rectify the situation, we brought the application server up and down a few times and tried redeploying the application, but still no dice. We had previously seen occasions where Glassfish had become corrupted, so the next action was to rebuild the instance. One nice feature of Glassfish is that it is quite scriptable and we fleshed out our script for rebuilding a production instance. Strangely, rebuilding the application server didn’t seem to help. The clean instance would run for a while and then just lock up. It seemed to do this non-deterministically.

We were feeling really stumped. As a last resort, we decided to reboot the server. This is something that I would have considered earlier if it was a Windows box, but this was a Linux server that had been up and running reliably since we first commissioned it 7 months earlier. Also this seemed to be a JVM issue and the JVM process was being brought up and down with each application server restart. Fortunately, rebooting seemed to do the trick. There must have been some malignant process or OS lock that was interfering with the JVM, but it wasn’t clear what was the cause.

Unfortunately this wasn’t the end of our problems. When we decided to rebuild Glassfish, we had opted to upgrade from V2ur2 to 2.1. Many of us had been running Glassfish 2.1 in development and it seemed more reliable than the V2ur2 release. Besides, it was just a minor point upgrade. When we went to reconnect our remote clients with the rebuilt server, they started throwing SerializationExceptions on a Sun library OrderedSet class. The IIOP/CORBA communication protocol uses binary serialization to transmit objects for remote JNDI lookups as part of the JMS handshake. Some genius on the project had decided to upgrade a key library as part of a point release that broke backwards compatibility for standard JMS clients. Nice.

Buried in the Glassfish 2.1 upgrade guide, the Application Client Interoperability section states:

You cannot run application clients with one version of the application server runtime with a server that has a different version. Most often, this would happen if you upgraded the server but had not upgraded all the application client installations. You can use the Java Web Start support to distribute and launch the application client. If the runtime on the server has changed since the end-user last used the application client, Java Web Start automatically retrieves the updated runtime. Java Web Start enables you to keep the clients and servers synchronized and using the same runtime.

WTF!?! What kind of an upgrade process is this? Upgrading the application server requires simultaneously upgrading all clients? Ain’t gonna happen. It’s essentially guaranteeing version lock down. And recommending Java web start is fine for distributed client applications, not for long-running autonomous processes.

Anyway, downgrading Glassfish back to v2ur2 resolved the connectivity problem. The v2.1 compatibility problem exposed the deeper issue: that JMS, at least the default CORBA implementation, is a tightly coupled train-wreck waiting to happen, especially with Sun’s cavalier attitude toward upgrades. It’s time to pursue alternate communication protocols built on open standards like, say, XMPP.