At my company, we use a form of Continuous Monitoring: every time our system logs a warning or an error we immediately receive an email identifying the source and nature of the problem. This allows us to respond rapidly to problems as they arise and gives us good visibility into the health of our system. Following the mantra of “do in test as is done in prod”, we have the same monitoring system set up in both environments to help us find issues in test before they find their way into production.
The downside to this level of monitoring is that it can amount to a lot of messages. Our challenge is to manage the signal-to-noise ratio so that:
As part of our 5 Whys activity for production issues, we have found that most production issues actually occurred first in test, but just went unnoticed. This provides a compelling reason to keep the signal ratio high in all environments. Any time that we find ourselves automatically archiving or filtering an alert indicates an opportunity for improvement.
We have found that refining and tuning these alert messages is an ongoing maintenance activity. As part of our weekly meeting, we try to select one message to clarify or dispatch each week. We have a script that trawls the support emails received in the past week and builds a pareto distribution of the number of messages by logger. This helps us decide where to focus our efforts and to quantify the impact of our actions on the volume of messages we receive.
Determining what kinds of things we need to be alerted about is difficult to assess in advance. Often things that we are concerned about when building a feature turn out to less important in production, and conversely, we miss things in development that turn out to be very important once real customers start using them. Fortunately, deploying every week gives plenty of opportunity for improvement. Also if a message is logged more frequently than intended, we only have to put up with it for a week before it can be rectified.
I should mention that we have a circuit breaker in place in the log monitor. We do not allow duplicate messages to be sent any more frequently than once per hour. (Relatively early on we managed to get temporarily blacklisted by a mail provider when an errant message was generated much too frequently).
In terms of managing the signal-to-noise ratio, I’ve found that there are a few broad categories of messages to deal with:
In the (enterprise) environments that I’ve worked in the past, there was very little interaction between development and operations. Logs were used only for analyzing severe production problems – generally after a severe system problem (a crash) or a user had reported a problem. The log files were poorly tuned for diagnosing problems and they tended to be full of junk – problems that no one had noticed or reported that may have been going on for months (or longer).
In contrast, the approach that we follow at my current company means we are able to use logs to proactively find and remedy problems. It requires effort to maintain a high signal-to-noise ratio, but it is very worthwhile.
After our recruitment manager departed on bereavement leave, the task of bringing candidates through the recruiting pipeline fell on the shoulders of the team. One of the developers who has taken up the recruitment mantle came up with the idea of using Jira as a way to track candidates through the interview process, and I have to say that it’s working pretty well.
A candidate’s application is treated as an issue in Jira, and the pipeline is loosely modeled as a workflow. We track phone screens and interview notes as comments against the candidate/issue. The issue can be assigned to whoever is responsible for interviewing or following up with the candidate. We receive emails about the status of the candidate through Jira’s notification scheme. The main debate was whether to treat candidates as “Bugs”, “Tasks”, “Improvements” or “New Features”. We settled on “New Features” :).
Using Jira certainly beats managing this information through Word docs, spreadsheets or closed HR systems. The main weakness is that scheduling is not that well supported – you can kind of give a candidate a due date, but it’s not that natural. But for now, it is just light enough for our needs.
Last week, one of our Glassfish instances stopped responding. The process was running, but no longer handling requests. The good news is that the load balancer automatically failed over so there was no downtime to the site. The bad news is that we didn’t receive any direct notification of the failure. We have monitoring on the box, but it is primarily at a system-level. In this case, everything was fine with the system, it was just the JVM that was having issues. And the problem wasn’t load per se, more lack thereof. Looking at the Ganglia graphs, the only thing suspicious was the absence of activity.
To rectify the situation, we brought the application server up and down a few times and tried redeploying the application, but still no dice. We had previously seen occasions where Glassfish had become corrupted, so the next action was to rebuild the instance. One nice feature of Glassfish is that it is quite scriptable and we fleshed out our script for rebuilding a production instance. Strangely, rebuilding the application server didn’t seem to help. The clean instance would run for a while and then just lock up. It seemed to do this non-deterministically.
We were feeling really stumped. As a last resort, we decided to reboot the server. This is something that I would have considered earlier if it was a Windows box, but this was a Linux server that had been up and running reliably since we first commissioned it 7 months earlier. Also this seemed to be a JVM issue and the JVM process was being brought up and down with each application server restart. Fortunately, rebooting seemed to do the trick. There must have been some malignant process or OS lock that was interfering with the JVM, but it wasn’t clear what was the cause.
Unfortunately this wasn’t the end of our problems. When we decided to rebuild Glassfish, we had opted to upgrade from V2ur2 to 2.1. Many of us had been running Glassfish 2.1 in development and it seemed more reliable than the V2ur2 release. Besides, it was just a minor point upgrade. When we went to reconnect our remote clients with the rebuilt server, they started throwing SerializationExceptions on a Sun library OrderedSet class. The IIOP/CORBA communication protocol uses binary serialization to transmit objects for remote JNDI lookups as part of the JMS handshake. Some genius on the project had decided to upgrade a key library as part of a point release that broke backwards compatibility for standard JMS clients. Nice.
Buried in the Glassfish 2.1 upgrade guide, the Application Client Interoperability section states:
You cannot run application clients with one version of the application server runtime with a server that has a different version. Most often, this would happen if you upgraded the server but had not upgraded all the application client installations. You can use the Java Web Start support to distribute and launch the application client. If the runtime on the server has changed since the end-user last used the application client, Java Web Start automatically retrieves the updated runtime. Java Web Start enables you to keep the clients and servers synchronized and using the same runtime.
WTF!?! What kind of an upgrade process is this? Upgrading the application server requires simultaneously upgrading all clients? Ain’t gonna happen. It’s essentially guaranteeing version lock down. And recommending Java web start is fine for distributed client applications, not for long-running autonomous processes.
Anyway, downgrading Glassfish back to v2ur2 resolved the connectivity problem. The v2.1 compatibility problem exposed the deeper issue: that JMS, at least the default CORBA implementation, is a tightly coupled train-wreck waiting to happen, especially with Sun’s cavalier attitude toward upgrades. It’s time to pursue alternate communication protocols built on open standards like, say, XMPP.
Last week, we were fortunate to have Eric Ries come out and spend some time talking with our team while he was here for the Agile Vancouver event. We had the chance to talk about 5 whys, split testing and other topics. I would have liked to spend a bit more time discussing continuous deployment, but I did get some more insight into how they got started with CD at IMVU.
One thing that I was surprised to learn was that IMVU started out with continuous deployment. They were deploying to production with every commit before they had an automated build server or extensive automated test coverage in place. Intuitively this seemed completely backwards to me – surely it would be better to start with CI, build up the test coverage until it reached an acceptable level and then work on deploying continuously. In retrospect and with a better understanding of their context, their approach makes perfect sense. Moreover, approaching the problem from the direction that I had intuitively is a recipe for never reaching a point where continuous deployment is feasible.
Initially, IMVU sought to quickly build a product that would prove out the soundness of their ideas and test the validity of their business model. Their initial users were super early adopters who were willing to trade quality for access to new features. Getting features and fixes into hands of users was the greatest priority – a test environment would just get in the way and slow down the validation coming from having code running in production. As the product matured, they were able to ratchet up the quality to prevent regression on features that had been truly embraced by their customers.
Second, leveraging a dynamic scripting language (like PHP) for building web applications made it easy to quickly set up a simple, non-disruptive deployment process. There’s no compilation or packaging steps which would generally be performed by an automated build server – just copy and change the symlink.
Third, they evolved ways to selectively expose functionality to sets of users. As Eric said, “at IMVU, ‘release’ is a marketing term”. New functionality could be living in production for days or weeks before being released to the majority of users. They could test, get feedback and refine a new feature with a subset of users until it was ready for wider consumption. Users were not just an extension of the testing team – they were an extension of the product design team.
Understanding these three factors makes it clear as to why continuous deployment was a starting point for IMVU. In contrast, at most organizations – especially those with mature products – high quality is the starting point. It is assumed that users will not tolerate any decrease in quality. Users should only see new functionality once it is ready, fully implemented and thoroughly tested, lest they get a bad impression of the product that could adversely affect the company’s brand. They would rather build the wrong product well than risk this kind of exposure. In this context, the automated test coverage would need to be so good as to render continuous deployment infeasible for most systems. Starting instead from a position where feedback cycle time is the priority and allowing quality to ratchet up as the product matures provides a more natural lead in to continuous deployment.
For my company, even though we do weekly deployments, we’re still a fair way off from being able to deploy continuously. As we are operating in a new and rapidly evolving market, we focus on building and releasing a simple initial version of new features that demonstrate the potential of the software. We can then receive feedback and invest more effort in expanding those features that resonate with our clients. While we do routinely selectively expose new functionality to a subset of users (generally internal users) to solicit feedback, we still need to create more sophisticated ways to do user segmentation. Aside from the obvious bugbear of automated test coverage (we use JUnit and Selenium, but our coverage isn’t nearly good enough), our main blocking issue from a technology perspective is the deployment process itself.
To deploy continuously, the deployment has to be quick and it has to be transparent to end users (ie. there should be no visible downtime). Performing a rollback should have the same characteristics. Our deployment process is automated, but in the world of Java application servers (even lightweight ones like Glassfish) deployment is anything but fast. Deployment entails all kinds work that the app server needs to do (parsing configuration files, generating WSDLs, starting thread pools, etc) during which the application is unresponsive. Also, because of memory leak issues in the container, we always restart the application server with each deployment anyway. All in all, the only way to avoid downtime is to pull the application server out of the load balancer pool until the deployment completes. Rollback is the same process in reverse.
A bit of an aside, but I know of some teams that package Glassfish with their app, inverting the container metaphor and simply treating it as another library/dependency. This makes it easier to just flip the symlink on deployment and rollback. It’s an interesting idea, as long as you don’t mind copying a massive WAR to production with each deploy (which for us would just shift the deployment bottleneck to the network).
We have made a fair bit of head way on streamlining our deployment process, and while we’re not ready to do continuous deployments into production, I am trying to get us into a position where we can do continuous deployment to test. I used to be of the opinion that deployment to test was something that should be controlled by testers (via a deploy button on the automated build server). Most testers want to work against a stable baseline, limiting the number of variables that they are dealing with when testing the app. But this is a fallacy because a batch of changes is simply piling up behind whatever version is deployed into test. It’s classic batch-and-queue thinking.
What if deployments happened without downtime in a way that was invisible to the tester or the end user? What if test coverage was sufficient to ensure that there would be no regression on major areas of functionality? I think that the fears of continuous deployment into test and the need for a stable baseline would evaporate. Moreover, this is something that we would want to test because it would mirror the experience of users using the site when a new version goes into production. In our office, every time we do a deployment to test, someone needs to call out “deploying to test”. This too would go away.
That’s the plan anyway. Over the next couple of weeks, I’ll see if we can move closer to achieving it. I’ll let you know how it goes.
If you’re looking for some quick ways to improve the performance of your site, Steve Souder’s High Performance Web Sites is packed with great advice. You don’t even need to buy the book as most of the information is available through links from the Firefox YSlow plugin. We have been picking one rule every couple of weeks to focus on and this past week we spent a bit of time adding far future expires headers for the Flex SWFs on our site.
Far future expires means setting the expires HTTP header for static content to some date far in the future. Effectively, this means that static content within a web site will always be loaded from the browser cache after it is first requested. This has the impact of greatly improving the load time for your site as well as reducing the number of requests sent to your web servers. The flip-side, however, is that because the cached content never expires, if you do need to change an image or a stylesheet then the user will need to clear their browser cache before they see it.
Hence, taking advantage of far future expires means taking responsibility for versioning static content on the server. Any time static content changes, it needs to be served up from a different URL. In his book, Steve Souders alludes to the approach that they follow at Yahoo! to achieve this, but he doesn’t give enough detail to just go ahead and implement it. So here is how we’re solving the problem.
If you have suggestions for a better way to version static content or if I can provide more clarity on the approach that we’re using, please let me know.
If you missed the chance to register for the Agile Vancouver Lean Event, Eric Ries has agreed to give a second talk while he’s in Vancouver. Eric will also be speaking at the Vancouver Ruby/Rails/Merb event at Workspace at 7pm on Monday April 20th.
Last week, we spent some time adding better timezone handling to the application – specifically, the ability to view data in the data source’s timezone rather than the user’s local timezone. Our application leverages Adobe Flex for charting and data visualization, and it’s sufficient to say that Flex’s timezone support is frankly lacking. Flex supports determining the UTC offset, which is fine when displaying data in a user’s local timezone, but it’s insufficient for working with alternate timezones.
The recommended advice is to keep dates on the Flex-side strictly in UTC and leave the server to handling all date and timezone manipulation. The server returns UTC dates (epochs) shifted relative to the timezone that the data should be displayed in.
One challenge is that not all Flex controls work with UTC dates directly, meaning that there is inevitably some back and forth between local dates and UTC on the client side. Also, as we discovered, the UTC setters on a date tend to have unexpected side effects. It is generally better to create a new UTC date from a local date rather than invoke its UTC setters directly. Testing can also be tricky as bugs may only be visible at certain times of day (ie. when its tomorrow in GMT but still today locally) or the month (at month boundaries).
On last little bit of date fun with Flex, the Flex data visualization/charting package has a tendency to crash your browser when viewing data at DST rollovers. The bug was raised over a year ago and ostensibly fixed before this year’s DST rollover at the start of March, but it still hasn’t found its way into to release (at least as of this post).
With all of the time invested into support timezones, I can’t help but feel we’d be better off adopting some standard global metric time.
On April 21st, Agile Vancouver is hosting a mini-conference on Lean Software Development. We have organized a great trio of speakers:
It’s shaping up to be a really great event. We just opened registration last week and we’ve already booked out.
Inspired by Eric Ries’ SEM on five dollars a day article, I decided to set up a little AdWords campaign for our Earth Hour event. The campaign was a bit of an experiment as we hadn’t advertised using AdWords before and weren’t too sure what to expect.
Overall, I was hugely impressed – the ad was able to drive fairly significant traffic to the site at a very reasonable cost. We were able to get great ad placement (often ours was the only ad showing) with a very low cost-per-click. Certainly in comparison with more conventional forms of advertisement, AdWords – especially with Google Analytics integration – offers more control and much better feedback and information.
One thing that struck me going through the process of getting set up with AdWords is that there is significant depth to the application and much to be learned about how to build an optimal campaign using it. One thing that I struggled a bit with was how to determine the estimated cost-per-click for newly added keywords. It seemed to be available with after digging around, but it strikes me that this is something that should be directly visible on the ad group page. Another thing that I found was that while cost for a phrase keyword match was generally cheaper than broad keyword match, this was not always the case. Some justification of the cost for a keyword would be useful (ie. “20 clients are already using these keywords” or something).
If you have any useful tips, advice or articles to read on getting the best results out of AdWords, please let me know.
Earth Hour is a big event for my company, Pulse Energy. We are monitoring the energy savings for a variety of sites around Vancouver. This is our second year participating in Earth Hour and our first year as WWF partner for the Canadian Earth Hour campaign. Last year was pretty hectic, but it was a very successful event for us and our partners. This year, we’re much better organized, we have a real product, we’re monitoring more sites and seeing considerably more traffic.
If you’re interested in seeing in real-time the energy savings realized by the actions taken by some local sites for Earth Hour, check out the Pulse Dashboards.
80% technical, 20% social change. This blog is dedicated to finding ways to sustainably release software more frequently.