<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>exortech.com &#187; release blog</title>
	<atom:link href="http://exortech.com/blog/category/release-blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://exortech.com/blog</link>
	<description>Peripatetic thinking</description>
	<lastBuildDate>Tue, 01 Dec 2009 05:56:13 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Weekly Release #53 &#8211; Environment bug bites</title>
		<link>http://exortech.com/blog/2009/11/30/weekly-release-53-environment-bug-bites/</link>
		<comments>http://exortech.com/blog/2009/11/30/weekly-release-53-environment-bug-bites/#comments</comments>
		<pubDate>Tue, 01 Dec 2009 05:47:33 +0000</pubDate>
		<dc:creator>exortech</dc:creator>
				<category><![CDATA[release blog]]></category>
		<category><![CDATA[technology]]></category>
		<category><![CDATA[weekly release]]></category>

		<guid isPermaLink="false">http://exortech.com/blog/?p=239</guid>
		<description><![CDATA[Last week I encountered one of the more bizarre bugs of my career. Following the release, gaps started appearing on charts. The strange thing was that the data was all in the database; it just wasn&#8217;t coming through the user interface. Unfortunately, for this release, rolling back wasn&#8217;t really an option, so we needed to [...]]]></description>
			<content:encoded><![CDATA[<p>Last week I encountered one of the more bizarre bugs of my career. Following the release, gaps started appearing on charts. The strange thing was that the data was all in the database; it just wasn&#8217;t coming through the user interface. Unfortunately, for this release, rolling back wasn&#8217;t really an option, so we needed to quickly identify and correct the source of the problem.</p>
<p>After spending the better part of a day banging our heads against the problem, we had isolated it to a few statements. The code had changed in this area in the last release (one of the advantages of weekly releases is that it is easier to pinpoint the source of a problem as each release contains only one week&#8217;s worth of changes) but not in a way that should have caused the problem we were seeing. And everything ran correctly in the test environment. It seemed like the problem could be environment-related.</p>
<p>We didn&#8217;t have the necessary hooks and logging to properly exercise the problem area in isolation, but after a quick patch release, we did (another benefit of zero-downtime deployment is that it allows greater flexibility with the time and frequency of deployment). After the patch deploy, we were able to see that the production system returned one fewer result than the corresponding functionality run against the test environment. If the system was supposed to return only one result, the production system was returning none &#8211; which explained the gaps on the chart. The problem was clearly at a SQL driver or database level.</p>
<p>Just prior to the release, the MySQL driver had been upgraded to the latest version (5.1.10). We had been running this version in the test environment for several weeks without issue, so it seemed odd that it could be the source of the problem. The version of the database, however, was inconsistent between the two environments (MySQL 5.1.39 in production vs 5.1.36 in test). The newer database version had been running fine in the production environment for several days and hadn&#8217;t been touched with the release, so it seemed odd that this could be the source of the problem. This was enough to go from, however, and we reverted the version of the MySQL driver on the web servers which ended up fixing the problem.</p>
<p>Later, while trolling the MySQL release logs, I <a href="http://bugs.mysql.com/bug.php?id=47963">came across the nasty bug that bit us</a>. Evidently, the 5.1.10 driver had changed the format of dates in a way that triggered this bug in 5.1.39. So it was the combination of these two version of the software that caused the problem &#8211; each in isolation worked fine. The issue has been fixed in MySQL 5.1.41, but it&#8217;s a pretty serious bug in core SQL functionality to come out of a sanctioned release.</p>
<p>Coming out of our 5 Whys analysis, the experience points to several ways that we need to tighten up our process as well.</p>
<ol>
<li>We need to separate environment changes from software releases. Even seemingly innocuous changes can have repercussions when combined with other systems. Keeping environment changes separate will make it easier and faster to pinpoint the source of  problems (ie. is it an environment problem or a software problem?).</li>
<li>We need to narrow the discrepancies between the test and production environments. Environment discrepancies are a common source of unexpected risk. While it&#8217;s not feasible to keep the environments perfectly in sync, this could be better. Ironically, it was an attempt to make the environments more consistent that caused this problem.</li>
<li>We need a more comprehensive set of automated validation tests that we can run against the system subsequent to deployment that verifies the integrity of the release. This is one of the tasks for this week.</li>
</ol>
<p>While these actions may seem fairly obvious, it often (unfortunately) takes getting bitten by these types of bugs to illuminate areas where we need to improve. Have you faced something similar? What additional actions do you take to prevent these types of environment problems?</p>
]]></content:encoded>
			<wfw:commentRss>http://exortech.com/blog/2009/11/30/weekly-release-53-environment-bug-bites/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Videos from weekly deployment talk at Agile Vancouver</title>
		<link>http://exortech.com/blog/2009/10/30/videos-from-weekly-deployment-talk-at-agile-vancouver/</link>
		<comments>http://exortech.com/blog/2009/10/30/videos-from-weekly-deployment-talk-at-agile-vancouver/#comments</comments>
		<pubDate>Sat, 31 Oct 2009 05:53:28 +0000</pubDate>
		<dc:creator>exortech</dc:creator>
				<category><![CDATA[agile]]></category>
		<category><![CDATA[release blog]]></category>
		<category><![CDATA[speaking]]></category>

		<guid isPermaLink="false">http://exortech.com/blog/?p=243</guid>
		<description><![CDATA[Last week, I spoke at the Agile Vancouver monthly meeting on the subject of *surprise* Deploying to Production Every Week. Eugene was kind enough to video the talk; so if you&#8217;re interested in catching a replay, it&#8217;s available on Vimeo in five 20 minute segments. Here&#8217;s the first video as a teaser: Deploying to Production [...]]]></description>
			<content:encoded><![CDATA[<p>Last week, I spoke at the <a href="http://agilevancouver.ca">Agile Vancouver</a> monthly meeting on the subject of *surprise* <a href="http://agilevancouver.ca/?p2=modules/blog/viewcomments.jsp&#038;bid=48">Deploying to Production Every Week</a>. <a href="http://advice.cio.com/blog/eugene_nizker">Eugene</a> was kind enough to video the talk; so if you&#8217;re interested in catching a replay, it&#8217;s available on Vimeo in five 20 minute segments. Here&#8217;s the first video as a teaser:<br />
<object width="400" height="225"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=7228943&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=7228943&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="400" height="225"></embed></object>
<p><a href="http://vimeo.com/7228943">Deploying to Production Every Week (Owen Rogers), part 1 of 5</a> from <a href="http://vimeo.com/user994644">Agile Vancouver</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://exortech.com/blog/2009/10/30/videos-from-weekly-deployment-talk-at-agile-vancouver/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Weekly Release Blog #47: Zero-downtime deployment to test</title>
		<link>http://exortech.com/blog/2009/10/21/weekly-release-blog-47-zero-downtime-deployment-to-test/</link>
		<comments>http://exortech.com/blog/2009/10/21/weekly-release-blog-47-zero-downtime-deployment-to-test/#comments</comments>
		<pubDate>Thu, 22 Oct 2009 05:52:22 +0000</pubDate>
		<dc:creator>exortech</dc:creator>
				<category><![CDATA[agile]]></category>
		<category><![CDATA[release blog]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://exortech.com/blog/?p=233</guid>
		<description><![CDATA[Yesterday we deployed 19 times to our test environment (about once for every 4 commits). We were busy making some final refinements to a new feature that we launched last night. We had most of the team working on this feature (Single Feature Release) and the rapid feedback of the regular deploys helped ensure that [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday we deployed 19 times to our test environment (about once for every 4 commits). We were busy making some final refinements to a new feature that we launched last night. We had most of the team working on this feature (Single Feature Release) and the rapid feedback of the regular deploys helped ensure that everything was coming together for the release. </p>
<p>There is no way that we could have maintained this pace if we were taking down the test system for 10 minutes with each deploy. We needed to be able to continuously test the system throughout. This to me is one of the unsung values of zero-downtime deployment. Everyone normally focuses on the benefit of your users not seeing a <a href="http://en.wikipedia.org/wiki/File:Failwhale.png">fail whale</a> page if they hit your site during a deploy. But let&#8217;s face it: unless you&#8217;re practicing <a href="http://timothyfitz.wordpress.com/2009/02/10/continuous-deployment-at-imvu-doing-the-impossible-fifty-times-a-day/">continuous deployment</a>, you&#8217;re deploying to test much more often than you are deploying to production. And the frequency with which we can cycle through and get feedback on changes in the test environment is one of the primary limiting factors determining how frequently we can release the software.</p>
<p>While many people that I&#8217;ve talked with like the idea of zero-downtime deployment, they have little access, control or influence over the production environment. So my advice is to start with the environments that you own, like test. Get it set up and running solidly there first.  This greatly lowers the barrier to then rolling it out to production. And, as I describe above, the benefits of having zero-downtime in test are substantial.</p>
<p>In my next post, I&#8217;ll try to provide some more technical details about our zero-downtime set up.</p>
]]></content:encoded>
			<wfw:commentRss>http://exortech.com/blog/2009/10/21/weekly-release-blog-47-zero-downtime-deployment-to-test/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Weekly Release Blog #40: Evolving Site Design Using CSS</title>
		<link>http://exortech.com/blog/2009/08/31/weekly-release-blog-40-evolving-site-design-using-css/</link>
		<comments>http://exortech.com/blog/2009/08/31/weekly-release-blog-40-evolving-site-design-using-css/#comments</comments>
		<pubDate>Tue, 01 Sep 2009 05:10:45 +0000</pubDate>
		<dc:creator>exortech</dc:creator>
				<category><![CDATA[agile]]></category>
		<category><![CDATA[release blog]]></category>
		<category><![CDATA[css]]></category>
		<category><![CDATA[weekly release]]></category>

		<guid isPermaLink="false">http://exortech.com/blog/?p=209</guid>
		<description><![CDATA[I&#8217;m a big fan of CSS &#8211; it keeps things looking consistent, it separates structure and design, and it keeps markup clean, simple and maintainable. The greatest strength and weakness of stylesheets is their scope. A large number of pages are typically styled by a single stylesheet. This is great for consistency and reuse, but [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m a big fan of <a href="http://en.wikipedia.org/wiki/Cascading_Style_Sheets">CSS</a> &#8211; it keeps things looking consistent, it separates structure and design, and it keeps markup clean, simple and maintainable. The greatest strength and weakness of stylesheets is their scope. A large number of pages are typically styled by a single stylesheet. This is great for consistency and reuse, but it means that it can be difficult to assess the impact of changing a style without verifying every page that uses it &#8211; in every supported browser! Changing a global style could break the layout somewhere in the site in ways that could easily go unnoticed.</p>
<p>Hence, when releasing software to production every week, the cost of making style changes can be prohibitive. It is difficult to regression test all impacted pages in all browser combinations within a reasonable amount of time. Another challenge is that in CSS there are many ways to achieve the same thing, though each approach could render differently in different browsers. And in a site that is changing rapidly, existing designs need to evolve to incorporate new features, usability improvements, user feedback and better ways of doing things. </p>
<p>Stylesheets can&#8217;t be seen as being too risky to change. Otherwise they will be subverted through local inline styles and other workarounds (the same goes for common javascript libraries or pretty much any shared component for that matter). While local styles may seem expedient for that particular page or feature, they only introduce inconsistencies and make the site more difficult to maintain in the long term. So what&#8217;s a web dev to do?</p>
<p>We&#8217;ve developed an approach to help deal with this problem. I accept that there are probably smarter ways to achieve this &#8211; if you have a better idea, please let me know. Here goes:</p>
<ol>
<li>All styles should be relative to some top-level class. For example:<br />
	<code>form.standard { margin: 0px 150px; }</code><br />
The class describes the type of component that is being styled &#8211; the type of table, form, or component we are designing here. In the example, we are defining a &#8220;standard&#8221; form, which is a particular type of form. Using a top-level class allows for other types of forms to be styled as required. </li>
<li>All sub-elements are defined relative to the top-level class:<br />
<code>form.standard label { margin-left: -150px; }</code><br />
All labels for a &#8220;standard&#8221; form have a negative left margin. It is acceptable (even preferable) to style bare tags as long as the style is relative to a top-level class. This keeps the markup simple and consistent. We only add CSS classes to child elements as required.</li>
<li>The corollary to the above is that it is not acceptable to style bare tags (unless we are doing a style reset). The problem with styling bare tags directly is that it locks us in to one specific style and that makes it difficult to evolve the design of the site. For example:<br />
<code>label { font-weight: bold }</code><br />
means that all form labels will be in bold. If we introduce a form later that shouldn&#8217;t have bold labels, we will have to explicitly override this style (which tends to be brittle and limiting).</li>
<li>Now, if we need to evolve an existing style, we have a few options. Say we want to replace the &#8220;standard&#8221; form with a new design. Rather than change the styles for the &#8220;standard&#8221; form class directly, which would immediately impact every page where this style is used, we can incrementally and selectively rollout the new design on a page-by-page basis. If the new style is significantly different from the initial style, we can fork the original style by defining a new top-level class:<br />
<code>form.danger { background-color: red; }</code><br />
Simply by changing the class attribute for each form element from &#8220;standard&#8221; to &#8220;danger&#8221;, we can then roll out the new style to select pages testing the design in all supported browsers as we go. Think of this being like continuous integration for site design.</li>
<li>If the style changes are relatively minimal, we can override the style for specific elements. One way to achieve this is to use multiple top-level CSS classes. For example, we could incrementally apply both the &#8220;standard&#8221; and the &#8220;danger&#8221; classes to the form elements on each page testing as we go. The &#8220;danger&#8221; class could override styles as required &#8211; though dealing with precedence can be tricky. Alternately, the new class could be defined relative to a top-level identifier. This solves the precedence problem as styles defined relative to an identifier take precedence over styles that are relative to a class. Another option is to define specific classes for the child elements to be restyled &#8211; but this means changing a lot more markup during the rollout.</li>
</ol>
<p>That&#8217;s it. It seems pretty simple &#8211; intuitive even. But I haven&#8217;t found many references to how others tackle this problem. Again, if you have a better idea, please let me know.</p>
]]></content:encoded>
			<wfw:commentRss>http://exortech.com/blog/2009/08/31/weekly-release-blog-40-evolving-site-design-using-css/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Weekly Release Blog #32: When things go down that shouldn&#8217;t</title>
		<link>http://exortech.com/blog/2009/07/07/weekly-release-blog-32-when-things-go-down-that-shouldnt/</link>
		<comments>http://exortech.com/blog/2009/07/07/weekly-release-blog-32-when-things-go-down-that-shouldnt/#comments</comments>
		<pubDate>Tue, 07 Jul 2009 14:44:54 +0000</pubDate>
		<dc:creator>exortech</dc:creator>
				<category><![CDATA[release blog]]></category>
		<category><![CDATA[weekly release]]></category>

		<guid isPermaLink="false">http://exortech.com/blog/?p=186</guid>
		<description><![CDATA[Last week, our site sustained a prolonged outage during core business hours. While testing their backup power systems, our data centre provider tripped a breaker leading to a cascade of failures that, among other things, produced a power surge that fried our hardware firewall&#8217;s power supply. The hardware firewall is one of those standard pieces [...]]]></description>
			<content:encoded><![CDATA[<p>Last week, our site sustained a prolonged outage during core business hours. While testing their backup power systems, our data centre provider tripped a breaker leading to a cascade of failures that, among other things, produced a power surge that fried our hardware firewall&#8217;s power supply. The hardware firewall is one of those standard pieces of system hardware that are so simple that they are assumed to be failure resistant &#8211; one of the pieces of an infrastructure least likely to fail. The reality is that they are antiquated, commodity hardware that the host provider has long ago paid off and that have sustained the load of numerous sites before ours. The question is not <em>if</em> they&#8217;re going to fail, but <em>when</em>. And the implications of their failure is quite severe. </p>
<p>By design, the firewall serves as the single access point into and out of the site. Even though we had taken redundancy and failover measures in the web server, application server and database clusters behind the firewall, it doesn&#8217;t matter much if the traffic can&#8217;t get though. Essentially the hardware firewall is a big old <a href="http://en.wikipedia.org/wiki/Single_Point_of_Failure">SPOF</a>.</p>
<p>Normally a hardware power supply is one of those things that a data centre can very quickly replace. But when the data centre itself is in turmoil because of a significant outage, replacing a power supply for some small customer is the last thing on their mind. When it comes down to it, the only one who cares about your site is you and your customers. We, of course, knew about the failure immediately because of the <a href="http://exortech.com/blog/2009/01/18/weekly-release-blog-9-production-monitoring/">monitoring we have in place</a>. But there wasn&#8217;t much we could do. When managed hardware fails, there&#8217;s not much you can do except log a ticket (assuming that the ticketing system is up &#8211; which in this case it wasn&#8217;t), sit back and wait. Of course there are SLAs in place (there&#8217;s a one hour replacement window on these types of things), but they don&#8217;t keep your site up and going through the negotiations to sort out the ramifications of a failure are a big waste of everyone&#8217;s time. The bottom line is that we need to eliminate this SPOF from our infrastructure by obtaining a secondary firewall that we can failover to.</p>
]]></content:encoded>
			<wfw:commentRss>http://exortech.com/blog/2009/07/07/weekly-release-blog-32-when-things-go-down-that-shouldnt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Weekly Release Blog #25 &#8211; Improving the signal-to-noise ratio</title>
		<link>http://exortech.com/blog/2009/05/13/weekly-release-blog-25-improving-the-signal-to-noise-ratio/</link>
		<comments>http://exortech.com/blog/2009/05/13/weekly-release-blog-25-improving-the-signal-to-noise-ratio/#comments</comments>
		<pubDate>Thu, 14 May 2009 06:21:05 +0000</pubDate>
		<dc:creator>exortech</dc:creator>
				<category><![CDATA[agile]]></category>
		<category><![CDATA[release blog]]></category>
		<category><![CDATA[continuous monitoring]]></category>
		<category><![CDATA[weekly release]]></category>

		<guid isPermaLink="false">http://exortech.com/blog/?p=161</guid>
		<description><![CDATA[At my company, we use a form of Continuous Monitoring: every time our system logs a warning or an error we immediately receive an email identifying the source and nature of the problem. This allows us to respond rapidly to problems as they arise and gives us good visibility into the health of our system. [...]]]></description>
			<content:encoded><![CDATA[<p>At my company, we use a form of <a href="http://exortech.com/blog/2008/08/14/continuous-monitoring-tutorial-at-agile-2008/">Continuous Monitoring</a>: every time our system logs a warning or an error we immediately receive an email identifying the source and nature of the problem. This allows us to respond rapidly to problems as they arise and gives us good visibility into the health of our system. Following the mantra of &#8220;do in test as is done in prod&#8221;, we have the same monitoring system set up in both environments to help us find issues in test before they find their way into production.</p>
<p>The downside to this level of monitoring is that it can amount to <strong>a lot</strong> of messages. Our challenge is to manage the signal-to-noise ratio so that:</p>
<ul>
<li>we are only notified about things that require immediate action,</li>
<li>we don&#8217;t suffer from information overload; and</li>
<li>emails that matter aren&#8217;t buried under a bunch of emails that don&#8217;t.</li>
</ul>
<p>As part of our <a href="http://startuplessonslearned.blogspot.com/2008/11/five-whys.html">5 Whys</a> activity for production issues, we have found that most production issues actually occurred first in test, but just went unnoticed. This provides a compelling reason to keep the signal ratio high in all environments. Any time that we find ourselves automatically archiving or filtering an alert indicates an opportunity for improvement. </p>
<p>We have found that refining and tuning these alert messages is an ongoing maintenance activity. As part of our weekly meeting, we try to select one message to clarify or dispatch each week. We have a script that trawls the support emails received in the past week and builds a pareto distribution of the number of messages by logger. This helps us decide where to focus our efforts and to quantify the impact of our actions on the volume of messages we receive.</p>
<p>Determining what kinds of things we need to be alerted about is difficult to assess in advance. Often things that we are concerned about when building a feature turn out to less important in production, and conversely, we miss things in development that turn out to be very important once real customers start using them. Fortunately, deploying every week gives plenty of opportunity for improvement. Also if a message is logged more frequently than intended, we only have to put up with it for a week before it can be rectified. </p>
<p>I should mention that we have a <a href="http://www.amazon.com/Release-Production-Ready-Software-Pragmatic-Programmers/dp/0978739213">circuit breaker</a> in place in the log monitor. We do not allow duplicate messages to be sent any more frequently than once per hour. (Relatively early on we managed to get temporarily blacklisted by a mail provider when an errant message was generated much too frequently).</p>
<p>In terms of managing the signal-to-noise ratio, I&#8217;ve found that there are a few broad categories of messages to deal with:</p>
<ul>
<li>Message source: did the message originate in our code or in one of the libraries that we depend on? Clearly, warnings coming from our code are easier deal with than those from outside. I&#8217;ve been frustrated by the laissez-faire attitude that various open source Java frameworks take to logging errors and warnings. We use <a href="http://cxf.apache.org/">Apache CXF</a>, and it generates over 10 severe messages with lengthy stacktraces every time the application starts up to inform us that JMS integration through JNDI is not enabled. WTF?!? Sometimes these messages can be controlled by setting custom log levels for specific loggers, but not always. And it typically feels a bit disconcerting to shut down logging just in case something important is missed.</li>
<li>System conditions: was the message generated during normal operations, during a shut down or a crash? I&#8217;ve found that systems tend to be very noisy during shutdown, but (perversely) pretty quiet during a crash. In the world of Java app servers where memory leaks across deployments are common, trying to quietly quiesce a server is a real challenge.</li>
</ul>
<p>In the (enterprise) environments that I&#8217;ve worked in the past, there was very little interaction between development and operations. Logs were used only for analyzing severe production problems &#8211; generally after a severe system problem (a crash) or a user had reported a problem. The log files were poorly tuned for diagnosing problems and they tended to be full of junk &#8211; problems that no one had noticed or reported that may have been going on for months (or longer).</p>
<p>In contrast, the approach that we follow at my current company means we are able to use logs to proactively find and remedy problems. It requires effort to maintain a high signal-to-noise ratio, but it is very worthwhile.</p>
]]></content:encoded>
			<wfw:commentRss>http://exortech.com/blog/2009/05/13/weekly-release-blog-25-improving-the-signal-to-noise-ratio/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Weekly Release Blog #24 &#8211; Downgrading from Glassfish 2.1</title>
		<link>http://exortech.com/blog/2009/05/06/weekly-release-blog-downgrading-from-glassfish-21/</link>
		<comments>http://exortech.com/blog/2009/05/06/weekly-release-blog-downgrading-from-glassfish-21/#comments</comments>
		<pubDate>Thu, 07 May 2009 06:22:33 +0000</pubDate>
		<dc:creator>exortech</dc:creator>
				<category><![CDATA[release blog]]></category>
		<category><![CDATA[Glassfish]]></category>
		<category><![CDATA[weekly release]]></category>

		<guid isPermaLink="false">http://exortech.com/blog/?p=156</guid>
		<description><![CDATA[Last week, one of our Glassfish instances stopped responding. The process was running, but no longer handling requests. The good news is that the load balancer automatically failed over so there was no downtime to the site. The bad news is that we didn&#8217;t receive any direct notification of the failure. We have monitoring on [...]]]></description>
			<content:encoded><![CDATA[<p>Last week, one of our <a href="https://glassfish.dev.java.net/">Glassfish</a> instances stopped responding. The process was running, but no longer handling requests. The good news is that the load balancer automatically failed over so there was no downtime to the site. The bad news is that we didn&#8217;t receive any direct notification of the failure. We have monitoring on the box, but it is primarily at a system-level. In this case, everything was fine with the system, it was just the JVM that was having issues. And the problem wasn&#8217;t load per se, more lack thereof. Looking at the Ganglia graphs, the only thing suspicious was the absence of activity.</p>
<p>To rectify the situation, we brought the application server up and down a few times and tried redeploying the application, but still no dice. We had previously seen occasions where Glassfish had become corrupted, so the next action was to rebuild the instance. One nice feature of Glassfish is that it is quite scriptable and we fleshed out our script for rebuilding a production instance. Strangely, rebuilding the application server didn&#8217;t seem to help. The clean instance would run for a while and then just lock up. It seemed to do this non-deterministically.</p>
<p>We were feeling really stumped. As a last resort, we decided to reboot the server. This is something that I would have considered earlier if it was a Windows box, but this was a Linux server that had been up and running reliably since we first commissioned it 7 months earlier. Also this seemed to be a JVM issue and the JVM process was being brought up and down with each application server restart. Fortunately, rebooting seemed to do the trick. There must have been some malignant process or OS lock that was interfering with the JVM, but it wasn&#8217;t clear what was the cause.</p>
<p>Unfortunately this wasn&#8217;t the end of our problems. When we decided to rebuild Glassfish, we had opted to upgrade from V2ur2 to 2.1. Many of us had been running Glassfish 2.1 in development and it seemed more reliable than the V2ur2 release. Besides, it was just a minor point upgrade. When we went to reconnect our remote clients with the rebuilt server, they started throwing SerializationExceptions on a Sun library OrderedSet class. The IIOP/CORBA communication protocol uses binary serialization to transmit objects for remote JNDI lookups as part of the JMS handshake. Some genius on the project had decided to upgrade a key library as part of a point release that broke backwards compatibility for standard JMS clients. Nice.</p>
<p>Buried in the Glassfish 2.1 upgrade guide, the <a href="http://docs.sun.com/app/docs/doc/820-4331/geyyk?a=view">Application Client Interoperability section</a> states:</p>
<blockquote><p>You cannot run application clients with one version of the application server runtime with a server that has a different version. Most often, this would happen if you upgraded the server but had not upgraded all the application client installations. You can use the Java Web Start support to distribute and launch the application client. If the runtime on the server has changed since the end-user last used the application client, Java Web Start automatically retrieves the updated runtime. Java Web Start enables you to keep the clients and servers synchronized and using the same runtime.</p></blockquote>
<p>WTF!?! What kind of an upgrade process is this? Upgrading the application server requires simultaneously upgrading all clients? Ain&#8217;t gonna happen. It&#8217;s essentially guaranteeing version lock down. And recommending Java web start is fine for distributed client applications, not for long-running autonomous processes.</p>
<p>Anyway, downgrading Glassfish back to v2ur2 resolved the connectivity problem. The v2.1 compatibility problem exposed the deeper issue: that JMS, at least the default CORBA implementation, is a tightly coupled train-wreck waiting to happen, especially with Sun&#8217;s cavalier attitude toward upgrades. It&#8217;s time to pursue alternate communication protocols built on open standards like, say, XMPP.</p>
]]></content:encoded>
			<wfw:commentRss>http://exortech.com/blog/2009/05/06/weekly-release-blog-downgrading-from-glassfish-21/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Weekly Release Blog #23 &#8211; Continuous Deployment&#8230; to test?</title>
		<link>http://exortech.com/blog/2009/04/28/release-blog-23-continuous-deployment-to-test/</link>
		<comments>http://exortech.com/blog/2009/04/28/release-blog-23-continuous-deployment-to-test/#comments</comments>
		<pubDate>Tue, 28 Apr 2009 17:49:39 +0000</pubDate>
		<dc:creator>exortech</dc:creator>
				<category><![CDATA[agile]]></category>
		<category><![CDATA[release blog]]></category>
		<category><![CDATA[continuous deployment]]></category>
		<category><![CDATA[Glassfish]]></category>
		<category><![CDATA[weekly release]]></category>

		<guid isPermaLink="false">http://exortech.com/blog/?p=150</guid>
		<description><![CDATA[Last week, we were fortunate to have Eric Ries come out and spend some time talking with our team while he was here for the Agile Vancouver event. We had the chance to talk about 5 whys, split testing and other topics. I would have liked to spend a bit more time discussing continuous deployment, [...]]]></description>
			<content:encoded><![CDATA[<p>Last week, we were fortunate to have <a href="http://startuplessonslearned.blogspot.com/">Eric Ries</a> come out and spend some time talking with our team while he was here for the <a href="http://agilevancouver.ca">Agile Vancouver</a> event. We had the chance to talk about <a href="http://startuplessonslearned.blogspot.com/2008/11/five-whys.html">5 whys</a>, <a href="http://startuplessonslearned.blogspot.com/2008/09/one-line-split-test-or-how-to-ab-all.html">split testing</a> and other topics. I would have liked to spend a bit more time discussing <a href="http://timothyfitz.wordpress.com/2009/02/10/continuous-deployment-at-imvu-doing-the-impossible-fifty-times-a-day/">continuous deployment</a>, but I did get some more insight into how they got started with CD at <a href="http://imvu.com/">IMVU</a>.</p>
<p>One thing that I was surprised to learn was that IMVU started out with continuous deployment. They were deploying to production with every commit before they had an automated build server or extensive automated test coverage in place. Intuitively this seemed completely backwards to me &#8211; surely it would be better to start with CI, build up the test coverage until it reached an acceptable level and then work on deploying continuously. In retrospect and with a better understanding of their context, their approach makes perfect sense. Moreover, approaching the problem from the direction that I had intuitively is a recipe for never reaching a point where continuous deployment is feasible.</p>
<p>Initially, IMVU sought to quickly build a product that would prove out the soundness of their ideas and test the validity of their business model. Their initial users were super early adopters who were willing to trade quality for access to new features. Getting features and fixes into hands of users was the greatest priority &#8211; a test environment would just get in the way and slow down the validation coming from having code running in production. As the product matured, they were able to <a href="http://skizz.biz/blog/2008/03/11/fixing-broken-windows-with-ratcheting/">ratchet up the quality</a> to prevent regression on features that had been truly embraced by their customers.</p>
<p>Second, leveraging a dynamic scripting language (like PHP) for building web applications made it easy to quickly set up a <a href="http://radar.oreilly.com/2009/03/continuous-deployment-5-eas.html">simple, non-disruptive deployment process</a>. There&#8217;s no compilation or packaging steps which would generally be performed by an automated build server &#8211; just copy and change the symlink. </p>
<p>Third, they evolved ways to selectively expose functionality to sets of users. As Eric said, &#8220;at IMVU, &#8216;release&#8217; is a marketing term&#8221;. New functionality could be living in production for days or weeks before being released to the majority of users. They could test, get feedback and refine a new feature with a subset of users until it was ready for wider consumption. Users were not just an extension of the testing team &#8211; they were an extension of the product design team.</p>
<p>Understanding these three factors makes it clear as to why continuous deployment was a starting point for IMVU. In contrast, at most organizations &#8211; especially those with mature products &#8211; high quality is the starting point. It is assumed that users will not tolerate any decrease in quality. Users should only see new functionality once it is ready, fully implemented and thoroughly tested, lest they get a bad impression of the product that could adversely affect the company&#8217;s brand. They would rather build the wrong product well than risk this kind of exposure. In this context, the automated test coverage would need to be so good as to render continuous deployment infeasible for most systems. Starting instead from a position where feedback cycle time is the priority and allowing quality to ratchet up as the product matures provides a more natural lead in to continuous deployment.</p>
<p>For my company, even though we do weekly deployments, we&#8217;re still a fair way off from being able to deploy continuously. As we are operating in a new and rapidly evolving market, we focus on building and releasing a simple initial version of new features that demonstrate the potential of the software. We can then receive feedback and invest more effort in expanding those features that resonate with our clients. While we do routinely selectively expose new functionality to a subset of users (generally internal users) to solicit feedback, we still need to create more sophisticated ways to do user segmentation. Aside from the obvious bugbear of automated test coverage (we use JUnit and Selenium, but our coverage isn&#8217;t nearly good enough), our main blocking issue from a technology perspective is the deployment process itself.</p>
<p>To deploy continuously, the deployment has to be quick and it has to be transparent to end users (ie. there should be no visible downtime). Performing a rollback should have the same characteristics. Our deployment process <em>is</em> automated, but in the world of Java application servers (even lightweight ones like Glassfish) deployment is anything but fast. Deployment entails all kinds work that the app server needs to do (parsing configuration files, generating WSDLs, starting thread pools, etc) during which the application is unresponsive. Also, because of memory leak issues in the container, we always restart the application server with each deployment anyway. All in all, the only way to avoid downtime is to pull the application server out of the load balancer pool until the deployment completes. Rollback is the same process in reverse. </p>
<p>A bit of an aside, but I know of some teams that package Glassfish with their app, inverting the container metaphor and simply treating it as another library/dependency. This makes it easier to just flip the symlink on deployment and rollback. It&#8217;s an interesting idea, as long as you don&#8217;t mind copying a massive WAR to production with each deploy (which for us would just shift the deployment bottleneck to the network).</p>
<p>We have made a fair bit of head way on streamlining our deployment process, and while we&#8217;re not ready to do continuous deployments into production, I am trying to get us into a position where we can do continuous deployment to test. I used to be of the opinion that deployment to test was something that should be controlled by testers (via a deploy button on the automated build server). Most testers want to work against a stable baseline, limiting the number of variables that they are dealing with when testing the app. But this is a fallacy because a batch of changes is simply piling up behind whatever version is deployed into test. It&#8217;s classic batch-and-queue thinking.</p>
<p>What if deployments happened without downtime in a way that was invisible to the tester or the end user? What if test coverage was sufficient to ensure that there would be no regression on major areas of functionality? I think that the fears of continuous deployment into test and the need for a stable baseline would evaporate. Moreover, this is something that we would want to test because it would mirror the experience of users using the site when a new version goes into production. In our office, every time we do a deployment to test, someone needs to call out &#8220;deploying to test&#8221;. This too would go away.</p>
<p>That&#8217;s the plan anyway. Over the next couple of weeks, I&#8217;ll see if we can move closer to achieving it. I&#8217;ll let you know how it goes.</p>
]]></content:encoded>
			<wfw:commentRss>http://exortech.com/blog/2009/04/28/release-blog-23-continuous-deployment-to-test/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Weekly Release Blog #20 &#8211; Far Future Expires</title>
		<link>http://exortech.com/blog/2009/04/07/weekly-release-blog-20-far-future-expires/</link>
		<comments>http://exortech.com/blog/2009/04/07/weekly-release-blog-20-far-future-expires/#comments</comments>
		<pubDate>Wed, 08 Apr 2009 06:42:05 +0000</pubDate>
		<dc:creator>exortech</dc:creator>
				<category><![CDATA[agile]]></category>
		<category><![CDATA[release blog]]></category>
		<category><![CDATA[weekly release]]></category>

		<guid isPermaLink="false">http://exortech.com/blog/?p=140</guid>
		<description><![CDATA[If you&#8217;re looking for some quick ways to improve the performance of your site, Steve Souder&#8217;s High Performance Web Sites is packed with great advice. You don&#8217;t even need to buy the book as most of the information is available through links from the Firefox YSlow plugin. We have been picking one rule every couple [...]]]></description>
			<content:encoded><![CDATA[<p>If you&#8217;re looking for some quick ways to improve the performance of your site, Steve Souder&#8217;s <a href="http://www.amazon.ca/High-Performance-Web-Sites-Essential/dp/0596529309">High Performance Web Sites</a> is packed with great advice. You don&#8217;t even need to buy the book as most of the <a href="http://developer.yahoo.com/performance/rules.html">information is available</a> through links from the <a href="http://developer.yahoo.com/yslow/">Firefox YSlow plugin</a>. We have been picking one rule every couple of weeks to focus on and this past week we spent a bit of time adding <a href="http://developer.yahoo.net/blog/archives/2007/05/high_performanc_2.html">far future expires headers</a> for the Flex SWFs on our site.</p>
<p><strong>Far future expires</strong> means setting the expires HTTP header for static content to some date far in the future. Effectively, this means that static content within a web site will always be loaded from the browser cache after it is first requested. This has the impact of greatly improving the load time for your site as well as reducing the number of requests sent to your web servers. The flip-side, however, is that because the cached content never expires, if you do need to change an image or a stylesheet then the user will need to clear their browser cache before they see it. </p>
<p>Hence, taking advantage of far future expires means taking responsibility for versioning static content on the server. Any time static content changes, it needs to be served up from a different URL. In his book, Steve Souders alludes to the approach that they follow at Yahoo! to achieve this, but he doesn&#8217;t give enough detail to just go ahead and implement it. So here is how we&#8217;re solving the problem.</p>
<p>We&#8217;re currently using two approaches to versioning static content: one for images and one for SWFs, CSS, and Javascript:</p>
<ul>
<li>Every time an image is added to our site, we place it in a folder named after the current release (ie. <em>/images/1.21/header.png</em>). If we need to update an image then we move it from the folder it&#8217;s in to the folder for the current release and then update all links to the image accordingly. While this approach does require some manual effort, it has the advantage of being incredibly simple and easy to get going immediately. Because images change relatively infrequently within the site, this approach creates minimal overhead. It also only means that images that have changed from release to release get reloaded. The majority of the images will stay cached because they haven&#8217;t changed.</li>
<li>Other static content like stylesheets, scripts and Flex applications change more frequently. They are versioned automatically with every build by getting copied to a folder named after the current build number and then bundled into the deployment package. We then dynamically build the path/URL to these resources using the current build number loaded from a bundled text file resource. This approach has the benefit of being completely automated. The only (small) disadvantage is that the version of the content changes with every release (as we&#8217;re releasing every week, this is quite often) regardless of whether the content has changed or not. However, given that this content generally changes weekly, it isn&#8217;t a problem.</li>
</ul>
<p>As far as setting the expires and cache control headers, we&#8217;re using <a href="http://nginx.net/">nginx</a> as a reverse proxy server which makes it trivial to <a href="http://wiki.nginx.org/NginxHttpHeadersModule">set HTTP headers</a> by the file extension for each requested URI.</p>
<p>If you have suggestions for a better way to version static content or if I can provide more clarity on the approach that we&#8217;re using, please let me know.</p>
]]></content:encoded>
			<wfw:commentRss>http://exortech.com/blog/2009/04/07/weekly-release-blog-20-far-future-expires/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Weekly Release Blog #19 &#8211; Timezones!?!</title>
		<link>http://exortech.com/blog/2009/04/01/weekly-release-blog-19-timezones/</link>
		<comments>http://exortech.com/blog/2009/04/01/weekly-release-blog-19-timezones/#comments</comments>
		<pubDate>Thu, 02 Apr 2009 04:59:54 +0000</pubDate>
		<dc:creator>exortech</dc:creator>
				<category><![CDATA[agile]]></category>
		<category><![CDATA[release blog]]></category>
		<category><![CDATA[weekly release]]></category>

		<guid isPermaLink="false">http://exortech.com/blog/?p=137</guid>
		<description><![CDATA[Last week, we spent some time adding better timezone handling to the application &#8211; specifically, the ability to view data in the data source&#8217;s timezone rather than the user&#8217;s local timezone. Our application leverages Adobe Flex for charting and data visualization, and it&#8217;s sufficient to say that Flex&#8217;s timezone support is frankly lacking. Flex supports [...]]]></description>
			<content:encoded><![CDATA[<p>Last week, we spent some time adding better timezone handling to the application &#8211; specifically, the ability to view data in the data source&#8217;s timezone rather than the user&#8217;s local timezone. Our application leverages Adobe Flex for charting and data visualization, and it&#8217;s sufficient to say that Flex&#8217;s timezone support is frankly lacking. Flex supports determining the UTC offset, which is fine when displaying data in a user&#8217;s local timezone, but it&#8217;s insufficient for working with alternate timezones.</p>
<p>The recommended advice is to keep dates on the Flex-side strictly in UTC and leave the server to handling all date and timezone manipulation. The server returns UTC dates (epochs) shifted relative to the timezone that the data should be displayed in. </p>
<p>One challenge is that not all Flex controls work with UTC dates directly, meaning that there is inevitably some back and forth between local dates and UTC on the client side. Also, as we discovered, the UTC setters on a date tend to have unexpected side effects. It is generally better to create a new UTC date from a local date rather than invoke its UTC setters directly. Testing can also be tricky as bugs may only be visible at certain times of day (ie. when its tomorrow in GMT but still today locally) or the month (at month boundaries).</p>
<p>On last little bit of date fun with Flex, the Flex data visualization/charting package has a tendency to crash your browser when viewing data at DST rollovers. The bug was raised over a year ago and ostensibly fixed before this year&#8217;s DST rollover at the start of March, but it still hasn&#8217;t found its way into to release (at least as of this post).</p>
<p>With all of the time invested into support timezones, I can&#8217;t help but feel we&#8217;d be better off adopting some standard global metric time.</p>
]]></content:encoded>
			<wfw:commentRss>http://exortech.com/blog/2009/04/01/weekly-release-blog-19-timezones/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

