Implementing an OpenShift Enterprise routing layer for HA applications

My previous post described how to make an application HA and what exactly that means behind the scenes. This post is to augment the explanation in the HA PEP of how an administrator should expect to implement the routing layer for HA apps.

The routing layer implementation is currently left entirely up to the administrator. At some point OpenShift will likely ship a supported routing layer component, but the first priority was to provide an SPI (Service Provider Interface) so that administrators could reuse existing routing and load balancer infrastructure with OpenShift. Since most enterprises already have such infrastructure, we expected they would prefer to leverage that investment (both in equipment and experience) rather than be forced to use something OpenShift-specific.

Still, this leaves the administrator with the task of implementing the interface to the routing layer. Worldline published an nginx implementation, and we have some reference implementations in the works, but I thought I’d outline some of the details that might not be obvious in such an implementation.

The routing SPI

The first step in the journey is to understand the routing SPI events. The routing SPI itself is an interface on the OpenShift broker app that must be implemented via plugin. The example routing plugin that is packaged for Origin and Enterprise simply serializes the SPI events to YAML and puts them on an ActiveMQ message queue/topic. This is just one way to distribute the events, but it’s a pretty good way, at least in the abstract. For routing layer development and testing, you can just publishes messages on a topic on the ActiveMQ instance OpenShift already uses (for Enterprise, does this for you) and use the trivial “echo” listener to see exactly what comes through. For production, publish events to a queue (or several if multiple instances need updating) on an HA ActiveMQ deployment that stores messages to disk when shutting down (you really don’t want to lose routing events) – note that the ActiveMQ deployment described in OpenShift docs and deployed by the installer does not do this, being intended for much more ephemeral messages.

I’m not going to go into detail about the routing events. You’ll become plenty familiar if you implement this. You can see some good example events in this description, but always check what is actually coming out of the SPI as there may have been updates (generally additions) since. The general outline of the events can be seen in the Sample Routing Plug-in Notifications table from the Deployment Guide or in the example implementation of the SPI. Remember you can always write your own plugin to give you information in the desired format.

Consuming SPI events for app creation

The routing SPI publishes events for all apps, not just HA ones, and you might want to do something with other apps (e.g. implement blue/green deployments), but the main focus of a routing layer is to implement HA apps. So let’s look at how you do that. I’m assuming YAML entries from the sample activemq plugin below — if you use a different plugin, similar concepts should apply just with different details.

First when an app is created you’re going to get an app creation event:

$ rhc app create phpha php-5.4 -s

:action: :create_application
:app_name: phpha
:namespace: demo
:scalable: true
:ha: false

This is pretty much just a placeholder for the application name. Note that it is not marked as HA. There is some work coming to make apps HA at creation, but currently you just get a scaled app and have to make it HA after it’s created. This plugin doesn’t publish the app UUID, which is what I would probably do if I were writing a plugin now. Instead, you’ll identify the application in any future events by the combination of app_name and namespace.

Once an actual gear is deployed, you’ll get two (or more) :add_public_endpoint actions, one for haproxy’s load_balancer type and one for the cartridge web_framework type (and possibly other endpoints depending on cartridge).

:action: :add_public_endpoint
:app_name: phpha
:namespace: demo
:gear_id: 542b72abafec2de3aa000009
:public_port_name: haproxy-1.4
:public_port: 50847
- http
- ws
- load_balancer
- frontend: ''
 backend: ''
- frontend: /health
 backend: /configuration/health

You might expect that when you make the app HA, there is some kind of event specific to being made HA. There isn’t at this time. You just get another load_balancer endpoint creation event for the same app, and you can infer that it’s now HA. For simplicity of implementation, it’s probably just best to treat all scaled apps as if they were already HA and define routing configuration for them.

Decision point 1: The routing layer can either direct requests only to the load_balancer endpoints and let them forward traffic all to the other gears, or it can actually just send traffic directly to all web_framework endpoints. The recommendation is to send traffic to the load_balancer endpoints, for a few reasons:

  1. This allows haproxy to monitor traffic in order to auto-scale.
  2. It will mean less frequent changes to your routing configuration (important when changes mean restarts).
  3. It will mean fewer entries in your routing configuration, which could grow quite large and become a performance concern.

However, direct routing is viable, and allows an implementation of HA without actually going to the trouble of making apps HA. You would just have to set up a DNS entry for the app that points at the routing layer and use that. You’d also have to handle scaling events manually or from the routing layer somehow (or even customize the HAproxy autoscale algorithm to use stats from the routing layer).

Decision point 2: The expectation communicated in the PEP (and how this was intended to be implemented) is that requests will be directed to the external proxy port on the node (in the example above, that would be There is one problem with doing this – idling. Idler stats are gathered only on requests that go through the node frontend proxy, so if we direct all traffic to the port proxy, the haproxy gear(s) will eventually idle and the app will be unavailable even though it’s handling lots of traffic. (Fun fact: secondary gears are exempt from idling – doesn’t help, unless the routing layer proxies directly to them.) So, how do we prevent idling? Here are a few options:

  1. Don’t enable the idler on nodes where you expect to have HA apps. This assumes you can set aside nodes for (presumably production) HA apps that you never want to idle. Definitely the simplest option.
  2. Implement health checks that actually go to the node frontend such that HA apps will never idle. You’ll need the gear name, which is slightly tricky – the above endpoint being on the first gear, it will be accessible by a request for http://phpha-demo.cloud_domain/health to the node at When the next gear comes in, you’ll have to recognize that it’s not the head gear and send the health check to e.g. http://542b72abafec2de3aa000009-demo.cloud_domain/health.
  3. Flout the PEP and send actual traffic to the node frontend. This would be the best of all worlds since the idler would work as intended without any special tricks, but there are some caveats I’ll discuss later.

Terminating SSL (TLS)

When hosting multiple applications behind a proxy, it is basically necessary to terminate SSL at the proxy. (Despite SSL having been essentially replaced by TLS at this point, we’re probably going to call it SSL for the lifetime of the internet.) This has to do with the way routing works under HTTPS; during the intialization of the TLS connection, the client has to indicate the name it wants (in our case the application’s DNS name) in the SNI extension to the TLS “hello”. The proxy can’t behave as a dumb layer 4 proxy (just forwarding packets unexamined to another TLS endpoint) because it has to examine the stream at the protocol level to determine where to send it. Since the SNI information is (from my reading of the RFC) volunteered by the client at the start of the connection, it does seem like it would be possible for a proxy to examine the protocol and then act like a layer 4 proxy based on that examination, and indeed I think F5 LBs have this capability, but it does not seem to be a standard proxy/LB capability, and certainly not for existing open source implementations (nginx, haproxy, httpd – someone correct me if I’m missing something here), so to be inclusive we are left with proxies that operate at the layer 7 protocol layer, meaning they perform the TLS negotiation from the client’s perspective.

Edit 2014-10-08: layer 4 routing based on SNI is probably more available than I thought. I should have realized HAproxy 1.5 can do it, given OpenShift’s SNI proxy is using that capability. It’s hard to find details on though. If most candidate routing layer technologies have this ability, then it could simplify a number of the issues around TLS because terminating TLS could be deferred to the node.

If that was all greek to you, the important point to extract is that a reverse proxy has to have all the information to handle TLS connections, meaning the appropriate key and certificate for any requested application name. This is the same information used at the node frontend proxy; indeed, the routing layer will likely need to reuse the same *.cloud_domain wildcard certificate and key that is shared on all OpenShift nodes, and it needs to be made aware of aliases and their custom certificates so that it can properly terminate requests for them. (If OpenShift supported specifying x509 authentication via client certificates [which BTW could be implemented without large structural changes], the necessary parameters would also need to be published to the routing layer in addition to the node frontend proxy.)

We assume that a wildcard certificate covers the standard HA DNS name created for HA apps (e.g. in this case ha-phpha-demo.cloud_domain, depending of course on configuration; notice that no event announces this name — it is implied when an app is HA). That leaves aliases which have their own custom certs needing to be understood at the routing layer:

$ rhc alias add -a phpha
:action: :add_alias
:app_name: phpha
:namespace: demo

$ rhc alias update-cert -a phpha --certificate certfile --private-key keyfile
:action: :add_ssl
:app_name: phpha
:namespace: demo
:ssl: [...]
:private_key: [...]

Aliases will of course need their own routing configuration entries regardless of HTTP/S, and something will have to create their DNS entries as CNAMEs to the ha- application DNS record.

A security-minded administrator would likely desire to encrypt connections from the routing layer back to the gears. Two methods of doing this present themselves:

  1. Send an HTTPS request back to the gear’s port proxy. This won’t work with any of the existing cartridges OpenShift provides (including the haproxy-1.4 LB cartridge), because none of them expose an HTTPS-aware endpoint. It may be possible to change this, but it would be a great deal of work and is not likely to happen in the lifetime of the current architecture.
  2. Send an HTTPS request back to the node frontend proxy, which does handle HTTPS. This actually works fine, if the app is being accessed via an alias – more about this caveat later.

Sending the right HTTP headers

It is critically important in any reverse-proxy situation to preserve the client’s HTTP request headers indicating the URL at which it is accessing an application. This allows the application to build self-referencing URLs accurately. This can be a little complicated in a reverse-proxy situation, because the same HTTP headers may be used to route requests to the right application. Let’s think a little bit about how this needs to work. Here’s an example HTTP request:

POST /app/login.php HTTP/1.1

If this request comes into the node frontend proxy, it looks at the Host header, and assuming that it’s a known application, forwards the request to the correct gear on that node. It’s also possible (although OpenShift doesn’t do this, but a routing layer might) to use the path (/app/login.php here) to route to different apps, e.g. requests for /app1/ might go to a different place than /app2/.

Now, when an application responds, it will often create response headers (e.g. a redirect with a Location: header) as well as content based on the request headers that are intended to link to itself relative to what the client requested. The client could be accessing the application by a number of paths – for instance, our HA app above should be reachable either as or as (default HA config). We would not want a client that requests the ha- address to receive a link to the non-ha- address, which may not even resolve for it, and in any case would not be HA. The application, in order to be flexible, should not make a priori assumptions about how it will be addressed, so every application framework of any note provides methods for creating redirects and content links based on the request headers. Thus, as stated above, it’s critical for these headers to come in with an accurate representation of what the client requested, meaning:

  1. The same path (URI) the client requested
  2. The same host the client requested
  3. The same protocol the client requested

(The last is implemented via the “X-Forwarded-Proto: https” header for secure connections. Interestingly, a recent RFC specifies a new header for communicating items 2 and 3, but not 1. This will be a useful alternative as it becomes adopted by proxies and web frameworks.)

Most reverse proxy software should be well aware of this requirement and provide options such that when the request is proxied, the headers are preserved (for example, the ProxyPreserveHost directive in httpd). This works perfectly with the HA routing layer scheme proposed in the PEP, where the proxied request goes directly to an application gear. The haproxy cartridge does not need to route based on Host: header (although it does route requests based on a cookie it sets for sticky sessions), so the request can come in for any name at all and it’s simply forwarded as-is for the application to use.

The complication arises in situations where, for example, you would like the routing layer to forward requests to the node frontend proxy (in order to use HTTPS, or to prevent idling). The node frontend does care about the Host header because it’s used for routing, so the requested host name has to be one that the OpenShift node knows in relation to the desired gear. It might be tempting to think that you can just rewrite the request to use the gear’s “normal” name (e.g. phpha-demo.cloud_domain) but this would be a mistake because the application would respond with headers and links based on this name. Reverse proxies often offer options for rewriting the headers and even contents of responses in an attempt to fix this, but they cannot do so accurately for all situations (example: links embedded in JavaScript properties) so this should not be attempted. (Side note: the same prohibition applies to rewriting the URI path while proxying. Rewriting… to… is only safe for sites that provide static content and all-relative links.)

What was that caveat?

I mentioned a caveat both on defeating the idler and proxying HTTPS connections to the node frontend, and it’s related to the section above. You can absolutely forward an HA request to the node frontend if the request is for a configured alias of the application, because the node frontend knows how to route aliases (so you don’t have to rewrite the Host: header which, as just discussed, is a terrible idea). The caveat is that, strangely, OpenShift does not create an alias for the ha- DNS entry automatically assigned to an HA app, so manual definition of an alias is currently required per-app for implementation of this scheme. I have created a feature request to instate the ha- DNS entry as an alias, and being hopefully easy to implement, this may soon remove the caveat behind this approach to routing layer implementation.

Things go away too

I probably shouldn’t even have to mention this, but: apps, endpoints, aliases, and certificates can all go away, too. Make sure that you process these events and don’t leave any debris lying around in your routing layer confs. Gears can also be moved from one host to another, which is an easy use case to forget about.

And finally, speaking of going away, the example routing plugin initially provided :add_gear and :remove_gear events, and for backwards compatibility it still does (duplicating the endpoint events). These events are deprecated and should disappear soon.

HA applications on OpenShift

Beginning with version 2.0, OpenShift Enterprise supports making applications “highly available”. However, there is very little documentation on the details of implementing HA apps or exactly how they work. We surely have customers using these, but they’re mostly doing so with a lot of direct help from our consulting services. Nothing in the documentation brings it all together, so I thought I’d share what I know at this point. I think we can expect these implementation details to remain fairly stable for the lifetime of OSE 2.

This is kind of a brain dump. We are working to get most of this into the manuals if it isn’t already (at least what’s fit for it and not just me rambling). But I don’t know anywhere else to find all this in one place right now.

Why and how?

The motivation behind supplying this feature and the basic architectural points and terminology are well covered at the start of the HA PEP. Definitely read this for background if you haven’t already; I won’t repeat most of the points.

As a super quick overview: HA is an enhancement to scaled apps. Standard scaled apps allow easy horizontal scaling by deploying duplicate instances of the application code and framework cartridge in multiple gears of your app, accessed via a single load balancer (LB) proxy gear (LB is performed by the haproxy cartridge in OSE 2). For a scaled app to become HA, it should have two or more LB gears on different nodes that can each proxy to the gears of the app. That way, if a node containing one LB gear goes down, you can still reach the app via another. This only makes the web framework HA — shared storage and DB replication aren’t OpenShift features (yet), so practically speaking. your app should be stateless or use external HA storage/DB if you actually want it to be HA.

So how do we direct HTTP requests to the multiple LB gears? Via the routing layer, which is described in the PEP, but which OpenShift does not (yet) supply a component/configuration for. If you are interested in HA apps, then chances are good you already have some kind of solution deployed for highly-available routing to multiple instances of an application (e.g. F5 boxes). All you need to do is hook OpenShift and your existing solution together via the Routing System Programming Interface (SPI). Once this is configured, HA apps are given an alias in DNS that resolves to the routing layer, which proxies requests to the LB gears on OpenShift nodes, which proxy requests to the application gears.

Making an app HA (the missing User Guide entry)

The OSE admin guide indicates how administrators can configure OSE to allow HA apps and enable specific users to make apps HA. I’ll assume you have a deployment and user with this capability.

To make an HA app, you first create a scaled app, then make it HA. The “make-ha” REST API call is implemented as an event (similar to “start” or “stop”) on an application. Direct REST API access is currently the only way to do this – rhc and other tools do not implement this call yet. So, currently the only mention is in the REST API guide. (Update 2014-12-19: rhc now has app-enable-ha and the docs have been updated. The preferred event name is now “enable-ha”.) So, for example:

$ rhc create-app testha ruby-1.9 -s

Your application ‘testha’ is now available.

$ curl -k -X POST –user demo:password –data-urlencode event=make-ha

long JSON response including: “messages”:[{“exit_code”:0, “field”:null, “index”:null, “severity”:”info”, “text”:”Application testha is now ha”}], “status”:”ok”

Upon success, the app will have scaled to two gears, and your second gear will also be a LB gear. You can confirm by sshing in to the second gear and looking around.

$ rhc app show –gears -a testha

ID State Cartridges Size SSH URL
53c6c6b2e659c57659000002 started haproxy-1.4 ruby-1.9 small
53c6c74fe659c57659000021 started haproxy-1.4 ruby-1.9 small

$ ssh

> ls

app-deployments app-root gear-registry git haproxy ruby

> ps -e

12607 ? 00:00:00 haproxy
13829 ? 00:00:00 httpd

A third gear will not look the same; it will only have the framework cartridge. I should mention that at this moment there’s a bug in rhc such that it displays all framework gears as having the haproxy LB cartridge, when actually only the first two do (Update 2014-12-19: rhc output has since been corrected).

What does make-ha actually do?

Behind the scenes are a few critical changes from the make-ha event.

First, a new DNS entry has been created to resolve requests to the router. How exactly this happens depends on configuration. (Refer to the Admin Guide for details on how the HA DNS entry is configured.) In the simplest path (with MANAGE_HA_DNS=true), OpenShift itself creates the DNS entry directly; the default is just to prepend  “ha-” to the app name and point that entry at ROUTER_HOSTNAME. Thus our app above would now have a DNS entry With MANAGE_HA_DNS=false, OpenShift counts on the routing layer to receive the routing SPI event and create this DNS entry itself.

In either case, this DNS entry is only useful if it points at the router which serves as a proxy. A request to one of the nodes for “” would not be proxied correctly – it’s supposed to be relayed by the routing layer as a request for either “” or a secondary LB gear. It’s also possible for the router to just proxy directly to the individual gears (endpoints for which are provided in the routing SPI); however, this may not be desirable for a variety of reasons.

A second change that occurs is that the parameters for the haproxy cartridge are modified. By default in a scaled application, the haproxy cartridge manifest has a minimum and maximum scale of 1 (see the manifest), so there will always be exactly one. But when you make an app HA, in the MongoDB record for the haproxy cartridge instance in your app, the minimum is changed to 2, and the maximum is changed to -1 (signifying no maximum). Also the HA multiplier is applied, which I’ll discuss later. As a result of raising the minimum to 2, the app scales up one gear to meet the minimum.

There are some interesting oddities here.

First, I should note that you can’t scale the haproxy cartridge directly via the REST API at all. You’ll just get an error. You only have the ability to scale the framework cartridge, and the haproxy cartridge may be deployed with it.

Also, once your app is HA, you can no longer scale below two gears:

$ rhc scale-cartridge -a testha ruby –min 1 –max 1
Setting scale range for ruby-1.9 … Cannot set the max gear limit to ‘1’ if the application is HA (highly available)

By the way, there is no event for making your app not HA. Just destroy and re-create it. (Update 2014-12-19: There is now a disable-ha event that may be invoked on the API just like enable-ha [no rhc support yet]. It just changes the minimum requirements for haproxy cartridges, doesn’t immediately remove any cartridges/gears.)

Also, if you already scaled the framework cartridge to multiple gears, then making it HA will neither scale it up another gear, nor deploy the haproxy cartridge on an existing gear (which is what I would have expected). So it will not actually be HA at that point. Instead, an haproxy cartridge will be deployed with the next gear created. If you then scale down, that gear will be destroyed, and your app again effectively ceases to be HA (Edit: this is considered a bug 2014-12-19: and it’s fixed). So, make sure you make your app HA before scaling it, so you are assured of having at least two LB gears. A little more about this at the end.

How does auto-scaling work with HA?

If you’re familiar with OpenShift’s auto-scaling, you may be wondering how two or more LB gears coordinate traffic statistics in order to decide when to scale.

First, I’ll say that if you’re going to the trouble to front your OpenShift app with an expensive HA solution, you may want to disable auto-scaling and let your router track the load and decide when to scale. Practically speaking, that can be tricky to implement (there actually isn’t an administrative mechanism to scale a user-owned app, so workarounds are needed, such as having all HA apps owned by an “administrative” user), but it’s something to think about.

That said, if you’re using auto-scaling, perhaps with a customized algorithm, you’ll want to know that the standard haproxy scaling algorithm runs in a daemon on the first gear (“head” gear) of the app, in both regular scaled apps and HA apps. In either case, it bases scaling decisions on a moving average of the number of HTTP sessions in flight. The only difference with an HA app is that the daemon makes a request each sampling period to all of the other LB gears to gather the same statistic and add it into the average.

Also, auto-scaling does not occur if the head gear is lost. So, auto-scaling is not HA – another reason to consider manual scaling. Which brings me to…

What happens when a gear in an HA app is lost?

If a node crashes or for some other reason one or more gears of your HA app is out of commission (rack PDU cut out? Someone tripped over the network cable?), how does your app react? Well, this is no different from a regular scaled app, but it bears repeating: nothing happens. Remaining LB gears notice the missing gears are failing health checks and pull them out of rotation – and that’s it. There is no attempt to replace the missing gear(s) even if they are LB gears, or to augment the gear count to make up for them. No alerts are sent. As far as I’m aware, nothing even notifies the broker to indicate that a gear’s state has changed from “started” to “unresponsive” so it’s hard for the app owner to even tell.

The whole point of HA apps is to remove the single point of failure in standard scaled apps by providing more than one endpoint for requests, instead of the app being unreachable when the head LB gear is lost. However, there are still some functions performed only by the head gear:

  1. Auto-scaling
  2. Deployment of application code changes

These functions are still not HA in an HA app. They still rely on the head gear, and there’s no mechanism for another gear to take over the head gear designation. What’s more, if a gear is out of commission when an application update is deployed, there isn’t any mechanism currently for automatically updating the gear when it returns (the next deploy will bring it in sync of course). These are all known problems that are in our backlog of issues to address, but aren’t terribly obvious to those not intimately familiar with the architecture.

The general assumption in OpenShift when a node disappears is that it will probably come back later – either after some crisis has passed, or via resurrection from backups. So, there is no attempt at “healing” the app. If, however, a node goes away with no hope for recovery, there is a tool oo-admin-repair which can do a limited amount of cleanup to make the MongoDB records accurately reflect the fact that the gears on that node no longer exist. I’m not sure the extent to which scaled apps are repaired; I would expect that the list of gears to route to is accurately updated, but I would not expect the head gear to be replaced (or re-designated) if lost, nor any gears that aren’t scaled (databases or other services that don’t scale yet). If the head gear is intact and auto-scaling in use, then scaling should continue normally. If auto-scaling is disabled, I would not expect any scaling to occur (to replace missing gears) without the owner triggering an event. I haven’t rigorously researched these expectations, however. But let’s just say OpenShift nodes aren’t ready to be treated like cattle yet.

How do I know my gears are spread out enough?

Under OSE 2.0, gears gained anti-affinity, meaning that gears of a scaled app will avoid being placed together on the same node if there are other nodes with available capacity. OSE 2.1 provided more explicit mechanisms for controlling gear placement. The most relevant one here is putting nodes in availability zones — for example, you could use a different zone for each rack in your datacenter. If possible, OpenShift will place the gears of an HA app across multiple zones, ensuring availability if one entire zone of nodes is lost at once.

To have better assurance than “if possible” on your HA app spanning zones, take a look at the two settings in the broker mcollective plugin that force your app to do so.

There’s one major caveat here. Anti-affinity only applies when a gear is placed. The OpenShift administrators could move gears between nodes after the fact, and if they’re not being careful about it, they could move the LB gears of an HA app all to the same node, in which case it would effectively cease to be HA since one node outage would leave the app unavailable. There’s really nothing to prevent or even detect this circumstance at this time.

Verifying that your gears actually are spread across zones as expected is currently a laborious manual operation (list all the gears, see which node their addresses resolve to, check which zones the nodes are in), but the REST API and the tools that access it are being expanded to expose the zone and region information to make this easier.

Let me add one more bit of direction here regarding zones, regions, and districts. Although you need to create districts, and although you use the district tool to assign regions and zones to your nodes, there really is no relationship between districts and zones/regions. They exist for (almost) orthogonal purposes. The purpose of districts is to make moving gears between nodes easy. The purpose of zones and regions is to control where application gears land with affinity and anti-affinity. Districts can contain nodes of multiple zones, zones can span multiple districts. It actually makes a lot of sense to have multiple zones in a district and zones spanning districts, because that way if in the future you need to shut down an entire zone, it will be easy to move all of the gears in the zone elsewhere, because there will be other zones in the same district. OTOH, while you *can* have multiple regions in a district, it doesn’t make a lot of sense to do so, because you won’t typically want to move apps en masse between regions.

The scaling multiplier: more than 2 LB gears

Of course, with only two LB gears, it only takes two gear failures to make your app unreachable. For further redundancy as well as extra load balancing capacity, you might want more than the two standard LB gears in an HA app. This can be accomplished with the use of a (currently esoteric) setting called the HA multiplier. This value indicates how many of your framework gears should also be LB gears. A multiplier of 3 would indicate that every 3rd gear should be a LB gear (so, after 6 gears were created with the first two being LBs, the 9th, 12th, and so on would all be LBs). A multiplier of 10 would mean gears 30, 40, 50,… would be LB gears.

The default multiplier is 0, which indicates that after the first two LB are created to satisfy the minimum requirement for the HA app (remember, the haproxy minimum scale is changed to 2 for an HA app), no more will ever be created. The multiplier can be configured in two ways:

  1. Setting DEFAULT_HA_MULTIPLIER in broker.conf – currently undocumented, but it will be read and used if present. This sets the default, of course, so it is only relevant at the time an app is created. Changing it later doesn’t affect existing apps.
  2. Using oo-admin-ctl-app to change the multiplier for a specific app, e.g:

# oo-admin-ctl-app -l demo -a testha -c set-multiplier –cartridge haproxy-1.4 –multiplier 3

Note that both are administrative functions, not available to the user via the REST API.

The gear-to-LB ratio: rotating out the LB gears

At app creation, LB gears also directly (locally) serve application requests via the co-located framework cartridge. Once a scaled app is busy enough, gears serving as load balancers may consume significant resources just for the load balancing, leaving less capacity to serve requests. Thus, for performance reasons, after a certain limit of gears created in an app, LB gears remove themselves from the load balancing rotation (“rotate out”). Under this condition, the framework cartridge is left running on LB gears, but receives no requests.

The limit at which this occurs is governed by an environment variable OPENSHIFT_HAPROXY_GEAR_RATIO which is read by the haproxy cartridge. The idea is that when the ratio of total gears to LB gears (rounded) reaches this limit, that is when the LB gears rotate themselves out.

The default value is 3. So in a plain scaled app, once the third gear is created, the first gear is removed from rotation. In an HA app, once the 5th gear is created (5 gears / 2 LB gears rounds to 3) the first two gears are removed from rotation (resulting in an actual reduction from 4 gears servicing requests to 3). In general, it would be unwise to set this value higher than the HA multiplier, as the LB gears would be rotated in and out unevenly as the app scaled.

The most obvious way to change this value is to set it node-wide in /etc/openshift/env/OPENSHIFT_HAPROXY_GEAR_RATIO (an administrative action). However, being an environment variable, it’s also possible for the user to override it by setting an environment variable for an individual app (there’s no administrative equivalent).


The primary method of integrating a routing layer into OpenShift is via the routing SPI, a service that announces events relevant to a routing layer, such as gears being created. The Routing SPI is an interface in the broker application that may be implemented via a plugin of your choosing. The documentation describes a plugin implementation that relays the events to an ActiveMQ messaging broker, and then an example listener for consuming those messages from the ActiveMQ broker. It is important to note that this is only one implementation, and can be substituted with anything that makes sense in your environment. Publishing to a messaging bus is a good idea, but note that as typically deployed for OpenShift, ActiveMQ is not configured to store messages across restarts. So, definitely treat this only as informational, not a recommended implementation.

Another method for integrating a routing layer is via the REST API. Because the REST API has no administrative concept and access is restricted per user, the REST API is somewhat limited in an administrative capacity; for example, there is no way to list all applications belonging to all users. However querying the REST API could be suitable when implementing an administrative interface to OpenShift that intermediates all requests to the broker, at least for apps that require the routing layer (i.e. HA apps). For example, there could be a single “administrative” user that owns all production applications that need to be HA, and some kind of system for regular users to request apps via this user. In such cases, it may make sense to retrieve application details synchronously while provisioning via the REST API, rather than asynchronously via the Routing SPI.

The most relevant REST API request here is for application endpoints. This includes the connection information for each exposed cartridge on each gear of a scaled app. For example:

$ curl -k -u demo:password | python -mjson.tool

... ( example LB endpoint: )
   "cartridge_name": "haproxy-1.4", 
   "external_address": "", 
   "external_port": "56432", 
   "internal_address": "", 
   "internal_port": "8080", 
   "protocols": [
   "types": [

Odds and ends

It’s perhaps worth talking about the “gear groups” shown in the API response a little bit, as they’re not terribly intuitive. Gear groups are sets of gears in the app that replicate the same cartridge(s). For a scaled app, the framework cartridge would be in one gear group along with the LB cartridge that scales it. The haproxy LB cartridge is known as a “sparse” cartridge because it is not deployed in all of the gears in the group, just some. If a database cartridge is added to the app, that is placed in a separate gear group (OpenShift doesn’t have scaling/replicated DB cartridges yet, but when it does, these gears groups will become larger than one and will “scale” separately from the framework cartridge). Without parameters, the gear_groups REST API response doesn’t indicate which cartridge is on which gear, just the cartridges that are in a gear group and the gears in that group; this is why rhc currently indicates the haproxy is located on all framework gears. This will be fixed by specifying the inclusion of endpoints in the request.

Remember how making an app HA makes it scale up to two LB gears, but not it if it’s already two gears? I suspect the “minimum” scale from the cartridge is applied to the gear group, without regard for the actual number of cartridge instances; so, while making an app HA will set the haproxy cartridge “minimum” to 2, the effect is that the gear group scale minimum is set to 2 (i.e. scale is a property of the gear group). If the gear group has already been scaled to 2 before being made HA, then it doesn’t need to scale up to meet the minimum (and in fact, won’t allow scaling down, so you can’t ever get that second gear to have the haproxy LB).

One thing discussed in the PEP that has no implementation (aside from a routing layer component): there’s no mechanism for a single LB gear to proxy only to a subset of the other gears. If your app is scaled to 1000 gears, then each LB gear proxies to all 1000 gears (minus LB gears which we assume are rotated out). A few other pieces of the PEP are not complete as yet: removing or restarting specific gears, removing HA, and probably other bits planned but not delivered in the current implementation.

OpenShift logging and metrics

Server logs aren’t usually a very exciting topic. But if you’re a sysadmin of an OpenShift Enterprise deployment with hundreds of app servers coming and going unpredictably, managing logs can get… interesting. Tools for managing logs are essential for keeping audit trails, collecting metrics, and debugging.

What’s new

Prior to OpenShift Enterprise 2.1, gear logs simply wrote to log files. Simple and effective. But this is not ideal for a number of reasons:

  1. Log files take up your gear storage capacity. It is not hard at all to fill up your gear with logs and DoS yourself.
  2. Log files go away when your gear does. Particularly for scaled applications, this is an unacceptable loss of auditability.
  3. Log file locations and rotation policies are at the whim of the particular cartridge, thus inconsistent.
  4. It’s a pain for administrators to gather app server logs for analysis, especially when they’re spread across several gears on several nodes.

OSE 2.1  introduced a method to redirect component and gear logs to syslogd, which is a standard Linux service for managing logs. In the simplest  configuration, you could have syslog just combine all the logs it receives into a single log file (and define rotation policy on that). But you can do much more. You can filter and send log entries to different destinations based on where they came from; you can send them to an external logging server, perhaps to be analyzed by tools like Splunk. Just by directing logs to syslog we get all this capability for free (we’re all about reusing existing tools in OpenShift).

Where did that come from?

Well, nothing is free. Once you’ve centralized all your logging to syslogd, then you have the problem of separating entries back out again according to source so your automation and log analysis tools can distinguish the logs of different gears from each other and from other components. This must be taken into account when directing logs to syslogd; the log entries must include enough identifying information to determine where they came from down to the level of granularity you care about.

We now give instructions for directing logs to syslog for OpenShift components too; take a look at the relevant sections of the Administration Guide for all of this. Redirecting logs from OpenShift components is fairly straightforward. There are separate places to configure if you want to use syslog from the broker rails application, the management console rails application, and the node platform. We don’t describe how to do this with MongoDB, ActiveMQ, or httpd, but those are standard components and should also be straightforward to configure as needed. Notably left out of the instructions at this point are instructions to syslog the httpd servers hosting the broker and console rails apps; but the main items of interest in those logs are error messages from the actual loading of the rails apps, which (fingers crossed) shouldn’t happen.

Notice that when configuring the node platform logging, there is an option to add “context” which is to say, the request ID and app/gear UUIDs if relevant. Adding the request ID allows connecting what happened on the node back to the broker API request that spawned the action on the node; previously this request ID was often shown in API error responses, but was only logged in the broker log. Logging the request ID with the logs for resulting node actions to the syslog  now makes it a lot easier to get the whole picture of what happened with a problem request, even if the gear was destroyed after the request failed.

Distinguishing gear logs

There are gear logs from two sources to be handled. First, we would like to collect the httpd access logs for the gears, which are generated by the node host httpd proxy (the “frontend”). Second, we would like to collect logs from the actual application servers running in each gear, whether they be httpd, Tomcat, MongoDB, or something else entirely.

Frontend access logs

These logs were already centralized into /var/log/httpd/openshift_log and included the app hostname as well as which backend address the request was proxied to. A single httpd option “OpenShiftFrontendSyslogEnabled” adds logging via “logger” which is the standard way to write to the syslog. Every entry is tagged with “openshift-node-frontend” to distinguish frontend access logs from any other httpd logs you might write.

With 2.1 the ability to look up and log the app and gear UUIDs is added. A single application may have multiple aliases, so it is hard to automatically collate all log entries for a single application. Also, an application could be destroyed and re-created with the same address, though it is technically a different app from OpenShift’s viewpoint. Also, the same application may have multiple gears, and those gears may come and go or be moved between hosts; the backend address for a gear could also be reused by a different gear after it has been destroyed.

In order to uniquely identify an application and its gears in the httpd logs for all time, OSE 2.1 introduces the “OpenShiftAnnotateFrontendAccessLog” option which adds the application and gear UUIDs as entries in the log messages. The application UUID is unique to an application for all time (another app created with exactly the same name will get a different UUID) and shared by all of its gears. The gear UUID is unique to each gear; note that the UUID (Universally Unique ID) is different from the gear UID (User ID) which is just a Linux user number and may be shared with many other gears. Scale an application down and back up, and even if the re-created gear has the same UID as a previous gear, it will have a different UUID. But note that if you move a gear between hosts, it retains its UUID.

If you want to automatically collect all of the frontend logs for an application from syslog, the way you should do it is to set the “OpenShiftAnnotateFrontendAccessLog” option and collect logs by Application UUID. Then your httpd log entries look like this:

Jun 10 14:43:59 vm openshift-node-frontend[6746]: – – [10/Jun/2014:14:43:59 -0400] “HEAD / HTTP/1.1” 200 – “-” “curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.15.3 zlib/1.2.3 libidn/1.18 libssh2/1.4.2” (3480us) – 53961099e659c55b08000102 53961099e659c55b08000102

The “openshift-node-frontend” tag is added to these syslog entries by logger (followed by the process ID which isn’t very useful here). The app and gear UUIDs are at the end there, after the backend address proxied to. The UUIDs will typically be equal in the frontend logs since the head gear in a scaled app gets the same UUID as the app; they would be different for secondary proxy gears in an HA app or if you directly requested any secondary gear by its DNS entry for some reason.

Gear application logs

In order to centralize application logs, it was necessary to standardize cartridge logging such that all logs go through a standard mechanism that can then be centrally configured. You might think this would just be syslog, but it was also a requirement that users should be able to keep their logs in their gear if so desired, and getting syslog to navigate all of the permissions necessary to lay down those log files with the right ownership proved difficult. So instead, all cartridges now must log via the new utility logshifter (our first released component written in “go” as far as I know). logshifter will just write logs to the gear app-root/logs directory by default, but it can also be configured (via /etc/openshift/logshifter.conf) to write to syslog. It can also be configured such that the end user can choose to override this and have logs written to gear files again (which may save them from having to navigate whatever logging service ends up handling syslogs when all they want to do is debug their app).

Here distinguishing between which gear is creating the log requires somewhat more drastic measures. We want to indicate which gear created each log entry, but we can’t trust each gear to self-report accurately (as opposed to spoofing the log traffic actually coming from another gear or something else entirely). So the context information is added by syslog itself via a custom rsyslog plugin, mmopenshift. Properly configuring this plugin requires an update to rsyslog version 7, which (to avoid conflicting with the version shipped in RHEL) is actually shipped in a separate RPM, rsyslog7. So to usefully consolidate gear logs into syslog really requires replacing your entire rsyslog with the newer one. This might seem extreme, but it’s actually not too bad.

Once this is done, any logs from an application can be directed to a central location and distinguished from other applications. This time the distinguishing characteristics are placed at the front of the log entry, e.g. for the app server entry corresponding to the frontend entry above:

2014-06-10T14:43:59.891285-04:00 vm php[2988]: app=php ns=demo appUuid=53961099e659c55b08000102 gearUuid=53961099e659c55b08000102 – – [10/Jun/2014:14:43:59 -0400] “HEAD / HTTP/1.1” 200 – “-” “curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.15.3 zlib/1.2.3 libidn/1.18 libssh2/1.4.2”

The example configuration in the manual directs these to a different log file, /var/log/openshift_gears. This log traffic could be directed to /var/log/messages like the default for everything else, or sent to a different destination entirely.

Gear metrics

Aside from just improving log administration capabilities, one of the motivations for these changes is to enable collection of arbitrary metrics from gears (see the metrics PEP for background). As of OSE 2.1, metrics are basically just implemented as log messages that begin with “type=metric”. These can be generated in a number of ways:

  • The application itself can actively generate log messages at any time; if your application framework provides a scheduler, just have it periodically output to stdout beginning with “type=metric” and logshifter will bundle these messages into the rest of your gear logs for analysis.
    • Edit 2014-06-25: Note that these have a different format and tag than the watchman-generated metrics described next, which appear under the “openshift-platform” tag and aren’t processed by the mmopenshift rsyslog plugin. So you may need to do some work to have your log analyzer consider these metrics.
  • Metrics can be passively generated in a run by the openshift-watchman service in a periodic node-wide run. This can generate metrics in several ways:
    • By default it generates standard metrics out of cgroups for every gear. These include RAM, CPU, and storage.
    • Each cartridge can indicate in its manifest that it supports metrics, in which case the bin/metrics script is executed and its output is logged as metrics. No standard cartridges shipped with OSE support metrics at this time, but custom cartridges could.
    • Each application can create a metrics action hook script in its git repo, which is executed with each watchman run and its output logged as metrics. This enables the application owner to add custom metrics per app.

It should be noted that the cartridge and action hook metrics scripts have a limited time to run, so that they can’t indefinitely block the metrics run for the rest of the gears on the node. All of this is configurable with watchman settings in node.conf. Also it should be noted that watchman-generated logs are tagged with “openshift-platform” e.g.:

Jun 10 16:25:39 vm openshift-platform[29398]: type=metric appName=php6 gear=53961099e659c55b08000102 app=53961099e659c55b08000102 ns=demo quota.blocks.used=988 quota.blocks.limit=1048576 quota.files.used=229 quota.files.limit=80000

The example rsyslog7 and openshift-watchman configuration will route watchman-generated entries differently from application-server entries since the app UUID parameter is specified differently (“app=” vs “appUuid=”). This is all very configurable.

I am currently working on installer options to enable these centralized logging options as sanely as possible.

OpenShift, Apache, and severe hair removal

I just solved a problem that stumped me for a week. It was what we in the business call “a doozy”. I’ll share here, mostly to vent, but also in case the process I went through might help someone else.

The problem: a non-starter

I’ve been working on packaging all of OpenShift Enterprise into a VM image with a desktop for convient hacking on a personal/throwaway environment. It’s about 3 GB (compressed) and takes an hour or so to build. It has to be built on our official build servers using signed RPMs via an unattended kickstart. There have been a few unexpected challenges, and this was basically the cherry on top.

The problem was that the first time the VM booted… openshift-broker and openshift-console failed to start. Everything else worked, including the host-level httpd. Those two (which are httpd-based) didn’t start, and they didn’t leave any logs to indicate why. They didn’t even get to the point of making log files.

And the best part? It only happened the first time. If you started the services manually, they worked. If you simply rebooted after starting the first time… they worked. So basically, the customer’s first impression would be that it was hosed… even though it magically starts working after a reboot, the damage is done. I would look like an idiot trying to release with that little caveat in place.

Can you recreate it? Conveniently?

WTF causes that? And how the heck do I troubleshoot? For a while, the best I could think of was starting the VM up in runlevel 1 (catch GRUB at boot and add the “1” parameter), patching in diagnostics, and then proceeding to init. After one run, if I don’t have what I need… recreate the VM and try again. So painful I literally just avoided it and busied myself with other things.

The first breakthrough was when I tried to test the kinds of things that happen only at the first boot after install. There are a number, potentially – services coming up for the first time and so forth. Another big one is network initialization on a new network and device. I couldn’t see how that would affect these services (they are httpd binding only to localhost), but I did experiment with changing the VM’s network device after booting (remove the udev rule, shut down, remove the device, add another), and found that indeed, it caused the failure on the next boot reliably.

So it had to be something to do with network initialization.

What’s the actual error?

Being able to cause it at will on reboot meant much easier iteration of diagnostics. First I tried just adding a call to ifconfig in the openshift-broker init script. I couldn’t see anything in the console output, so I assumed it was just being suppressed somehow.

Next I tried to zero in on the actual failure. When invoked via init script, the “daemon” function apparently swallows console output from the httpd command, but it provides an opportunity to add options to the command invocation, so I looked up httpd command line options and found two that looked helpful: “-e debug -E /var/log/httpd_console”:

-e level
Sets the LogLevel to level during server startup. This is useful for temporarily increasing the verbosity of the error messages to find problems during startup.

-E file
Send error messages during server startup to file.

This let me bump up the logging level at startup and capture the server startup messages. (Actually probably only the latter matters. Another one to keep in mind is -X which starts it as a single process/thread only – helpful for using strace to follow it. Not helpful here though.)

This let me see the actual failure:

 [crit] (EAI 9)Address family for hostname not supported: alloc_listener: failed to set up sockaddr for

Apparently the httpd startup process tries to bind to network interfaces before it even opens logs, and this is what you get when binding to localhost fails.

What’s the fix?

An error is great, but searching the mighty Google for it was not very encouraging. There were a number of reports of the problem, but precious little about what actually caused it. The closest I could find was this httpd bug report:

Bug 52709 – Apache can’t bind to if eth0 has only IPv6


This bug also affects httpd function ap_get_local_host() as described in
Httpd will then fail to get the fully qualified domain name.

This occurs when apache starts before dhclient finished its job.

Here was something linking the failure to bind to to incomplete network initialization by dhclient. Suddenly the fact that my test “ifconfig” at service start had no output did not seem like a fluke. When I started the service manually, ifconfig certainly had output.

So, here’s the part I don’t really claim to understand. Apparently there’s a period after NetworkManager does its thing and other services are starting where, in some sense at least, the network isn’t really available. At least not in a way that ifconfig can detect, and not in a way that allows httpd to explicitly bind to localhost.

As a workaround, I added a shim service that would wait for network initialization to actually complete before trying to start the broker and console. I could add a wait into those service scripts directly, but I didn’t want to hack up the official files that way. So I created this very simple service (openshift-await-eth0) that literally just runs “ifconfig eth0” and waits up to 60 seconds for it to include a line beginning “inet addr:”. Notably, if I have it start right after NetworkManager, it finishes immediately, so it seems the network is up at that point, but goes away just in time for openshift-broker and openshift-console to trip over it. So I have the service run right before openshift-broker, and now my first boot successfully starts everything.

Since this is probably the only place we’ll ever deploy OpenShift that has dhclient trying to process a new IP at just the time the broker and console are being started, I don’t know that it will ever be relevant to anyone else. But who knows. Maybe someone can further enlighten me on what’s going on here and a better way to avoid it.

OpenShift with dynamic host IPs?

From the time we began packaging OpenShift Enterprise, we made a decision not to support dynamic changes to host IP addresses. This might seem a little odd since we do demonstrate installation with the assumption that DHCP is in use; we just require it to be used with addresses pinned statically to host names. It’s not that it’s impossible to work with dynamic re-leasing; it’s just that it’s an unnecessary complication and potentially a source of tricky problems.

However, I’ve crawled all over OpenShift configuration for the last few months, and I can say with a fair amount of confidence that it’s certainly possible to handle dynamic changes to host IP, as long as those changes are tracked by DNS with static hostnames.

But there are, of course, a number of caveats.

First off, it should be obvious that DNS must be integrated with DHCP such that hostnames never change and always resolve correctly to the same actual host. Then, if configuration everywhere uses hostnames, it should in theory be able to survive IP changes.

The most obvious exception is the IP(s) of the nameserver(s) themselves. In /etc/resolv.conf clearly the IP must be used, as it’s the source for name resolution, so it can’t bootstrap itself. However, in the unlikely event that nameservers need to re-IP, DHCP could make the transition with a bit of work. You could not use our basic dhclient configuration that statically prepends the installation nameserver IP – instead the DHCP server would need to supply all nameserver definitions, and there would be some complications around the transition since not all hosts would renew leases at the same time. Really, this would probably be the province of a configuration management system. I trust that those who need to do such a thing have thought about it much more than I have.

Then there’s the concern of the dynamic DNS server that OpenShift publishes app hostnames to. Well, no reason that can’t be a hostname as well, as long as the nameserver supplied by DHCP/dhclient knows how to resolve it. Have I mentioned that you should probably implement your host DNS separately from the dynamic app DNS? No reason they need to use the same server, and probably lots of reasons not to.

OK, maybe you’ve looked through /etc/openshift/node.conf and noticed the PUBLIC_IP setting in there. What about that? Well, I’ve tracked that through the code base and as far as I can tell, the only thing it is ever used for is to create a log entry when gears are created. In other words, it has no functional significance. It may have in the past – as I understand it, apps used to be created with A records rather than as CNAMEs to the node hosts. But for now, it’s a red herring.

Something else to worry about are iptables filters. In our instructions we never demonstrate filters for specific addresses, but conscientious sysadmins would likely limit connections to the hosts that are expected to need them in many cases. And they would be unlikely to define them using hostnames. So either don’t do that… or have a plan for handling IP changes.

One more caveat: what do we mean by dynamic IP changes? How dynamic?

If we’re talking about the kind of IP change where you shut down the host (perhaps to migrate its storage) and when it is booted again, it has a new IP, then that use case should be handled pretty well (again, as long as all configured host references use the hostname). This is the sort of thing you would run into in Amazon EC2 where hosts keep their IP as long as they stay up, but when shut down generally get a new IP. All the services on the host are started with the proper IP in use.

It’s a little more tricky to support IP changes while the host is operating. Any services that bind specifically to the external IP address would need restarting. I’ve had a look while writing this, though, and this is a lot less common than I expected. As far as I can see, only one node host service does that: haproxy (which is used by openshift-port-proxy to proxy specific ports from the external interface back to gear ports). The httpd proxy listens to all interfaces so it’s exempt, and individual gears listen on internal interfaces only. On the broker and supporting hosts, ActiveMQ and MongoDB either listen locally or to all interfaces. The nameserver, despite being configured to listen to “all”, appears to bind to specific IPs, so it looks like it would need a restart. You could probably configure dhclient to do this when a lease changes (with appropriate SELinux policy changes to permit it). But you can see how easy this would be to get wrong.

Hopefully this brief exploration of the issues involved demonstrates why we’re going to stick with the current (non-)support policy for the time being. But I also expect some of you out there will try out OpenShift in a dynamic IP environment, and if so I hope you’ll let me know what you run into.

What does a tech support geek do all day anyway?

I hate to let a calendar month go by with no post.

Though hey, I’m so far behind on reading my web comics, I don’t even know how far behind I am anymore. That’s when you know it’s serious.

Being a parent and keeping things from falling apart has pretty much taken up all my energy outside work at this point. It takes up a lot of time, too, but the real constraint is energy and enthusiasm. It’s hard to really get into anything significant that you know will probably be interrupted soon. Thus I haven’t even gotten around to checking out the Android-friendly ORMlite updates. Someday!

At least some of the arcane stuff I’ve been working on for my day job is publicly visible! Too bad VMware doesn’t really have a way for KB authors to sign our work, but here are a few bits of interest in the web server / application server world…

I have some in the pipeline to attempt to finally explain Tomcat auto-deployment and logging in a way that a non-expert can follow and use (I dare say the engineers are too close to their work). I should work on the official docs, but that’s not officially part of my goals… if I could get someone to pay me for it, I probably would. I think I have something of a gift for technical docs. It just takes me such a long time, I doubt anyone would consider it worth funding. At the same time, I don’t want to quit doing actual technical work in order to document it.

Maybe soon I’ll document my journey of trying to use Fedora 16 for my workstation (VPNs… virtualization… and dual monitors, oh my!)

Clustering Tomcat (part III): the HTTPS connector

Refer to my earlier posts on the subject for background. Here are further explorations, not having much to do with clustering as it turns out, but everything to do with proxying.

In a number of situations you might want to set up encrypted communication between the proxy and backend. For this, Tomcat supplies an HTTPS connector (as far as I know, the only way to encrypt requests to Tomcat).

Connector setup

Setup is actually fairly simple with just a few surprises. Mainly, the protocol setting on the connector remains “HTTP/1.1” not some version of “HTTPS” – the protocol being the language spoken, and SSL encryption being a layer on top of that which you specify with SSLProtocol. Basically, HTTPS connectors look like HTTP connectors with some extra SSL properties specifying the encryption:

<Connector executor="tomcatThreadPool"

If I really wanted to be thorough I would set clientAuth=”true” and set up the proxy with a client certificate signed by this server’s truststore, thereby guaranteeing only the proxy can even make a request. But not right now.

Note the “scheme” and “secure” properties here. These don’t actually affect the connection; instead, they specify what Tomcat should answer when a servlet asks questions about its request. Specifically, request.getScheme() is likely to be used to create self-referential URLs, while request.isSecure() is used to make several security-related decisions. Defaults are for a non-secure HTTP connector but they can be set to whatever makes sense in context – in fact, AFAICS the scheme can be set to anything and the connector will still serve data fine. Read on for clarity on uses of these properties. Continue reading

Clustering Tomcat (part II)

Refer to my earlier post on the subject for background. Here are some more things I ran into.

Tricking mod_proxy_html

It didn’t take too long to come up against something that mod_proxy_html wouldn’t rewrite correctly. In the petcare sample app, there’s a line in a Javascript file (resources/jquery/openid-selector/js/openid-jquery.js) that looks like this:

	img_path: '/petcare/resources/jquery/openid-selector/images/',

There’s nothing to identify it as a URL; it’s just a regular old JS property,  so mod_proxy_html does nothing with it. In fact, as far as I can see, there isn’t any way to tweak the module configuration to adjust it, even knowing exactly what the problem is. But having the path wrong meant in this case that some vital icons on the OpenID sign-in page didn’t show up.

So I went ahead and broke everything out into vhosts like I should have all along. Look folks, if you’re reverse proxying, you really just need to have the same path on the front and backend unless what you’re doing is dead simple. If your front-end URL-space dictates a particular path, move your app to that path on the backend; don’t try to remap the path, it will almost certainly cause you headaches.

<Location> bleeds into VirtualHost

Having created vhosts, I noticed something interesting: <Location> sections that I defined in the  main server config sometimes applied in the vhosts as well – sometimes not. In particular, applying a handler (like status-handler) also applied to vhosts, while a JkMount in the main section did not cause vhosts to also serve requests through mod_jk. I think there were some other oddities like this but I can’t recall them any more.

It’s worth noting that RewriteRule directives in the main server config are explicitly not inherited by vhosts unless you specify that they should be.

mod_jk cluster partitioning

mod_jk workers have this property “domain” that you can set (ref. the Tomcat Connector guide). It’s not exactly obvious what it does – it doesn’t make sense to use the domain as the route when you’ve already got a route for each instance. I also read somewhere that mod_jk can be used to partition a cluster. Reading between the lines a little bit and trying it out, here’s what I found:

If you specify a domain, the workers with the same domain will failover like a sub-cluster within the cluster; they all still have their own routes, but if one instance fails, mod_jk will try to route to another member of the same sub-cluster. This means that you only need to replicate sessions between the members of the sub-cluster (assuming you trust the sub-cluster not to fail entirely). This could significantly cut down on the amount of session replication network traffic in a large cluster.

Clustering Tomcat

I’ve been meaning for some time to set up Tomcat clustering in several different ways just to see how it works. I finally got a chance today. There are guides and other bits of information elsewhere on how to do all this. I’m not gonna tell you how, sorry; the point of this post is to document my problems and how I overcame them.

A word about setup

My front end load balancer is Apache httpd 2.2.15 (from SpringSource ERS 4.0.2) using mod_proxy_http, mod_proxy_ajp, mod_jk, and (just to keep things interesting) mod_rewrite as connectors to the backend Tomcat 6.0.28 instances. I wanted to try out 2-node Tomcat clusters (one side of each cluster ERS Tomcat, the other side tc Server 2.0.3) without any session replication (so, sticky sessions and you get a new session if there’s a failure) and with session replication via the standard Delta manager (which replicates all sessions to all nodes) and the Backup manager (which replicates all sessions to a single reference node, the “backup” for each app).  Basically the idea is to test out every conceivable way to put together a cluster with standard SpringSource technologies.

The first trick was mapping all of these into the  URL space for my httpd proxy. I wanted to put each setup behind its own URL /<connector>/<cluster> so e.g /http/backup and /ajp/delta. This is typically not done and for good reason; mod_proxy will let you do the mapping and will even clean up redirects and cookie paths from the backend, but to take care of self-referential links in the backend apps you actually have to rewrite the content that comes back; for that I installed mod_proxy_html, a non-standard module for doing such things. The reason it’s a non-standard module is that this approach is fraught with danger. But given that I mostly don’t care about how well the apps work in my demo cluster, I thought it’d be a great time to put it through its paces.

For this reason, most people make sure the URLs on the front-end and back-end are the same; and in fact, as far as I could tell, there was no way at all to make mod_jk do any mapping, so I’m setting it up as a special case – more later if it’s relevant. The best way to do this demo would probably be to set up virtual hosts on the proxy for each scenario and not require any URI mapping; if I run into enough problems I’ll probably fall back to that.

Problem 1: the AJP connector

I got things running without session replication fairly easily. My first big error with session replication was actually something else, but at first I thought it might be related to this warning in the Tomcat log:

Aug 13, 2010 9:43:00 PM org.apache.catalina.core.AprLifecycleListener init

INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/jdk1.6.0_21/jre/lib/i386/server:/usr/java/jdk1.6.0_21/jre/lib/i386:/usr/java/jdk1.6.0_21/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib
Aug 13, 2010 9:43:00 PM org.apache.catalina.startup.ConnectorCreateRule _setExecutor
WARNING: Connector [org.apache.catalina.connector.Connector@1d6f122] does not support external executors. Method setExecutor(java.util.concurrent.Executor) not found.

These actually turn out to be related:

<Listener className=”org.apache.catalina.core.AprLifecycleListener” SSLEngine=”on” />

<Connector  executor=”tomcatThreadPool”

I don’t really know why the APR isn’t working properly, but a little searching turned up some obscure facts: if the APR isn’t loaded, then for the AJP connector Tomcat makes a default choice of implementation that doesn’t use the executor thread pool. So you have to explicitly set the class to use like this:

<Connector  executor=”tomcatThreadPool”

Nice, eh? OK. But that was just an annoyance.

Problem 2: MBeans blowup

The real thing holding me back was this error:

Aug 13, 2010 9:52:16 PM org.apache.catalina.mbeans.ServerLifecycleListener createMBeans
SEVERE: createMBeans: Throwable
at org.apache.catalina.mbeans.MBeanUtils.createObjectName(
at org.apache.catalina.mbeans.MBeanUtils.createMBean(

Now what the heck was that all about? Well, I found no love on Google. But I did eventually guess what the problem was. This probably underscores a good methodology: when working on configs, add one thing at a time and test it out before going on. If only Tomcat had a “configtest” like httpd – it takes forever to “try it out”.

In case anyone else runs into this, I’ll tell you what it was. The Cluster Howto made it pretty clear that you need to set your webapps context to be distributable for clustering to work. It wasn’t clear to me where to put that, but I didn’t want to create a context descriptor for each app on each instance. I knew you could put a <Context> element in server.xml so that’s right where I put it, right inside the <Host> element:

<Context distributable=”true” />

Well, that turns out to be a bad idea. It causes the error above. So don’t do that. For the products I’m using, there’s a single context.xml that applies to all apps on the server; that’s where you want to put the distributable attribute.

Cluster membership – static

My next task was to get the cluster members in each case to recognize each other and replicate sessions. Although all the examples use multicast to do this, I wanted to try setting up static membership, because I didn’t want to look up how to enable multicast just yet. And it should be simpler, right? Well, I had a heck of a time finding this, but it looks like the StaticMembershipInterceptor is the path.

My interceptor looks like this:

<Interceptor className=””>
<Member className=”org.apache.catalina.tribes.membership.StaticMember”
port=”8210″ securePort=”-1″

(with the “host” being the other host in the 2-node cluster on each side). Starting with this configuration brings an annoying warning message:

WARNING: [SetPropertiesRule]{Server/Service/Engine/Cluster/Channel/Interceptor/Member} Setting property ‘uniqueId’ to ‘{1,2,3,4,5,6,7,8,9,0}’ did not find a matching property.

So I guess that property has been removed and the docs not updated; didn’t seem necessary anyway given the host/port combo should be unique.

In any case, the log at first looks encouraging:

Aug 13, 2010 10:13:04 PM org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded
INFO: Replication member added:org.apache.catalina.tribes.membership.MemberImpl[tcp://vm-centos-cluster-tcs.sosiouxme.lan:8210,vm-centos-cluster-tcs.sosiouxme.lan,8210, alive=0,id={0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 }, payload={}, command={}, domain={100 101 108 116 97 45 99 108 117 …(13)}, ]
Aug 13, 2010 10:13:04 PM org.apache.catalina.tribes.membership.McastServiceImpl waitForMembers
INFO: Sleeping for 1000 milliseconds to establish cluster membership, start level:4
Aug 13, 2010 10:13:04 PM org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded
INFO: Replication member added:org.apache.catalina.tribes.membership.MemberImpl[tcp://{172, 31, 1, 108}:8210,{172, 31, 1, 108},8210, alive=25488670,id={66 108 -53 22 -38 -64 76 -110 -110 -54 28 -11 -126 -44 66 28 }, payload={}, command={}, domain={}, ]

Sounds like the other cluster member is found, right? I don’t know why it’s in there twice (once w/ hostname, once with IP) but I think the first is the configuration value, and the second is for when the member is actually contacted (alive=).

And indeed, for one node, later in the log as the applications are being deployed, I see the session replication appears to happen for each:

WARNING: Manager [localhost#/petclinic], requesting session state from org.apache.catalina.tribes.membership.MemberImpl[tcp://vm-centos-cluster-tcs.sosiouxme.lan:8210,vm-centos-cluster-tcs.sosiouxme.lan,8210, alive=0,id={0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 }, payload={}, command={}, domain={100 101 108 116 97 45 99 108 117 …(13)}, ]. This operation will timeout if no session state has been received within 60 seconds.
Aug 13, 2010 10:13:09 PM org.apache.catalina.ha.session.DeltaManager waitForSendAllSessions
INFO: Manager [localhost#/petclinic]; session state send at 8/13/10 10:13 PM received in 186 ms.

Um, why is that a WARNING? Looks like normal operation to me, surely it should be an INFO. Whatever. The real problem is on the other side of the node:

13-Aug-2010 22:25:16.624 INFO org.apache.catalina.ha.session.DeltaManager.start Register manager /manager to cluster element Engine with name Catalina
13-Aug-2010 22:25:16.624 INFO org.apache.catalina.ha.session.DeltaManager.start Starting clustering manager at/manager
13-Aug-2010 22:25:16.627 WARNING org.apache.catalina.ha.session.DeltaManager.getAllClusterSessions Manager [localhost#/manager], requesting session state from org.apache.catalina.tribes.membership.MemberImpl[tcp://vm-centos-cluster-ers.sosiouxme.lan:8210,vm-centos-cluster-ers.sosiouxme.lan,8210, alive=0,id={0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 }, payload={}, command={}, domain={100 101 108 116 97 45 99 108 117 …(13)}, ]. This operation will timeout if no session state has been received within 60 seconds.
13-Aug-2010 22:26:16.635 SEVERE org.apache.catalina.ha.session.DeltaManager.waitForSendAllSessions Manager [localhost#/manager]: No session state send at 8/13/10 10:25 PM received, timing out after 60,009 ms.

And so on for the rest of my webapps too! And if I try to access my cluster, it seems to be hung! While I was writing this up I think I figured out the problem with that. On the first node, I had gone into the context files for the manager and host-manager apps and set distributable=”false” (doesn’t really make sense to distribute manager apps). On the second node I had not done the same. My bad, BUT:

  1. Why did it take 60 seconds to figure this out for EACH app; and
  2. Why did EVERY app, not just the non-distributable ones, fail replication (at 60 seconds apiece)?

Well, having cleared that up my cluster with DeltaManager session replication seems to be working great.

Cluster with BackupManager, multicast

OK, here’s the surprise denouement – this seems to have just worked out of the box. I didn’t even think multicast would work without me having to tweak something in my OS (CentOS 5.5) or my router. But it did, and failover worked flawlessly when I killed a node too. Sweet.