Implementing an OpenShift Enterprise routing layer for HA applications

My previous post described how to make an application HA and what exactly that means behind the scenes. This post is to augment the explanation in the HA PEP of how an administrator should expect to implement the routing layer for HA apps.

The routing layer implementation is currently left entirely up to the administrator. At some point OpenShift will likely ship a supported routing layer component, but the first priority was to provide an SPI (Service Provider Interface) so that administrators could reuse existing routing and load balancer infrastructure with OpenShift. Since most enterprises already have such infrastructure, we expected they would prefer to leverage that investment (both in equipment and experience) rather than be forced to use something OpenShift-specific.

Still, this leaves the administrator with the task of implementing the interface to the routing layer. Worldline published an nginx implementation, and we have some reference implementations in the works, but I thought I’d outline some of the details that might not be obvious in such an implementation.

The routing SPI

The first step in the journey is to understand the routing SPI events. The routing SPI itself is an interface on the OpenShift broker app that must be implemented via plugin. The example routing plugin that is packaged for Origin and Enterprise simply serializes the SPI events to YAML and puts them on an ActiveMQ message queue/topic. This is just one way to distribute the events, but it’s a pretty good way, at least in the abstract. For routing layer development and testing, you can just publishes messages on a topic on the ActiveMQ instance OpenShift already uses (for Enterprise, openshift.sh does this for you) and use the trivial “echo” listener to see exactly what comes through. For production, publish events to a queue (or several if multiple instances need updating) on an HA ActiveMQ deployment that stores messages to disk when shutting down (you really don’t want to lose routing events) – note that the ActiveMQ deployment described in OpenShift docs and deployed by the installer does not do this, being intended for much more ephemeral messages.

I’m not going to go into detail about the routing events. You’ll become plenty familiar if you implement this. You can see some good example events in this description, but always check what is actually coming out of the SPI as there may have been updates (generally additions) since. The general outline of the events can be seen in the Sample Routing Plug-in Notifications table from the Deployment Guide or in the example implementation of the SPI. Remember you can always write your own plugin to give you information in the desired format.

Consuming SPI events for app creation

The routing SPI publishes events for all apps, not just HA ones, and you might want to do something with other apps (e.g. implement blue/green deployments), but the main focus of a routing layer is to implement HA apps. So let’s look at how you do that. I’m assuming YAML entries from the sample activemq plugin below — if you use a different plugin, similar concepts should apply just with different details.

First when an app is created you’re going to get an app creation event:

$ rhc app create phpha php-5.4 -s

:action: :create_application
:app_name: phpha
:namespace: demo
:scalable: true
:ha: false

This is pretty much just a placeholder for the application name. Note that it is not marked as HA. There is some work coming to make apps HA at creation, but currently you just get a scaled app and have to make it HA after it’s created. This plugin doesn’t publish the app UUID, which is what I would probably do if I were writing a plugin now. Instead, you’ll identify the application in any future events by the combination of app_name and namespace.

Once an actual gear is deployed, you’ll get two (or more) :add_public_endpoint actions, one for haproxy’s load_balancer type and one for the cartridge web_framework type (and possibly other endpoints depending on cartridge).

:action: :add_public_endpoint
:app_name: phpha
:namespace: demo
:gear_id: 542b72abafec2de3aa000009
:public_port_name: haproxy-1.4
:public_address: 172.16.4.200
:public_port: 50847
:protocols:
- http
- ws
:types:
- load_balancer
:mappings:
- frontend: ''
 backend: ''
- frontend: /health
 backend: /configuration/health

You might expect that when you make the app HA, there is some kind of event specific to being made HA. There isn’t at this time. You just get another load_balancer endpoint creation event for the same app, and you can infer that it’s now HA. For simplicity of implementation, it’s probably just best to treat all scaled apps as if they were already HA and define routing configuration for them.

Decision point 1: The routing layer can either direct requests only to the load_balancer endpoints and let them forward traffic all to the other gears, or it can actually just send traffic directly to all web_framework endpoints. The recommendation is to send traffic to the load_balancer endpoints, for a few reasons:

  1. This allows haproxy to monitor traffic in order to auto-scale.
  2. It will mean less frequent changes to your routing configuration (important when changes mean restarts).
  3. It will mean fewer entries in your routing configuration, which could grow quite large and become a performance concern.

However, direct routing is viable, and allows an implementation of HA without actually going to the trouble of making apps HA. You would just have to set up a DNS entry for the app that points at the routing layer and use that. You’d also have to handle scaling events manually or from the routing layer somehow (or even customize the HAproxy autoscale algorithm to use stats from the routing layer).

Decision point 2: The expectation communicated in the PEP (and how this was intended to be implemented) is that requests will be directed to the external proxy port on the node (in the example above, that would be http://172.16.4.200:50847/). There is one problem with doing this – idling. Idler stats are gathered only on requests that go through the node frontend proxy, so if we direct all traffic to the port proxy, the haproxy gear(s) will eventually idle and the app will be unavailable even though it’s handling lots of traffic. (Fun fact: secondary gears are exempt from idling – doesn’t help, unless the routing layer proxies directly to them.) So, how do we prevent idling? Here are a few options:

  1. Don’t enable the idler on nodes where you expect to have HA apps. This assumes you can set aside nodes for (presumably production) HA apps that you never want to idle. Definitely the simplest option.
  2. Implement health checks that actually go to the node frontend such that HA apps will never idle. You’ll need the gear name, which is slightly tricky – the above endpoint being on the first gear, it will be accessible by a request for http://phpha-demo.cloud_domain/health to the node at 172.16.4.200. When the next gear comes in, you’ll have to recognize that it’s not the head gear and send the health check to e.g. http://542b72abafec2de3aa000009-demo.cloud_domain/health.
  3. Flout the PEP and send actual traffic to the node frontend. This would be the best of all worlds since the idler would work as intended without any special tricks, but there are some caveats I’ll discuss later.

Terminating SSL (TLS)

When hosting multiple applications behind a proxy, it is basically necessary to terminate SSL at the proxy. (Despite SSL having been essentially replaced by TLS at this point, we’re probably going to call it SSL for the lifetime of the internet.) This has to do with the way routing works under HTTPS; during the intialization of the TLS connection, the client has to indicate the name it wants (in our case the application’s DNS name) in the SNI extension to the TLS “hello”. The proxy can’t behave as a dumb layer 4 proxy (just forwarding packets unexamined to another TLS endpoint) because it has to examine the stream at the protocol level to determine where to send it. Since the SNI information is (from my reading of the RFC) volunteered by the client at the start of the connection, it does seem like it would be possible for a proxy to examine the protocol and then act like a layer 4 proxy based on that examination, and indeed I think F5 LBs have this capability, but it does not seem to be a standard proxy/LB capability, and certainly not for existing open source implementations (nginx, haproxy, httpd – someone correct me if I’m missing something here), so to be inclusive we are left with proxies that operate at the layer 7 protocol layer, meaning they perform the TLS negotiation from the client’s perspective.

Edit 2014-10-08: layer 4 routing based on SNI is probably more available than I thought. I should have realized HAproxy 1.5 can do it, given OpenShift’s SNI proxy is using that capability. It’s hard to find details on though. If most candidate routing layer technologies have this ability, then it could simplify a number of the issues around TLS because terminating TLS could be deferred to the node.

If that was all greek to you, the important point to extract is that a reverse proxy has to have all the information to handle TLS connections, meaning the appropriate key and certificate for any requested application name. This is the same information used at the node frontend proxy; indeed, the routing layer will likely need to reuse the same *.cloud_domain wildcard certificate and key that is shared on all OpenShift nodes, and it needs to be made aware of aliases and their custom certificates so that it can properly terminate requests for them. (If OpenShift supported specifying x509 authentication via client certificates [which BTW could be implemented without large structural changes], the necessary parameters would also need to be published to the routing layer in addition to the node frontend proxy.)

We assume that a wildcard certificate covers the standard HA DNS name created for HA apps (e.g. in this case ha-phpha-demo.cloud_domain, depending of course on configuration; notice that no event announces this name — it is implied when an app is HA). That leaves aliases which have their own custom certs needing to be understood at the routing layer:

$ rhc alias add foo.example.com -a phpha
:action: :add_alias
:app_name: phpha
:namespace: demo
:alias: foo.example.com

$ rhc alias update-cert foo.example.com -a phpha --certificate certfile --private-key keyfile
:action: :add_ssl
:app_name: phpha
:namespace: demo
:alias: foo.example.com
:ssl: [...]
:private_key: [...]
:pass_phrase:

Aliases will of course need their own routing configuration entries regardless of HTTP/S, and something will have to create their DNS entries as CNAMEs to the ha- application DNS record.

A security-minded administrator would likely desire to encrypt connections from the routing layer back to the gears. Two methods of doing this present themselves:

  1. Send an HTTPS request back to the gear’s port proxy. This won’t work with any of the existing cartridges OpenShift provides (including the haproxy-1.4 LB cartridge), because none of them expose an HTTPS-aware endpoint. It may be possible to change this, but it would be a great deal of work and is not likely to happen in the lifetime of the current architecture.
  2. Send an HTTPS request back to the node frontend proxy, which does handle HTTPS. This actually works fine, if the app is being accessed via an alias – more about this caveat later.

Sending the right HTTP headers

It is critically important in any reverse-proxy situation to preserve the client’s HTTP request headers indicating the URL at which it is accessing an application. This allows the application to build self-referencing URLs accurately. This can be a little complicated in a reverse-proxy situation, because the same HTTP headers may be used to route requests to the right application. Let’s think a little bit about how this needs to work. Here’s an example HTTP request:

POST /app/login.php HTTP/1.1
Host: phpha-demo.openshift.example.com
[...]

If this request comes into the node frontend proxy, it looks at the Host header, and assuming that it’s a known application, forwards the request to the correct gear on that node. It’s also possible (although OpenShift doesn’t do this, but a routing layer might) to use the path (/app/login.php here) to route to different apps, e.g. requests for /app1/ might go to a different place than /app2/.

Now, when an application responds, it will often create response headers (e.g. a redirect with a Location: header) as well as content based on the request headers that are intended to link to itself relative to what the client requested. The client could be accessing the application by a number of paths – for instance, our HA app above should be reachable either as phpha-demo.openshift.example.com or as ha-phpha-demo.openshift.example.com (default HA config). We would not want a client that requests the ha- address to receive a link to the non-ha- address, which may not even resolve for it, and in any case would not be HA. The application, in order to be flexible, should not make a priori assumptions about how it will be addressed, so every application framework of any note provides methods for creating redirects and content links based on the request headers. Thus, as stated above, it’s critical for these headers to come in with an accurate representation of what the client requested, meaning:

  1. The same path (URI) the client requested
  2. The same host the client requested
  3. The same protocol the client requested

(The last is implemented via the “X-Forwarded-Proto: https” header for secure connections. Interestingly, a recent RFC specifies a new header for communicating items 2 and 3, but not 1. This will be a useful alternative as it becomes adopted by proxies and web frameworks.)

Most reverse proxy software should be well aware of this requirement and provide options such that when the request is proxied, the headers are preserved (for example, the ProxyPreserveHost directive in httpd). This works perfectly with the HA routing layer scheme proposed in the PEP, where the proxied request goes directly to an application gear. The haproxy cartridge does not need to route based on Host: header (although it does route requests based on a cookie it sets for sticky sessions), so the request can come in for any name at all and it’s simply forwarded as-is for the application to use.

The complication arises in situations where, for example, you would like the routing layer to forward requests to the node frontend proxy (in order to use HTTPS, or to prevent idling). The node frontend does care about the Host header because it’s used for routing, so the requested host name has to be one that the OpenShift node knows in relation to the desired gear. It might be tempting to think that you can just rewrite the request to use the gear’s “normal” name (e.g. phpha-demo.cloud_domain) but this would be a mistake because the application would respond with headers and links based on this name. Reverse proxies often offer options for rewriting the headers and even contents of responses in an attempt to fix this, but they cannot do so accurately for all situations (example: links embedded in JavaScript properties) so this should not be attempted. (Side note: the same prohibition applies to rewriting the URI path while proxying. Rewriting example.com/app/… to app.internal.example.com/… is only safe for sites that provide static content and all-relative links.)

What was that caveat?

I mentioned a caveat both on defeating the idler and proxying HTTPS connections to the node frontend, and it’s related to the section above. You can absolutely forward an HA request to the node frontend if the request is for a configured alias of the application, because the node frontend knows how to route aliases (so you don’t have to rewrite the Host: header which, as just discussed, is a terrible idea). The caveat is that, strangely, OpenShift does not create an alias for the ha- DNS entry automatically assigned to an HA app, so manual definition of an alias is currently required per-app for implementation of this scheme. I have created a feature request to instate the ha- DNS entry as an alias, and being hopefully easy to implement, this may soon remove the caveat behind this approach to routing layer implementation.

Things go away too

I probably shouldn’t even have to mention this, but: apps, endpoints, aliases, and certificates can all go away, too. Make sure that you process these events and don’t leave any debris lying around in your routing layer confs. Gears can also be moved from one host to another, which is an easy use case to forget about.

And finally, speaking of going away, the example routing plugin initially provided :add_gear and :remove_gear events, and for backwards compatibility it still does (duplicating the endpoint events). These events are deprecated and should disappear soon.

Advertisements

OpenShift, Apache, and severe hair removal

I just solved a problem that stumped me for a week. It was what we in the business call “a doozy”. I’ll share here, mostly to vent, but also in case the process I went through might help someone else.

The problem: a non-starter

I’ve been working on packaging all of OpenShift Enterprise into a VM image with a desktop for convient hacking on a personal/throwaway environment. It’s about 3 GB (compressed) and takes an hour or so to build. It has to be built on our official build servers using signed RPMs via an unattended kickstart. There have been a few unexpected challenges, and this was basically the cherry on top.

The problem was that the first time the VM booted… openshift-broker and openshift-console failed to start. Everything else worked, including the host-level httpd. Those two (which are httpd-based) didn’t start, and they didn’t leave any logs to indicate why. They didn’t even get to the point of making log files.

And the best part? It only happened the first time. If you started the services manually, they worked. If you simply rebooted after starting the first time… they worked. So basically, the customer’s first impression would be that it was hosed… even though it magically starts working after a reboot, the damage is done. I would look like an idiot trying to release with that little caveat in place.

Can you recreate it? Conveniently?

WTF causes that? And how the heck do I troubleshoot? For a while, the best I could think of was starting the VM up in runlevel 1 (catch GRUB at boot and add the “1” parameter), patching in diagnostics, and then proceeding to init. After one run, if I don’t have what I need… recreate the VM and try again. So painful I literally just avoided it and busied myself with other things.

The first breakthrough was when I tried to test the kinds of things that happen only at the first boot after install. There are a number, potentially – services coming up for the first time and so forth. Another big one is network initialization on a new network and device. I couldn’t see how that would affect these services (they are httpd binding only to localhost), but I did experiment with changing the VM’s network device after booting (remove the udev rule, shut down, remove the device, add another), and found that indeed, it caused the failure on the next boot reliably.

So it had to be something to do with network initialization.

What’s the actual error?

Being able to cause it at will on reboot meant much easier iteration of diagnostics. First I tried just adding a call to ifconfig in the openshift-broker init script. I couldn’t see anything in the console output, so I assumed it was just being suppressed somehow.

Next I tried to zero in on the actual failure. When invoked via init script, the “daemon” function apparently swallows console output from the httpd command, but it provides an opportunity to add options to the command invocation, so I looked up httpd command line options and found two that looked helpful: “-e debug -E /var/log/httpd_console”:

-e level
Sets the LogLevel to level during server startup. This is useful for temporarily increasing the verbosity of the error messages to find problems during startup.

-E file
Send error messages during server startup to file.

This let me bump up the logging level at startup and capture the server startup messages. (Actually probably only the latter matters. Another one to keep in mind is -X which starts it as a single process/thread only – helpful for using strace to follow it. Not helpful here though.)

This let me see the actual failure:

 [crit] (EAI 9)Address family for hostname not supported: alloc_listener: failed to set up sockaddr for 127.0.0.1

Apparently the httpd startup process tries to bind to network interfaces before it even opens logs, and this is what you get when binding to localhost fails.

What’s the fix?

An error is great, but searching the mighty Google for it was not very encouraging. There were a number of reports of the problem, but precious little about what actually caused it. The closest I could find was this httpd bug report:

Bug 52709 – Apache can’t bind to 127.0.0.1 if eth0 has only IPv6

[…]

This bug also affects httpd function ap_get_local_host() as described in http://bugs.debian.org/629899
Httpd will then fail to get the fully qualified domain name.

This occurs when apache starts before dhclient finished its job.

Here was something linking the failure to bind to 127.0.0.1 to incomplete network initialization by dhclient. Suddenly the fact that my test “ifconfig” at service start had no output did not seem like a fluke. When I started the service manually, ifconfig certainly had output.

So, here’s the part I don’t really claim to understand. Apparently there’s a period after NetworkManager does its thing and other services are starting where, in some sense at least, the network isn’t really available. At least not in a way that ifconfig can detect, and not in a way that allows httpd to explicitly bind to localhost.

As a workaround, I added a shim service that would wait for network initialization to actually complete before trying to start the broker and console. I could add a wait into those service scripts directly, but I didn’t want to hack up the official files that way. So I created this very simple service (openshift-await-eth0) that literally just runs “ifconfig eth0” and waits up to 60 seconds for it to include a line beginning “inet addr:”. Notably, if I have it start right after NetworkManager, it finishes immediately, so it seems the network is up at that point, but goes away just in time for openshift-broker and openshift-console to trip over it. So I have the service run right before openshift-broker, and now my first boot successfully starts everything.

Since this is probably the only place we’ll ever deploy OpenShift that has dhclient trying to process a new IP at just the time the broker and console are being started, I don’t know that it will ever be relevant to anyone else. But who knows. Maybe someone can further enlighten me on what’s going on here and a better way to avoid it.

OpenShift with dynamic host IPs?

From the time we began packaging OpenShift Enterprise, we made a decision not to support dynamic changes to host IP addresses. This might seem a little odd since we do demonstrate installation with the assumption that DHCP is in use; we just require it to be used with addresses pinned statically to host names. It’s not that it’s impossible to work with dynamic re-leasing; it’s just that it’s an unnecessary complication and potentially a source of tricky problems.

However, I’ve crawled all over OpenShift configuration for the last few months, and I can say with a fair amount of confidence that it’s certainly possible to handle dynamic changes to host IP, as long as those changes are tracked by DNS with static hostnames.

But there are, of course, a number of caveats.

First off, it should be obvious that DNS must be integrated with DHCP such that hostnames never change and always resolve correctly to the same actual host. Then, if configuration everywhere uses hostnames, it should in theory be able to survive IP changes.

The most obvious exception is the IP(s) of the nameserver(s) themselves. In /etc/resolv.conf clearly the IP must be used, as it’s the source for name resolution, so it can’t bootstrap itself. However, in the unlikely event that nameservers need to re-IP, DHCP could make the transition with a bit of work. You could not use our basic dhclient configuration that statically prepends the installation nameserver IP – instead the DHCP server would need to supply all nameserver definitions, and there would be some complications around the transition since not all hosts would renew leases at the same time. Really, this would probably be the province of a configuration management system. I trust that those who need to do such a thing have thought about it much more than I have.

Then there’s the concern of the dynamic DNS server that OpenShift publishes app hostnames to. Well, no reason that can’t be a hostname as well, as long as the nameserver supplied by DHCP/dhclient knows how to resolve it. Have I mentioned that you should probably implement your host DNS separately from the dynamic app DNS? No reason they need to use the same server, and probably lots of reasons not to.

OK, maybe you’ve looked through /etc/openshift/node.conf and noticed the PUBLIC_IP setting in there. What about that? Well, I’ve tracked that through the code base and as far as I can tell, the only thing it is ever used for is to create a log entry when gears are created. In other words, it has no functional significance. It may have in the past – as I understand it, apps used to be created with A records rather than as CNAMEs to the node hosts. But for now, it’s a red herring.

Something else to worry about are iptables filters. In our instructions we never demonstrate filters for specific addresses, but conscientious sysadmins would likely limit connections to the hosts that are expected to need them in many cases. And they would be unlikely to define them using hostnames. So either don’t do that… or have a plan for handling IP changes.

One more caveat: what do we mean by dynamic IP changes? How dynamic?

If we’re talking about the kind of IP change where you shut down the host (perhaps to migrate its storage) and when it is booted again, it has a new IP, then that use case should be handled pretty well (again, as long as all configured host references use the hostname). This is the sort of thing you would run into in Amazon EC2 where hosts keep their IP as long as they stay up, but when shut down generally get a new IP. All the services on the host are started with the proper IP in use.

It’s a little more tricky to support IP changes while the host is operating. Any services that bind specifically to the external IP address would need restarting. I’ve had a look while writing this, though, and this is a lot less common than I expected. As far as I can see, only one node host service does that: haproxy (which is used by openshift-port-proxy to proxy specific ports from the external interface back to gear ports). The httpd proxy listens to all interfaces so it’s exempt, and individual gears listen on internal interfaces only. On the broker and supporting hosts, ActiveMQ and MongoDB either listen locally or to all interfaces. The nameserver, despite being configured to listen to “all”, appears to bind to specific IPs, so it looks like it would need a restart. You could probably configure dhclient to do this when a lease changes (with appropriate SELinux policy changes to permit it). But you can see how easy this would be to get wrong.

Hopefully this brief exploration of the issues involved demonstrates why we’re going to stick with the current (non-)support policy for the time being. But I also expect some of you out there will try out OpenShift in a dynamic IP environment, and if so I hope you’ll let me know what you run into.

Clustering Tomcat (part III): the HTTPS connector

Refer to my earlier posts on the subject for background. Here are further explorations, not having much to do with clustering as it turns out, but everything to do with proxying.

In a number of situations you might want to set up encrypted communication between the proxy and backend. For this, Tomcat supplies an HTTPS connector (as far as I know, the only way to encrypt requests to Tomcat).

Connector setup

Setup is actually fairly simple with just a few surprises. Mainly, the protocol setting on the connector remains “HTTP/1.1” not some version of “HTTPS” – the protocol being the language spoken, and SSL encryption being a layer on top of that which you specify with SSLProtocol. Basically, HTTPS connectors look like HTTP connectors with some extra SSL properties specifying the encryption:

<Connector executor="tomcatThreadPool"
 port="${https.port}"
 protocol="HTTP/1.1"
 connectionTimeout="20000"
 acceptCount="100"
 maxKeepAliveRequests="15"
 SSLEnabled="true"
 SSLProtocol="TLS"
 scheme="https"
 secure="true"
 keystorePass="changeit"
 clientAuth="false"
/>

If I really wanted to be thorough I would set clientAuth=”true” and set up the proxy with a client certificate signed by this server’s truststore, thereby guaranteeing only the proxy can even make a request. But not right now.

Note the “scheme” and “secure” properties here. These don’t actually affect the connection; instead, they specify what Tomcat should answer when a servlet asks questions about its request. Specifically, request.getScheme() is likely to be used to create self-referential URLs, while request.isSecure() is used to make several security-related decisions. Defaults are for a non-secure HTTP connector but they can be set to whatever makes sense in context – in fact, AFAICS the scheme can be set to anything and the connector will still serve data fine. Read on for clarity on uses of these properties. Continue reading

Fixing Apache httpd reverse proxy redirect rewrites

ProxyPassReverse statement was adding an extra “/” in its Location: header rewrites. I noticed this when requesting a URL like “/petcare”. Tomcat would redirect this appropriately with a 302 and this header:

Location: http://vm-centos-cluster-ers.sosiouxme.lan:8100/petcare/

But when it came back through the proxy as “/http/nocluster/petcare”, it ended up rewritten as:

Location: http://vm-centos-cluster-ers.sosiouxme.lan/http/nocluster//petcare/

It seemed like a small thing – and, after all, it still worked due to URL canonicalization – but I wanted to understand why this happened and make it right. Here’s a typical configuration section initially:

# use mod_proxy_http to connect to non-replicated tc instances
<Proxy balancer://http-nocluster/>
BalancerMember http://vm-centos-cluster-ers.sosiouxme.lan:8100 route=ers
BalancerMember http://vm-centos-cluster-tcs.sosiouxme.lan:8100 route=tcs
ProxySet stickysession=JSESSIONID nofailover=On
</Proxy>

<Location /http/nocluster/>
ProxyPass balancer://http-nocluster/
ProxyPassReverse balancer://http-nocluster/
ProxyPassReverseCookiePath / /http/nocluster/
ProxyHTMLURLMap / /http/nocluster/
</Location>

I figured it was just a matter of juggling where “/” appeared at the end of various things. I cranked up the logging to “debug” and tried a few changes one by one.

  • Remove “/” from end of ProxyPass. This gave me a lovely 500 error and log messages:

ProxyPass balancer://http-nocluster

[debug] mod_proxy_balancer.c(46): proxy: BALANCER: canonicalising URL //http-noclusterpetcare

[debug] proxy_util.c(1525): [client 172.31.1.52] proxy: *: found reverse proxy worker for balancer://http-noclusterpetcare/

[…]

[warn] proxy: No protocol handler was valid for the URL /http/nocluster/petcare. If you are using a DSO version of mod_proxy, make sure the proxy submodules are included in the configuration using LoadModule.

  • Remove “/” from end of ProxyPassReverse. No apparent effect.
  • Remove “/” from end of <Proxy balancer://http-nocluster/> – no effect.
  • Remove “/” from end of <Location /http/nocluster/> – now we were getting somewhere! The Location header was rewritten correctly; only problem is that after rewriting, it was passed through to Tomcat as //petcare/ and failing.
  • Remove “/” from the end of everything! This seems to be what works best – everything passes through correctly and Location is rewritten correctly. So the configuration I ended up with is:

# use mod_proxy_http to connect to non-replicated tc instances
<Proxy balancer://http-nocluster>
BalancerMember http://vm-centos-cluster-ers.sosiouxme.lan:8100 route=ers
BalancerMember http://vm-centos-cluster-tcs.sosiouxme.lan:8100 route=tcs
ProxySet stickysession=JSESSIONID nofailover=On
</Proxy>

<Location /http/nocluster>
ProxyPass balancer://http-nocluster
ProxyPassReverse balancer://http-nocluster
ProxyPassReverseCookiePath / /http/nocluster/
ProxyHTMLURLMap / /http/nocluster/
</Location>

This was pretty much the only thing I tried that worked properly. Now, with mod_proxy_ajp it was a different story. The configuration looked pretty similar (because I’d done a cut/paste/edit):

# use mod_proxy_ajp to connect to non-replicated tc instances
<Proxy balancer://ajp-nocluster>
BalancerMember ajp://vm-centos-cluster-tcs.sosiouxme.lan:8109 route=tcs
BalancerMember ajp://vm-centos-cluster-ers.sosiouxme.lan:8109 route=ers
ProxySet stickysession=JSESSIONID nofailover=On
</Proxy>

<Location /ajp/nocluster/>
ProxyPass balancer://ajp-nocluster/
ProxyPassReverse balancer://ajp-nocluster/
ProxyPassReverseCookiePath / /ajp/nocluster/
ProxyHTMLURLMap / /ajp/nocluster/
</Location>

Thing is, my ProxyPassReverse there wasn’t doing anything at all. This is a little-known fact about how mod_proxy_ajp and ProxyPassReverse interact: an AJP connection doesn’t get a new http request to the backend; rather the HTTP headers from the request to the proxy are passed to the backend, and typically presented by Tomcat to the app as the request headers. So when the app (or Tomcat) forms a redirect (Location: header), it is relative to the host and port on the proxy, not the backend.

Meanwhile, ProxyPassReverse is very literal-minded. It only matches exactly what you put in the statement. So it’s a common error to have config like this:

ProxyPass / ajp://backend.example.com:8009/

ProxyPassReverse / ajp://backend.example.com:8009/

The ProxyPassReverse there isn’t doing anything at all, because it’s never going to see a “Location: ajp://backend.example.com:8009/” header from the backend – instead it will see URLs based on the front end. Most people won’t notice this because most people are using the same paths on front and backend, so nothing needed to be rewritten anyway. I had to be different and remap paths so I noticed when they weren’t rewritten.

The exception to the literal-mindedness of PPR is the balancer:// faux protocol. When you have a bunch of http backends in a balancer, you would normally need to rewrite headers corresponding to any of them – so, a PPR directive for each. This is pretty tedious. Starting in (I think) httpd 2.2.9 you could do a single PPR directive with the balancer:// notation as above and get this for free. I was hoping it would be smarter about AJP, but it’s not. That’s not such a big deal, though – since the host and port are always that of the front-end, I only need a single PPR for the rewrite.

<Location /ajp/nocluster/>
ProxyPass balancer://ajp-nocluster/

# note: http! This is the proxy server URL
ProxyPassReverse http://vm-centos-cluster-ers.sosiouxme.lan/
ProxyPassReverseCookiePath / /ajp/nocluster/
ProxyHTMLURLMap / /ajp/nocluster/
</Location>

And I didn’t even have to futz with the slashes, it just worked with them in.

Clustering Tomcat

I’ve been meaning for some time to set up Tomcat clustering in several different ways just to see how it works. I finally got a chance today. There are guides and other bits of information elsewhere on how to do all this. I’m not gonna tell you how, sorry; the point of this post is to document my problems and how I overcame them.

A word about setup

My front end load balancer is Apache httpd 2.2.15 (from SpringSource ERS 4.0.2) using mod_proxy_http, mod_proxy_ajp, mod_jk, and (just to keep things interesting) mod_rewrite as connectors to the backend Tomcat 6.0.28 instances. I wanted to try out 2-node Tomcat clusters (one side of each cluster ERS Tomcat, the other side tc Server 2.0.3) without any session replication (so, sticky sessions and you get a new session if there’s a failure) and with session replication via the standard Delta manager (which replicates all sessions to all nodes) and the Backup manager (which replicates all sessions to a single reference node, the “backup” for each app).  Basically the idea is to test out every conceivable way to put together a cluster with standard SpringSource technologies.

The first trick was mapping all of these into the  URL space for my httpd proxy. I wanted to put each setup behind its own URL /<connector>/<cluster> so e.g /http/backup and /ajp/delta. This is typically not done and for good reason; mod_proxy will let you do the mapping and will even clean up redirects and cookie paths from the backend, but to take care of self-referential links in the backend apps you actually have to rewrite the content that comes back; for that I installed mod_proxy_html, a non-standard module for doing such things. The reason it’s a non-standard module is that this approach is fraught with danger. But given that I mostly don’t care about how well the apps work in my demo cluster, I thought it’d be a great time to put it through its paces.

For this reason, most people make sure the URLs on the front-end and back-end are the same; and in fact, as far as I could tell, there was no way at all to make mod_jk do any mapping, so I’m setting it up as a special case – more later if it’s relevant. The best way to do this demo would probably be to set up virtual hosts on the proxy for each scenario and not require any URI mapping; if I run into enough problems I’ll probably fall back to that.

Problem 1: the AJP connector

I got things running without session replication fairly easily. My first big error with session replication was actually something else, but at first I thought it might be related to this warning in the Tomcat log:

Aug 13, 2010 9:43:00 PM org.apache.catalina.core.AprLifecycleListener init

INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/jdk1.6.0_21/jre/lib/i386/server:/usr/java/jdk1.6.0_21/jre/lib/i386:/usr/java/jdk1.6.0_21/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib
Aug 13, 2010 9:43:00 PM org.apache.catalina.startup.ConnectorCreateRule _setExecutor
WARNING: Connector [org.apache.catalina.connector.Connector@1d6f122] does not support external executors. Method setExecutor(java.util.concurrent.Executor) not found.

These actually turn out to be related:

<Listener className=”org.apache.catalina.core.AprLifecycleListener” SSLEngine=”on” />

<Connector  executor=”tomcatThreadPool”
port=”8209″
protocol=”AJP/1.3″
emptySessionPath=”true”
/>

I don’t really know why the APR isn’t working properly, but a little searching turned up some obscure facts: if the APR isn’t loaded, then for the AJP connector Tomcat makes a default choice of implementation that doesn’t use the executor thread pool. So you have to explicitly set the class to use like this:

<Connector  executor=”tomcatThreadPool”
port=”8209″
protocol=”org.apache.coyote.ajp.AjpProtocol”
emptySessionPath=”true”
/>

Nice, eh? OK. But that was just an annoyance.

Problem 2: MBeans blowup

The real thing holding me back was this error:

Aug 13, 2010 9:52:16 PM org.apache.catalina.mbeans.ServerLifecycleListener createMBeans
SEVERE: createMBeans: Throwable
java.lang.NullPointerException
at org.apache.catalina.mbeans.MBeanUtils.createObjectName(MBeanUtils.java:1086)
at org.apache.catalina.mbeans.MBeanUtils.createMBean(MBeanUtils.java:504)

Now what the heck was that all about? Well, I found no love on Google. But I did eventually guess what the problem was. This probably underscores a good methodology: when working on configs, add one thing at a time and test it out before going on. If only Tomcat had a “configtest” like httpd – it takes forever to “try it out”.

In case anyone else runs into this, I’ll tell you what it was. The Cluster Howto made it pretty clear that you need to set your webapps context to be distributable for clustering to work. It wasn’t clear to me where to put that, but I didn’t want to create a context descriptor for each app on each instance. I knew you could put a <Context> element in server.xml so that’s right where I put it, right inside the <Host> element:

<Context distributable=”true” />

Well, that turns out to be a bad idea. It causes the error above. So don’t do that. For the products I’m using, there’s a single context.xml that applies to all apps on the server; that’s where you want to put the distributable attribute.

Cluster membership – static

My next task was to get the cluster members in each case to recognize each other and replicate sessions. Although all the examples use multicast to do this, I wanted to try setting up static membership, because I didn’t want to look up how to enable multicast just yet. And it should be simpler, right? Well, I had a heck of a time finding this, but it looks like the StaticMembershipInterceptor is the path.

My interceptor looks like this:

<Interceptor className=”org.apache.catalina.tribes.group.interceptors.StaticMembershipInterceptor”>
<Member className=”org.apache.catalina.tribes.membership.StaticMember”
port=”8210″ securePort=”-1″
host=”vm-centos-cluster-tcs.sosiouxme.lan”
domain=”delta-cluster”
uniqueId=”{0,1,2,3,4,5,6,7,8,9}”
/>
</Interceptor>

(with the “host” being the other host in the 2-node cluster on each side). Starting with this configuration brings an annoying warning message:

WARNING: [SetPropertiesRule]{Server/Service/Engine/Cluster/Channel/Interceptor/Member} Setting property ‘uniqueId’ to ‘{1,2,3,4,5,6,7,8,9,0}’ did not find a matching property.

So I guess that property has been removed and the docs not updated; didn’t seem necessary anyway given the host/port combo should be unique.

In any case, the log at first looks encouraging:

Aug 13, 2010 10:13:04 PM org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded
INFO: Replication member added:org.apache.catalina.tribes.membership.MemberImpl[tcp://vm-centos-cluster-tcs.sosiouxme.lan:8210,vm-centos-cluster-tcs.sosiouxme.lan,8210, alive=0,id={0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 }, payload={}, command={}, domain={100 101 108 116 97 45 99 108 117 …(13)}, ]
Aug 13, 2010 10:13:04 PM org.apache.catalina.tribes.membership.McastServiceImpl waitForMembers
INFO: Sleeping for 1000 milliseconds to establish cluster membership, start level:4
Aug 13, 2010 10:13:04 PM org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded
INFO: Replication member added:org.apache.catalina.tribes.membership.MemberImpl[tcp://{172, 31, 1, 108}:8210,{172, 31, 1, 108},8210, alive=25488670,id={66 108 -53 22 -38 -64 76 -110 -110 -54 28 -11 -126 -44 66 28 }, payload={}, command={}, domain={}, ]

Sounds like the other cluster member is found, right? I don’t know why it’s in there twice (once w/ hostname, once with IP) but I think the first is the configuration value, and the second is for when the member is actually contacted (alive=).

And indeed, for one node, later in the log as the applications are being deployed, I see the session replication appears to happen for each:

WARNING: Manager [localhost#/petclinic], requesting session state from org.apache.catalina.tribes.membership.MemberImpl[tcp://vm-centos-cluster-tcs.sosiouxme.lan:8210,vm-centos-cluster-tcs.sosiouxme.lan,8210, alive=0,id={0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 }, payload={}, command={}, domain={100 101 108 116 97 45 99 108 117 …(13)}, ]. This operation will timeout if no session state has been received within 60 seconds.
Aug 13, 2010 10:13:09 PM org.apache.catalina.ha.session.DeltaManager waitForSendAllSessions
INFO: Manager [localhost#/petclinic]; session state send at 8/13/10 10:13 PM received in 186 ms.

Um, why is that a WARNING? Looks like normal operation to me, surely it should be an INFO. Whatever. The real problem is on the other side of the node:

13-Aug-2010 22:25:16.624 INFO org.apache.catalina.ha.session.DeltaManager.start Register manager /manager to cluster element Engine with name Catalina
13-Aug-2010 22:25:16.624 INFO org.apache.catalina.ha.session.DeltaManager.start Starting clustering manager at/manager
13-Aug-2010 22:25:16.627 WARNING org.apache.catalina.ha.session.DeltaManager.getAllClusterSessions Manager [localhost#/manager], requesting session state from org.apache.catalina.tribes.membership.MemberImpl[tcp://vm-centos-cluster-ers.sosiouxme.lan:8210,vm-centos-cluster-ers.sosiouxme.lan,8210, alive=0,id={0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 }, payload={}, command={}, domain={100 101 108 116 97 45 99 108 117 …(13)}, ]. This operation will timeout if no session state has been received within 60 seconds.
13-Aug-2010 22:26:16.635 SEVERE org.apache.catalina.ha.session.DeltaManager.waitForSendAllSessions Manager [localhost#/manager]: No session state send at 8/13/10 10:25 PM received, timing out after 60,009 ms.

And so on for the rest of my webapps too! And if I try to access my cluster, it seems to be hung! While I was writing this up I think I figured out the problem with that. On the first node, I had gone into the context files for the manager and host-manager apps and set distributable=”false” (doesn’t really make sense to distribute manager apps). On the second node I had not done the same. My bad, BUT:

  1. Why did it take 60 seconds to figure this out for EACH app; and
  2. Why did EVERY app, not just the non-distributable ones, fail replication (at 60 seconds apiece)?

Well, having cleared that up my cluster with DeltaManager session replication seems to be working great.

Cluster with BackupManager, multicast

OK, here’s the surprise denouement – this seems to have just worked out of the box. I didn’t even think multicast would work without me having to tweak something in my OS (CentOS 5.5) or my router. But it did, and failover worked flawlessly when I killed a node too. Sweet.