HA applications on OpenShift

Beginning with version 2.0, OpenShift Enterprise supports making applications “highly available”. However, there is very little documentation on the details of implementing HA apps or exactly how they work. We surely have customers using these, but they’re mostly doing so with a lot of direct help from our consulting services. Nothing in the documentation brings it all together, so I thought I’d share what I know at this point. I think we can expect these implementation details to remain fairly stable for the lifetime of OSE 2.

This is kind of a brain dump. We are working to get most of this into the manuals if it isn’t already (at least what’s fit for it and not just me rambling). But I don’t know anywhere else to find all this in one place right now.

Why and how?

The motivation behind supplying this feature and the basic architectural points and terminology are well covered at the start of the HA PEP. Definitely read this for background if you haven’t already; I won’t repeat most of the points.

As a super quick overview: HA is an enhancement to scaled apps. Standard scaled apps allow easy horizontal scaling by deploying duplicate instances of the application code and framework cartridge in multiple gears of your app, accessed via a single load balancer (LB) proxy gear (LB is performed by the haproxy cartridge in OSE 2). For a scaled app to become HA, it should have two or more LB gears on different nodes that can each proxy to the gears of the app. That way, if a node containing one LB gear goes down, you can still reach the app via another. This only makes the web framework HA — shared storage and DB replication aren’t OpenShift features (yet), so practically speaking. your app should be stateless or use external HA storage/DB if you actually want it to be HA.

So how do we direct HTTP requests to the multiple LB gears? Via the routing layer, which is described in the PEP, but which OpenShift does not (yet) supply a component/configuration for. If you are interested in HA apps, then chances are good you already have some kind of solution deployed for highly-available routing to multiple instances of an application (e.g. F5 boxes). All you need to do is hook OpenShift and your existing solution together via the Routing System Programming Interface (SPI). Once this is configured, HA apps are given an alias in DNS that resolves to the routing layer, which proxies requests to the LB gears on OpenShift nodes, which proxy requests to the application gears.

Making an app HA (the missing User Guide entry)

The OSE admin guide indicates how administrators can configure OSE to allow HA apps and enable specific users to make apps HA. I’ll assume you have a deployment and user with this capability.

To make an HA app, you first create a scaled app, then make it HA. The “make-ha” REST API call is implemented as an event (similar to “start” or “stop”) on an application. Direct REST API access is currently the only way to do this – rhc and other tools do not implement this call yet. So, currently the only mention is in the REST API guide. (Update 2014-12-19: rhc now has app-enable-ha and the docs have been updated. The preferred event name is now “enable-ha”.) So, for example:

$ rhc create-app testha ruby-1.9 -s

Your application ‘testha’ is now available.

$ curl -k -X POST https://broker.example.com/broker/rest/domains/test/applications/testha/events –user demo:password –data-urlencode event=make-ha

long JSON response including: “messages”:[{“exit_code”:0, “field”:null, “index”:null, “severity”:”info”, “text”:”Application testha is now ha”}], “status”:”ok”

Upon success, the app will have scaled to two gears, and your second gear will also be a LB gear. You can confirm by sshing in to the second gear and looking around.

$ rhc app show –gears -a testha

ID State Cartridges Size SSH URL
53c6c6b2e659c57659000002 started haproxy-1.4 ruby-1.9 small 53c6c6b2e659c57659000002@testha-test.example.com
53c6c74fe659c57659000021 started haproxy-1.4 ruby-1.9 small 53c6c74fe659c57659000021@53c6c74fe659c57659000021-test.example.com

$ ssh 53c6c74fe659c57659000021@53c6c74fe659c57659000021-test.example.com

> ls

app-deployments app-root gear-registry git haproxy ruby

> ps -e

12607 ? 00:00:00 haproxy
13829 ? 00:00:00 httpd

A third gear will not look the same; it will only have the framework cartridge. I should mention that at this moment there’s a bug in rhc such that it displays all framework gears as having the haproxy LB cartridge, when actually only the first two do (Update 2014-12-19: rhc output has since been corrected).

What does make-ha actually do?

Behind the scenes are a few critical changes from the make-ha event.

First, a new DNS entry has been created to resolve requests to the router. How exactly this happens depends on configuration. (Refer to the Admin Guide for details on how the HA DNS entry is configured.) In the simplest path (with MANAGE_HA_DNS=true), OpenShift itself creates the DNS entry directly; the default is just to prepend  “ha-” to the app name and point that entry at ROUTER_HOSTNAME. Thus our app above would now have a DNS entry ha-testha-test.example.com. With MANAGE_HA_DNS=false, OpenShift counts on the routing layer to receive the routing SPI event and create this DNS entry itself.

In either case, this DNS entry is only useful if it points at the router which serves as a proxy. A request to one of the nodes for “ha-testha-test.example.com” would not be proxied correctly – it’s supposed to be relayed by the routing layer as a request for either “testha-test.example.com” or a secondary LB gear. It’s also possible for the router to just proxy directly to the individual gears (endpoints for which are provided in the routing SPI); however, this may not be desirable for a variety of reasons.

A second change that occurs is that the parameters for the haproxy cartridge are modified. By default in a scaled application, the haproxy cartridge manifest has a minimum and maximum scale of 1 (see the manifest), so there will always be exactly one. But when you make an app HA, in the MongoDB record for the haproxy cartridge instance in your app, the minimum is changed to 2, and the maximum is changed to -1 (signifying no maximum). Also the HA multiplier is applied, which I’ll discuss later. As a result of raising the minimum to 2, the app scales up one gear to meet the minimum.

There are some interesting oddities here.

First, I should note that you can’t scale the haproxy cartridge directly via the REST API at all. You’ll just get an error. You only have the ability to scale the framework cartridge, and the haproxy cartridge may be deployed with it.

Also, once your app is HA, you can no longer scale below two gears:

$ rhc scale-cartridge -a testha ruby –min 1 –max 1
Setting scale range for ruby-1.9 … Cannot set the max gear limit to ‘1’ if the application is HA (highly available)

By the way, there is no event for making your app not HA. Just destroy and re-create it. (Update 2014-12-19: There is now a disable-ha event that may be invoked on the API just like enable-ha [no rhc support yet]. It just changes the minimum requirements for haproxy cartridges, doesn’t immediately remove any cartridges/gears.)

Also, if you already scaled the framework cartridge to multiple gears, then making it HA will neither scale it up another gear, nor deploy the haproxy cartridge on an existing gear (which is what I would have expected). So it will not actually be HA at that point. Instead, an haproxy cartridge will be deployed with the next gear created. If you then scale down, that gear will be destroyed, and your app again effectively ceases to be HA (Edit: this is considered a bug 2014-12-19: and it’s fixed). So, make sure you make your app HA before scaling it, so you are assured of having at least two LB gears. A little more about this at the end.

How does auto-scaling work with HA?

If you’re familiar with OpenShift’s auto-scaling, you may be wondering how two or more LB gears coordinate traffic statistics in order to decide when to scale.

First, I’ll say that if you’re going to the trouble to front your OpenShift app with an expensive HA solution, you may want to disable auto-scaling and let your router track the load and decide when to scale. Practically speaking, that can be tricky to implement (there actually isn’t an administrative mechanism to scale a user-owned app, so workarounds are needed, such as having all HA apps owned by an “administrative” user), but it’s something to think about.

That said, if you’re using auto-scaling, perhaps with a customized algorithm, you’ll want to know that the standard haproxy scaling algorithm runs in a daemon on the first gear (“head” gear) of the app, in both regular scaled apps and HA apps. In either case, it bases scaling decisions on a moving average of the number of HTTP sessions in flight. The only difference with an HA app is that the daemon makes a request each sampling period to all of the other LB gears to gather the same statistic and add it into the average.

Also, auto-scaling does not occur if the head gear is lost. So, auto-scaling is not HA – another reason to consider manual scaling. Which brings me to…

What happens when a gear in an HA app is lost?

If a node crashes or for some other reason one or more gears of your HA app is out of commission (rack PDU cut out? Someone tripped over the network cable?), how does your app react? Well, this is no different from a regular scaled app, but it bears repeating: nothing happens. Remaining LB gears notice the missing gears are failing health checks and pull them out of rotation – and that’s it. There is no attempt to replace the missing gear(s) even if they are LB gears, or to augment the gear count to make up for them. No alerts are sent. As far as I’m aware, nothing even notifies the broker to indicate that a gear’s state has changed from “started” to “unresponsive” so it’s hard for the app owner to even tell.

The whole point of HA apps is to remove the single point of failure in standard scaled apps by providing more than one endpoint for requests, instead of the app being unreachable when the head LB gear is lost. However, there are still some functions performed only by the head gear:

  1. Auto-scaling
  2. Deployment of application code changes

These functions are still not HA in an HA app. They still rely on the head gear, and there’s no mechanism for another gear to take over the head gear designation. What’s more, if a gear is out of commission when an application update is deployed, there isn’t any mechanism currently for automatically updating the gear when it returns (the next deploy will bring it in sync of course). These are all known problems that are in our backlog of issues to address, but aren’t terribly obvious to those not intimately familiar with the architecture.

The general assumption in OpenShift when a node disappears is that it will probably come back later – either after some crisis has passed, or via resurrection from backups. So, there is no attempt at “healing” the app. If, however, a node goes away with no hope for recovery, there is a tool oo-admin-repair which can do a limited amount of cleanup to make the MongoDB records accurately reflect the fact that the gears on that node no longer exist. I’m not sure the extent to which scaled apps are repaired; I would expect that the list of gears to route to is accurately updated, but I would not expect the head gear to be replaced (or re-designated) if lost, nor any gears that aren’t scaled (databases or other services that don’t scale yet). If the head gear is intact and auto-scaling in use, then scaling should continue normally. If auto-scaling is disabled, I would not expect any scaling to occur (to replace missing gears) without the owner triggering an event. I haven’t rigorously researched these expectations, however. But let’s just say OpenShift nodes aren’t ready to be treated like cattle yet.

How do I know my gears are spread out enough?

Under OSE 2.0, gears gained anti-affinity, meaning that gears of a scaled app will avoid being placed together on the same node if there are other nodes with available capacity. OSE 2.1 provided more explicit mechanisms for controlling gear placement. The most relevant one here is putting nodes in availability zones — for example, you could use a different zone for each rack in your datacenter. If possible, OpenShift will place the gears of an HA app across multiple zones, ensuring availability if one entire zone of nodes is lost at once.

To have better assurance than “if possible” on your HA app spanning zones, take a look at the two settings in the broker mcollective plugin that force your app to do so.

There’s one major caveat here. Anti-affinity only applies when a gear is placed. The OpenShift administrators could move gears between nodes after the fact, and if they’re not being careful about it, they could move the LB gears of an HA app all to the same node, in which case it would effectively cease to be HA since one node outage would leave the app unavailable. There’s really nothing to prevent or even detect this circumstance at this time.

Verifying that your gears actually are spread across zones as expected is currently a laborious manual operation (list all the gears, see which node their addresses resolve to, check which zones the nodes are in), but the REST API and the tools that access it are being expanded to expose the zone and region information to make this easier.

Let me add one more bit of direction here regarding zones, regions, and districts. Although you need to create districts, and although you use the district tool to assign regions and zones to your nodes, there really is no relationship between districts and zones/regions. They exist for (almost) orthogonal purposes. The purpose of districts is to make moving gears between nodes easy. The purpose of zones and regions is to control where application gears land with affinity and anti-affinity. Districts can contain nodes of multiple zones, zones can span multiple districts. It actually makes a lot of sense to have multiple zones in a district and zones spanning districts, because that way if in the future you need to shut down an entire zone, it will be easy to move all of the gears in the zone elsewhere, because there will be other zones in the same district. OTOH, while you *can* have multiple regions in a district, it doesn’t make a lot of sense to do so, because you won’t typically want to move apps en masse between regions.

The scaling multiplier: more than 2 LB gears

Of course, with only two LB gears, it only takes two gear failures to make your app unreachable. For further redundancy as well as extra load balancing capacity, you might want more than the two standard LB gears in an HA app. This can be accomplished with the use of a (currently esoteric) setting called the HA multiplier. This value indicates how many of your framework gears should also be LB gears. A multiplier of 3 would indicate that every 3rd gear should be a LB gear (so, after 6 gears were created with the first two being LBs, the 9th, 12th, and so on would all be LBs). A multiplier of 10 would mean gears 30, 40, 50,… would be LB gears.

The default multiplier is 0, which indicates that after the first two LB are created to satisfy the minimum requirement for the HA app (remember, the haproxy minimum scale is changed to 2 for an HA app), no more will ever be created. The multiplier can be configured in two ways:

  1. Setting DEFAULT_HA_MULTIPLIER in broker.conf – currently undocumented, but it will be read and used if present. This sets the default, of course, so it is only relevant at the time an app is created. Changing it later doesn’t affect existing apps.
  2. Using oo-admin-ctl-app to change the multiplier for a specific app, e.g:

# oo-admin-ctl-app -l demo -a testha -c set-multiplier –cartridge haproxy-1.4 –multiplier 3

Note that both are administrative functions, not available to the user via the REST API.

The gear-to-LB ratio: rotating out the LB gears

At app creation, LB gears also directly (locally) serve application requests via the co-located framework cartridge. Once a scaled app is busy enough, gears serving as load balancers may consume significant resources just for the load balancing, leaving less capacity to serve requests. Thus, for performance reasons, after a certain limit of gears created in an app, LB gears remove themselves from the load balancing rotation (“rotate out”). Under this condition, the framework cartridge is left running on LB gears, but receives no requests.

The limit at which this occurs is governed by an environment variable OPENSHIFT_HAPROXY_GEAR_RATIO which is read by the haproxy cartridge. The idea is that when the ratio of total gears to LB gears (rounded) reaches this limit, that is when the LB gears rotate themselves out.

The default value is 3. So in a plain scaled app, once the third gear is created, the first gear is removed from rotation. In an HA app, once the 5th gear is created (5 gears / 2 LB gears rounds to 3) the first two gears are removed from rotation (resulting in an actual reduction from 4 gears servicing requests to 3). In general, it would be unwise to set this value higher than the HA multiplier, as the LB gears would be rotated in and out unevenly as the app scaled.

The most obvious way to change this value is to set it node-wide in /etc/openshift/env/OPENSHIFT_HAPROXY_GEAR_RATIO (an administrative action). However, being an environment variable, it’s also possible for the user to override it by setting an environment variable for an individual app (there’s no administrative equivalent).

Integration

The primary method of integrating a routing layer into OpenShift is via the routing SPI, a service that announces events relevant to a routing layer, such as gears being created. The Routing SPI is an interface in the broker application that may be implemented via a plugin of your choosing. The documentation describes a plugin implementation that relays the events to an ActiveMQ messaging broker, and then an example listener for consuming those messages from the ActiveMQ broker. It is important to note that this is only one implementation, and can be substituted with anything that makes sense in your environment. Publishing to a messaging bus is a good idea, but note that as typically deployed for OpenShift, ActiveMQ is not configured to store messages across restarts. So, definitely treat this only as informational, not a recommended implementation.

Another method for integrating a routing layer is via the REST API. Because the REST API has no administrative concept and access is restricted per user, the REST API is somewhat limited in an administrative capacity; for example, there is no way to list all applications belonging to all users. However querying the REST API could be suitable when implementing an administrative interface to OpenShift that intermediates all requests to the broker, at least for apps that require the routing layer (i.e. HA apps). For example, there could be a single “administrative” user that owns all production applications that need to be HA, and some kind of system for regular users to request apps via this user. In such cases, it may make sense to retrieve application details synchronously while provisioning via the REST API, rather than asynchronously via the Routing SPI.

The most relevant REST API request here is for application endpoints. This includes the connection information for each exposed cartridge on each gear of a scaled app. For example:

$ curl -k -u demo:password https://broker.example.com/broker/rest/domains/test/applications/testha/gear_groups?include=endpoints | python -mjson.tool

... ( example LB endpoint: )
 {
   "cartridge_name": "haproxy-1.4", 
   "external_address": "192.168.122.51", 
   "external_port": "56432", 
   "internal_address": "127.9.36.2", 
   "internal_port": "8080", 
...
   "protocols": [
     "http", 
     "ws"
   ], 
   "types": [
     "load_balancer"
   ]
 }
...

Odds and ends

It’s perhaps worth talking about the “gear groups” shown in the API response a little bit, as they’re not terribly intuitive. Gear groups are sets of gears in the app that replicate the same cartridge(s). For a scaled app, the framework cartridge would be in one gear group along with the LB cartridge that scales it. The haproxy LB cartridge is known as a “sparse” cartridge because it is not deployed in all of the gears in the group, just some. If a database cartridge is added to the app, that is placed in a separate gear group (OpenShift doesn’t have scaling/replicated DB cartridges yet, but when it does, these gears groups will become larger than one and will “scale” separately from the framework cartridge). Without parameters, the gear_groups REST API response doesn’t indicate which cartridge is on which gear, just the cartridges that are in a gear group and the gears in that group; this is why rhc currently indicates the haproxy is located on all framework gears. This will be fixed by specifying the inclusion of endpoints in the request.

Remember how making an app HA makes it scale up to two LB gears, but not it if it’s already two gears? I suspect the “minimum” scale from the cartridge is applied to the gear group, without regard for the actual number of cartridge instances; so, while making an app HA will set the haproxy cartridge “minimum” to 2, the effect is that the gear group scale minimum is set to 2 (i.e. scale is a property of the gear group). If the gear group has already been scaled to 2 before being made HA, then it doesn’t need to scale up to meet the minimum (and in fact, won’t allow scaling down, so you can’t ever get that second gear to have the haproxy LB).

One thing discussed in the PEP that has no implementation (aside from a routing layer component): there’s no mechanism for a single LB gear to proxy only to a subset of the other gears. If your app is scaled to 1000 gears, then each LB gear proxies to all 1000 gears (minus LB gears which we assume are rotated out). A few other pieces of the PEP are not complete as yet: removing or restarting specific gears, removing HA, and probably other bits planned but not delivered in the current implementation.