OpenShift: profiles, districts, and nodes, oh my!

Someone in #openshift-dev recently pointed out that the relationship between OpenShift profiles, districts, and nodes isn’t laid out clearly anywhere. I had a look through the docs, and I have to admit, he has a point. You can kind of infer it from various parts of the documentation, but I couldn’t find anywhere that simply states what I’m about to here. I’d be happy to be shown wrong.

TL;DR: Profiles contain districts, which contain nodes, which contain gears. You can’t have districts or nodes with multiple profiles. You can’t control what district or node a gear is created in.

If you’re familiar with OpenShift at all, you probably have at least some grasp of what a gear is: the basic unit of compute in OpenShift. Practically speaking, a gear is actually a regular old user account on a Linux host (an OpenShift node host), with a specified allocation of resources (RAM, disk, network, etc.), locked down by various containment mechanisms to just those resources. Much of OpenShift revolves around managing gears.

Profiles

Gears have a profile, also known as a size. I don’t like calling it a size, because it need not have anything to do with size, but we’re stuck with the term in a few places (notably, the DB and API) so I can’t pretend it’s not there. And for many deployments, it probably will be about size. But I’ll call it a profile here.

Profiles are the most fundamental way in which gears are grouped. The original point of profiles was to provide some uniformity for capacity planning, but you can really use them to partition your gears in any way you want – departmental ownership, location, high availability separation, security clearance, etc. We will have better ways to implement separation for some of these concepts in the not-too-distant future, but at this moment, gear profiles are the only point at which OpenShift enables giving different users access to separate parts of the deployment.

For such a fundamental piece of the architecture, it may be surprising that there is no model object or real definition of a profile in the OpenShift broker database schema. It is literally just a string, which can be whatever you want.

Nodes

An OpenShift node is a Linux host with OpenShift services for containing gears. There might conceivably be additional implementations in the future – the point is that nodes are containers for gears. Nodes have a gear profile, which is defined in /etc/openshfit/resource_limits.conf – this file contains the gear profile string as well as all of the resources that a gear gets when it is created on that node.

Technically, a bunch of nodes all claiming to have a particular gear profile could specify wildly different resource limits. There’s nothing at the broker that would even know they were different, for the most part. But if you are a sane system administrator, you would not do this, except by accident. See, you would probably like to have some idea of how many gears you can fit on the node, so that you know when to make new nodes. And you would probably not like your gears in a profile to have randomly different resource limits.

Most OpenShift admins (and salespeople) wonder at some point how you can specify multiple gear profiles for a node. You can’t. If nodes could host multiple sizes, how would you know how many gears you can fit on that node? It comes as a great surprise to people who want to create a monster node host and run their whole PaaS off of it when we tell them “just partition it into VMs and give them different profiles.” So perhaps someday, we will enable multiple profiles per host; but don’t bet on it. Sometimes very unintuitive results make sense when you look at the bigger picture.

So at this point, gear profiles are synonymous with node profiles. A node contains gears of a particular profile, and a gear profile is constrained in number of gears by the number of node hosts configured with that profile.

Districts

Of all the things that are misunderstood in OpenShift, I’m pretty sure districts are number one. I think it’s because they kind of sound like what profiles actually are – a partitioning scheme. They’re really not at all.

Conceptually, profiles contain districts, and districts contain node hosts. A district has a gear profile and can only contain nodes with that profile. But districts are completely invisible to users, and there is no way to specify which one a gear lands in when you’re creating a gear. They have nothing to do with permissions or partitions. To understand what they are for, you have to understand a little about the technical details that motivated them.

Sometimes, for one reason or another, you want to move a gear from one node host to another. You would like it to function exactly the same way on the new host as the old one. The problem is, there are a few resources that must be exclusive to any single gear on a host: internal IPs and external ports being the main ones. If you move a gear from one host to another, there’s no guarantee that the same resources it was using will be available on the new host, so you would need to detect this situation and reconfigure the gear to use unique resources that don’t clash. This is indeed the approach that was taken before districts, but it proved to be rather brittle.

  1. It led to all kinds of edge case bugs where things would wind up broken only when certain cartridges (or combinations of them, perhaps when scaled…) were moved to nodes where they needed to be reconfigured. In short, it was a regression testing nightmare.
  2. It also made cartridges hard to write correctly to handle moving, and we wanted writing cartridges to be easy so that lots of developers can contribute them.
  3. Finally, since gears are configured via setting environment variables, reconfiguring for a move meant changing environment variables. The parts of the gear that relied on these would work fine after a move, but the places where the app developer had hard-coded the values instead of using environment variables… broke. Naturally, developers assume it is the administrator’s fault, or a bug in the PaaS. So, it’s an administrative nightmare too.

To get around this, OpenShift introduced a simple allocation scheme: to ensure that you can move a gear off of a node, reserve the unique resources it will need on multiple nodes.

In practice, a district is nothing more than a pool of numeric user IDs that are reserved against a set of nodes. Every time a gear is created, it first reserves an ID from the district pool; then on a node host in that district, a user is created with that ID, and algorithms based on that ID specify the range of resources that are available to the gear. Since the UID is reserved across all the nodes in the district, it is guaranteed to be available if you move a gear to any node in that district, and thus all the resources based on it will also be available, and the gear needs no reconfiguration. Problem solved.

Unfortunately, having to explain all this just to get across what districts are for… is pretty awkward. But it’s a necessary concept to understand if you’re an OpenShift administrator.