OpenShift, Apache, and severe hair removal

I just solved a problem that stumped me for a week. It was what we in the business call “a doozy”. I’ll share here, mostly to vent, but also in case the process I went through might help someone else.

The problem: a non-starter

I’ve been working on packaging all of OpenShift Enterprise into a VM image with a desktop for convient hacking on a personal/throwaway environment. It’s about 3 GB (compressed) and takes an hour or so to build. It has to be built on our official build servers using signed RPMs via an unattended kickstart. There have been a few unexpected challenges, and this was basically the cherry on top.

The problem was that the first time the VM booted… openshift-broker and openshift-console failed to start. Everything else worked, including the host-level httpd. Those two (which are httpd-based) didn’t start, and they didn’t leave any logs to indicate why. They didn’t even get to the point of making log files.

And the best part? It only happened the first time. If you started the services manually, they worked. If you simply rebooted after starting the first time… they worked. So basically, the customer’s first impression would be that it was hosed… even though it magically starts working after a reboot, the damage is done. I would look like an idiot trying to release with that little caveat in place.

Can you recreate it? Conveniently?

WTF causes that? And how the heck do I troubleshoot? For a while, the best I could think of was starting the VM up in runlevel 1 (catch GRUB at boot and add the “1” parameter), patching in diagnostics, and then proceeding to init. After one run, if I don’t have what I need… recreate the VM and try again. So painful I literally just avoided it and busied myself with other things.

The first breakthrough was when I tried to test the kinds of things that happen only at the first boot after install. There are a number, potentially – services coming up for the first time and so forth. Another big one is network initialization on a new network and device. I couldn’t see how that would affect these services (they are httpd binding only to localhost), but I did experiment with changing the VM’s network device after booting (remove the udev rule, shut down, remove the device, add another), and found that indeed, it caused the failure on the next boot reliably.

So it had to be something to do with network initialization.

What’s the actual error?

Being able to cause it at will on reboot meant much easier iteration of diagnostics. First I tried just adding a call to ifconfig in the openshift-broker init script. I couldn’t see anything in the console output, so I assumed it was just being suppressed somehow.

Next I tried to zero in on the actual failure. When invoked via init script, the “daemon” function apparently swallows console output from the httpd command, but it provides an opportunity to add options to the command invocation, so I looked up httpd command line options and found two that looked helpful: “-e debug -E /var/log/httpd_console”:

-e level
Sets the LogLevel to level during server startup. This is useful for temporarily increasing the verbosity of the error messages to find problems during startup.

-E file
Send error messages during server startup to file.

This let me bump up the logging level at startup and capture the server startup messages. (Actually probably only the latter matters. Another one to keep in mind is -X which starts it as a single process/thread only – helpful for using strace to follow it. Not helpful here though.)

This let me see the actual failure:

 [crit] (EAI 9)Address family for hostname not supported: alloc_listener: failed to set up sockaddr for 127.0.0.1

Apparently the httpd startup process tries to bind to network interfaces before it even opens logs, and this is what you get when binding to localhost fails.

What’s the fix?

An error is great, but searching the mighty Google for it was not very encouraging. There were a number of reports of the problem, but precious little about what actually caused it. The closest I could find was this httpd bug report:

Bug 52709 – Apache can’t bind to 127.0.0.1 if eth0 has only IPv6

[…]

This bug also affects httpd function ap_get_local_host() as described in http://bugs.debian.org/629899
Httpd will then fail to get the fully qualified domain name.

This occurs when apache starts before dhclient finished its job.

Here was something linking the failure to bind to 127.0.0.1 to incomplete network initialization by dhclient. Suddenly the fact that my test “ifconfig” at service start had no output did not seem like a fluke. When I started the service manually, ifconfig certainly had output.

So, here’s the part I don’t really claim to understand. Apparently there’s a period after NetworkManager does its thing and other services are starting where, in some sense at least, the network isn’t really available. At least not in a way that ifconfig can detect, and not in a way that allows httpd to explicitly bind to localhost.

As a workaround, I added a shim service that would wait for network initialization to actually complete before trying to start the broker and console. I could add a wait into those service scripts directly, but I didn’t want to hack up the official files that way. So I created this very simple service (openshift-await-eth0) that literally just runs “ifconfig eth0” and waits up to 60 seconds for it to include a line beginning “inet addr:”. Notably, if I have it start right after NetworkManager, it finishes immediately, so it seems the network is up at that point, but goes away just in time for openshift-broker and openshift-console to trip over it. So I have the service run right before openshift-broker, and now my first boot successfully starts everything.

Since this is probably the only place we’ll ever deploy OpenShift that has dhclient trying to process a new IP at just the time the broker and console are being started, I don’t know that it will ever be relevant to anyone else. But who knows. Maybe someone can further enlighten me on what’s going on here and a better way to avoid it.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: