OpenShift, Apache, and severe hair removal

I just solved a problem that stumped me for a week. It was what we in the business call “a doozy”. I’ll share here, mostly to vent, but also in case the process I went through might help someone else.

The problem: a non-starter

I’ve been working on packaging all of OpenShift Enterprise into a VM image with a desktop for convient hacking on a personal/throwaway environment. It’s about 3 GB (compressed) and takes an hour or so to build. It has to be built on our official build servers using signed RPMs via an unattended kickstart. There have been a few unexpected challenges, and this was basically the cherry on top.

The problem was that the first time the VM booted… openshift-broker and openshift-console failed to start. Everything else worked, including the host-level httpd. Those two (which are httpd-based) didn’t start, and they didn’t leave any logs to indicate why. They didn’t even get to the point of making log files.

And the best part? It only happened the first time. If you started the services manually, they worked. If you simply rebooted after starting the first time… they worked. So basically, the customer’s first impression would be that it was hosed… even though it magically starts working after a reboot, the damage is done. I would look like an idiot trying to release with that little caveat in place.

Can you recreate it? Conveniently?

WTF causes that? And how the heck do I troubleshoot? For a while, the best I could think of was starting the VM up in runlevel 1 (catch GRUB at boot and add the “1” parameter), patching in diagnostics, and then proceeding to init. After one run, if I don’t have what I need… recreate the VM and try again. So painful I literally just avoided it and busied myself with other things.

The first breakthrough was when I tried to test the kinds of things that happen only at the first boot after install. There are a number, potentially – services coming up for the first time and so forth. Another big one is network initialization on a new network and device. I couldn’t see how that would affect these services (they are httpd binding only to localhost), but I did experiment with changing the VM’s network device after booting (remove the udev rule, shut down, remove the device, add another), and found that indeed, it caused the failure on the next boot reliably.

So it had to be something to do with network initialization.

What’s the actual error?

Being able to cause it at will on reboot meant much easier iteration of diagnostics. First I tried just adding a call to ifconfig in the openshift-broker init script. I couldn’t see anything in the console output, so I assumed it was just being suppressed somehow.

Next I tried to zero in on the actual failure. When invoked via init script, the “daemon” function apparently swallows console output from the httpd command, but it provides an opportunity to add options to the command invocation, so I looked up httpd command line options and found two that looked helpful: “-e debug -E /var/log/httpd_console”:

-e level
Sets the LogLevel to level during server startup. This is useful for temporarily increasing the verbosity of the error messages to find problems during startup.

-E file
Send error messages during server startup to file.

This let me bump up the logging level at startup and capture the server startup messages. (Actually probably only the latter matters. Another one to keep in mind is -X which starts it as a single process/thread only – helpful for using strace to follow it. Not helpful here though.)

This let me see the actual failure:

 [crit] (EAI 9)Address family for hostname not supported: alloc_listener: failed to set up sockaddr for 127.0.0.1

Apparently the httpd startup process tries to bind to network interfaces before it even opens logs, and this is what you get when binding to localhost fails.

What’s the fix?

An error is great, but searching the mighty Google for it was not very encouraging. There were a number of reports of the problem, but precious little about what actually caused it. The closest I could find was this httpd bug report:

Bug 52709 – Apache can’t bind to 127.0.0.1 if eth0 has only IPv6

[…]

This bug also affects httpd function ap_get_local_host() as described in http://bugs.debian.org/629899
Httpd will then fail to get the fully qualified domain name.

This occurs when apache starts before dhclient finished its job.

Here was something linking the failure to bind to 127.0.0.1 to incomplete network initialization by dhclient. Suddenly the fact that my test “ifconfig” at service start had no output did not seem like a fluke. When I started the service manually, ifconfig certainly had output.

So, here’s the part I don’t really claim to understand. Apparently there’s a period after NetworkManager does its thing and other services are starting where, in some sense at least, the network isn’t really available. At least not in a way that ifconfig can detect, and not in a way that allows httpd to explicitly bind to localhost.

As a workaround, I added a shim service that would wait for network initialization to actually complete before trying to start the broker and console. I could add a wait into those service scripts directly, but I didn’t want to hack up the official files that way. So I created this very simple service (openshift-await-eth0) that literally just runs “ifconfig eth0” and waits up to 60 seconds for it to include a line beginning “inet addr:”. Notably, if I have it start right after NetworkManager, it finishes immediately, so it seems the network is up at that point, but goes away just in time for openshift-broker and openshift-console to trip over it. So I have the service run right before openshift-broker, and now my first boot successfully starts everything.

Since this is probably the only place we’ll ever deploy OpenShift that has dhclient trying to process a new IP at just the time the broker and console are being started, I don’t know that it will ever be relevant to anyone else. But who knows. Maybe someone can further enlighten me on what’s going on here and a better way to avoid it.

Im-a build me some Spring samples (Part 1: petcare) #fail

Just more examples of commonplace Eclipse/Java things that drive me nuts. I should say up front that the version of Eclipse I’m using is actually SpringSource Tool Suite (STS) version 2.3.2.RELEASE, running on the 1.6.0_20 64-bit Sun HotSpot JVM on Fedora 13.

There’s a trove of Spring sample projects over at https://src.springframework.org/svn/spring-samples/ – very helpful. So I had Eclipse check some out and import them as projects. Funny thing, there seem to be a lot of complaints about XML. I’m getting a little more savvy, let’s see if I can resolve these.

Pet Care

spring-petcare-3.0.xsd

From petcare’s servlet-context.xml, we get the good old “Referenced file contains errors (http://www.springframework.org/schema/petcare/spring-petcare-3.0.xsd).” OK – I’m pretty familiar with that. If you try to go to the URL given, SpringSource’s server redirects you to the home page. Seems the “petcare” schema just didn’t quite make the list with the rest of the framework – I can understand that. But what URL should we use instead?

Well, not surprisingly, the petcare projects contains its own XSD; it’s in /src/main/resources/org/springframework/samples/petcare/util/config/ – so how do I use that as the URL? Well, one way is to use the svn tree from the website; then the schemaLocation is https://src.springframework.org/svn/spring-samples/petcare/trunk/src/main/resources/org/springframework/samples/petcare/util/config/spring-petcare-3.0.xsd. Would have been nice if the project just did that, wouldn’t it? I would think that, distrusting SpringSource not to move/change it again, you could refer to it relative to src/main/resources/META-INF/spring/appServlet/servlet-context.xml as ../../../org/springframework/samples/petcare/util/config/spring-petcare-3.0.xsd (feeling the pain yet?) or copy it to the same directory and just refer to it as spring-petcare-3.0.xsd – but neither of these seems to work on deploy, while the web URL does. There’s a file META-INF/spring/spring.schemas that appears to be intended to specify where to look in the project for the schema, but I’m not sure how it’s supposed to be referred to in Eclipse to make this work.

Here’s a really great error Eclipse has on one of the XML elements that refers to the schema: “- schema_reference.4: Failed to read schema document ‘spring-petcare-3.0.xsd’, because 1) could not find the document; 2) the document could not be read; 3) the root element of the document is not <xsd:schema>.” So which is it? Good grief. Would it really be hard to make messages like this actually useful? This is so frustrating.

Amazingly enough, if you try to deploy and you haven’t cleaned this up, Tomcat goes and tries to download the XSD as well and fails if it comes out wrong. So there you have it: your ability to deploy seems to depend on whether the right web sites are up at the right time. I imagine there’s a Tomcat or context.xml setting to disable this behavior.

Spring NamespaceHandler for petcare:resources

OK. So now I’m pointing at the right XSD, but I still have problems with servlet-context.xml. Actually it’s only a warning, but: “Unable to locate Spring NamespaceHandler for element ‘petcare:resources’ of schema namespace ‘http://www.springframework.org/schema/petcare'&#8221; – again with the “Unable to locate” business. That crap sure gets old. Why don’t you give me some kind of clue? Where did you look, where should I look? Actually I think I understand this and it’s trying to say there’s no “resources” element defined in the XSD. But there certainly is, so I don’t really know what this is on about. It seems to work anyway on deploy so I’ll ignore for now.

root-context.xml: Build path is incomplete

On to root-context.xml, another Spring bean file. This has three errors listed. The first two are really confusing: Next to line 22 which looks like this:

<!-- Embedded H2 Database -->

I have the error “Build path is incomplete. Cannot find class file for org.springframework.samples.petcare.users.PetcareAuthenticationFailureHandler”. Next to line 24 which looks like this:

<jdbc:script location="classpath:schema.sql" />

I have the error “Build path is incomplete. Cannot find class file for org.springframework.samples.petcare.users.PetcareUserService”. WTF? Actually Eclipse helpfully gave me a clue below on line 56, which looks like this:

<import resource="security.xml" />

This has the warning “Validation warning occured in imported configuration file ‘src/main/resources/META-INF/spring/security.xml'”. Yes, as it turns out that file has errors on lines 22 and 24, where  they make a lot more sense. But they’re not reported in security.xml; they’re reported against irrelevant lines in root-context.xml, which just includes it. Truly mind-boggling.

Now from this, plus the “Build path is incomplete. Cannot find class file for org.springframework.samples.petcare.util.templating.DefaultStringTemplateFactory” on line 29, I gather the build path is incomplete. But all those files look to be provided by this project and the build paths seem to be set up fine. So what the heck is going on?

At this point I checked my system and noticed that after installing some unrelated things, Fedora has chosen java-1.5.0-gcj as my compiler. I don’t think Eclipse is using this, and theoretically it shouldn’t matter anyway, but I shut down Eclipse, reconfigure alternatives so the Sun JDK is again providing the default javac, restart Eclipse, and clean the project. Somewhere in there, the complaints went away.

page.jsp Type mismatch

Now I’m left with just one actual error in src/main/webapp/WEB-INF/layouts/page.jsp. Only, when I open that file, Eclipse doesn’t list any errors. I have to go down to the “Markers” tab to find it:

page.jsp line 62 Type mismatch: cannot convert from boolean to String

Line 62 is a completely innocuous HTML div. Eclipse is just bonkers. I delete the marker and now my petcare project is free of red Xes. I deploy and it works. That might be enough for some people… but now, what about all those warnings?

Unresolvable warnings

Our old friend servlet-context.xml still has a warning next to <petcare:resources> that says “Unable to locate Spring NamespaceHandler for element ‘petcare:resources’ of schema namespace ‘http://www.springframework.org/schema/petcare'&#8221;. I feel I’ve been as clear as possible about the petcare schema so I really don’t know what more there is to say. So let’s leave that (or delete it).

Under src/main/resources, log4j.xml complains “The file cannot be validated as the XML Schema “/home/luke/Documents/workspace-sts-2.3.2.RELEASE/petcare/src/main/java/log4j.dtd (No such file or directory)” that is specified as describing the syntax of the file cannot be located.” This seems to be universally ignored – as I recall, I saw some bogus explanation as to why it was bad to specify a URL for this DTD. You would think Eclipse could just supply the DTD itself, or find it somewhere in the log4j JAR in my Maven dependencies. I don’t have a clue what to do about this so I leave it.

The final warnings are in src/main/webapp/WEB-INF/views/appointments/calendar.jsp, where Eclipse is complaining about data-* attributes on some HTML tags. As I recall these attribute “extensions” are part of the HTML standard so I don’t know why Eclipse is complaining. No way to shut it up without disabling validation entirely, so again let’s leave it. And that’s it for petcare.

Memory (classloader) leak

BTW, petcare has a big fat memory leak in it somewhere, because after you’ve re-deployed a few times, you get a PermGen error. Nice going. I can use this to hone my memory-leak-debugging skills soon.