Working around the mysterious yum multilib error on an OpenShift Enterprise install

This one took a little while to track down and understand so I offer it here in the spirit of helping someone else get around this without that discovery process.

TL;DR

if when installing cartridges on an OpenShift Enterprise node host you get a yum error with “Error:  Multilib version problems found.” then make sure you have followed the steps in https://access.redhat.com/site/articles/316613 even if you don’t want the JBoss cartridges.

The problem

The example configure script we’ve provided for OpenShift Enterprise currently installs node cartridges as a single yum command which attempts to install all of them, but hedges its bets with the –skip-broken flag in case something doesn’t work out with one or more cartridges. This keeps you from having an install where no cartridges get installed just because some fiddling dependency problem blocked one of them and the others could have proceeded fine. This is very helpful during our development testing. For a production deployment it may or may not be, depending on how you define “helpful.”

So, I found that if I ran a production configuration and did not follow the directions at https://access.redhat.com/site/articles/316613 to work around the nature of our JBoss subscription dependencies, not only did JBoss cartridge installation fail, but I got this bizarre yum error message:

Error: Multilib version problems found. This often means that the root
cause is something else and multilib version checking is just
pointing out that there is a problem.

[…]

Protected multilib versions: zlib-1.2.3-27.el6.i686 != zlib-1.2.3-29.el6.x86_64

(The full message is similar to this post.) Reading between the lines there, what this error tells you is: There is some kind of dependency problem, we found it when trying to update zlib (and found that we were trying to install mismatched versions on the two architectures i686 and x86_64), but the real problem is probably somewhere way upstream. Which is about as useless an error message as you can get, but I understand where these things come from – it’s pretty hard to unravel the chain of consequences sometimes to give a helpful error message.

So, what the heck, right?

Also, if you try to install individual cartridges, they’ll work fine. If you remove the JBoss ones from the yum command, it works fine. It only fails on exactly the command that is run by default by the script.

Debugging

Running yum with the -v flag (verbose) gives a lot more info. I did not chase this down to the exact dependency chain but I gather the following describes what’s happening.

The first annoying question is, why is it even trying to install the i686 zlib in the first place? Everything I’m using is 64-bit. The next question is, why is it trying to install different versions? The -27 release is available for both architectures, why isn’t it just using that?

The clue comes in these lines of debugging output:

SKIPBROKEN: removing zlib-devel-1.2.3-29.el6.x86_64 from transaction 
SKIPBROKEN: removing zlib-devel-1.2.3-29.el6.x86_64 from pkgSack & updates

zlib-devel requires a zlib with the same version, and itself is required by freetype-devel, a requirement of the python cartridge.  It is also a dependency for libxml2-devel (required from the ruby cartridge) and openssl-devel (distant requirement for several cartridges). So let’s just say it is in the thick of things.

Unraveling the chain

I gather that what happened is this: With the channels broken, yum got deep into resolution and realized it could not install the JBoss cartridges. With the –skip-broken flag, yum removed from the transaction those cartridges and all of the package versions it had included for their dependencies; then it tried to resolve dependencies for the rest of the transaction with ALL of those packages not allowed in the transaction (as any could be the source of the problem). Apparently zlib-devel-1.2.3-29.el6.x86_64 was part of what was excised. But zlib-devel was still needed by other cartridges, and this is where it gets really bizarre.

openssl-devel-1.0.0-27.el6_4.2.x86_64 requires: zlib-devel
–> Processing Dependency: zlib-devel for package: openssl-devel-1.0.0-27.el6_4.2.x86_64
Searching pkgSack for dep: zlib-devel
TSINFO: Marking zlib-devel-1.2.3-27.el6.x86_64 as install for openssl-devel-1.0.0-27.el6_4.2.x86_64
[…]
—> Package zlib-devel.x86_64 0:1.2.3-27.el6 will be installed

[…]

zlib-devel-1.2.3-27.el6.x86_64 requires: zlib = 1.2.3-27.el6
–> Processing Dependency: zlib = 1.2.3-27.el6 for package: zlib-devel-1.2.3-27.el6.x86_64

Searching pkgSack for dep: zlib
Potential resolving package zlib-1.2.3-27.el6.x86_64 has newer instance installed.
TSINFO: Marking zlib-1.2.3-27.el6.i686 as install for zlib-devel-1.2.3-27.el6.x86_64
–> Running transaction check
—> Package zlib.i686 0:1.2.3-27.el6 will be installed

Because zlib-devel -29 was verboten, it fell back to zlib-devel -27. But that requires zlib -27, and -29 was already installed. You can’t have both installed at the same time, but yum saw a way to resolve the request by pulling in the zlib package from the i686 arch (zlib-devel did not specify arch in its requirement). So now yum wanted to install zlib-1.2.3-27.i686 on a system where zlib-1.2.3-29.x86_64 was already installed. It wasn’t allowed to downgrade the existing installation and can’t find any other way to satisfy the requirements under the restrictions imposed by having removed part of the transaction. So finally yum gave up and I got this confusing error about the multilib versions not matching, far from the source of the problem (which is, arguably, yum’s process for dependency resolution following pruning by –skip-broken).

Fixing the repo configuration (https://access.redhat.com/site/articles/316613) so that JBoss cartridges install cleanly fixes the problem. Not attempting to install them also fixes the problem.

Hopefully this provides some useful insight into this kind of problem, even for those who arrive here by completely different means.

Advertisements

This sort of thing makes me so mad I could spit

Often projects have external dependencies. Sometimes their whole job is to resolve external dependencies. Maven, yum, rubygems, heck CPAN!

These projects should be built as if their dependencies are trying to deliberately sabotage them. Reminds me of the talk I saw at Uberconf that drove home that point: stuff goes wrong, so you gotta protect yourself at all of the integration points.

When it comes to external dependencies, if something goes wrong, you really need to point the user in the right direction. You need to say “here’s what I was looking for, and here’s why I was looking for it, and here’s where I looked, and here’s what I got, and here’s what’s in my cache, and here’s the problem, and here’s what you can do once you’ve figured out why I got the wrong answer.” Then the user has some idea what to pursue. When they get an error like “Could not resolve dependency xyz” then guess where they end up? Posting desperately on some forum somewhere. Or Googling for that forum post.

Which brings me to today’s offender: Eclipse. Oh, Eclipse. Despite copious (even overwhelming) feedback to the user, how rarely you succeed in producing useful diagnostics. It’s bad enough when you’re installing a plugin and one of the dependencies in some repo is missing. Then you at least have a fighting chance of realizing that it was a dependency and who might be to blame. But Eclipse also validates XML against the stated DTD, and guess what? That’s an external dependency. And guess what happens when something goes wrong with that dependency?

Evidently Eclipse caches the broken DTD and refers to that to declare your XML invalid with the useful error message: Referenced file contains errors (http://tuckey.org/res/dtds/urlrewrite3.0.dtd). For more information, right click on the message in the Problems View and select "Show Details..."

Now that I look at the error, it’s actually better than I first read. It does blame the DTD, and it does say where it got it. At first I read this more like “Your file has errors.” (Helpful! Also what someone would see if they didn’t know what a DTD was for.) OK. Once I downloaded that file myself, I could see that it was broken. Actually tuckey.org seems to be sporadically serving that file wrong.The second time I downloaded it, it was fine. That’s going to happen, though, see? That’s sabotage. The really broken part here was that Eclipse actually cached that broken DTD after downloading it. You’d think that having detected it was broken, it might retry the download each time a validation was needed. It might provide you a mechanism to request a new download. It might actually inform you that the file is cached, for those of us not familiar with Eclipse internals.

A helpful error message would have been something like: “Eclipse tried to validate this XML file’s schema with the DTD downloaded from http://tuckey.org/res/dtds/urlrewrite3.0.dtd (as specified in the file) and cached at the location /some/path/urlrewrite3.0.dtd. This DTD file has errors; please check the DTD URL specified and the cached file to determine the source of the errors. To clear the cache and try the same URL again, (follow these instructions).”

still kicking around eclipse problems [resolved?]

I don’t know what I did to deserve this :-) I think installing jpackage RPMs on my system was a mistake.

Java people seem to be perfectly happy to download a zip for eclipse, tomcat, etc. and just use that. So every user on a system gets their own copy? That’s just not how i’m used to doing application installs. (Anyway I tried it and had even weirder problems.)

I like to do things “the right way” with Fedora. I try to use yum to resolve all of my dependencies rather than go around installing things from source. Because if you have a package manager, you kind of need to let it be in charge, or the dependencies will bite your ass later. Right?

Jpackage is a valiant effort to bundle java dependencies into something that yum can consume. In particular it provides a compatibility RPM that sets up alternatives properly for the sun jdk rpm. It also has tomcat and other java notables. I don’t know much about the project other than that. I don’t remember why I added the repo initially, perhaps I was just hoping to get the compat RPM that way since I can’t seem to find one to match the current sun jdk on the site (still can’t – had to set up alternatives manually). but as jpackage seems to provide bleeding-edge stuff it’s replaced standard fedora rpms and won’t budge without a whole slew of updates being performed. i turned off the repo entirely after some of the scary suggestions and conflicts yum made with it.

So, fast forward to my eclipse/android problems. i kept getting an error about dependencies when i tried to install the eclipse ADT plugin – and not matter what software sites i enabled or RPMs I installed, it persisted. it was complaining about lucene (why this should be a dependency for what i’m doing? don’t know, but that’s the nature of these things). so eventually i thought maybe the real problem was that there was a conflict with RPMs i’d installed for dependencies blocking what eclipse was actually looking for. i found that i had lucene RPMs installed from jpackage which fedora itself offered, so i wanted to see what happened when i used the fedora versions.

This is where yum gave me problems.

Setting up Yum Shell
> remove lucene
Setting up Remove Process
> run
--> Running transaction check
---> Package lucene.noarch 0:2.4.1-5.jpp6 set to be erased
--> Processing Dependency: lucene for package: 1:openoffice.org-core-3.1.1-19.14.fc12.i686
--> Processing Dependency: lucene >= 2.3.1-3.4 for package: 1:eclipse-platform-3.5.1-22.fc12.i686
--> Processing Dependency: lucene = 2.4.1-5.jpp6 for package: lucene-contrib-2.4.1-5.jpp6.noarch
--> Running transaction check
---> Package eclipse-platform.i686 1:3.5.1-22.fc12 set to be erased
[...]
--> Finished Dependency Resolution
==============================================================================================
 Package                           Arch     Version                         Repository   Size
==============================================================================================
Removing:
 lucene                            noarch   2.4.1-5.jpp6                    installed   1.7 M
Removing for dependencies:
 eclipse-dltk                      noarch   1.0.0-3.fc12                    installed   6.8 M
 eclipse-dltk-ruby                 noarch   1.0.0-3.fc12                    installed   1.7 M
 eclipse-gef                       noarch   3.5.1-2.fc12                    installed   1.7 M
 eclipse-jdt                       i686     1:3.5.1-22.fc12                 installed    25 M
 eclipse-jgit                      noarch   0.6.0-0.1.git20091029.fc12      installed   741 k
 eclipse-platform                  i686     1:3.5.1-22.fc12                 installed    31 M
 lucene-contrib                    noarch   2.4.1-5.jpp6                    installed   864 k
 openoffice.org-brand              i686     1:3.1.1-19.14.fc12              installed   1.3 M
 openoffice.org-calc               i686     1:3.1.1-19.14.fc12              installed    61 k
 openoffice.org-calc-core          i686     1:3.1.1-19.14.fc12              installed    22 M
 openoffice.org-core               i686     1:3.1.1-19.14.fc12              installed   220 M
 openoffice.org-impress            i686     1:3.1.1-19.14.fc12              installed    58 k
 openoffice.org-impress-core       i686     1:3.1.1-19.14.fc12              installed   3.8 M
 openoffice.org-presenter-screen   i686     1:3.1.1-19.14.fc12              installed   3.7 M
 openoffice.org-writer             i686     1:3.1.1-19.14.fc12              installed    73 k
 openoffice.org-writer-core        i686     1:3.1.1-19.14.fc12              installed    17 M
Transaction Summary
==============================================================================================
Remove       17 Package(s)
Reinstall     0 Package(s)
Downgrade     0 Package(s)
Is this ok [y/N]:

Evidently lucene is a dependency for a multitude of eclipse and openoffice RPMs, so I didn’t want to just remove them – it would take me a long time to re-download and re-install all the RPMs once I’d updated lucene. And it shouldn’t be necessary.

To do things “the yum way” I would need to remove and replace them in one transaction. The Fedora versions are lower than the JPP versions so I tried a downgrade (I think this is new in F12).

> downgrade lucene lucene-contrib
> run
[nothing happens]
> Leaving Shell

I tried creating a transaction with a remove and install.

> remove lucene lucene-contrib
Setting up Remove Process
> install lucene lucene-contrib
Loading mirror speeds from cached hostfile
 * fedora: mirrors.rit.edu
 * rpmfusion-free: mirrors.tummy.com
 * rpmfusion-free-updates: mirrors.tummy.com
 * rpmfusion-nonfree: mirrors.tummy.com
 * rpmfusion-nonfree-updates: mirrors.tummy.com
 * updates: mirrors.rit.edu
Setting up Install Process
> run
--> Running transaction check
---> Package lucene.i686 0:2.3.1-5.5.fc12 set to be updated
---> Package lucene.noarch 0:2.4.1-5.jpp6 set to be erased
---> Package lucene-contrib.i686 0:2.3.1-5.5.fc12 set to be updated
---> Package lucene-contrib.noarch 0:2.4.1-5.jpp6 set to be erased
--> Finished Dependency Resolution
==============================================================================================
 Package                  Arch             Version                  Repository           Size
==============================================================================================
Installing:
 lucene                   i686             2.3.1-5.5.fc12           fedora              1.7 M
 lucene-contrib           i686             2.3.1-5.5.fc12           fedora              763 k
Removing:
 lucene                   noarch           2.4.1-5.jpp6             installed           1.7 M
 lucene-contrib           noarch           2.4.1-5.jpp6             installed           864 k
Transaction Summary
==============================================================================================
Install       2 Package(s)
Upgrade       0 Package(s)
Remove        2 Package(s)
Reinstall     0 Package(s)
Downgrade     0 Package(s)
Total size: 2.4 M
Is this ok [y/N]: y
Downloading Packages:
Running rpm_check_debug
Running Transaction Test
Finished Transaction Test
Error: Transaction Check Error:
  package lucene-contrib-0:2.4.1-5.jpp6.noarch (which is newer than lucene-contrib-0:2.3.1-5.5.fc12.i686) is already installed
  package lucene-0:2.4.1-5.jpp6.noarch (which is newer than lucene-0:2.3.1-5.5.fc12.i686) is already installed

I think finally I tried “the RPM way” and removed them –nodeps and reinstalled them with yum (which of course complained about the RPM DB changing underneath it). Yum is intended to take over all system package management and do everything that RPM could do before, so what’s the way to serve this case?

Anyway, I eventually gave up on that installation – something else was still wrong after that – and moved over to my current installation which has no jpackage whatsoever. It seems to be working fine. Sorry jpackage folks, I don’t know what you’re doing but it doesn’t seem to work for me and I never want to mix repos like that again. I got so frustrated with the kinds of problems I saw I must have taken a year off my life. Have you ever clicked on a button to proceed and had NO response? No progress, no freezing up, no error, just… nothing, as if the button had no listener? Well I don’t know how it’s possible for normally functional software to be broken like that because of dependencies but I’ve had enough of it.

back in the saddle… (?)

Hoo boy have I been busy, mostly too busy to track it here. Found some actual contract work – well, if I could just get a signed contract. And also have been exploring Spring for a potential job offer. But my tech-fu has been failing left and right, don’t know what happened!

My trouble started a week ago when I went to my Android hack night and Eclipse on my laptop wouldn’t even bring up any editors. At the time I thought it was somehow related to the STS (SpringSource Tool Suite) I had installed (it’s another eclipse distro) but now I doubt it was related; there had been a number of other system updates and I think some of them caused problems, though I’m still at this date unclear what happened. I installed an entire parallel Fedora 12 system reusing my home directory but not introducing all the rpmfusion/jpackage stuff and it seemed to do better with Android but I ended up monkeying around with .eclipse and eclipse-workspace/.metadata until I’m not sure what I did anymore. All the while switching back and forth between the two OS installs did wonders for my firefox config. I *think* it’ll play Flash again now.

In the midst of all that I was trying to get work done and try out Spring. STS gave me fits, and I’m really not sure how much of it is STS/Java and how much was related to other issues was having. I stopped trying to figure things out on the notebook after a while, but I tried STS on Ubuntu and on WinXP and the best I could get it to was that it would build the examples and deploy them to tomcat, but this might not work after the first time, and half of my projects were riddled with all kinds of dependency problems. Dependencies in Java are a nightmare. I’m sure if I sat in a cube next to some gurus for a few days, they’d get my system in shape. But having to figure it out alone is infuriating.

But the fun did not end there. My laptop started acting really odd yesterday; stopped being able to open applications and such. I rebooted and the shutdown had lots of complaints about stuff that was refusing to unmount and such. After reboot I couldn’t log in graphically. I got to a place where it was trying to login and stuck, and I couldn’t switch to any consoles so had to do a hard stop, and of course when it rebooted I was nailed with fsck (for BOTH installs of Fedora now). Then I was fiddling with it and started getting messages about “can’t modify (foo) – read-only filesystem” – this was for BOTH the OS and home partitions. From what I could gather in /var/log/messages, the OS started having problems with the filesystems and just remounted everything to read-only – have never seen this before. I rebooted (more scary umount msgs) and applied some Fedora updates, hopefully those may improve things, or I may just have a hardware problem and need a new drive or something. That’s doable but what a pain. I also noticed on the latest reboot that firefox is spawning off numerous gtk-gnash instances that burn CPU for two flash movies in my tabs that I’m not even playing. Killed them all but that really should not be happening. So not too confident in laptop still :-(

Agenda for today: root my old G1, try out CyanogenMod on it, and see if I can get androidscreencast going.