In the past I’ve ended up installing a bunch of development tools and dependencies on my workstation OS. Then these have updates and the whole system has updates and I do `dnf update` and suddenly who knows what is broken. Meanwhile I want to tweak this, use a development version of that, etc. Before I know it my system is a giant ball of mud.

Well, I just got a new laptop and This Time It Will Be Different. This is a problem that containers were meant to solve. I will have only the necessary GUI packages installed on the host, and everything else will go into containers with a few tools for invoking them with all the necessary options (directories mounted in, user set, etc.) and updating them as needed. Wish me luck.

This basically means setting up each development environment from scratch in a container instead of on the host… teasing out which things from my old configurations were relevant to each thing. This will definitely create new management headaches… well, it will be a learning experience.

imagebuilder. It’s a bare-bones “docker build” client with some defaults more geared toward automated builds. There’s no caching of layers and no squashing of layers at the end. It can do multi-stage builds.

There’s a -from option for overriding FROM. What happens with that in a multi-stage build? Let’s find out.

First I ran into a problem. After I ran imagebuilder, the image wouldn’t run:

$ docker run -it 6eccbc13a8bb /bin/bash
/usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"/bin/bash\": stat /bin/bash: not a directory"

Quick internet search gave no clues. Trying a more minimal build, I still was seeing weird stuff: exec: “sleep 86400”: executable file not found in $PATH · Issue #68 · openshift/imagebuilder · GitHub

That seemed related but I’m trying to get at the inability to even run bash explicitly. Ah, finally figured it out:

COPY --from=0 foo /usr/bin

Without a trailing / this turns /usr/bin into a file. OK, my stupid mistake. At least I figured that out.

To answer the original question: if you specify –from on a multistage build, it overrides only the first FROM in the Dockerfile.



Thursday (after a lapse…)

Yesterday I finally got my diagnostics code in a good enough place to create a new pull request. Tests are flaking and there are some validation bits I missed but I think it’s in pretty good shape to address the feedback from the previous incarnation. So now I can move on to continue working on the AppCreate diagnostic knowing there’s a good chance it will be able to merge once this blocking design change is merged.

It looks like issues and bugs are piling up around the openshift-ansible docker_image_availability check. I’d really like to clear up a bunch of these relatively small things but I feel like I have to focus on the large things for extended periods and ignore everything else in order to make any progress. *sigh* perhaps I should designate one day a week just for powering through quick items. I suspect part of my resistance is the fear that small things will turn out not to be so small… that seems pretty common.



I’m trying out a new feature of Google Assistant and IFTTT, the email digest. This allows me to leave notes for myself, by calling out to my Google Home (or assistant on the phone) while I’m pacing around at the end of the hour because my Fitbit says I need to move. I’m trying to use it as a daily work journal. We’ll see. One shortcoming: if I get too loquacious then Google Home seems to get confused and say “I don’t understand” while Assistant just runs a useless query based on what I said. Another is that the transcription is pretty atrocious for most names – product names, infrastructure names, technical terms and such.

Aaaaand… I didn’t write down anything else for the day. *sigh*


My morning install ran into another new problem. Docker was running just fine at the beginning. The install reconfigures and restarts docker, and it fails to start, complaining about storage:

Error starting daemon: error initializing graphdriver: devicemapper:  Unable to take ownership of thin-pool (docker-docker--pool) that already has used data blocks

A Google search turns up some similar issues from a year or more ago, and it generally seems to relate to /var/lib/docker having been deleted after devicemapper storage was used. Nothing should have done that and I really don’t know why I’m seeing this while doing the same thing I’ve done before. Perhaps it’s a new version of docker in our internal repos. To get past it, I blew away /var/lib/docker and re-initialized storage.

Then things seemed to work until Docker actually needed to pull an image and run something. docker pull seemed completely broken:

$ sudo docker pull
Using default tag: latest
Trying to pull repository ... 
latest: Pulling from
00406150827c: Pulling fs layer 
00c572151848: Pulling fs layer 
dfcd8fbc5ec3: Pulling fs layer 
open /var/lib/containers/docker/tmp/GetImageBlob003587718: no such file or directory

This went away after restarting docker… again. What the heck?

And other stuff went wrong. I gave up on it. Then I went to work on some go code, and vim-go did its thing where it freezes for a while and spins up all my CPUs to run go oracle (or whatever) and I thought it might be a good time to find out about using evil mode in emacs. Or, I guess, spacemacs. Looks pretty cool (actually I learned some new vim sequences just by watching demo videos on youtube, so even if I don’t make the switch… cool).


I thought it would be nice to have aws-launcher be able to attach an extra volume to the nodes it creates, and at the same time I could stand to learn a little about the python boto3 module for manipulating AWS. As usual, navigating a new API takes a lot of fiddling around and the docs just don’t really connect all the dots. So for instance, apparently after creating a volume, I have to wait for it to be available before I can attach it to an instance. This is just not obvious until I actually try it and get a failure message. And why are there a separate boto3.resource(‘ec2’) and boto3.client(‘ec2’) with different methods, and you need both to attach a volume? Why is there an instance.wait_until_running() method but not volume.wait_until_available() method? Why does the client doc not mention how to set the (required!) region on the client? Why are the examples and tutorials so limited?

Well, these are just the typical learning pains whenever I tackle a new API, and since it makes me want to avoid anything new, I need to get over it and just accept a certain amount of fumbling around until I get familiar enough.

Anyway, all that was fun, but it turns out all I really needed to do was specify the extra volume in the existing create_instances() call. That way I also don’t have to deal with state on the volume/instance (waiting until available, waiting until detached… why doesn’t EC2 have a fire-and-forget function on these?), the volume just lives while the instance does.

So that should make it easy to provide a storage volume for CNS.

Random little nugget. To run ansible repeatedly with the log going to a different file each time:

ANSIBLE_LOG_PATH=/tmp/ansible.log.$((count=count+1)) ansible-playbook -i ../hosts -vvv playbooks/byo/config.yml

Of course better solutions are things like ARA.

I ran into this little annoyance again while running a cluster install with Ansible:

2017-11-15 18:19:40,608 p=7182 u=ec2-user | Using module file /home/ec2-user/openshift-ansible/roles/openshift_facts/library/
2017-11-15 18:19:43,666 p=7182 u=ec2-user | failed: [] (item=prometheus) => {
 "changed": false, 
 "failed": true, 
 "item": "prometheus", 
 "module_stderr": "Shared connection to closed.\r\n", 
 "module_stdout": "Traceback (most recent call last):\r\n File \"/tmp/ansible_6AvhP1/\", line 2476, in <module>\r\n main()\r\n File \"/tmp/ansible_6AvhP1/ansible_module_o\", line 2463, in main\r\n protected_facts_to_overwrite)\r\n File \"/tmp/ansible_6AvhP1/\", line 1836, in __init__\r\n protected_facts_to_overwrite)\r\n 
File \"/tmp/ansible_6AvhP1/\", line 1885, in generate_facts\r\n facts = set_selectors(facts)\r\n File \"/tmp/ansible_6AvhP1/\", line 504, in 
set_selectors\r\n facts['prometheus']['selector'] = None\r\nTypeError: 'str' object does not support item assignment\r\n", 
 "msg": "MODULE FAILURE", 
 "rc": 0

The difference is, when I saw it previously, it had to do with logging, on hosts that I had installed a long time before, thus the “schema” of the facts file had changed in the meantime. But this time, it was regarding prometheus, and it was from the initial run. So that’s interesting. This keeps anything from running. I disabled the prometheus options and deleted /etc/ansible/facts.d/openshift.fact on all hosts to continue. Then ran into yet more breakage — couldn’t pull images. Had to leave at that point so I don’t know what went wrong, will try again tomorrow.



Monday: not so successful.


I thought it would be cool to set up a cluster using CNS (Container Native Storage) to back everything that needs storage. Well, it’s a learning experience at least.

The first thing that happened is that the install breaks trying to install iptables-services because iptables needs to be updated to match it at the same time. Not sure if this is going to be a common problem for others but I updated my tooling to fix this up.

Then I didn’t free up the devices that are needed on each node for GlusterFS to run. The CNS deploy failed. Once I fixed that up and ran it again I got a pretty mysterious error:

    "invocation": {
        "module_args": {
            "_raw_params": "oc rsh --namespace=glusterfs deploy-heketi-storage-1-fhh6f heketi-cli -s http://localhost:8080 --user admin --secret 'srtQRfJz4mh8PugHQjy3rgspHEfpumYC2dnBmQIoX9Y=' cluster list", 
    "stderr": "Error: signature is invalid\ncommand terminated with exit code 255", 

Turns out the problem was that the heketi pod was created with the secret in its env variables on the previous run, and then the secret was re-created for the second run. This playbook doesn’t handle consecutive runs too well yet. I added the openshift_storage_glusterfs_wipe=True option to inventory so it would start fresh and tried again. This time it failed differently:

 "invocation": {
 "module_args": {
 "_raw_params": "oc rsh --namespace=glusterfs deploy-heketi-storage-1-nww6c heketi-cli -s http://localhost:8080 --user admin --secret 'IzajucIIGPp0Tm3FyueSvxNs51YYjyTLGvWAqsvfolY=' topology load --jso
n=/tmp/openshift-glusterfs-ansible-dZSjA4/topology.json 2>&1", 
 "stdout": "Creating cluster ... ID: 00876e6ce506058e048c8d68500d194c\n\tAllowing file volumes on cluster.\n\tAllowing block volumes on cluster.\n\tCreating node ip-172-18-9-218.ec2.internal ... Unable to cre
ate node: New Node doesn't have glusterd running\n\tCreating node ip-172-18-8-20.ec2.internal ... Unable to create node: New Node doesn't have glusterd running\n\tCreating node ip-172-18-3-119.ec2.internal ... U
nable to create node: New Node doesn't have glusterd running",

But if I rsh into the pods directly and check if glusterd is running, it actually is. So I’m not sure what’s going on yet.

jarrpa set me straight on a number of things while trying to help me out here. For one thing I was thinking glusterfs could magically be used for backing everything once deployed. Not so; you have to define it as a storage class (there’s an option for that I hope works once I get the rest worked out: openshift_storage_glusterfs_storageclass=True). And then you have to make that storage class the default (apparently a manual step at this time), and have the other things you want to use it (logging, metrics, etc.) use dynamic provisioning for their storage. Something to look forward to.

I worked on a little PR to bring some sanity to the registry-console image.


Spent most of the morning checking email, reviewing PRs, checking bugs, and other such administrivia. Since we really actually branched 3.7 yesterday I updated the openshift-ansible image builder to start building 3.8 from master.

I decided to take a shot at an Ansible install of OpenShift v3.7 to take a look at some of the new features. The first thing I ran into is this lovely error:

 1. Hosts: localhost
 Play: Populate config host groups
 Task: Evaluate groups - Fail if no etcd hosts group is defined
 Message: Running etcd as an embedded service is no longer supported. If this is a new install please define an 'etcd' group with either one or three hosts. These hosts may be the same hosts as your masters
. If this is an upgrade you may set openshift_master_unsupported_embedded_etcd=true until a migration playbook becomes available.

Ah, here’s the problem, in a warning that scrolled by quickly at the front of the Ansible run:

 [WARNING]: No inventory was parsed, only implicit localhost is available

So that’s what you get with the default inventory (simply including localhost) when your real inventory doesn’t parse. That could be more friendly. Also that seems like it should be more than a warning. Turns out with Ansible 2.4 there is an option to make it an error so I made a quick PR to turn that on.

After that I ran into all kinds of fun stuff regarding internal registries and repos and kind of spun my wheels a lot.