Wednesday

I’m trying out a new feature of Google Assistant and IFTTT, the email digest. This allows me to leave notes for myself, by calling out to my Google Home (or assistant on the phone) while I’m pacing around at the end of the hour because my Fitbit says I need to move. I’m trying to use it as a daily work journal. We’ll see. One shortcoming: if I get too loquacious then Google Home seems to get confused and say “I don’t understand” while Assistant just runs a useless query based on what I said. Another is that the transcription is pretty atrocious for most names – product names, infrastructure names, technical terms and such.

Aaaaand… I didn’t write down anything else for the day. *sigh*

Advertisements

Thursday

My morning install ran into another new problem. Docker was running just fine at the beginning. The install reconfigures and restarts docker, and it fails to start, complaining about storage:

Error starting daemon: error initializing graphdriver: devicemapper:  Unable to take ownership of thin-pool (docker-docker--pool) that already has used data blocks

A Google search turns up some similar issues from a year or more ago, and it generally seems to relate to /var/lib/docker having been deleted after devicemapper storage was used. Nothing should have done that and I really don’t know why I’m seeing this while doing the same thing I’ve done before. Perhaps it’s a new version of docker in our internal repos. To get past it, I blew away /var/lib/docker and re-initialized storage.

Then things seemed to work until Docker actually needed to pull an image and run something. docker pull seemed completely broken:

$ sudo docker pull registry.access.redhat.com/rhgs3/rhgs-server-rhel7
Using default tag: latest
Trying to pull repository registry.access.redhat.com/rhgs3/rhgs-server-rhel7 ... 
latest: Pulling from registry.access.redhat.com/rhgs3/rhgs-server-rhel7
00406150827c: Pulling fs layer 
00c572151848: Pulling fs layer 
dfcd8fbc5ec3: Pulling fs layer 
open /var/lib/containers/docker/tmp/GetImageBlob003587718: no such file or directory

This went away after restarting docker… again. What the heck?

And other stuff went wrong. I gave up on it. Then I went to work on some go code, and vim-go did its thing where it freezes for a while and spins up all my CPUs to run go oracle (or whatever) and I thought it might be a good time to find out about using evil mode in emacs. Or, I guess, spacemacs. Looks pretty cool (actually I learned some new vim sequences just by watching demo videos on youtube, so even if I don’t make the switch… cool).

Wednesday

I thought it would be nice to have aws-launcher be able to attach an extra volume to the nodes it creates, and at the same time I could stand to learn a little about the python boto3 module for manipulating AWS. As usual, navigating a new API takes a lot of fiddling around and the docs just don’t really connect all the dots. So for instance, apparently after creating a volume, I have to wait for it to be available before I can attach it to an instance. This is just not obvious until I actually try it and get a failure message. And why are there a separate boto3.resource(‘ec2’) and boto3.client(‘ec2’) with different methods, and you need both to attach a volume? Why is there an instance.wait_until_running() method but not volume.wait_until_available() method? Why does the client doc not mention how to set the (required!) region on the client? Why are the examples and tutorials so limited?

Well, these are just the typical learning pains whenever I tackle a new API, and since it makes me want to avoid anything new, I need to get over it and just accept a certain amount of fumbling around until I get familiar enough.

Anyway, all that was fun, but it turns out all I really needed to do was specify the extra volume in the existing create_instances() call. That way I also don’t have to deal with state on the volume/instance (waiting until available, waiting until detached… why doesn’t EC2 have a fire-and-forget function on these?), the volume just lives while the instance does.

So that should make it easy to provide a storage volume for CNS.

Random little nugget. To run ansible repeatedly with the log going to a different file each time:

ANSIBLE_LOG_PATH=/tmp/ansible.log.$((count=count+1)) ansible-playbook -i ../hosts -vvv playbooks/byo/config.yml

Of course better solutions are things like ARA.

I ran into this little annoyance again while running a cluster install with Ansible:

2017-11-15 18:19:40,608 p=7182 u=ec2-user | Using module file /home/ec2-user/openshift-ansible/roles/openshift_facts/library/openshift_facts.py
2017-11-15 18:19:43,666 p=7182 u=ec2-user | failed: [ec2-54-152-246-175.compute-1.amazonaws.com] (item=prometheus) => {
 "changed": false, 
 "failed": true, 
 "item": "prometheus", 
 "module_stderr": "Shared connection to ec2-54-152-246-175.compute-1.amazonaws.com closed.\r\n", 
 "module_stdout": "Traceback (most recent call last):\r\n File \"/tmp/ansible_6AvhP1/ansible_module_openshift_facts.py\", line 2476, in <module>\r\n main()\r\n File \"/tmp/ansible_6AvhP1/ansible_module_o
penshift_facts.py\", line 2463, in main\r\n protected_facts_to_overwrite)\r\n File \"/tmp/ansible_6AvhP1/ansible_module_openshift_facts.py\", line 1836, in __init__\r\n protected_facts_to_overwrite)\r\n 
File \"/tmp/ansible_6AvhP1/ansible_module_openshift_facts.py\", line 1885, in generate_facts\r\n facts = set_selectors(facts)\r\n File \"/tmp/ansible_6AvhP1/ansible_module_openshift_facts.py\", line 504, in 
set_selectors\r\n facts['prometheus']['selector'] = None\r\nTypeError: 'str' object does not support item assignment\r\n", 
 "msg": "MODULE FAILURE", 
 "rc": 0
}

The difference is, when I saw it previously, it had to do with logging, on hosts that I had installed a long time before, thus the “schema” of the facts file had changed in the meantime. But this time, it was regarding prometheus, and it was from the initial run. So that’s interesting. This keeps anything from running. I disabled the prometheus options and deleted /etc/ansible/facts.d/openshift.fact on all hosts to continue. Then ran into yet more breakage — couldn’t pull images. Had to leave at that point so I don’t know what went wrong, will try again tomorrow.

 

Tuesday

Monday: not so successful.

Today:

I thought it would be cool to set up a cluster using CNS (Container Native Storage) to back everything that needs storage. Well, it’s a learning experience at least.

The first thing that happened is that the install breaks trying to install iptables-services because iptables needs to be updated to match it at the same time. Not sure if this is going to be a common problem for others but I updated my tooling to fix this up.

Then I didn’t free up the devices that are needed on each node for GlusterFS to run. The CNS deploy failed. Once I fixed that up and ran it again I got a pretty mysterious error:

    "invocation": {
        "module_args": {
            "_raw_params": "oc rsh --namespace=glusterfs deploy-heketi-storage-1-fhh6f heketi-cli -s http://localhost:8080 --user admin --secret 'srtQRfJz4mh8PugHQjy3rgspHEfpumYC2dnBmQIoX9Y=' cluster list", 
   ...
    "stderr": "Error: signature is invalid\ncommand terminated with exit code 255", 

Turns out the problem was that the heketi pod was created with the secret in its env variables on the previous run, and then the secret was re-created for the second run. This playbook doesn’t handle consecutive runs too well yet. I added the openshift_storage_glusterfs_wipe=True option to inventory so it would start fresh and tried again. This time it failed differently:

 "invocation": {
 "module_args": {
 "_raw_params": "oc rsh --namespace=glusterfs deploy-heketi-storage-1-nww6c heketi-cli -s http://localhost:8080 --user admin --secret 'IzajucIIGPp0Tm3FyueSvxNs51YYjyTLGvWAqsvfolY=' topology load --jso
n=/tmp/openshift-glusterfs-ansible-dZSjA4/topology.json 2>&1", 
...
 "stdout": "Creating cluster ... ID: 00876e6ce506058e048c8d68500d194c\n\tAllowing file volumes on cluster.\n\tAllowing block volumes on cluster.\n\tCreating node ip-172-18-9-218.ec2.internal ... Unable to cre
ate node: New Node doesn't have glusterd running\n\tCreating node ip-172-18-8-20.ec2.internal ... Unable to create node: New Node doesn't have glusterd running\n\tCreating node ip-172-18-3-119.ec2.internal ... U
nable to create node: New Node doesn't have glusterd running",

But if I rsh into the pods directly and check if glusterd is running, it actually is. So I’m not sure what’s going on yet.

jarrpa set me straight on a number of things while trying to help me out here. For one thing I was thinking glusterfs could magically be used for backing everything once deployed. Not so; you have to define it as a storage class (there’s an option for that I hope works once I get the rest worked out: openshift_storage_glusterfs_storageclass=True). And then you have to make that storage class the default (apparently a manual step at this time), and have the other things you want to use it (logging, metrics, etc.) use dynamic provisioning for their storage. Something to look forward to.

I worked on a little PR to bring some sanity to the registry-console image.

Friday

Spent most of the morning checking email, reviewing PRs, checking bugs, and other such administrivia. Since we really actually branched 3.7 yesterday I updated the openshift-ansible image builder to start building 3.8 from master.

I decided to take a shot at an Ansible install of OpenShift v3.7 to take a look at some of the new features. The first thing I ran into is this lovely error:

 1. Hosts: localhost
 Play: Populate config host groups
 Task: Evaluate groups - Fail if no etcd hosts group is defined
 Message: Running etcd as an embedded service is no longer supported. If this is a new install please define an 'etcd' group with either one or three hosts. These hosts may be the same hosts as your masters
. If this is an upgrade you may set openshift_master_unsupported_embedded_etcd=true until a migration playbook becomes available.

Ah, here’s the problem, in a warning that scrolled by quickly at the front of the Ansible run:

 [WARNING]: No inventory was parsed, only implicit localhost is available

So that’s what you get with the default inventory (simply including localhost) when your real inventory doesn’t parse. That could be more friendly. Also that seems like it should be more than a warning. Turns out with Ansible 2.4 there is an option to make it an error so I made a quick PR to turn that on.

After that I ran into all kinds of fun stuff regarding internal registries and repos and kind of spun my wheels a lot.

Thursday

Tried to summarize what my team has been doing the last few months. It’s a depressingly short list. Although, you know, reasons.

I couldn’t get past that ansible problem from yesterday. Seems I’m not the only one seeing it (or at least something like it); I summarized my situation on another user’s issue.

Forward-looking question of the day: how could I work better/faster? I didn’t come up with anything right away.

Listening to an internal presentation on what’s coming out with OpenShift 3.7. Wow, do I have a lot to explore.

Realization: the pre-install checks shouldn’t even be in a separate location. They should just be baked in as preflight tasks in the roles where those tasks are performed. Same for post-install/post-upgrade checks. Ansible health checks should be reserved for ongoing verification that everything is still running as expected, looking for known problems and such.

Michael Gugino tracked down and addressed that ansible issue. Nice work.

Wednesday

Returning from a long silence, going to try once again to make a habit of journaling. Expect it to be mundane.

Also returning from a long vacation — two weeks (that’s long for me) plus two days of F2F with my team. So, a fair amount of time going through email, trying to respond to quick things, turning the rest into personal Trello cards. For a long time I tried to turn things into todos in the GMail app, which had the advantage of enabling nice references to emails so I could return to them and follow up when done with something. However it didn’t do a very good job of capturing the state of each task and I was clearly not really using it. So, trying something else. Not sure personal Trello will stick either, but I gotta keep trying things until something does.

Right now I’m stuck trying to get openshift-ansible to run to test a little change I’m making. openshift_facts module is failing inexplicably:

<origin-master> (0, 'Traceback (most recent call last):\r\n File "/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py", line 2470, in <module>\r\n main()\r\n File "/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py", line 2457, in main\r\n protected_facts_to_overwrite)\r\n File "/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py", line 1830, in __init__\r\n protected_facts_to_overwrite)\r\n File "/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py", line 1879, in generate_facts\r\n facts = set_selectors(facts)\r\n File "/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py", line 496, in set_selectors\r\n facts[\'logging\'][\'selector\'] = None\r\nTypeError: \'unicode\' object does not support item assignment\r\n', 'Shared connection to 192.168.122.156 closed.\r\n')
fatal: [origin-master]: FAILED! => {
 "changed": false, 
 "failed": true, 
 "module_stderr": "Shared connection to 192.168.122.156 closed.\r\n", 
 "module_stdout": "Traceback (most recent call last):\r\n File \"/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py\", line 2470, in <module>\r\n main()\r\n File \"/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py\", line 2457, in main\r\n protected_facts_to_overwrite)\r\n File \"/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py\", line 1830, in __init__\r\n protected_facts_to_overwrite)\r\n File \"/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py\", line 1879, in generate_facts\r\n facts = set_selectors(facts)\r\n File \"/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py\", line 496, in set_selectors\r\n facts['logging']['selector'] = None\r\nTypeError: 'unicode' object does not support item assignment\r\n", 
 "msg": "MODULE FAILURE", 
 "rc": 0
}

And since that error happens early in init of the first master, it cascades to the node which fails trying to look up the master’s version, giving a lovely masking error at the end of the output:

fatal: [origin-node-1]: FAILED! => {
 "failed": true, 
 "msg": "The task includes an option with an undefined variable. The error was: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'\n\nThe error appears to have been in '/home/lmeyer/go/src/github.com/openshift/openshift-ansible/playbooks/common/openshift-cluster/initialize_openshift_version.yml': line 16, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n pre_tasks:\n - set_fact:\n ^ here\n\nexception type: <class 'ansible.errors.AnsibleUndefinedVariable'>\nexception: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'"

Yeah, so… Ansible has a great way of welcoming you back.