Wednesday

I’m trying out a new feature of Google Assistant and IFTTT, the email digest. This allows me to leave notes for myself, by calling out to my Google Home (or assistant on the phone) while I’m pacing around at the end of the hour because my Fitbit says I need to move. I’m trying to use it as a daily work journal. We’ll see. One shortcoming: if I get too loquacious then Google Home seems to get confused and say “I don’t understand” while Assistant just runs a useless query based on what I said. Another is that the transcription is pretty atrocious for most names – product names, infrastructure names, technical terms and such.

Aaaaand… I didn’t write down anything else for the day. *sigh*

Advertisements

Thursday

My morning install ran into another new problem. Docker was running just fine at the beginning. The install reconfigures and restarts docker, and it fails to start, complaining about storage:

Error starting daemon: error initializing graphdriver: devicemapper:  Unable to take ownership of thin-pool (docker-docker--pool) that already has used data blocks

A Google search turns up some similar issues from a year or more ago, and it generally seems to relate to /var/lib/docker having been deleted after devicemapper storage was used. Nothing should have done that and I really don’t know why I’m seeing this while doing the same thing I’ve done before. Perhaps it’s a new version of docker in our internal repos. To get past it, I blew away /var/lib/docker and re-initialized storage.

Then things seemed to work until Docker actually needed to pull an image and run something. docker pull seemed completely broken:

$ sudo docker pull registry.access.redhat.com/rhgs3/rhgs-server-rhel7
Using default tag: latest
Trying to pull repository registry.access.redhat.com/rhgs3/rhgs-server-rhel7 ... 
latest: Pulling from registry.access.redhat.com/rhgs3/rhgs-server-rhel7
00406150827c: Pulling fs layer 
00c572151848: Pulling fs layer 
dfcd8fbc5ec3: Pulling fs layer 
open /var/lib/containers/docker/tmp/GetImageBlob003587718: no such file or directory

This went away after restarting docker… again. What the heck?

And other stuff went wrong. I gave up on it. Then I went to work on some go code, and vim-go did its thing where it freezes for a while and spins up all my CPUs to run go oracle (or whatever) and I thought it might be a good time to find out about using evil mode in emacs. Or, I guess, spacemacs. Looks pretty cool (actually I learned some new vim sequences just by watching demo videos on youtube, so even if I don’t make the switch… cool).

Wednesday

I thought it would be nice to have aws-launcher be able to attach an extra volume to the nodes it creates, and at the same time I could stand to learn a little about the python boto3 module for manipulating AWS. As usual, navigating a new API takes a lot of fiddling around and the docs just don’t really connect all the dots. So for instance, apparently after creating a volume, I have to wait for it to be available before I can attach it to an instance. This is just not obvious until I actually try it and get a failure message. And why are there a separate boto3.resource(‘ec2’) and boto3.client(‘ec2’) with different methods, and you need both to attach a volume? Why is there an instance.wait_until_running() method but not volume.wait_until_available() method? Why does the client doc not mention how to set the (required!) region on the client? Why are the examples and tutorials so limited?

Well, these are just the typical learning pains whenever I tackle a new API, and since it makes me want to avoid anything new, I need to get over it and just accept a certain amount of fumbling around until I get familiar enough.

Anyway, all that was fun, but it turns out all I really needed to do was specify the extra volume in the existing create_instances() call. That way I also don’t have to deal with state on the volume/instance (waiting until available, waiting until detached… why doesn’t EC2 have a fire-and-forget function on these?), the volume just lives while the instance does.

So that should make it easy to provide a storage volume for CNS.

Random little nugget. To run ansible repeatedly with the log going to a different file each time:

ANSIBLE_LOG_PATH=/tmp/ansible.log.$((count=count+1)) ansible-playbook -i ../hosts -vvv playbooks/byo/config.yml

Of course better solutions are things like ARA.

I ran into this little annoyance again while running a cluster install with Ansible:

2017-11-15 18:19:40,608 p=7182 u=ec2-user | Using module file /home/ec2-user/openshift-ansible/roles/openshift_facts/library/openshift_facts.py
2017-11-15 18:19:43,666 p=7182 u=ec2-user | failed: [ec2-54-152-246-175.compute-1.amazonaws.com] (item=prometheus) => {
 "changed": false, 
 "failed": true, 
 "item": "prometheus", 
 "module_stderr": "Shared connection to ec2-54-152-246-175.compute-1.amazonaws.com closed.\r\n", 
 "module_stdout": "Traceback (most recent call last):\r\n File \"/tmp/ansible_6AvhP1/ansible_module_openshift_facts.py\", line 2476, in <module>\r\n main()\r\n File \"/tmp/ansible_6AvhP1/ansible_module_o
penshift_facts.py\", line 2463, in main\r\n protected_facts_to_overwrite)\r\n File \"/tmp/ansible_6AvhP1/ansible_module_openshift_facts.py\", line 1836, in __init__\r\n protected_facts_to_overwrite)\r\n 
File \"/tmp/ansible_6AvhP1/ansible_module_openshift_facts.py\", line 1885, in generate_facts\r\n facts = set_selectors(facts)\r\n File \"/tmp/ansible_6AvhP1/ansible_module_openshift_facts.py\", line 504, in 
set_selectors\r\n facts['prometheus']['selector'] = None\r\nTypeError: 'str' object does not support item assignment\r\n", 
 "msg": "MODULE FAILURE", 
 "rc": 0
}

The difference is, when I saw it previously, it had to do with logging, on hosts that I had installed a long time before, thus the “schema” of the facts file had changed in the meantime. But this time, it was regarding prometheus, and it was from the initial run. So that’s interesting. This keeps anything from running. I disabled the prometheus options and deleted /etc/ansible/facts.d/openshift.fact on all hosts to continue. Then ran into yet more breakage — couldn’t pull images. Had to leave at that point so I don’t know what went wrong, will try again tomorrow.

 

Tuesday

Monday: not so successful.

Today:

I thought it would be cool to set up a cluster using CNS (Container Native Storage) to back everything that needs storage. Well, it’s a learning experience at least.

The first thing that happened is that the install breaks trying to install iptables-services because iptables needs to be updated to match it at the same time. Not sure if this is going to be a common problem for others but I updated my tooling to fix this up.

Then I didn’t free up the devices that are needed on each node for GlusterFS to run. The CNS deploy failed. Once I fixed that up and ran it again I got a pretty mysterious error:

    "invocation": {
        "module_args": {
            "_raw_params": "oc rsh --namespace=glusterfs deploy-heketi-storage-1-fhh6f heketi-cli -s http://localhost:8080 --user admin --secret 'srtQRfJz4mh8PugHQjy3rgspHEfpumYC2dnBmQIoX9Y=' cluster list", 
   ...
    "stderr": "Error: signature is invalid\ncommand terminated with exit code 255", 

Turns out the problem was that the heketi pod was created with the secret in its env variables on the previous run, and then the secret was re-created for the second run. This playbook doesn’t handle consecutive runs too well yet. I added the openshift_storage_glusterfs_wipe=True option to inventory so it would start fresh and tried again. This time it failed differently:

 "invocation": {
 "module_args": {
 "_raw_params": "oc rsh --namespace=glusterfs deploy-heketi-storage-1-nww6c heketi-cli -s http://localhost:8080 --user admin --secret 'IzajucIIGPp0Tm3FyueSvxNs51YYjyTLGvWAqsvfolY=' topology load --jso
n=/tmp/openshift-glusterfs-ansible-dZSjA4/topology.json 2>&1", 
...
 "stdout": "Creating cluster ... ID: 00876e6ce506058e048c8d68500d194c\n\tAllowing file volumes on cluster.\n\tAllowing block volumes on cluster.\n\tCreating node ip-172-18-9-218.ec2.internal ... Unable to cre
ate node: New Node doesn't have glusterd running\n\tCreating node ip-172-18-8-20.ec2.internal ... Unable to create node: New Node doesn't have glusterd running\n\tCreating node ip-172-18-3-119.ec2.internal ... U
nable to create node: New Node doesn't have glusterd running",

But if I rsh into the pods directly and check if glusterd is running, it actually is. So I’m not sure what’s going on yet.

jarrpa set me straight on a number of things while trying to help me out here. For one thing I was thinking glusterfs could magically be used for backing everything once deployed. Not so; you have to define it as a storage class (there’s an option for that I hope works once I get the rest worked out: openshift_storage_glusterfs_storageclass=True). And then you have to make that storage class the default (apparently a manual step at this time), and have the other things you want to use it (logging, metrics, etc.) use dynamic provisioning for their storage. Something to look forward to.

I worked on a little PR to bring some sanity to the registry-console image.

Friday

Spent most of the morning checking email, reviewing PRs, checking bugs, and other such administrivia. Since we really actually branched 3.7 yesterday I updated the openshift-ansible image builder to start building 3.8 from master.

I decided to take a shot at an Ansible install of OpenShift v3.7 to take a look at some of the new features. The first thing I ran into is this lovely error:

 1. Hosts: localhost
 Play: Populate config host groups
 Task: Evaluate groups - Fail if no etcd hosts group is defined
 Message: Running etcd as an embedded service is no longer supported. If this is a new install please define an 'etcd' group with either one or three hosts. These hosts may be the same hosts as your masters
. If this is an upgrade you may set openshift_master_unsupported_embedded_etcd=true until a migration playbook becomes available.

Ah, here’s the problem, in a warning that scrolled by quickly at the front of the Ansible run:

 [WARNING]: No inventory was parsed, only implicit localhost is available

So that’s what you get with the default inventory (simply including localhost) when your real inventory doesn’t parse. That could be more friendly. Also that seems like it should be more than a warning. Turns out with Ansible 2.4 there is an option to make it an error so I made a quick PR to turn that on.

After that I ran into all kinds of fun stuff regarding internal registries and repos and kind of spun my wheels a lot.

Thursday

Tried to summarize what my team has been doing the last few months. It’s a depressingly short list. Although, you know, reasons.

I couldn’t get past that ansible problem from yesterday. Seems I’m not the only one seeing it (or at least something like it); I summarized my situation on another user’s issue.

Forward-looking question of the day: how could I work better/faster? I didn’t come up with anything right away.

Listening to an internal presentation on what’s coming out with OpenShift 3.7. Wow, do I have a lot to explore.

Realization: the pre-install checks shouldn’t even be in a separate location. They should just be baked in as preflight tasks in the roles where those tasks are performed. Same for post-install/post-upgrade checks. Ansible health checks should be reserved for ongoing verification that everything is still running as expected, looking for known problems and such.

Michael Gugino tracked down and addressed that ansible issue. Nice work.

A summary of my life with depression

This may not seem like a technical post but if you cross-reference to this talk it should be clear this is a problem that developers should really be aware of. I mean, not my personal issues, but the topic of mental illness in the developer world. So in the spirit of openness and sharing with others, I present my own story.

I’m not sure when depression began; I think it was kind of a slow progression over years. I’m 41 years old and I’ve always been kind of a negative person, always looking for flaws in things and worrying about what could go wrong. I liked to think this made me a better engineer. Somewhere along the way it became a state of mind where I could only see the negative.

I started to realize I was in trouble when it became clear that I didn’t really enjoy anything any more. It’s called anhedonia. It turns out you can function for quite a while being motivated only by negatives (fear of failure, fear of letting down your family/coworkers, etc.) and through sheer determination, but it’s a really miserable way to live. It’s hard to understand if you haven’t experienced it, so it’s not much use explaining it. Without anything that drives me to say “yes!” life seems pretty damn pointless. I was frustrated and angry all the time.

My wife encouraged me to get clinical help and eventually I did. Apparently sometime around September 2015, though I don’t really remember. I remember describing my depression not so much as “stuck in a pit” as “life in a dense fog”. The next year and few months my psychiatrist had me trying out various medications and tweaking dosages and such. Sometimes something would seem to be helping a little, but nothing really seemed to stick or make a big difference. It was discouraging to say the least. I was doing counseling, too, though I have yet to find a counselor who helps much.

In the summer of 2016 my mother was diagnosed with incurable cancer. In September 2016 my wife and I separated and I started shuttling our kids between us. Then in late November her health took a nosedive and I was left taking care of the kids alone, in addition to working full time with depression. I had always been able to deal with everything myself before, but something finally gave out in me. My job at Red Hat, which before had always been a refuge in turbulent times, became unbearable. I would spend all day staring at my screen and moving the mouse occasionally when it started to go dark because I hadn’t done anything for so long. I felt crushing guilt and shame that the one thing I had always been good at and enjoyed was now a joyless burden and I was letting everyone down.

I went on disability leave in early December. I didn’t even know you could go on disability for depression, but it was definitely disabling me, so it makes sense. My psychiatrist suggested trying a new course of treatment called TMS (Transcranial Magnetic Stimulation). In short, it uses an electromagnet to stimulate your brain, in daily treatments over the course of 6-8 weeks. I was expecting to get started with it ASAP, but it turned out I couldn’t start until January 4th.

I thought disability leave would be a relief, and it certainly was in the sense that I no longer had to feel guilty about the work I wasn’t doing (well, less guilty anyway – getting paid to do nothing really rubs me the wrong way). The downside is that it gave me a lot more time to brood over how useless I was and how I was going to lose everything and end up still depressed but in a homeless shelter, with my kids in foster care. I can look at things objectively and say that actually my situation is not that bad, and quite recoverable if I can just kick this depression thing, and there’s a good chance I can. And I’m so grateful for being able to take disability, and for my health insurance that covers all this pretty well, and for having my health otherwise. But the thing about being depressed is that you still feel hopeless, regardless of the reality of the situation.

I’m two weeks into TMS now. If anything, I feel worse because I’m starting to develop anxiety and having more trouble sleeping. My psychiatrist said her patients often saw improvement within a week (which made me more anxious when that week passed that I might be among the 20% or so that don’t benefit), but the TMS folks said actually to expect more like four weeks so I’m trying to be patient. If it doesn’t work out, I can do genetic testing to see if that helps pinpoint a medication that will actually help. I’m trying meditation, working on gratitude, connecting with people (something I never put much effort into before), contradicting my negative thoughts, and other random things in case they might help. And exercising, that seems to help. And just keeping busy to distract myself from feeling hopeless. I don’t have a happy ending yet, but everyone tells me things will get better if I just keep trying.

I guess if there’s a silver lining, it’s that people have come out of the woodwork to tell me they understand what I’m going through because they have been there. This is so common, there should be no reason to feel shame or to avoid treatment like I did for so long. It’s made me realize that in my fierce self-sufficiency I’ve never been open to being helped, or for that matter to helping others. But it turns out that nearly everybody needs some help sometimes, and I hope that out of this experience I’ll learn to be a more decent human being than I have been so far.