Wednesday

Returning from a long silence, going to try once again to make a habit of journaling. Expect it to be mundane.

Also returning from a long vacation — two weeks (that’s long for me) plus two days of F2F with my team. So, a fair amount of time going through email, trying to respond to quick things, turning the rest into personal Trello cards. For a long time I tried to turn things into todos in the GMail app, which had the advantage of enabling nice references to emails so I could return to them and follow up when done with something. However it didn’t do a very good job of capturing the state of each task and I was clearly not really using it. So, trying something else. Not sure personal Trello will stick either, but I gotta keep trying things until something does.

Right now I’m stuck trying to get openshift-ansible to run to test a little change I’m making. openshift_facts module is failing inexplicably:

<origin-master> (0, 'Traceback (most recent call last):\r\n File "/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py", line 2470, in <module>\r\n main()\r\n File "/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py", line 2457, in main\r\n protected_facts_to_overwrite)\r\n File "/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py", line 1830, in __init__\r\n protected_facts_to_overwrite)\r\n File "/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py", line 1879, in generate_facts\r\n facts = set_selectors(facts)\r\n File "/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py", line 496, in set_selectors\r\n facts[\'logging\'][\'selector\'] = None\r\nTypeError: \'unicode\' object does not support item assignment\r\n', 'Shared connection to 192.168.122.156 closed.\r\n')
fatal: [origin-master]: FAILED! => {
 "changed": false, 
 "failed": true, 
 "module_stderr": "Shared connection to 192.168.122.156 closed.\r\n", 
 "module_stdout": "Traceback (most recent call last):\r\n File \"/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py\", line 2470, in <module>\r\n main()\r\n File \"/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py\", line 2457, in main\r\n protected_facts_to_overwrite)\r\n File \"/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py\", line 1830, in __init__\r\n protected_facts_to_overwrite)\r\n File \"/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py\", line 1879, in generate_facts\r\n facts = set_selectors(facts)\r\n File \"/tmp/ansible_QyeeOK/ansible_module_openshift_facts.py\", line 496, in set_selectors\r\n facts['logging']['selector'] = None\r\nTypeError: 'unicode' object does not support item assignment\r\n", 
 "msg": "MODULE FAILURE", 
 "rc": 0
}

And since that error happens early in init of the first master, it cascades to the node which fails trying to look up the master’s version, giving a lovely masking error at the end of the output:

fatal: [origin-node-1]: FAILED! => {
 "failed": true, 
 "msg": "The task includes an option with an undefined variable. The error was: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'\n\nThe error appears to have been in '/home/lmeyer/go/src/github.com/openshift/openshift-ansible/playbooks/common/openshift-cluster/initialize_openshift_version.yml': line 16, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n pre_tasks:\n - set_fact:\n ^ here\n\nexception type: <class 'ansible.errors.AnsibleUndefinedVariable'>\nexception: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'"

Yeah, so… Ansible has a great way of welcoming you back.

 

Advertisements

Concerns for preflight check design

Lately I’m working on preflight checks for OpenShift v3 Ansible installs/upgrades. There is no piece right now that checks that you have everything you might reasonably need set up for an install/upgrade and bails out before doing anything if you don’t. What happens right now is that you get partway through the install/upgrade and then find out… oh, you have the wrong repos enabled or whatever, UGLY ERROR -> Fix it and start over again… bleah. Nobody enjoys SEV1 support calls in the middle of the night. For installs and particularly for upgrades, we’d really like the sysadmin to be able to run a preflight check before their outage window and find out about any common problems at that time.

So my latest conundrum is figuring out what the user expects during a preflight check. This is not as straightforward as you might think. The installer does a pretty good job of figuring out what you meant without you having to specify everything down to the last detail (because humans are not reliably good at doing that). Thing is, it may install and configure a number of things on your systems… just in order to figure out how to run.

This isn’t a big deal in the installer, because when you run an install or upgrade, you expect to install and configure things. Preflight checks are different because you’d like to affect system state as little as possible. The whole idea is to do checks before you make changes. So if we just reuse the logic the installer uses, users may be unpleasantly surprised to find their systems being changed.

So, for example. Pretty much the first thing that we want is facts about the configuration and the systems, which the openshift_facts role provides. This role runs various custom Ansible modules on target systems, which requires several dependencies to be present on those systems. If they aren’t there, they’re installed.

An Origin RPM install requires enabling an Origin repo. Unless you configure one beforehand, for Origin this is usually set up by the openshift_repos role, which is a dependency of the openshift_version role. So if you want to run the preflight checks before an install, you won’t have any Origin repo to check RPMs unless the checks configure this repo like the installer does.

The openshift_version role itself relies on some clever things to determine the version to install. If you’re doing an RPM install, it uses the repoquery tool to determine the precise version of RPMs that are available, so it can match it with the precise version of images to run; thus yum-utils is installed to provide repoquery. If you’re doing an enterprise containerized install, it looks up the precise version of images available by running a docker image on the remote host — and on an RPM-based host, installs and configures firewalld and docker to run that.

So in thinking about this, I’ve tried to determine if there’s any way to tease out just what we need for preflight checks and put that in a shared role, without having to go through as thorough a setup as we would for an install or upgrade. Or if we can make simplifying assumptions to do only what we need. Without going through too detailed an analysis, I think the answer is basically… no. We do not want to create and maintain parallel logic in the preflight checks for the very complex ways in which the installer determines what to do.

Reflecting a bit further, letting preflight config setup alter the systems is not really a problem, practically speaking.  If the user is installing a new cluster or adding hosts to an existing one, the target hosts are not in production yet, so altering them should be acceptable. If the user is upgrading, all of the necessary config and dependencies should already be in place, so hosts won’t be substantially altered. So, just depend on the same logic from the installer (and perhaps improve the user-friendliness of the output when things go wrong even before preflight checks). And very clearly document expectations.

Running an OpenShift install into containers

For testing purposes, we would like the ability to set up and tear down a whole lot of OpenShift clusters (single- or multi-node). And why do this with VMs when we have all of this container technology? A container looks a lot like a VM, right? And we have the very nifty (but little-documented) docker connection plugin for Ansible to treat a container like a host. So we ought to be able to run the installer against containers.

Of course, things are not quite that simple. And even though I’m not sure how useful this will be, I set out to just see what happens. Perhaps we could at least have a base image from an actual Ansible install of OpenShift that runs an all-in-one cluster in a container, rather than going through oc cluster up or the like. Then we would have full configuration files and separate systemd units to work with in our testing.

So first, defining the “hosts”. It took me a few iterations to get to this given the examples go in a different direction, but I can just define containers in my inventory as if they were hosts, and specify the docker connection method for them as a host variable. Here’s my inventory for an Origin install:

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
deployment_type=origin
openshift_release=1.4
openshift_uninstall_images=False

[masters]
master_container ansible_connection=docker

[nodes]
master_container ansible_connection=docker
node_container ansible_connection=docker

[etcd]
master_container ansible_connection=docker

To ensure the containers exist and are running before ansible tries to connect to them, I created a play to iterate over the inventory names and create them:

---
- name: start up containers
  hosts: localhost
  tasks:
  - name: start containers
    with_inventory_hostnames:
      - all
    docker_container:
      image: centos/systemd
      name: "{{ item }}"
      state: started
      volumes:
        - /var/run/docker.sock:/var/run/docker.sock:z

This uses the Ansible docker_container module to ensure there is a docker container for each hostname that is running the centos/systemd image (a base CentOS image that runs systemd init). Since I don’t really want to run a separate docker inside of each container once the cluster is up (remember, I want to start a lot of these, and they’ll pretty much all use the same images, so I’d really like to reuse the docker image cache), I’m mounting in the host’s docker socket so everyone will use one big happy docker daemon.

Then I just have to run the regular plays for an install (this assumes we’re in the openshift-ansible source directory):

- include: playbooks/byo/openshift-cluster/config.yml

Now of course it could not be that simple. After a few minutes of installing, I ran into an error:

TASK [openshift_clock : Start and enable ntpd/chronyd] *************************
fatal: [master_container]: FAILED! => {
 "changed": true, 
 "cmd": "timedatectl set-ntp true", 
 "delta": "0:00:00.200535", 
 "end": "2017-03-14 23:43:39.038562", 
 "failed": true, 
 "rc": 1, 
 "start": "2017-03-14 23:43:38.838027", 
 "warnings": []
}

STDERR:

Failed to create bus connection: No such file or directory

I looked around and found others who had similarly experienced this issue, and it seemed related to running dbus, but dbus is installed in the image and I couldn’t get it running. Eventually a colleague told me that you have to run the container privileged for dbus to work. Why this should be, I don’t know, but it’s easily enough done.

On to the next problem. I ran into an error from within Ansible that was trying to use 1.4 as a string when it’s specified as a float.

TASK [openshift_version : set_fact] **********************************************************************************
fatal: [master_container]: FAILED! => {
 "failed": true
}

MSG:

The conditional check 'openshift_release is defined and 
openshift_release[0] == 'v'' failed. The error was: 
error while evaluating conditional (openshift_release is 
defined and openshift_release[0] == 'v'): float object has no element 0

Having seen this sort of thing before I could see this was due to how I specified the openshift_release in my inventory. It looks like a number so the YAML parser treats it as one. So I can just change it to "1.4" or v1.4 and it will be parsed as a string. I think this was only a problem when I was running Ansible from source; I didn’t see it with the released package.

Next problem. A playbook error because I’m using the docker connection plugin and so no ssh user is specified and thus it can’t be retrieved. Well, even though it’s unnecessary, just specify one in the inventory.

[OSEv3:vars]
ansible_user=root

Next problem. The installer complains that you need to have NetworkManager before running the install.

TASK [openshift_node_dnsmasq : fail] *******************************************
fatal: [master_container]: FAILED! => {
 "changed": false, 
 "failed": true
}

MSG:

Currently, NetworkManager must be installed and enabled prior to installation.

And I quickly found out that things will hang if you don’t restart dbus (possibly related to this old Fedora bug) after installing NetworkManager. Alright, just add that to my plays:

- name: set up NetworkManager
  hosts: all
  tasks:
  - name: ensure NetworkManager is installed
    package:
      name: NetworkManager
      state: present
  - name: ensure NetworkManager is enabled
    systemd:
      name: NetworkManager
      enabled: True
  - name: dbus needs a restart after this or NetworkManager and firewall-cmd choke
    systemd:
      name: dbus
      state: restarted

When I was first experimenting with this it went through just fine. On later tries, starting with fresh containers, this hung at starting NetworkManager, and I haven’t figured out why yet.

Finally it looked like everything is actually installing successfully, but then of course starting the actual node failed.

fatal: [node_container]: FAILED! => {
 "attempts": 1, 
 "changed": false, 
 "failed": true
}

MSG:

Unable to start service origin-node: Job for origin-node.service 
failed because the control process exited with error code. 
See "systemctl status origin-node.service" and "journalctl -xe" for details.

# docker exec -i --tty node_container bash
[root@9f7e04f06921 /]# journalctl --no-pager -eu origin-node 
[...]
systemd[1]: Starting Origin Node...
origin-node[8835]: F0315 19:13:21.972837 8835 start_node.go:131] 
cannot fetch "default" cluster network: Get 
https://cf42f96fd2f8:8443/oapi/v1/clusternetworks/default: 
dial tcp: lookup cf42f96fd2f8: no such host
systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
systemd[1]: Failed to start Origin Node.


Actually I got a completely different error previously related to ovs that I’m not seeing now. These could be anything as far as I know, but it may be related to the fact that I didn’t expose any ports or specify any external IP addresses for my “hosts” to talk to each other nor arrange any DNS for them to resolve each other. In any case, something to remedy another day. So far the playbook and inventory look like this:

---
- name: start up containers
  hosts: localhost
  tasks:
    - name: start containers
  with_inventory_hostnames:
    - all
  docker_container:
    image: centos/systemd
    name: "{{ item }}"
    state: started
    privileged: True
    volumes:
    - /var/run/docker.sock:/var/run/docker.sock:z

- name: set up NetworkManager
  hosts: all
  tasks:
    - name: ensure NetworkManager is installed
      package:
        name: NetworkManager
        state: present
    - name: ensure NetworkManager is enabled
      systemd:
        name: NetworkManager
        enabled: yes
        state: started
    - name: dbus needs a restart after this or NetworkManager and firewall-cmd choke
      systemd:
        name: dbus
        state: restarted

- include: openshift-cluster/config.yml

 

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
deployment_type=origin
openshift_release="1.4"
openshift_uninstall_images=False
ansible_user=root

[masters]
master_container ansible_connection=docker

[nodes]
master_container ansible_connection=docker
node_container ansible_connection=docker

[etcd]
master_container ansible_connection=docker