Running an OpenShift install into containers

For testing purposes, we would like the ability to set up and tear down a whole lot of OpenShift clusters (single- or multi-node). And why do this with VMs when we have all of this container technology? A container looks a lot like a VM, right? And we have the very nifty (but little-documented) docker connection plugin for Ansible to treat a container like a host. So we ought to be able to run the installer against containers.

Of course, things are not quite that simple. And even though I’m not sure how useful this will be, I set out to just see what happens. Perhaps we could at least have a base image from an actual Ansible install of OpenShift that runs an all-in-one cluster in a container, rather than going through oc cluster up or the like. Then we would have full configuration files and separate systemd units to work with in our testing.

So first, defining the “hosts”. It took me a few iterations to get to this given the examples go in a different direction, but I can just define containers in my inventory as if they were hosts, and specify the docker connection method for them as a host variable. Here’s my inventory for an Origin install:

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
deployment_type=origin
openshift_release=1.4
openshift_uninstall_images=False

[masters]
master_container ansible_connection=docker

[nodes]
master_container ansible_connection=docker
node_container ansible_connection=docker

[etcd]
master_container ansible_connection=docker

To ensure the containers exist and are running before ansible tries to connect to them, I created a play to iterate over the inventory names and create them:

---
- name: start up containers
  hosts: localhost
  tasks:
  - name: start containers
    with_inventory_hostnames:
      - all
    docker_container:
      image: centos/systemd
      name: "{{ item }}"
      state: started
      volumes:
        - /var/run/docker.sock:/var/run/docker.sock:z

This uses the Ansible docker_container module to ensure there is a docker container for each hostname that is running the centos/systemd image (a base CentOS image that runs systemd init). Since I don’t really want to run a separate docker inside of each container once the cluster is up (remember, I want to start a lot of these, and they’ll pretty much all use the same images, so I’d really like to reuse the docker image cache), I’m mounting in the host’s docker socket so everyone will use one big happy docker daemon.

Then I just have to run the regular plays for an install (this assumes we’re in the openshift-ansible source directory):

- include: playbooks/byo/openshift-cluster/config.yml

Now of course it could not be that simple. After a few minutes of installing, I ran into an error:

TASK [openshift_clock : Start and enable ntpd/chronyd] *************************
fatal: [master_container]: FAILED! => {
 "changed": true, 
 "cmd": "timedatectl set-ntp true", 
 "delta": "0:00:00.200535", 
 "end": "2017-03-14 23:43:39.038562", 
 "failed": true, 
 "rc": 1, 
 "start": "2017-03-14 23:43:38.838027", 
 "warnings": []
}

STDERR:

Failed to create bus connection: No such file or directory

I looked around and found others who had similarly experienced this issue, and it seemed related to running dbus, but dbus is installed in the image and I couldn’t get it running. Eventually a colleague told me that you have to run the container privileged for dbus to work. Why this should be, I don’t know, but it’s easily enough done.

On to the next problem. I ran into an error from within Ansible that was trying to use 1.4 as a string when it’s specified as a float.

TASK [openshift_version : set_fact] **********************************************************************************
fatal: [master_container]: FAILED! => {
 "failed": true
}

MSG:

The conditional check 'openshift_release is defined and 
openshift_release[0] == 'v'' failed. The error was: 
error while evaluating conditional (openshift_release is 
defined and openshift_release[0] == 'v'): float object has no element 0

Having seen this sort of thing before I could see this was due to how I specified the openshift_release in my inventory. It looks like a number so the YAML parser treats it as one. So I can just change it to "1.4" or v1.4 and it will be parsed as a string. I think this was only a problem when I was running Ansible from source; I didn’t see it with the released package.

Next problem. A playbook error because I’m using the docker connection plugin and so no ssh user is specified and thus it can’t be retrieved. Well, even though it’s unnecessary, just specify one in the inventory.

[OSEv3:vars]
ansible_user=root

Next problem. The installer complains that you need to have NetworkManager before running the install.

TASK [openshift_node_dnsmasq : fail] *******************************************
fatal: [master_container]: FAILED! => {
 "changed": false, 
 "failed": true
}

MSG:

Currently, NetworkManager must be installed and enabled prior to installation.

And I quickly found out that things will hang if you don’t restart dbus (possibly related to this old Fedora bug) after installing NetworkManager. Alright, just add that to my plays:

- name: set up NetworkManager
  hosts: all
  tasks:
  - name: ensure NetworkManager is installed
    package:
      name: NetworkManager
      state: present
  - name: ensure NetworkManager is enabled
    systemd:
      name: NetworkManager
      enabled: True
  - name: dbus needs a restart after this or NetworkManager and firewall-cmd choke
    systemd:
      name: dbus
      state: restarted

When I was first experimenting with this it went through just fine. On later tries, starting with fresh containers, this hung at starting NetworkManager, and I haven’t figured out why yet.

Finally it looked like everything is actually installing successfully, but then of course starting the actual node failed.

fatal: [node_container]: FAILED! => {
 "attempts": 1, 
 "changed": false, 
 "failed": true
}

MSG:

Unable to start service origin-node: Job for origin-node.service 
failed because the control process exited with error code. 
See "systemctl status origin-node.service" and "journalctl -xe" for details.

# docker exec -i --tty node_container bash
[root@9f7e04f06921 /]# journalctl --no-pager -eu origin-node 
[...]
systemd[1]: Starting Origin Node...
origin-node[8835]: F0315 19:13:21.972837 8835 start_node.go:131] 
cannot fetch "default" cluster network: Get 
https://cf42f96fd2f8:8443/oapi/v1/clusternetworks/default: 
dial tcp: lookup cf42f96fd2f8: no such host
systemd[1]: origin-node.service: main process exited, code=exited, status=255/n/a
systemd[1]: Failed to start Origin Node.


Actually I got a completely different error previously related to ovs that I’m not seeing now. These could be anything as far as I know, but it may be related to the fact that I didn’t expose any ports or specify any external IP addresses for my “hosts” to talk to each other nor arrange any DNS for them to resolve each other. In any case, something to remedy another day. So far the playbook and inventory look like this:

---
- name: start up containers
  hosts: localhost
  tasks:
    - name: start containers
  with_inventory_hostnames:
    - all
  docker_container:
    image: centos/systemd
    name: "{{ item }}"
    state: started
    privileged: True
    volumes:
    - /var/run/docker.sock:/var/run/docker.sock:z

- name: set up NetworkManager
  hosts: all
  tasks:
    - name: ensure NetworkManager is installed
      package:
        name: NetworkManager
        state: present
    - name: ensure NetworkManager is enabled
      systemd:
        name: NetworkManager
        enabled: yes
        state: started
    - name: dbus needs a restart after this or NetworkManager and firewall-cmd choke
      systemd:
        name: dbus
        state: restarted

- include: openshift-cluster/config.yml

 

[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
deployment_type=origin
openshift_release="1.4"
openshift_uninstall_images=False
ansible_user=root

[masters]
master_container ansible_connection=docker

[nodes]
master_container ansible_connection=docker
node_container ansible_connection=docker

[etcd]
master_container ansible_connection=docker
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: