How to debug Ansible issues?

Debugging modules

  • The most basic way is to run ansible/ansible-playbook with an increased verbosity level by adding -vvv to the execution line.

  • The most thorough way for the modules written in Python (Linux/Unix) is to run ansible/ansible-playbook with an environment variable ANSIBLE_KEEP_REMOTE_FILES set to 1 (on the control machine).

    It causes Ansible to leave the exact copy of the Python scripts it executed (either successfully or not) on the target machine.

    The path to the scripts is printed in the Ansible log and for regular tasks they are stored under the SSH user's home directory: ~/.ansible/tmp/.

    The exact logic is embedded in the scripts and depends on each module. Some are using Python with standard or external libraries, some are calling external commands.

Debugging playbooks

  • Similarly to debugging modules increasing verbosity level with -vvv parameter causes more data to be printed to the Ansible log

  • Since Ansible 2.1 a Playbook Debugger allows to debug interactively failed tasks: check, modify the data; re-run the task.

Debugging connections

  • Adding -vvvv parameter to the ansible/ansible-playbook call causes the log to include the debugging information for the connections.

Debugging Ansible tasks can be almost impossible if the tasks are not your own. Contrary to what Ansible website states.

No special coding skills needed

Ansible requires highly specialized programming skills because it is not YAML or Python, it is a messy mix of both.

The idea of using markup languages for programming has been tried before. XML was very popular in Java community at one time. XSLT is also a fine example.

As Ansible projects grow, the complexity grows exponentially as result. Take for example the OpenShift Ansible project which has the following task:

- name: Create the master server certificate
  command: >
    {{ hostvars[openshift_ca_host]['first_master_client_binary'] }} adm ca create-server-cert
    {% for named_ca_certificate in openshift.master.named_certificates | default([]) | lib_utils_oo_collect('cafile') %}
    --certificate-authority {{ named_ca_certificate }}
    {% endfor %}
    {% for legacy_ca_certificate in g_master_legacy_ca_result.files | default([]) | lib_utils_oo_collect('path') %}
    --certificate-authority {{ legacy_ca_certificate }}
    {% endfor %}
    --hostnames={{ hostvars[item].openshift.common.all_hostnames | join(',') }}
    --cert={{ openshift_generated_configs_dir }}/master-{{ hostvars[item].openshift.common.hostname }}/master.server.crt
    --key={{ openshift_generated_configs_dir }}/master-{{ hostvars[item].openshift.common.hostname }}/master.server.key
    --expire-days={{ openshift_master_cert_expire_days }}
    --signer-cert={{ openshift_ca_cert }}
    --signer-key={{ openshift_ca_key }}
    --signer-serial={{ openshift_ca_serial }}
    --overwrite=false
  when: item != openshift_ca_host
  with_items: "{{ hostvars
                  | lib_utils_oo_select_keys(groups['oo_masters_to_config'])
                  | lib_utils_oo_collect(attribute='inventory_hostname', filters={'master_certs_missing':True}) }}"
  delegate_to: "{{ openshift_ca_host }}"
  run_once: true

I think we can all agree that this is programming in YAML. Not a very good idea. This specific snippet could fail with a message like

fatal: [master0]: FAILED! => {"msg": "The conditional check 'item != openshift_ca_host' failed. The error was: error while evaluating conditional (item != openshift_ca_host): 'item' is undefined\n\nThe error appears to have been in '/home/user/openshift-ansible/roles/openshift_master_certificates/tasks/main.yml': line 39, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Create the master server certificate\n ^ here\n"}

If you hit a message like that you are doomed. But we have the debugger right? Okay, let's take a look what is going on.

master0] TASK: openshift_master_certificates : Create the master server certificate (debug)> p task.args
{u'_raw_params': u"{{ hostvars[openshift_ca_host]['first_master_client_binary'] }} adm ca create-server-cert {% for named_ca_certificate in openshift.master.named_certificates | default([]) | lib_utils_oo_collect('cafile') %} --certificate-authority {{ named_ca_certificate }} {% endfor %} {% for legacy_ca_certificate in g_master_legacy_ca_result.files | default([]) | lib_utils_oo_collect('path') %} --certificate-authority {{ legacy_ca_certificate }} {% endfor %} --hostnames={{ hostvars[item].openshift.common.all_hostnames | join(',') }} --cert={{ openshift_generated_configs_dir }}/master-{{ hostvars[item].openshift.common.hostname }}/master.server.crt --key={{ openshift_generated_configs_dir }}/master-{{ hostvars[item].openshift.common.hostname }}/master.server.key --expire-days={{ openshift_master_cert_expire_days }} --signer-cert={{ openshift_ca_cert }} --signer-key={{ openshift_ca_key }} --signer-serial={{ openshift_ca_serial }} --overwrite=false"}
[master0] TASK: openshift_master_certificates : Create the master server certificate (debug)> exit

How does that help? It doesn't.

The point here is that it is an incredibly bad idea to use YAML as a programming language. It is a mess. And the symptoms of the mess we are creating are everywhere.

Some additional facts. Provision of prerequisites phase on Azure of Openshift Ansible takes on +50 minutes. Deploy phase takes more than +70 minutes. Each time! First run or subsequent runs. And there is no way to limit provision to a single node. This limit problem was part of Ansible in 2012 and it is still part of Ansible today. This fact tells us something.

The point here is that Ansible should be used as was intended. For simple tasks without the YAML programming. Fine for lots of servers but it should not be used for complex configuration management tasks.

Ansible is a not Infrastructure as Code ( IaC ) tool.

If you ask how to debug Ansible issues, you are using it in a way it was not intended to be used. Don't use it as a IaC tool.