How Do You Troubleshoot an Ansible Playbook That Has No Visible Errors when It Runs?

Updated 8/1/21

Problem scenario

An Ansible playbook appears to run with no errors or explicit failures.  But the playbook is not working as you expect.  There is no message indicating what could be wrong.  What should you do to troubleshoot it so the intended effect(s) will happen? 

Tips and Possible Solutions in Nearly Random Order
1.  When you run the ansible-playbook command, use the "-vvvv" flag at the end of the command itself.  This provides for greater verbosity (detailed output) with the playbook is run.

2.  Check if any parts are commented out.  You may have commented out something for a special task but forgot the "#" symbol was still before the section of .yaml code you were thinking was executing.

3.  Does the playbook use "roles"?  If so, do those "roles" directories exist in the same directory as the playbook?  If the "roles" parent directory is not in the directory where the playbook (.yml or .yaml file) is, the ansible.cfg file must be configured to look where the roles directory is.  If the roles exist where they should (either in the same directory as the playbook itself or where the ansible.cfg file is configured), do those directories have a "tasks" subdirectory?  If no "tasks" subdirectories are in the "roles" directories (e.g., named "java", "foobar" or whatever name you chose), then the playbook in the role is not being invoked.  Ansible looks for the playbooks in the "tasks" subdirectories of the roles if a role is specified in the playbook you are running.

The names of the subdirectories in the "roles" directories should be the same as the name of the roles used in the playbook.  These directories each should have a subdirectory named "tasks".

In other words be sure you have a path like this:
roles/foobar/tasks/main.yaml (where "foobar" is the name of the role you list in your playbook).

4.  Check the inventory hosts file.  You may think the playbook is running against certain servers that are not correct.  Verify the labels in the [groupname] syntax.  Some labels you create may be confusingly similar.

5.  Use shell, raw or cmd commands in the playbook itself.  These are not preferred to built-in modules, but they can give you an alternative way of doing something.  In general you will not want use raw shell commands whenever possible.  The outcomes are not as known to Ansible as modules.  But for variable computation and troubleshooting purposes, shell, raw, or cmd commands may help.  They can give you a record of intermediate values during the course of the playbook run.  This way you will know the computation values and be able to diagnose what is wrong by tracing the steps carefully (e.g., by using 'echo "{{ variablename }}" > /tmp/result.txt'). Sometimes variables assume types you do not expect. To learn what type of variable you have, see this posting.

6.  Generally it is very possible that the playbook is working, and the real problem is a misconfiguration in the operating system or a software component.  If you can manually simulate what the playbook should do, is the problem reproducible?  This may be tedious, but you can rule out the Ansible playbook being the cause of your problem.

7.  Use the "debug" module to track down the error.  This module allows you to print out values when your playbook executes.  To learn more about it, see this external posting. This link may also be helpful.

8. Use the --step flag with "ansible-playbook".  This helps you find the problem as you can inspect the system at intermediate steps of the playbook.

9.  Do you have two sections of your playbook?  If one section is applied to a certain group of hosts and a second section is applied to a different group of hosts, the variable assignments may be easily confused (by humans).  One set of variables will only be valid for the scope of one section of your playbook.

10.  When there is output when you run "ansible-playbook" do you see green, yellow or red text?  Output from a playbook's operation in green text means nothing needed to be changed and nothing was changed.  Output from a playbook's operation in yellow text means that something was changed successfully.  Output from a playbook's operation in red text means that the operation failed (and the desired change was not made).

11.  If the playbook is trying to modify the /etc/fstab, read this possible solution (#11).  If the /etc/fstab is not being modified with the "mount" module and with the "state: absent" attribute is set, is the path correct?  If a "/" is missing from the end of the "path" in the playbook and the terminating "/" is in the /etc/fstab, the playbook will not change the /etc/fstab file.  It is persnickety on this compared to what you might expect.

12.  If the problem pertains to mounting or unmounting (removing) a file or directory, see this posting.

13. Does your playbook have two sections of "tasks"? Here is an example where one section's tasks will not execute:

- name: This is a test.
  hosts: contintserver
  tasks:
    - shell:  "free -m > /tmp/memory.txt"
  tasks:
    - shell: "date > /tmp/date.txt"

The first "tasks" will not run. To get the first "tasks" to run, use a new "name" stanza/line and a new "hosts" stanza/line before the second "tasks" section. Here is an example of the same playbook above but with the two "tasks" modules that actually execute:

- name: This is a test.
  hosts: contintserver
  tasks:
    - shell:  "free -m > /tmp/memory.txt"
- name: Second section.
  hosts: contintserver
  tasks:
    - shell: "date > /tmp/date.txt"

14. Is your playbook using the AWS SSM modules? The CLI aws ssm commands or the SSM via the web console have limitations. AWS SSM may think that something has been installed or uninstalled that is currently in a different status. For example if you manually remove a key file from Linux via a Bash command, AWS SSM will get its status from somewhere else. To ensure that the playbook works, run an "aws ssm" uninstall command every time. This way the install can happen in the event there was corruption in the existing installation (e.g., a person modified a key file without using AWS SSM). Playbooks can run and appear to have no failures or cause no changes due to a mix of traditional systems administration and AWS SSM invocations.

15. Does the playbook call a role? You may need to change the playbook from something like this:

roles:
  - foobar

It should be more like this:

roles:
  -foobar/subdirectoryName

(where subdirectoryName is a subdirectory of the role or directory foobar. The subdirectoryName has a directory called "tasks" with a main.yml file)

The playbook, if it has just "foobar" as the roles with no subdirectory name will appear to complete with no errors and never make any changes. So remember to use the proper syntax as the precise location of the parent directory of the roles files needs to be explicit and accurate.

16. Use Ansible meta. It can end a play or clear variables. To learn more, see these postings:

17. If the playbook uses one or more roles, try Molecule.

18. To follow recommended practices with Ansible, you may want to use ansible-lint. You can obtain it here https://github.com/ansible-community/ansible-lint.

19. On a Linux/Unix machine, run $? after your command. Does it return a non-zero? Returning a "1" would indicate it was not a successful command. This external page has more information. Definitely the command is not working if it returns a "1". If it returns a "0", then as far as the OS is concerned, it worked. Hopefully another possible solution can help you if that is the case.

20. View additional documentation:
https://www.ansible.com/blog/introduction-to-ansible-test
https://docs.ansible.com/ansible/latest/reference_appendices/test_strategies.html
https://docs.ansible.com/ansible/latest/dev_guide/testing_units_modules.html

How Do You Use Testinfra (the Python module)?

Problem scenario
You prefer Python to Ruby for certain tasks, and you want to have a way of testing your configuration management tools. You want to use Testinfra to accomplish this (https://testinfra.readthedocs.io/en/latest/). You want to install Testinfra to test it out. What do you do?

Solution
Prerequisites
i. You must have pip3 installed on the server.
For Ubuntu/Debian systems, you would run this: sudo yum install -y python3-pip
For CentOS/RHEL/Fedora systems, you would run this: sudo apt-get install -y python3-pip

ii. This assumes that you have installed Nginx and have it running. With Ubuntu/Debian systems, you
would run this:

sudo apt -y install nginx && sudo systemctl start nginx

For CentOS/RHEL/Fedora systems, you would run this:

sudo yum install -y nginx && sudo systemctl enable nginx && sudo systemctl start nginx

Procedures
1. Run this command: sudo pip3 install testinfra

2. Create this file test_nginx.py
It should have these lines in it:

def test_nginx_is_installed(host):
     nginx = host.package("nginx")
     assert nginx.is_installed
     assert nginx.version.startswith("1.14")
 def test_nginx_running_and_enabled(host):
     nginx = host.service("nginx")
     assert nginx.is_running
     assert nginx.is_enabled 

3. Run this command from the directory where the above file is: pytest

4. The file above test if nginx 1.14 has been installed. If you want to generate a failure, run this: sudo systemctl stop nginx

How Do You Use the Python self Keyword?

Problem scenario
You have seen Python functions defined using the self keyword. You want to test it out. How do you do this?

Solution
First of all, self is NOT a keyword in Python. Yes, you should probably use the word "self" when it is invoked. This is a well-accepted convention. It may be an unwritten rule. But you do not need to use the word "self".

The word "self" [in the context of Python object-oriented programming] allows you to refer to the instance of the object itself (according to this posting).

Here is an example where self is not used:

class Cartesian(object):
    def __init__(wwxxyyzz,a = 0,b = 0):
        wwxxyyzz.a = a
        wwxxyyzz.b = b

    def distance(wwxxyyzz):
        return (wwxxyyzz.a**2 + wwxxyyzz.b**2) ** 0.5 # Pythagorean theorem.

contint = Cartesian(4, 1)
print(contint.distance())

The above object instantiation and invocation of a function encapsulated within the object are all done with passing the "wwxxyyzz" variable. This is defined with the init constructor. You can replace "wwxxyyzz" with the word "self". When referring to the object itself outside of the class, use the object's name; we use "contint" in the above example. When referring to the object itself from within the class, use the word self.

Here is the same program but with the word "self":

class Cartesian(object):
    def __init__(self,a = 0,b = 0):
        self.a = a
        self.b = b

    def distance(self):
        return (self.a**2 + self.b**2) ** 0.5 # Pythagorean theorem.

contint = Cartesian(4, 1)
print(contint.distance())

How Do You Browse .eml Files?

Problem scenario
You have some .eml files (e.g., emails from a web-based email account) that you want to view. How do you display them for free?

Solution

  1. Install Thunderbird (e.g., on Windows or Linux).
  2. Configure Thunderbird with an email account (e.g., gmail). It seems to work when you configure it with any old email account. Without an email account, I do not know if it will work.
  3. Create a special folder under the Inbox section.
  4. Place your .eml files in this new folder.

How Do You Set up Nginx as an HTTP Load Balancer So Client Requests (from Web Browsers) Do Not Go to Certain Nginx Servers unless Others Are Down?

Problem scenario
You have a web server running Nginx that acts as a reverse proxy server.  On occasion your regular web (Nginx) servers go down.  You want to have one or two web (Nginx) servers that are  reserved as backups exclusively.  You do not want traffic going to these servers unless the main Nginx servers are unavailable (either due to network or server failure).  You can allocate RAM and CPU to these reserved servers on demand.  To save money the business wants these servers with few resources unless they are brought online.  How do you have Nginx convey traffic to other Nginx  servers only when the main Nginx servers are down?

Solution
Prerequisite

You must have configured Nginx as a an HTTP load balancer (or reverse proxy server).  To do this, see this article which will actually work for Nginx distributing traffic to regular Nginx websites, Apache websites, or Nginx  websites running in Docker containers. 

Procedures
In the default.conf file, use the "backup" directive near the servers' FQDNs or IP addresses as they appear in the "upstream backend" clause (within the braces of "upstream backend {}").

This "backup" designation is performed with Nginx in Docker containers the same way it is done with Nginx running directly on a server.  Arguably one difference would be that the IP address you use in the default.conf file is an internal IP address associated with the user defined Docker network.

To find the IP address that would be ideal when you know the container ID of the Nginx instance you want to be a back up, you could use this command:

docker inspect <containerID> | grep IPAddress | tail -n 1

That would give you the internal IP address of the Docker container of your Nginx instance.

The Nginx instance with the landing page should have a default.conf file (in /etc/nginx/conf.d/default.conf).  Here is an example of two Nginx servers reserved as "backup" in the default.conf file:

upstream backend {
  server 10.10.10.10;
  server 10.10.10.11;
  server 10.10.10.12 backup;
  server 10.10.10.13 backup;
}

These "backup" keywords do take take effect until you stop Nginx services and restart them.  These keywords work with Nginx running in Docker no differently from Nginx running directly on a server.

FFR
If you want to need to troubleshoot problems with your Nginx load balancer, see this link.

In Linux without the Internet, How Do You Enter Left Double Quotes (“) or Right Double Quotes (”)?

Problem scenario
You are trying to search a file for left double quotes and right double quotes (e.g., using a Linux command terminal or a vi editor). How do you enter left double quotes or right double quotes using a keyboard (without copying the apparently italicized double quotes from a web page)?

Solution
Hold Alt, and then enter the sequence 8220 for left double quotes.

Hold Alt, and then enter the sequence 8221 for right double quotes.

How Do You Write a Python Program to Test if a Word is a Substring of Another Word?

Problem scenario
You want to write a program that will test if a pattern is in another word. You are looking for the SQL equivalent of "contains" in Python. How do you test if a string is a substring of another word?

Solution
Use this program:

# Change "micro" and "microsoft" to the substring and string to be searched respectively:

a = "micro"
b = "microsoft"
if a in b:
  print(a + " is in " + b)
else:
  print(a + " is NOT in " + b)

How Do You Find the URL of Your Kubernetes Cluster?

Problem scenario
You want to view the website that is powered by Kubernetes. But you do not know which URL to go to. What should you do from the back-end server with kubectl?

Solution

1. Run this: kubectl get services
With the resulting output, find a name that you want the URL for. (Services have names.) Let's assume the name was "foobar".

2. Run this command: kubectl get service foobar -o wide

The resulting output should include an "EXTERNAL-IP" value that you can use.

How Do You Troubleshoot the Error “Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock”?

Problem scenario
You try to run a Docker command, but you get this error: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.29/containers/json?all=1: dial unix /var/run/docker.sock: connect: permission denied

What should you do?

Solution
Have you added the user who was trying to execute the command to the "docker" group?

This command would add the user jdoe to the "docker" group:
sudo usermod -aG docker jdoe

If you ran the above command as the jdoe user, log out and log back in.

It is not advisable to run "docker" commands with "sudo " before them. In rare cases it may be necessary to use it to get the "docker" command to work.

For troubleshooting similar errors you may want to click on one of the following:

How Do You Get a PHP Script to Invoke a Bash Command?

Problem scenario
You are trying to get a PHP file to invoke a bash script. When you run it from the command line with "php foobar.php" you just see the content of the PHP file. What do you do to get PHP to invoke a bash script?

Solution
Make sure you have the correct header and closing symbols. Here is the content of foobar.php:

<?php
exec('bash /var/www/html/testScript.sh');
?>

To run it, do this: php foobar.php