Updated 8/1/21
Problem scenario
An Ansible playbook appears to run with no errors or explicit failures. But the playbook is not working as you expect. There is no message indicating what could be wrong. What should you do to troubleshoot it so the intended effect(s) will happen?
Tips and Possible Solutions in Nearly Random Order
1. When you run the ansible-playbook command, use the "-vvvv" flag at the end of the command itself. This provides for greater verbosity (detailed output) with the playbook is run.
2. Check if any parts are commented out. You may have commented out something for a special task but forgot the "#" symbol was still before the section of .yaml code you were thinking was executing.
3. Does the playbook use "roles"? If so, do those "roles" directories exist in the same directory as the playbook? If the "roles" parent directory is not in the directory where the playbook (.yml or .yaml file) is, the ansible.cfg file must be configured to look where the roles directory is. If the roles exist where they should (either in the same directory as the playbook itself or where the ansible.cfg file is configured), do those directories have a "tasks" subdirectory? If no "tasks" subdirectories are in the "roles" directories (e.g., named "java", "foobar" or whatever name you chose), then the playbook in the role is not being invoked. Ansible looks for the playbooks in the "tasks" subdirectories of the roles if a role is specified in the playbook you are running.
The names of the subdirectories in the "roles" directories should be the same as the name of the roles used in the playbook. These directories each should have a subdirectory named "tasks".
In other words be sure you have a path like this:roles/foobar/tasks/main.yaml
(where "foobar" is the name of the role you list in your playbook).
4. Check the inventory hosts file. You may think the playbook is running against certain servers that are not correct. Verify the labels in the [groupname] syntax. Some labels you create may be confusingly similar.
5. Use shell, raw or cmd commands in the playbook itself. These are not preferred to built-in modules, but they can give you an alternative way of doing something. In general you will not want use raw shell commands whenever possible. The outcomes are not as known to Ansible as modules. But for variable computation and troubleshooting purposes, shell, raw, or cmd commands may help. They can give you a record of intermediate values during the course of the playbook run. This way you will know the computation values and be able to diagnose what is wrong by tracing the steps carefully (e.g., by using 'echo "{{ variablename }}" > /tmp/result.txt'
). Sometimes variables assume types you do not expect. To learn what type of variable you have, see this posting.
6. Generally it is very possible that the playbook is working, and the real problem is a misconfiguration in the operating system or a software component. If you can manually simulate what the playbook should do, is the problem reproducible? This may be tedious, but you can rule out the Ansible playbook being the cause of your problem.
7. Use the "debug" module to track down the error. This module allows you to print out values when your playbook executes. To learn more about it, see this external posting. This link may also be helpful.
8. Use the --step
flag with "ansible-playbook". This helps you find the problem as you can inspect the system at intermediate steps of the playbook.
9. Do you have two sections of your playbook? If one section is applied to a certain group of hosts and a second section is applied to a different group of hosts, the variable assignments may be easily confused (by humans). One set of variables will only be valid for the scope of one section of your playbook.
10. When there is output when you run "ansible-playbook" do you see green, yellow or red text? Output from a playbook's operation in green text means nothing needed to be changed and nothing was changed. Output from a playbook's operation in yellow text means that something was changed successfully. Output from a playbook's operation in red text means that the operation failed (and the desired change was not made).
11. If the playbook is trying to modify the /etc/fstab, read this possible solution (#11). If the /etc/fstab is not being modified with the "mount" module and with the "state: absent" attribute is set, is the path correct? If a "/" is missing from the end of the "path" in the playbook and the terminating "/" is in the /etc/fstab, the playbook will not change the /etc/fstab file. It is persnickety on this compared to what you might expect.
12. If the problem pertains to mounting or unmounting (removing) a file or directory, see this posting.
13. Does your playbook have two sections of "tasks"? Here is an example where one section's tasks will not execute:
- name: This is a test.
hosts: contintserver
tasks:
- shell: "free -m > /tmp/memory.txt"
tasks:
- shell: "date > /tmp/date.txt"
The first "tasks" will not run. To get the first "tasks" to run, use a new "name" stanza/line and a new "hosts" stanza/line before the second "tasks" section. Here is an example of the same playbook above but with the two "tasks" modules that actually execute:
- name: This is a test.
hosts: contintserver
tasks:
- shell: "free -m > /tmp/memory.txt"
- name: Second section.
hosts: contintserver
tasks:
- shell: "date > /tmp/date.txt"
14. Is your playbook using the AWS SSM modules? The CLI aws ssm commands or the SSM via the web console have limitations. AWS SSM may think that something has been installed or uninstalled that is currently in a different status. For example if you manually remove a key file from Linux via a Bash command, AWS SSM will get its status from somewhere else. To ensure that the playbook works, run an "aws ssm" uninstall command every time. This way the install can happen in the event there was corruption in the existing installation (e.g., a person modified a key file without using AWS SSM). Playbooks can run and appear to have no failures or cause no changes due to a mix of traditional systems administration and AWS SSM invocations.
15. Does the playbook call a role? You may need to change the playbook from something like this:
roles:
- foobar
It should be more like this:
roles:
-foobar/subdirectoryName
(where subdirectoryName is a subdirectory of the role or directory foobar. The subdirectoryName has a directory called "tasks" with a main.yml file)
The playbook, if it has just "foobar" as the roles with no subdirectory name will appear to complete with no errors and never make any changes. So remember to use the proper syntax as the precise location of the parent directory of the roles files needs to be explicit and accurate.
16. Use Ansible meta. It can end a play or clear variables. To learn more, see these postings:
- https://stackoverflow.com/questions/36451793/how-do-i-exit-ansible-play-without-error-on-a-condition
- https://docs.ansible.com/ansible/2.9/modules/meta_module.html
17. If the playbook uses one or more roles, try Molecule.
18. To follow recommended practices with Ansible, you may want to use ansible-lint. You can obtain it here https://github.com/ansible-community/ansible-lint.
19. On a Linux/Unix machine, run $?
after your command. Does it return a non-zero? Returning a "1" would indicate it was not a successful command. This external page has more information. Definitely the command is not working if it returns a "1". If it returns a "0", then as far as the OS is concerned, it worked. Hopefully another possible solution can help you if that is the case.
20. View additional documentation:
https://www.ansible.com/blog/introduction-to-ansible-test
https://docs.ansible.com/ansible/latest/reference_appendices/test_strategies.html
https://docs.ansible.com/ansible/latest/dev_guide/testing_units_modules.html