TripleO: Debugging Overcloud Deployment Failure

You run ‘openstack overcloud deploy’ and after a couple of minutes you find out it failed and if that’s not enough, then you open the deployment log just to find a very (very!) long output that doesn’t give you an clue as to why the deployment failed. In the following sections we’ll see how can we get to the root of the problem.

Overcloud stack status

The way overcloud is being deployed is using heat templates. The result is a heat stack which includes all the different resources of the overcloud ( more stack, networks, subnets, etc.).

If you are not familiar with Heat, I recommend to read a little bit about it before proceeding, since it might be more confusing than helpful without any knowledge regarding Heat.

So our first step in finding out what happened to our overcloud deployment is to list the existing stacks

> openstack stack list

+--------------------------------------+------------+---------------+----------------------+--------------+
| ID                                   | Stack Name | Stack Status  | Creation Time        | Updated Time |
+--------------------------------------+------------+---------------+----------------------+--------------+
| 279addb0-1781-4d40-a62e-db8e063b9e81 | overcloud  | CREATE_FAILED | 2017-02-12T07:23:27Z | None         |
+--------------------------------------+------------+---------------+----------------------+--------------+

You can also use the deprecated command heat stack-list.

We get a lot of useful information from the above command. First of all we know how our overcloud stack deployment is called – ‘overcloud’. We’ll use it soon enough. We know also that the deployment was unsuccessful due to the stack status – “CREATE_FAILED”.

What we still don’t know is why the deployment failed. We can choose one of two ways to continue.

The short way

If you are lucky enough to deploy the latest available release (OSP 10+ / Newton+), then you can use the magical openstack stack failures list command

> openstack stack failures list overcloud

overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.1:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: e0c37619-e3e8-49fb-9990-da0fdbdc862c
  status: CREATE_FAILED
  status_reason: |
    CREATE aborted
  deploy_stdout: |
    ...
    Notice: /Stage[main]/Pacemaker::Corosync/User[hacluster]/groups: groups changed '' to 'haclient'
    Notice: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: Triggered 'refresh' from 2 events
    Notice: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: Error: cluster is not currently running on this node
    Notice: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Exec[Creating cluster-wide property stonith-enabled]: Dependency Exec[wait-for-settle] has failures: true
    Notice: /Stage[main]/Haproxy/Haproxy::Instance[haproxy]/Haproxy::Config[haproxy]/Concat[/etc/haproxy/haproxy.cfg]/File[/etc/haproxy/haproxy.cfg]/content: content changed '{md5}1f337186b0e1ba5ee82760cb437fb810' to '{md5}674dff01894ec5b21f29ce257324b9b0'
    Notice: /File[/etc/haproxy/haproxy.cfg]/seluser: seluser changed 'unconfined_u' to 'system_u'
    Notice: /Stage[main]/Tripleo::Profile::Base::Haproxy/Exec[haproxy-reload]: Triggered 'refresh' from 1 events
    Notice: /Firewall[998 log all]: Dependency Exec[wait-for-settle] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[wait-for-settle] has failures: true
    Notice: Finished catalog run in 3729.91 seconds
    (truncated, view all with --long)

It will instantly give you all the information regarding the failure. The above output was very long so I pasted only portion of it.

You can see which resources in the stack failed and what exactly happened with the ‘delpy_stdout’ field.

Note that ‘overcloud’ in the command, is the name of the stack and it’s not a fixed word. You should change it in case your stack called differently.

Older releases

Don’t have ‘openstack stack failures list’ command? don’t worry, we’ll simply use the old good way.

List nested stacks

From openstack stack list it looks like there is only one stack for overcloud deployment, but actually there could be more than 50(!) stacks.

The one we are interested in, is every stack that failed during the creation (deployment). To see all of them, we’ll use the flag ‘–nested’ and to filter only those with “FAILED” status we’ll use grep.

> openstack stack list --nested | grep FAILED 
# OR the deprecated version: heat stack-list --show-nested | grep FAILED

| c738ba27-b6d7-475e-86c6-937d5bd4ac6c | overcloud-AllNodesDeploySteps-pi6akgjya6pd-ControllerDeployment_Step1-ddultaifqbuo                                                           | CREATE_FAILED   | 2017-02-12T07:31:09Z | None         | fabf0c0e-77a1-4751-a61d-bfa9458fe89d |
| fabf0c0e-77a1-4751-a61d-bfa9458fe89d | overcloud-AllNodesDeploySteps-pi6akgjya6pd                                                                                                   | CREATE_FAILED   | 2017-02-12T07:30:00Z | None         | 279addb0-1781-4d40-a62e-db8e063b9e81 |
| 279addb0-1781-4d40-a62e-db8e063b9e81 | overcloud                                                                                                                                    | CREATE_FAILED   | 2017-02-12T07:23:27Z | None         | None                                 |

We can see there three failed stacks. One of them (the last) is our original top-level stack, which we already saw when using openstack stack list

The other two are nested stacks. So we still don’t know what caused the deployment to fail, but we are getting closer 🙂

List stack resources

Once we know which nested stack failed, we can proceed with checking which resources failed.

> openstack stack resource list c738ba27-b6d7-475e-86c6-937d5bd4ac6c
# OR the deprecated command: heat resource-list c738ba27-b6d7-475e-86c6-937d5bd4ac6c 

+---------------+--------------------------------------+--------------------------------+-----------------+----------------------+
| resource_name | physical_resource_id                 | resource_type                  | resource_status | updated_time         |
+---------------+--------------------------------------+--------------------------------+-----------------+----------------------+
| 1             | e0c37619-e3e8-49fb-9990-da0fdbdc862c | OS::Heat::StructuredDeployment | CREATE_FAILED   | 2017-02-12T07:31:09Z |
| 0             | 82627d0c-87c4-43b6-b3a9-51bcd020f46d | OS::Heat::StructuredDeployment | CREATE_FAILED   | 2017-02-12T07:31:09Z |
| 2             | 9bd21425-f7c7-4261-9d73-4b4b99c3d5f1 | OS::Heat::StructuredDeployment | CREATE_FAILED   | 2017-02-12T07:31:09Z |
+---------------+--------------------------------------+--------------------------------+-----------------+----------------------+

We got three resources, all failed for some reason (don’t worry, we’ll find out why eventually). The valuable information for us here, is the physical_resource_id. This is the last piece of information we need for finding out why our deployment failed.

If you are wondering where ‘c738ba27-b6d7-475e-86c6-937d5bd4ac6c’ is coming from then it’s from the previous command’s output. This is the nested stack. Why specifically this one? because it’s most nested one of the 3 stacks. You can tell by the name.

The holy grail (a.k.a why my deployment failed)

We’ll now use the physical_resource_id from of the first resource (‘0’) from the previous section, to find out what happened to our deployment.

> openstack software deployment show 82627d0c-87c4-43b6-b3a9-51bcd020f46d --long
# OR heat deployment-show 82627d0c-87c4-43b6-b3a9-51bcd020f46d

+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | 82627d0c-87c4-43b6-b3a9-51bcd020f46d |
| server_id | 2048ae49-c5a0-40d1-9efa-9816148f8734 |
| config_id | c6a0d957-54e6-4ad0-a8fc-65ec2d4633af |
| creation_time | 2017-02-12T07:31:10Z |
| updated_time | 2017-02-12T07:32:38Z |
| status | FAILED |
| status_reason | deploy_status_code : Deployment exited with non-zero status code: 6 |
| input_values | {u'step': 1, u'update_identifier': u'1486884195'} |
| action | CREATE |
| output_values | {u'deploy_stdout': u"Matching apachectl 'Server version: Apache/2.4.6 (Red Hat Enterprise Linux)nServer built: Aug 3 2016 08:33:27'nx1b[mNotice: |
| | Scope(Class[Tripleo::Firewall::Post]): At this stage, all network traffic is blocked.x1b[0mnx1b[mNotice: Compiled catalog for overcloud-controller-0.localdomain in |
| | environment production in 5.90 secondsx1b[0mnx1b[mNotice: /Stage[setup]/Firewall::Linux::Redhat/Exec[/usr/bin/systemctl daemon-reload]/returns: executed |
| | successfullyx1b[0mnx1b[mNotice: /Stage[setup]/Firewall::Linux::Redhat/Service[iptables]/ensure: ensure changed 'stopped' to 'running'x1b[0mnx1b[mNotice: |
| | /Stage[setup]/Tripleo::Firewall::Pre/Tripleo::Firewall::Rule[002 accept all to lo interface]/Firewall[002 accept all to lo interface]/ensure: |
| | Error: /sbin/pcs cluster start --all returned 1 instead of one of [0]x1b[0mnx1b[1;31mError: |
...

We can finally see what exactly the puppet modules executed and why the deployment failed in ‘output_values’ field:

| | Error: /sbin/pcs cluster start --all returned 1 instead of one of [0]x1b[0mnx1b[1;31mError: |

If the output is too long for you, try to grep “Error” but with several lines before and after

> openstack software deployment show 82627d0c-87c4-43b6-b3a9-51bcd020f46d --long | grep -A 3 -B 3 "Error"

Also, don’t forget to use the ‘–long’ flag to see the full output. Without using it, you might miss the actual cause of the failure.

To conclude, I’ll just say use openstack stack failures list life is too short.

4 thoughts on “TripleO: Debugging Overcloud Deployment Failure

  1. Thanks Arie, this is a nice article indeed.
    When I run “openstack stack failures list overcloud” it shows me the error below:
    (undercloud) [stack@interop010 ~]$ openstack stack failures list overcloud
    overcloud.AllNodesDeploySteps.ComputeDeployment_Step1.0:
    resource_type: OS::Heat::StructuredDeployment
    physical_resource_id: 2820c220-e83f-402b-b691-17b066e63ad5
    status: CREATE_FAILED
    status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
    deploy_stdout: |

    “2017-12-21 13:39:19,703 ERROR: 17321 — ERROR configuring crond”,
    “2017-12-21 13:39:19,703 ERROR: 17321 — ERROR configuring nova_libvirt”
    ],
    “failed_when_result”: true
    }
    to retry, use: –limit @/var/lib/heat-config/heat-config-ansible/e9185878-81e0-4914-a2b8-8516ea6fd613_playbook.retry

    What does it mean? How do I fix it to have a successful deployment?

    thanks,
    IK

    Like

  2. “2017-12-21 13:39:19,703 ERROR: 17321 — ERROR configuring crond”,
    “2017-12-21 13:39:19,703 ERROR: 17321 — ERROR configuring nova_libvirt

    Seems your template have the issue.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s