[Linux-ha-dev] Problem in SLES11 SP2 (actions on removed resources)?

Ulrich Windl

2013-04-19 07:56:37 UTC

Hi!

I have some strange problems with the current update of the cluster software in SLES11 SP2 (I didn't see such problems before the update):

sbd monitoring went crazy (reporting running sbds when there were none, compaining the unability to stop sbd when there was none), so I stopped it.

Now that I re-activated it, the cluster talks about resources that had been deleted days ago, like:
---
Apr 19 08:56:19 h05 attrd: [13083]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Apr 19 08:56:19 h05 attrd: [13083]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-prm_stonith_sbd (1365148953)
Apr 19 08:56:19 h05 cib: [13080]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='h05']/lrm (origin=local/crmd/6835, version=0.744.19): ok (rc=0)
Apr 19 08:56:19 h05 crmd: [13085]: info: abort_transition_graph: te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=prm_v06_v06_raid1_last_0, magic=0:7;117:15:7:de539cd3-5895-4bcd-a388-ebad29a7b63d, cib=0.744.19) : Resource op removal
---

The resource prm_v06_v06_raid1 had been removed several days before in:
Apr 15 10:08:16 h05 cib: [13080]: info: cib_replace_notify: Replaced: 0.733.19 -> 0.734.1 from <null>

Interestingly a CIB dump minutes before the SBD-Change showed that the deleted resource still had an "lrm_resource" entry in the CIB:
---
<lrm_resource id="prm_v06_v06_raid1" type="Raid1" class="ocf" provider="heartbeat">
<lrm_rsc_op id="prm_v06_v06_raid1_last_0" operation_key="prm_v06_v06_raid1_monitor_0" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.6" transition-key="117:15:7:de539cd3-5895-4bcd-a388-ebad29a7b63d" transition-magic="0:7;117:15:7:de539cd3-5895-4bcd-a388-ebad29a7b63d" call-id="76" rc-code="7" op-status="0" interval="0" op-digest="0e6b2558abfd3cee98ee60cb7b03e6b0"/>
---
And the resource should have been removed before:
Apr 15 13:14:00 h05 crmd: [13085]: info: abort_transition_graph: te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=prm_v06_v06_raid1_last_0, magic=0:7;117:15:7:de5
39cd3-5895-4bcd-a388-ebad29a7b63d, cib=0.735.35) : Resource op removal

Isn't his very strange, or is there a reasonable explanation?

Regards,
Ulrich