Discussion:
[Linux-ha-dev] Problem in SLES11 SP2 (actions on removed resources)?
Ulrich Windl
2013-04-19 07:56:37 UTC
Permalink
Hi!

I have some strange problems with the current update of the cluster software in SLES11 SP2 (I didn't see such problems before the update):

sbd monitoring went crazy (reporting running sbds when there were none, compaining the unability to stop sbd when there was none), so I stopped it.

Now that I re-activated it, the cluster talks about resources that had been deleted days ago, like:
---
Apr 19 08:56:19 h05 attrd: [13083]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Apr 19 08:56:19 h05 attrd: [13083]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-prm_stonith_sbd (1365148953)
Apr 19 08:56:19 h05 cib: [13080]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='h05']/lrm (origin=local/crmd/6835, version=0.744.19): ok (rc=0)
Apr 19 08:56:19 h05 crmd: [13085]: info: abort_transition_graph: te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=prm_v06_v06_raid1_last_0, magic=0:7;117:15:7:de539cd3-5895-4bcd-a388-ebad29a7b63d, cib=0.744.19) : Resource op removal
---

The resource prm_v06_v06_raid1 had been removed several days before in:
Apr 15 10:08:16 h05 cib: [13080]: info: cib_replace_notify: Replaced: 0.733.19 -> 0.734.1 from <null>

Interestingly a CIB dump minutes before the SBD-Change showed that the deleted resource still had an "lrm_resource" entry in the CIB:
---
<lrm_resource id="prm_v06_v06_raid1" type="Raid1" class="ocf" provider="heartbeat">
<lrm_rsc_op id="prm_v06_v06_raid1_last_0" operation_key="prm_v06_v06_raid1_monitor_0" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.6" transition-key="117:15:7:de539cd3-5895-4bcd-a388-ebad29a7b63d" transition-magic="0:7;117:15:7:de539cd3-5895-4bcd-a388-ebad29a7b63d" call-id="76" rc-code="7" op-status="0" interval="0" op-digest="0e6b2558abfd3cee98ee60cb7b03e6b0"/>
---
And the resource should have been removed before:
Apr 15 13:14:00 h05 crmd: [13085]: info: abort_transition_graph: te_update_diff:320 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=prm_v06_v06_raid1_last_0, magic=0:7;117:15:7:de5
39cd3-5895-4bcd-a388-ebad29a7b63d, cib=0.735.35) : Resource op removal

Isn't his very strange, or is there a reasonable explanation?

Regards,
Ulrich
Lars Marowsky-Bree
2013-04-19 08:22:06 UTC
Permalink
Post by Ulrich Windl
sbd monitoring went crazy (reporting running sbds when there were none, compaining the unability to stop sbd when there was none), so I stopped it.
What did you monitor? And what do you mean by "went crazy"?

(Besides, monitoring sbd is unnecessary anyway.)
Hm, is this creating an actual problem? The status section may have
records about orphan resources, but that should be harmless. (I think a
recent change made this better, too.)


Regards,
Lars
--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
Loading...