Discussion:
[Linux-ha-dev] TOTEM implementation eror (SLES11 SP2)?
Ulrich Windl
2013-02-25 14:26:36 UTC
Permalink
Hello,

I'm wondering about these messages:

Feb 25 14:53:31 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:31 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:31 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6

If you look at the first and the last item in the retransmit list, it's obvious that this cannot be a ring buffer (as I was expecting). To me it looks like an implementation error.

Those messages appear and disappear without apparent reason. Maybe the reason is having two independent rings combined with poor logging: Here is how the situation switches:

Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41 42 43 44 45 46
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 1f 20 21 22 23 24 25 26 27 28
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41 42 43 44 45 46
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 1f 20 21 22 23 24 25 26 27 28
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41 42 43 44 45 46
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 1f 20 21 22 23 24 25 26 27 28
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41 42 43 44 45 46
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 1f 20 21 22 23 24 25 26 27 28
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41 42 43 44 45 46
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 1f 20 21 22 23 24 25 26 27 28

I doubt the network can have that many problems as TOTEM reports:

[...]
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 780
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 780
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 782
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 784
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 784
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 786
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 786
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Marking ringid 1 interface 192.168.0.64 FAULTY
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 788
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 789
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 78c
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Automatically recovered ring 1
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Automatically recovered ring 1
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Automatically recovered ring 1
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 79a
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 79c
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 79c
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 79e
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 79e
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 7a0
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 7a2
[...]

# grep "Retransmit List" /var/log/messages | wc -l
5504

(All in less than an hour when some nodes booted)

Regards,
Ulrich
Lars Marowsky-Bree
2013-02-26 10:54:16 UTC
Permalink
Post by Ulrich Windl
Hello,
Feb 25 14:53:31 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
That has nothing to do with Linux HA; this belongs to the corosync list.

Or, as always, to support, if you want to have it fixed in our product
;-)
It's a corosync issue affecting some network environments that we are
actively tracing. It'll sometimes happen even with one ring, and
persists even in 1.4.5. Alas. If you report it to support, we can add
that environment as a data point.


Regards,
Lars
--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
Loading...