[Linux-ha-dev] RFC: pidfile handling; current worst case: stop failure and node level fencing

Discussion:

Lars Ellenberg

2014-10-20 19:17:29 UTC

Recent discussions with Dejan made me again more prominently aware of a
few issues we probably all know about, but usually dismis as having not
much relevance in the real-world.

The facts:

* a pidfile typically only stores a pid
* a pidfile may "stale", not properly cleaned up
when the pid it references died.
* pids are recycled

This is more an issue if kernel.pid_max is small
wrt the number of processes created per unit time,
for example on some embeded systems,
or on some very busy systems.

But it may be an issue on any system,
even a mostly idle one, given "bad luck^W timing",
see below.

A common idiom in resource agents is to

kill_that_pid_and_wait_until_dead()
{
local pid=$1
is_alive $pid || return 0
kill -TERM $pid
while is_alive $pid ; sleep 1; done
return 0
}

The naïve implementation of is_alive() is
is_alive() { kill -0 $1 ; }

This is the main issue:
-----------------------

If the last-used-pid is just a bit smaller then $pid,
during the sleep 1, $pid may die,
and the OS may already have created a new process with that exact pid.

Using above "is_alive", kill_that_pid() will not notice that the
to-be-killed pid has actually terminated while that new process runs.
Which may be a very long time if that is some other long running daemon.

This may result in stop failure and resulting node level fencing.

The question is, which better way do we have to detect if some pid died
after we killed it. Or, related, and even better: how to detect if the
process currently running with some pid is in fact still the process
referenced by the pidfile.

I have two suggestions.

(I am trying to avoid bashisms in here.
But maybe I overlook some.
Also, the code is typed, not sourced from some working script,
so there may be logic bugs and typos.
My intent should be obvious enough, though.)

using "cd /proc/$pid; stat ."
-----------------------------

# this is most likely linux specific
kill_that_pid_and_wait_until_dead()
{
local pid=$1
(
cd /proc/$pid || return 0
kill -TERM $pid
while stat . ; sleep 1; done
)
return 0
}

Once pid dies, /proc/$pid will become stale (but not completely go away,
because it is our cwd), and stat . will return "No such process".

Variants:

using test -ef
--------------

exec 7</proc/$pid || return 0
kill -TERM $pid
while :; do
exec 8</proc/$pid || break
test /proc/self/fd/7 -ef /proc/self/fd/8 || break
sleep 1
done
exec 7<&- 8<&-

using stat -c %Y /proc/$pid
---------------------------

ctime0=$(stat -c %Y /proc/$pid)
kill -TERM $pid
while ctime=$(stat -c %Y /proc/$pid) && [ $ctime = $ctime0 ] ; do sleep 1; done

Why not use the inode number I hear you say.
Because it is not stable. Sorry.
Don't believe me? Don't want to read kernel source?
Try it yourself:

sleep 120 & k=$!
stat /proc/$k
echo 3 > /proc/sys/vm/drop_caches
stat /proc/$k

But that leads me to an other proposal:
store the starttime together with the pid in a pidfile.

For linux that would be:

(see proc(5) for /proc/pid/stat field meanings.
note that (comm) may contain both whitespace and ")",
which is the reason for my sed | cut below)

spawn_create_exclusive_pid_starttime()
{
local pidfile=$1
shift
local reset
case $- in *C*) reset=":";; *) set -C; reset="set +C";; esac
if ! exec 3>$pidfile ; then
$reset
return 1
fi

$reset
setsid sh -c '
read pid _ < /proc/self/stat
starttime=$(sed -e 's/^.*) //' /proc/$pid/stat | cut -d' ' -f 20)

&3 echo $pid $starttime

3>&- exec "$@"
' -- "$@" &
return 0
}

It does not seem possible to cycle through all available pids
within fractions of time smaller than the granularity of starttime,
so "pid starttime" should be a unique tuple (until the next reboot --
at least on linux, starttime is measured as strictly monotonic "uptime").

If we have "pid starttime" in the pidfile,
we can:

get_proc_pid_starttime()
{
proc_pid_starttime=$(sed -e 's/^.*) //' /proc/$pid/stat) || return 1
proc_pid_starttime=$(echo "$proc_pid_starttime" | cut -d' ' -f 20)
}

kill_using_pidfile()
{
local pidfile=$1
local pid starttime proc_pid_starttime

test -e $pidfile || return # already dead
read pid starttime <$pidfile || return # unreadable

# check pid and starttime are both present, numeric only, ...
# I have a version that distinguishes 16 distinct error
# conditions; this is the short version only...

local i=0
while
get_proc_pid_starttime &&
[ "$starttime" = "$proc_pid_starttime" ]
do
: $(( i+=1 ))
[ $i = 1 ] && kill -TERM $pid
# MAYBE # [ $i = 30 ] && kill -KILL $pid
sleep 1
done

# it's not (anymore) the process we where looking for
# remove that pidfile.

rm -f "$pidfile"
}

In other OSes, ps may be able to give a good enough equivalent?

Any comments?

Thanks,
Lars

_______________________________________________________
Linux-HA-Dev: Linux-HA-***@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Alan Robertson

2014-10-20 20:52:13 UTC