Hi. I have a requirement to develop a 'system health monitor' module
that starts, stops and continually monitors certain processes (and
their threads) on an embedded linux system.

The idea is that is that if a thread is detected to have exited or
hung, this monitor would kill and restart the parent process and all
its threads.

My initial thought is that each thread would register itself with the
monitoring process, and then subsequently send a heartbeat to it. But
on second thoughts, most threads are created to run in a blocking type
scenario, and there is no guarentee that the events they block on
would occur often enough to determine if the thread was in trouble or
merely legitimately blocked for a long time.

Im relatively ignroant when it comes to the guts of linux, so Im not
sure how to go about this exactly.

for instance, if a thread was to start a periodic timer that generated
a heartbeat message, would the absence of the heartbeat message
reliably indicate a problem with the thread?

am I talking about a known solved problem here? if so, what is the
accepted solution?

cheers