Hi,
I am writing a parallel program using MPI for Beowulf Clusters. However I figured out I would need to define checkpoints in my program to save the work and not start from square one in case the program crashed or the machine was re-booted. I am programming in C, and need sources and references that would teach me how to define checkpoints and how(and what) information would be saved by using those checkpoints. I have never used this before and I need to know this right from the very basics. I would very much appreciate the help!!
Thanks