415_fault_tolerance.doxy 2.2 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
  1. /* StarPU --- Runtime system for heterogeneous multicore architectures.
  2. *
  3. * Copyright (C) 2019 Université de Bordeaux
  4. *
  5. * StarPU is free software; you can redistribute it and/or modify
  6. * it under the terms of the GNU Lesser General Public License as published by
  7. * the Free Software Foundation; either version 2.1 of the License, or (at
  8. * your option) any later version.
  9. *
  10. * StarPU is distributed in the hope that it will be useful, but
  11. * WITHOUT ANY WARRANTY; without even the implied warranty of
  12. * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
  13. *
  14. * See the GNU Lesser General Public License in COPYING.LGPL for more details.
  15. */
  16. /*! \page FaultTolerance Fault Tolerance
  17. \section Introduction Introduction
  18. Due to e.g. hardware error, some tasks may fail, or even complete nodes may
  19. fail. For now, StarPU provides some support for failure of tasks.
  20. \section TaskRetry Retrying tasks
  21. In case a task implementation notices that it fail to compute properly, it can
  22. call starpu_task_failed() to notify StarPU of the failure.
  23. <c>tests/fault-tolerance/retry.c</c> is an example of coping with such failure:
  24. the principle is that when submitting the task, one sets its prologue callback
  25. to starpu_task_ft_prologue(). That prologue will turn the task into a meta
  26. task which will manage the repeated submission of try-tasks to perform the
  27. computation until one of the computations succeeds.
  28. By default, try-tasks will be just retried until one of them succeeds (i.e. the
  29. task implementation does not call starpu_task_failed()). One can change the
  30. behavior by passing a <c>check_failsafe</c> function as prologue parameter,
  31. which will be called at the end of the try-task attempt. It can look at
  32. <c>starpu_task_get_current()->failed</c> to determine whether the try-task
  33. suceeded, in which case it can call starpu_task_ft_success() on the meta-task to
  34. notify success, or if it failed, in which case it can call
  35. starpu_task_failsafe_create_retry() to create another try-task, and submit it
  36. with starpu_task_submit_nodeps().
  37. This can however only work if the task input are not modified, and is thus not
  38. supported for tasks with data access mode ::STARPU_RW.
  39. */