/* StarPU --- Runtime system for heterogeneous multicore architectures. * * Copyright (C) 2019-2021 Université de Bordeaux, CNRS (LaBRI UMR 5800), Inria * * StarPU is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser General Public License as published by * the Free Software Foundation; either version 2.1 of the License, or (at * your option) any later version. * * StarPU is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. * * See the GNU Lesser General Public License in COPYING.LGPL for more details. */ /*! \page FaultTolerance Fault Tolerance \section FaultTolerance_Introduction Introduction Due to e.g. hardware error, some tasks may fail, or even complete nodes may fail. For now, StarPU provides some support for failure of tasks. \section TaskRetry Retrying tasks In case a task implementation notices that it fail to compute properly, it can call starpu_task_failed() to notify StarPU of the failure. tests/fault-tolerance/retry.c is an example of coping with such failure: the principle is that when submitting the task, one sets its prologue callback to starpu_task_ft_prologue(). That prologue will turn the task into a meta task which will manage the repeated submission of try-tasks to perform the computation until one of the computations succeeds. By default, try-tasks will be just retried until one of them succeeds (i.e. the task implementation does not call starpu_task_failed()). One can change the behavior by passing a check_failsafe function as prologue parameter, which will be called at the end of the try-task attempt. It can look at starpu_task_get_current()->failed to determine whether the try-task suceeded, in which case it can call starpu_task_ft_success() on the meta-task to notify success, or if it failed, in which case it can call starpu_task_failsafe_create_retry() to create another try-task, and submit it with starpu_task_submit_nodeps(). This can however only work if the task input are not modified, and is thus not supported for tasks with data access mode ::STARPU_RW. */