123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641 |
- /* StarPU --- Runtime system for heterogeneous multicore architectures.
- *
- * Copyright (C) 2009-2021 Université de Bordeaux, CNRS (LaBRI UMR 5800), Inria
- *
- * StarPU is free software; you can redistribute it and/or modify
- * it under the terms of the GNU Lesser General Public License as published by
- * the Free Software Foundation; either version 2.1 of the License, or (at
- * your option) any later version.
- *
- * StarPU is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
- *
- * See the GNU Lesser General Public License in COPYING.LGPL for more details.
- */
- /*! \page TasksInStarPU Tasks In StarPU
- \section TaskGranularity Task Granularity
- Like any other runtime, StarPU has some overhead to manage tasks. Since
- it does smart scheduling and data management, this overhead is not always
- neglectable. The order of magnitude of the overhead is typically a couple of
- microseconds, which is actually quite smaller than the CUDA overhead itself. The
- amount of work that a task should do should thus be somewhat
- bigger, to make sure that the overhead becomes neglectible. The offline
- performance feedback can provide a measure of task length, which should thus be
- checked if bad performance are observed. To get a grasp at the scalability
- possibility according to task size, one can run
- <c>tests/microbenchs/tasks_size_overhead.sh</c> which draws curves of the
- speedup of independent tasks of very small sizes.
- To determine what task size your application is actually using, one can use
- <c>starpu_fxt_data_trace</c>, see \ref DataTrace .
- The choice of scheduler also has impact over the overhead: for instance, the
- scheduler <c>dmda</c> takes time to make a decision, while <c>eager</c> does
- not. <c>tasks_size_overhead.sh</c> can again be used to get a grasp at how much
- impact that has on the target machine.
- \section TaskSubmission Task Submission
- To let StarPU make online optimizations, tasks should be submitted
- asynchronously as much as possible. Ideally, all tasks should be
- submitted, and mere calls to starpu_task_wait_for_all() or
- starpu_data_unregister() be done to wait for
- termination. StarPU will then be able to rework the whole schedule, overlap
- computation with communication, manage accelerator local memory usage, etc.
- \section TaskPriorities Task Priorities
- By default, StarPU will consider the tasks in the order they are submitted by
- the application. If the application programmer knows that some tasks should
- be performed in priority (for instance because their output is needed by many
- other tasks and may thus be a bottleneck if not executed early
- enough), the field starpu_task::priority should be set to provide the
- priority information to StarPU.
- \section TaskDependencies Task Dependencies
- \subsection SequentialConsistency Sequential Consistency
- By default, task dependencies are inferred from data dependency (sequential
- coherency) by StarPU. The application can however disable sequential coherency
- for some data, and dependencies can be specifically expressed.
- Setting (or unsetting) sequential consistency can be done at the data
- level by calling starpu_data_set_sequential_consistency_flag() for a
- specific data or starpu_data_set_default_sequential_consistency_flag()
- for all datas.
- Setting (or unsetting) sequential consistency can also be done at task
- level by setting the field starpu_task::sequential_consistency to \c 0.
- Sequential consistency can also be set (or unset) for each handle of a
- specific task, this is done by using the field
- starpu_task::handles_sequential_consistency. When set, its value
- should be a array with the number of elements being the number of
- handles for the task, each element of the array being the sequential
- consistency for the \c i-th handle of the task. The field can easily be
- set when calling starpu_task_insert() with the flag
- ::STARPU_HANDLES_SEQUENTIAL_CONSISTENCY
- \code{.c}
- char *seq_consistency = malloc(cl.nbuffers * sizeof(char));
- seq_consistency[0] = 1;
- seq_consistency[1] = 1;
- seq_consistency[2] = 0;
- ret = starpu_task_insert(&cl,
- STARPU_RW, handleA, STARPU_RW, handleB, STARPU_RW, handleC,
- STARPU_HANDLES_SEQUENTIAL_CONSISTENCY, seq_consistency,
- 0);
- free(seq_consistency);
- \endcode
- The internal algorithm used by StarPU to set up implicit dependency is
- as follows:
- \code{.c}
- if (sequential_consistency(task) == 1)
- for(i=0 ; i<STARPU_TASK_GET_NBUFFERS(task) ; i++)
- if (sequential_consistency(i-th data, task) == 1)
- if (sequential_consistency(i-th data) == 1)
- create_implicit_dependency(...)
- \endcode
- \subsection TasksAndTagsDependencies Tasks And Tags Dependencies
- One can explicitely set dependencies between tasks using
- starpu_task_declare_deps() or starpu_task_declare_deps_array(). Dependencies between tasks can be
- expressed through tags associated to a tag with the field
- starpu_task::tag_id and using the function starpu_tag_declare_deps()
- or starpu_tag_declare_deps_array().
- The termination of a task can be delayed through the function
- starpu_task_end_dep_add() which specifies the number of calls to the function
- starpu_task_end_dep_release() needed to trigger the task termination. One can
- also use starpu_task_declare_end_deps() or starpu_task_declare_end_deps_array()
- to delay the termination of a task until the termination of other tasks.
- \section SettingManyDataHandlesForATask Setting Many Data Handles For a Task
- The maximum number of data a task can manage is fixed by the macro
- \ref STARPU_NMAXBUFS which has a default value which can be changed
- through the \c configure option \ref enable-maxbuffers "--enable-maxbuffers".
- However, it is possible to define tasks managing more data by using
- the field starpu_task::dyn_handles when defining a task and the field
- starpu_codelet::dyn_modes when defining the corresponding codelet.
- \code{.c}
- enum starpu_data_access_mode modes[STARPU_NMAXBUFS+1] =
- {
- STARPU_R, STARPU_R, ...
- };
- struct starpu_codelet dummy_big_cl =
- {
- .cuda_funcs = { dummy_big_kernel },
- .opencl_funcs = { dummy_big_kernel },
- .cpu_funcs = { dummy_big_kernel },
- .cpu_funcs_name = { "dummy_big_kernel" },
- .nbuffers = STARPU_NMAXBUFS+1,
- .dyn_modes = modes
- };
- task = starpu_task_create();
- task->cl = &dummy_big_cl;
- task->dyn_handles = malloc(task->cl->nbuffers * sizeof(starpu_data_handle_t));
- for(i=0 ; i<task->cl->nbuffers ; i++)
- {
- task->dyn_handles[i] = handle;
- }
- starpu_task_submit(task);
- \endcode
- \code{.c}
- starpu_data_handle_t *handles = malloc(dummy_big_cl.nbuffers * sizeof(starpu_data_handle_t));
- for(i=0 ; i<dummy_big_cl.nbuffers ; i++)
- {
- handles[i] = handle;
- }
- starpu_task_insert(&dummy_big_cl,
- STARPU_VALUE, &dummy_big_cl.nbuffers, sizeof(dummy_big_cl.nbuffers),
- STARPU_DATA_ARRAY, handles, dummy_big_cl.nbuffers,
- 0);
- \endcode
- The whole code for this complex data interface is available in the
- file <c>examples/basic_examples/dynamic_handles.c</c>.
- \section SettingVariableDataHandlesForATask Setting a Variable Number Of Data Handles For a Task
- Normally, the number of data handles given to a task is set with
- starpu_codelet::nbuffers. This field can however be set to
- \ref STARPU_VARIABLE_NBUFFERS, in which case starpu_task::nbuffers
- must be set, and starpu_task::modes (or starpu_task::dyn_modes,
- see \ref SettingManyDataHandlesForATask) should be used to specify the modes for
- the handles.
- \section UsingMultipleImplementationsOfACodelet Using Multiple Implementations Of A Codelet
- One may want to write multiple implementations of a codelet for a single type of
- device and let StarPU choose which one to run. As an example, we will show how
- to use SSE to scale a vector. The codelet can be written as follows:
- \code{.c}
- #include <xmmintrin.h>
- void scal_sse_func(void *buffers[], void *cl_arg)
- {
- float *vector = (float *) STARPU_VECTOR_GET_PTR(buffers[0]);
- unsigned int n = STARPU_VECTOR_GET_NX(buffers[0]);
- unsigned int n_iterations = n/4;
- if (n % 4 != 0)
- n_iterations++;
- __m128 *VECTOR = (__m128*) vector;
- __m128 factor __attribute__((aligned(16)));
- factor = _mm_set1_ps(*(float *) cl_arg);
- unsigned int i;
- for (i = 0; i < n_iterations; i++)
- VECTOR[i] = _mm_mul_ps(factor, VECTOR[i]);
- }
- \endcode
- \code{.c}
- struct starpu_codelet cl =
- {
- .cpu_funcs = { scal_cpu_func, scal_sse_func },
- .cpu_funcs_name = { "scal_cpu_func", "scal_sse_func" },
- .nbuffers = 1,
- .modes = { STARPU_RW }
- };
- \endcode
- Schedulers which are multi-implementation aware (only <c>dmda</c> and
- <c>pheft</c> for now) will use the performance models of all the
- provided implementations, and pick the one which seems to be the fastest.
- \section EnablingImplementationAccordingToCapabilities Enabling Implementation According To Capabilities
- Some implementations may not run on some devices. For instance, some CUDA
- devices do not support double floating point precision, and thus the kernel
- execution would just fail; or the device may not have enough shared memory for
- the implementation being used. The field starpu_codelet::can_execute
- permits to express this. For instance:
- \code{.c}
- static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
- {
- const struct cudaDeviceProp *props;
- if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)
- return 1;
- /* Cuda device */
- props = starpu_cuda_get_device_properties(workerid);
- if (props->major >= 2 || props->minor >= 3)
- /* At least compute capability 1.3, supports doubles */
- return 1;
- /* Old card, does not support doubles */
- return 0;
- }
- struct starpu_codelet cl =
- {
- .can_execute = can_execute,
- .cpu_funcs = { cpu_func },
- .cpu_funcs_name = { "cpu_func" },
- .cuda_funcs = { gpu_func }
- .nbuffers = 1,
- .modes = { STARPU_RW }
- };
- \endcode
- This can be essential e.g. when running on a machine which mixes various models
- of CUDA devices, to take benefit from the new models without crashing on old models.
- Note: the function starpu_codelet::can_execute is called by the
- scheduler each time it tries to match a task with a worker, and should
- thus be very fast. The function starpu_cuda_get_device_properties()
- provides a quick access to CUDA properties of CUDA devices to achieve
- such efficiency.
- Another example is to compile CUDA code for various compute capabilities,
- resulting with two CUDA functions, e.g. <c>scal_gpu_13</c> for compute capability
- 1.3, and <c>scal_gpu_20</c> for compute capability 2.0. Both functions can be
- provided to StarPU by using starpu_codelet::cuda_funcs, and
- starpu_codelet::can_execute can then be used to rule out the
- <c>scal_gpu_20</c> variant on a CUDA device which will not be able to execute it:
- \code{.c}
- static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
- {
- const struct cudaDeviceProp *props;
- if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)
- return 1;
- /* Cuda device */
- if (nimpl == 0)
- /* Trying to execute the 1.3 capability variant, we assume it is ok in all cases. */
- return 1;
- /* Trying to execute the 2.0 capability variant, check that the card can do it. */
- props = starpu_cuda_get_device_properties(workerid);
- if (props->major >= 2 || props->minor >= 0)
- /* At least compute capability 2.0, can run it */
- return 1;
- /* Old card, does not support 2.0, will not be able to execute the 2.0 variant. */
- return 0;
- }
- struct starpu_codelet cl =
- {
- .can_execute = can_execute,
- .cpu_funcs = { cpu_func },
- .cpu_funcs_name = { "cpu_func" },
- .cuda_funcs = { scal_gpu_13, scal_gpu_20 },
- .nbuffers = 1,
- .modes = { STARPU_RW }
- };
- \endcode
- Another example is having specialized implementations for some given common
- sizes, for instance here we have a specialized implementation for 1024x1024
- matrices:
- \code{.c}
- static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
- {
- const struct cudaDeviceProp *props;
- if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)
- return 1;
- /* Cuda device */
- switch (nimpl)
- {
- case 0:
- /* Trying to execute the generic capability variant. */
- return 1;
- case 1:
- {
- /* Trying to execute the size == 1024 specific variant. */
- struct starpu_matrix_interface *interface = starpu_data_get_interface_on_node(task->handles[0]);
- return STARPU_MATRIX_GET_NX(interface) == 1024 && STARPU_MATRIX_GET_NY(interface == 1024);
- }
- }
- }
- struct starpu_codelet cl =
- {
- .can_execute = can_execute,
- .cpu_funcs = { cpu_func },
- .cpu_funcs_name = { "cpu_func" },
- .cuda_funcs = { potrf_gpu_generic, potrf_gpu_1024 },
- .nbuffers = 1,
- .modes = { STARPU_RW }
- };
- \endcode
- Note that the most generic variant should be provided first, as some schedulers are
- not able to try the different variants.
- \section InsertTaskUtility Insert Task Utility
- StarPU provides the wrapper function starpu_task_insert() to ease
- the creation and submission of tasks.
- Here the implementation of a codelet:
- \code{.c}
- void func_cpu(void *descr[], void *_args)
- {
- int *x0 = (int *)STARPU_VARIABLE_GET_PTR(descr[0]);
- float *x1 = (float *)STARPU_VARIABLE_GET_PTR(descr[1]);
- int ifactor;
- float ffactor;
- starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
- *x0 = *x0 * ifactor;
- *x1 = *x1 * ffactor;
- }
- struct starpu_codelet mycodelet =
- {
- .cpu_funcs = { func_cpu },
- .cpu_funcs_name = { "func_cpu" },
- .nbuffers = 2,
- .modes = { STARPU_RW, STARPU_RW }
- };
- \endcode
- And the call to the function starpu_task_insert():
- \code{.c}
- starpu_task_insert(&mycodelet,
- STARPU_VALUE, &ifactor, sizeof(ifactor),
- STARPU_VALUE, &ffactor, sizeof(ffactor),
- STARPU_RW, data_handles[0],
- STARPU_RW, data_handles[1],
- 0);
- \endcode
- The call to starpu_task_insert() is equivalent to the following
- code:
- \code{.c}
- struct starpu_task *task = starpu_task_create();
- task->cl = &mycodelet;
- task->handles[0] = data_handles[0];
- task->handles[1] = data_handles[1];
- char *arg_buffer;
- size_t arg_buffer_size;
- starpu_codelet_pack_args(&arg_buffer, &arg_buffer_size,
- STARPU_VALUE, &ifactor, sizeof(ifactor),
- STARPU_VALUE, &ffactor, sizeof(ffactor),
- 0);
- task->cl_arg = arg_buffer;
- task->cl_arg_size = arg_buffer_size;
- int ret = starpu_task_submit(task);
- \endcode
- Here a similar call using ::STARPU_DATA_ARRAY.
- \code{.c}
- starpu_task_insert(&mycodelet,
- STARPU_DATA_ARRAY, data_handles, 2,
- STARPU_VALUE, &ifactor, sizeof(ifactor),
- STARPU_VALUE, &ffactor, sizeof(ffactor),
- 0);
- \endcode
- If some part of the task insertion depends on the value of some computation,
- the macro ::STARPU_DATA_ACQUIRE_CB can be very convenient. For
- instance, assuming that the index variable <c>i</c> was registered as handle
- <c>A_handle[i]</c>:
- \code{.c}
- /* Compute which portion we will work on, e.g. pivot */
- starpu_task_insert(&which_index, STARPU_W, i_handle, 0);
- /* And submit the corresponding task */
- STARPU_DATA_ACQUIRE_CB(i_handle, STARPU_R,
- starpu_task_insert(&work, STARPU_RW, A_handle[i], 0));
- \endcode
- The macro ::STARPU_DATA_ACQUIRE_CB submits an asynchronous request for
- acquiring data <c>i</c> for the main application, and will execute the code
- given as third parameter when it is acquired. In other words, as soon as the
- value of <c>i</c> computed by the codelet <c>which_index</c> can be read, the
- portion of code passed as third parameter of ::STARPU_DATA_ACQUIRE_CB will
- be executed, and is allowed to read from <c>i</c> to use it e.g. as an
- index. Note that this macro is only avaible when compiling StarPU with
- the compiler <c>gcc</c>.
- StarPU also provides a utility function starpu_codelet_unpack_args() to retrieve the ::STARPU_VALUE arguments passed to the task. There is several ways of calling this function starpu_codelet_unpack_args().
- \code{.c}
- void func_cpu(void *descr[], void *_args)
- {
- int ifactor;
- float ffactor;
- starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
- }
- \endcode
- \code{.c}
- void func_cpu(void *descr[], void *_args)
- {
- int ifactor;
- float ffactor;
- starpu_codelet_unpack_args(_args, &ifactor, 0);
- starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
- }
- \endcode
- \code{.c}
- void func_cpu(void *descr[], void *_args)
- {
- int ifactor;
- float ffactor;
- char buffer[100];
- starpu_codelet_unpack_args_and_copyleft(_args, buffer, 100, &ifactor, 0);
- starpu_codelet_unpack_args(buffer, &ffactor);
- }
- \endcode
- \section GettingTaskChildren Getting Task Children
- It may be interesting to get the list of tasks which depend on a given task,
- notably when using implicit dependencies, since this list is computed by StarPU.
- starpu_task_get_task_succs() provides it. For instance:
- \code{.c}
- struct starpu_task *tasks[4];
- ret = starpu_task_get_task_succs(task, sizeof(tasks)/sizeof(*tasks), tasks);
- \endcode
- \section ParallelTasks Parallel Tasks
- StarPU can leverage existing parallel computation libraries by the means of
- parallel tasks. A parallel task is a task which is run by a set of CPUs
- (called a parallel or combined worker) at the same time, by using an existing
- parallel CPU implementation of the computation to be achieved. This can also be
- useful to improve the load balance between slow CPUs and fast GPUs: since CPUs
- work collectively on a single task, the completion time of tasks on CPUs become
- comparable to the completion time on GPUs, thus relieving from granularity
- discrepancy concerns. <c>hwloc</c> support needs to be enabled to get
- good performance, otherwise StarPU will not know how to better group
- cores.
- Two modes of execution exist to accomodate with existing usages.
- \subsection Fork-modeParallelTasks Fork-mode Parallel Tasks
- In the Fork mode, StarPU will call the codelet function on one
- of the CPUs of the combined worker. The codelet function can use
- starpu_combined_worker_get_size() to get the number of threads it is
- allowed to start to achieve the computation. The CPU binding mask for the whole
- set of CPUs is already enforced, so that threads created by the function will
- inherit the mask, and thus execute where StarPU expected, the OS being in charge
- of choosing how to schedule threads on the corresponding CPUs. The application
- can also choose to bind threads by hand, using e.g. <c>sched_getaffinity</c> to know
- the CPU binding mask that StarPU chose.
- For instance, using OpenMP (full source is available in
- <c>examples/openmp/vector_scal.c</c>):
- \snippet forkmode.c To be included. You should update doxygen if you see this text.
- Other examples include for instance calling a BLAS parallel CPU implementation
- (see <c>examples/mult/xgemm.c</c>).
- \subsection SPMD-modeParallelTasks SPMD-mode Parallel Tasks
- In the SPMD mode, StarPU will call the codelet function on
- each CPU of the combined worker. The codelet function can use
- starpu_combined_worker_get_size() to get the total number of CPUs
- involved in the combined worker, and thus the number of calls that are made in
- parallel to the function, and starpu_combined_worker_get_rank() to get
- the rank of the current CPU within the combined worker. For instance:
- \code{.c}
- static void func(void *buffers[], void *args)
- {
- unsigned i;
- float *factor = _args;
- struct starpu_vector_interface *vector = buffers[0];
- unsigned n = STARPU_VECTOR_GET_NX(vector);
- float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
- /* Compute slice to compute */
- unsigned m = starpu_combined_worker_get_size();
- unsigned j = starpu_combined_worker_get_rank();
- unsigned slice = (n+m-1)/m;
- for (i = j * slice; i < (j+1) * slice && i < n; i++)
- val[i] *= *factor;
- }
- static struct starpu_codelet cl =
- {
- .modes = { STARPU_RW },
- .type = STARPU_SPMD,
- .max_parallelism = INT_MAX,
- .cpu_funcs = { func },
- .cpu_funcs_name = { "func" },
- .nbuffers = 1,
- }
- \endcode
- Of course, this trivial example will not really benefit from parallel task
- execution, and was only meant to be simple to understand. The benefit comes
- when the computation to be done is so that threads have to e.g. exchange
- intermediate results, or write to the data in a complex but safe way in the same
- buffer.
- \subsection ParallelTasksPerformance Parallel Tasks Performance
- To benefit from parallel tasks, a parallel-task-aware StarPU scheduler has to
- be used. When exposed to codelets with a flag ::STARPU_FORKJOIN or
- ::STARPU_SPMD, the schedulers <c>pheft</c> (parallel-heft) and <c>peager</c>
- (parallel eager) will indeed also try to execute tasks with
- several CPUs. It will automatically try the various available combined
- worker sizes (making several measurements for each worker size) and
- thus be able to avoid choosing a large combined worker if the codelet
- does not actually scale so much.
- This is however for now only proof of concept, and has not really been optimized yet.
- \subsection CombinedWorkers Combined Workers
- By default, StarPU creates combined workers according to the architecture
- structure as detected by <c>hwloc</c>. It means that for each object of the <c>hwloc</c>
- topology (NUMA node, socket, cache, ...) a combined worker will be created. If
- some nodes of the hierarchy have a big arity (e.g. many cores in a socket
- without a hierarchy of shared caches), StarPU will create combined workers of
- intermediate sizes. The variable \ref STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER
- permits to tune the maximum arity between levels of combined workers.
- The combined workers actually produced can be seen in the output of the
- tool <c>starpu_machine_display</c> (the environment variable
- \ref STARPU_SCHED has to be set to a combined worker-aware scheduler such
- as <c>pheft</c> or <c>peager</c>).
- \subsection ConcurrentParallelTasks Concurrent Parallel Tasks
- Unfortunately, many environments and librairies do not support concurrent
- calls.
- For instance, most OpenMP implementations (including the main ones) do not
- support concurrent <c>pragma omp parallel</c> statements without nesting them in
- another <c>pragma omp parallel</c> statement, but StarPU does not yet support
- creating its CPU workers by using such pragma.
- Other parallel libraries are also not safe when being invoked concurrently
- from different threads, due to the use of global variables in their sequential
- sections for instance.
- The solution is then to use only one combined worker at a time. This can be
- done by setting the field starpu_conf::single_combined_worker to <c>1</c>, or
- setting the environment variable \ref STARPU_SINGLE_COMBINED_WORKER
- to <c>1</c>. StarPU will then run only one parallel task at a time (but other
- CPU and GPU tasks are not affected and can be run concurrently). The parallel
- task scheduler will however still try varying combined worker
- sizes to look for the most efficient ones.
- \subsection SynchronizationTasks Synchronization Tasks
- For the application conveniency, it may be useful to define tasks which do not
- actually make any computation, but wear for instance dependencies between other
- tasks or tags, or to be submitted in callbacks, etc.
- The obvious way is of course to make kernel functions empty, but such task will
- thus have to wait for a worker to become ready, transfer data, etc.
- A much lighter way to define a synchronization task is to set its field starpu_task::cl
- to <c>NULL</c>. The task will thus be a mere synchronization point,
- without any data access or execution content: as soon as its dependencies become
- available, it will terminate, call the callbacks, and release dependencies.
- An intermediate solution is to define a codelet with its field
- starpu_codelet::where set to \ref STARPU_NOWHERE, for instance:
- \code{.c}
- struct starpu_codelet cl =
- {
- .where = STARPU_NOWHERE,
- .nbuffers = 1,
- .modes = { STARPU_R },
- }
- task = starpu_task_create();
- task->cl = &cl;
- task->handles[0] = handle;
- starpu_task_submit(task);
- \endcode
- will create a task which simply waits for the value of <c>handle</c> to be
- available for read. This task can then be depended on, etc.
- */
|