| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639 | 
							- /* StarPU --- Runtime system for heterogeneous multicore architectures.
 
-  *
 
-  * Copyright (C) 2010-2018                                CNRS
 
-  * Copyright (C) 2011,2012,2018                           Inria
 
-  * Copyright (C) 2009-2011,2014-2016,2018                 Université de Bordeaux
 
-  *
 
-  * StarPU is free software; you can redistribute it and/or modify
 
-  * it under the terms of the GNU Lesser General Public License as published by
 
-  * the Free Software Foundation; either version 2.1 of the License, or (at
 
-  * your option) any later version.
 
-  *
 
-  * StarPU is distributed in the hope that it will be useful, but
 
-  * WITHOUT ANY WARRANTY; without even the implied warranty of
 
-  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
-  *
 
-  * See the GNU Lesser General Public License in COPYING.LGPL for more details.
 
-  */
 
- /*! \page TasksInStarPU Tasks In StarPU
 
- \section TaskGranularity Task Granularity
 
- Like any other runtime, StarPU has some overhead to manage tasks. Since
 
- it does smart scheduling and data management, this overhead is not always
 
- neglectable. The order of magnitude of the overhead is typically a couple of
 
- microseconds, which is actually quite smaller than the CUDA overhead itself. The
 
- amount of work that a task should do should thus be somewhat
 
- bigger, to make sure that the overhead becomes neglectible. The offline
 
- performance feedback can provide a measure of task length, which should thus be
 
- checked if bad performance are observed. To get a grasp at the scalability
 
- possibility according to task size, one can run
 
- <c>tests/microbenchs/tasks_size_overhead.sh</c> which draws curves of the
 
- speedup of independent tasks of very small sizes.
 
- The choice of scheduler also has impact over the overhead: for instance, the
 
-  scheduler <c>dmda</c> takes time to make a decision, while <c>eager</c> does
 
- not. <c>tasks_size_overhead.sh</c> can again be used to get a grasp at how much
 
- impact that has on the target machine.
 
- \section TaskSubmission Task Submission
 
- To let StarPU make online optimizations, tasks should be submitted
 
- asynchronously as much as possible. Ideally, all the tasks should be
 
- submitted, and mere calls to starpu_task_wait_for_all() or
 
- starpu_data_unregister() be done to wait for
 
- termination. StarPU will then be able to rework the whole schedule, overlap
 
- computation with communication, manage accelerator local memory usage, etc.
 
- \section TaskPriorities Task Priorities
 
- By default, StarPU will consider the tasks in the order they are submitted by
 
- the application. If the application programmer knows that some tasks should
 
- be performed in priority (for instance because their output is needed by many
 
- other tasks and may thus be a bottleneck if not executed early
 
- enough), the field starpu_task::priority should be set to transmit the
 
- priority information to StarPU.
 
- \section TaskDependencies Task Dependencies
 
- \subsection SequentialConsistency Sequential Consistency
 
- By default, task dependencies are inferred from data dependency (sequential
 
- coherency) by StarPU. The application can however disable sequential coherency
 
- for some data, and dependencies can be specifically expressed.
 
- Setting (or unsetting) sequential consistency can be done at the data
 
- level by calling starpu_data_set_sequential_consistency_flag() for a
 
- specific data or starpu_data_set_default_sequential_consistency_flag()
 
- for all datas.
 
- Setting (or unsetting) sequential consistency can also be done at task
 
- level by setting the field starpu_task::sequential_consistency to 0.
 
- Sequential consistency can also be set (or unset) for each handle of a
 
- specific task, this is done by using the field
 
- starpu_task::handles_sequential_consistency. When set, its value
 
- should be a array with the number of elements being the number of
 
- handles for the task, each element of the array being the sequential
 
- consistency for the i-th handle of the task. The field can easily be
 
- set when calling starpu_task_insert() with the flag
 
- ::STARPU_HANDLES_SEQUENTIAL_CONSISTENCY
 
- \code{.c}
 
- char *seq_consistency = malloc(cl.nbuffers * sizeof(char));
 
- seq_consistency[0] = 1;
 
- seq_consistency[1] = 1;
 
- seq_consistency[2] = 0;
 
- ret = starpu_task_insert(&cl,
 
- 	STARPU_RW, handleA, STARPU_RW, handleB, STARPU_RW, handleC,
 
- 	STARPU_HANDLES_SEQUENTIAL_CONSISTENCY, seq_consistency,
 
- 	0);
 
- free(seq_consistency);
 
- \endcode
 
- The internal algorithm used by StarPU to set up implicit dependency is
 
- as follows:
 
- \code{.c}
 
- if (sequential_consistency(task) == 1)
 
-     for(i=0 ; i<STARPU_TASK_GET_NBUFFERS(task) ; i++)
 
-       if (sequential_consistency(i-th data, task) == 1)
 
-         if (sequential_consistency(i-th data) == 1)
 
-            create_implicit_dependency(...)
 
- \endcode
 
- \subsection TasksAndTagsDependencies Tasks And Tags Dependencies
 
- One can explicitely set dependencies between tasks using
 
- starpu_task_declare_deps() or starpu_task_declare_deps_array(). Dependencies between tasks can be
 
- expressed through tags associated to a tag with the field
 
- starpu_task::tag_id and using the function starpu_tag_declare_deps()
 
- or starpu_tag_declare_deps_array().
 
- The termination of a task can be delayed through the function
 
- starpu_task_end_dep_add() which specifies the number of calls to the function
 
- starpu_task_end_dep_release() needed to trigger the task termination. One can
 
- also use starpu_task_declare_end_deps() or starpu_task_declare_end_deps_array()
 
- to delay the termination of a task until the termination of other tasks.
 
- \section SettingManyDataHandlesForATask Setting Many Data Handles For a Task
 
- The maximum number of data a task can manage is fixed by the environment variable
 
- \ref STARPU_NMAXBUFS which has a default value which can be changed
 
- through the configure option \ref enable-maxbuffers "--enable-maxbuffers".
 
- However, it is possible to define tasks managing more data by using
 
- the field starpu_task::dyn_handles when defining a task and the field
 
- starpu_codelet::dyn_modes when defining the corresponding codelet.
 
- \code{.c}
 
- enum starpu_data_access_mode modes[STARPU_NMAXBUFS+1] =
 
- {
 
- 	STARPU_R, STARPU_R, ...
 
- };
 
- struct starpu_codelet dummy_big_cl =
 
- {
 
- 	.cuda_funcs = { dummy_big_kernel },
 
- 	.opencl_funcs = { dummy_big_kernel },
 
- 	.cpu_funcs = { dummy_big_kernel },
 
- 	.cpu_funcs_name = { "dummy_big_kernel" },
 
- 	.nbuffers = STARPU_NMAXBUFS+1,
 
- 	.dyn_modes = modes
 
- };
 
- task = starpu_task_create();
 
- task->cl = &dummy_big_cl;
 
- task->dyn_handles = malloc(task->cl->nbuffers * sizeof(starpu_data_handle_t));
 
- for(i=0 ; i<task->cl->nbuffers ; i++)
 
- {
 
- 	task->dyn_handles[i] = handle;
 
- }
 
- starpu_task_submit(task);
 
- \endcode
 
- \code{.c}
 
- starpu_data_handle_t *handles = malloc(dummy_big_cl.nbuffers * sizeof(starpu_data_handle_t));
 
- for(i=0 ; i<dummy_big_cl.nbuffers ; i++)
 
- {
 
- 	handles[i] = handle;
 
- }
 
- starpu_task_insert(&dummy_big_cl,
 
-          	  STARPU_VALUE, &dummy_big_cl.nbuffers, sizeof(dummy_big_cl.nbuffers),
 
- 		  STARPU_DATA_ARRAY, handles, dummy_big_cl.nbuffers,
 
- 		  0);
 
- \endcode
 
- The whole code for this complex data interface is available in the
 
- directory <c>examples/basic_examples/dynamic_handles.c</c>.
 
- \section SettingVariableDataHandlesForATask Setting a Variable Number Of Data Handles For a Task
 
- Normally, the number of data handles given to a task is fixed in the
 
- starpu_codelet::nbuffers codelet field. This field can however be set to
 
- \ref STARPU_VARIABLE_NBUFFERS, in which case the starpu_task::nbuffers task field
 
- must be set, and the starpu_task::modes field (or starpu_task::dyn_modes field,
 
- see \ref SettingManyDataHandlesForATask) should be used to specify the modes for
 
- the handles.
 
- \section UsingMultipleImplementationsOfACodelet Using Multiple Implementations Of A Codelet
 
- One may want to write multiple implementations of a codelet for a single type of
 
- device and let StarPU choose which one to run. As an example, we will show how
 
- to use SSE to scale a vector. The codelet can be written as follows:
 
- \code{.c}
 
- #include <xmmintrin.h>
 
- void scal_sse_func(void *buffers[], void *cl_arg)
 
- {
 
-     float *vector = (float *) STARPU_VECTOR_GET_PTR(buffers[0]);
 
-     unsigned int n = STARPU_VECTOR_GET_NX(buffers[0]);
 
-     unsigned int n_iterations = n/4;
 
-     if (n % 4 != 0)
 
-         n_iterations++;
 
-     __m128 *VECTOR = (__m128*) vector;
 
-     __m128 factor __attribute__((aligned(16)));
 
-     factor = _mm_set1_ps(*(float *) cl_arg);
 
-     unsigned int i;
 
-     for (i = 0; i < n_iterations; i++)
 
-         VECTOR[i] = _mm_mul_ps(factor, VECTOR[i]);
 
- }
 
- \endcode
 
- \code{.c}
 
- struct starpu_codelet cl =
 
- {
 
-     .cpu_funcs = { scal_cpu_func, scal_sse_func },
 
-     .cpu_funcs_name = { "scal_cpu_func", "scal_sse_func" },
 
-     .nbuffers = 1,
 
-     .modes = { STARPU_RW }
 
- };
 
- \endcode
 
- Schedulers which are multi-implementation aware (only <c>dmda</c> and
 
- <c>pheft</c> for now) will use the performance models of all the
 
- implementations it was given, and pick the one which seems to be the fastest.
 
- \section EnablingImplementationAccordingToCapabilities Enabling Implementation According To Capabilities
 
- Some implementations may not run on some devices. For instance, some CUDA
 
- devices do not support double floating point precision, and thus the kernel
 
- execution would just fail; or the device may not have enough shared memory for
 
- the implementation being used. The field starpu_codelet::can_execute
 
- permits to express this. For instance:
 
- \code{.c}
 
- static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
 
- {
 
-   const struct cudaDeviceProp *props;
 
-   if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)
 
-     return 1;
 
-   /* Cuda device */
 
-   props = starpu_cuda_get_device_properties(workerid);
 
-   if (props->major >= 2 || props->minor >= 3)
 
-     /* At least compute capability 1.3, supports doubles */
 
-     return 1;
 
-   /* Old card, does not support doubles */
 
-   return 0;
 
- }
 
- struct starpu_codelet cl =
 
- {
 
-     .can_execute = can_execute,
 
-     .cpu_funcs = { cpu_func },
 
-     .cpu_funcs_name = { "cpu_func" },
 
-     .cuda_funcs = { gpu_func }
 
-     .nbuffers = 1,
 
-     .modes = { STARPU_RW }
 
- };
 
- \endcode
 
- This can be essential e.g. when running on a machine which mixes various models
 
- of CUDA devices, to take benefit from the new models without crashing on old models.
 
- Note: the function starpu_codelet::can_execute is called by the
 
- scheduler each time it tries to match a task with a worker, and should
 
- thus be very fast. The function starpu_cuda_get_device_properties()
 
- provides a quick access to CUDA properties of CUDA devices to achieve
 
- such efficiency.
 
- Another example is to compile CUDA code for various compute capabilities,
 
- resulting with two CUDA functions, e.g. <c>scal_gpu_13</c> for compute capability
 
- 1.3, and <c>scal_gpu_20</c> for compute capability 2.0. Both functions can be
 
- provided to StarPU by using starpu_codelet::cuda_funcs, and
 
- starpu_codelet::can_execute can then be used to rule out the
 
- <c>scal_gpu_20</c> variant on a CUDA device which will not be able to execute it:
 
- \code{.c}
 
- static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
 
- {
 
-   const struct cudaDeviceProp *props;
 
-   if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)
 
-     return 1;
 
-   /* Cuda device */
 
-   if (nimpl == 0)
 
-     /* Trying to execute the 1.3 capability variant, we assume it is ok in all cases.  */
 
-     return 1;
 
-   /* Trying to execute the 2.0 capability variant, check that the card can do it.  */
 
-   props = starpu_cuda_get_device_properties(workerid);
 
-   if (props->major >= 2 || props->minor >= 0)
 
-     /* At least compute capability 2.0, can run it */
 
-     return 1;
 
-   /* Old card, does not support 2.0, will not be able to execute the 2.0 variant.  */
 
-   return 0;
 
- }
 
- struct starpu_codelet cl =
 
- {
 
-     .can_execute = can_execute,
 
-     .cpu_funcs = { cpu_func },
 
-     .cpu_funcs_name = { "cpu_func" },
 
-     .cuda_funcs = { scal_gpu_13, scal_gpu_20 },
 
-     .nbuffers = 1,
 
-     .modes = { STARPU_RW }
 
- };
 
- \endcode
 
- Another example is having specialized implementations for some given common
 
- sizes, for instance here we have a specialized implementation for 1024x1024
 
- matrices:
 
- \code{.c}
 
- static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
 
- {
 
-   const struct cudaDeviceProp *props;
 
-   if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)
 
-     return 1;
 
-   /* Cuda device */
 
-   switch (nimpl)
 
-   {
 
-     case 0:
 
-       /* Trying to execute the generic capability variant.  */
 
-       return 1;
 
-     case 1:
 
-     {
 
-       /* Trying to execute the size == 1024 specific variant.  */
 
-       struct starpu_matrix_interface *interface = starpu_data_get_interface_on_node(task->handles[0]);
 
-       return STARPU_MATRIX_GET_NX(interface) == 1024 && STARPU_MATRIX_GET_NY(interface == 1024);
 
-     }
 
-   }
 
- }
 
- struct starpu_codelet cl =
 
- {
 
-     .can_execute = can_execute,
 
-     .cpu_funcs = { cpu_func },
 
-     .cpu_funcs_name = { "cpu_func" },
 
-     .cuda_funcs = { potrf_gpu_generic, potrf_gpu_1024 },
 
-     .nbuffers = 1,
 
-     .modes = { STARPU_RW }
 
- };
 
- \endcode
 
- Note: the most generic variant should be provided first, as some schedulers are
 
- not able to try the different variants.
 
- \section InsertTaskUtility Insert Task Utility
 
- StarPU provides the wrapper function starpu_task_insert() to ease
 
- the creation and submission of tasks.
 
- Here the implementation of the codelet:
 
- \code{.c}
 
- void func_cpu(void *descr[], void *_args)
 
- {
 
-         int *x0 = (int *)STARPU_VARIABLE_GET_PTR(descr[0]);
 
-         float *x1 = (float *)STARPU_VARIABLE_GET_PTR(descr[1]);
 
-         int ifactor;
 
-         float ffactor;
 
-         starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
 
-         *x0 = *x0 * ifactor;
 
-         *x1 = *x1 * ffactor;
 
- }
 
- struct starpu_codelet mycodelet =
 
- {
 
-         .cpu_funcs = { func_cpu },
 
-         .cpu_funcs_name = { "func_cpu" },
 
-         .nbuffers = 2,
 
-         .modes = { STARPU_RW, STARPU_RW }
 
- };
 
- \endcode
 
- And the call to the function starpu_task_insert():
 
- \code{.c}
 
- starpu_task_insert(&mycodelet,
 
-                    STARPU_VALUE, &ifactor, sizeof(ifactor),
 
-                    STARPU_VALUE, &ffactor, sizeof(ffactor),
 
-                    STARPU_RW, data_handles[0],
 
- 		   STARPU_RW, data_handles[1],
 
-                    0);
 
- \endcode
 
- The call to starpu_task_insert() is equivalent to the following
 
- code:
 
- \code{.c}
 
- struct starpu_task *task = starpu_task_create();
 
- task->cl = &mycodelet;
 
- task->handles[0] = data_handles[0];
 
- task->handles[1] = data_handles[1];
 
- char *arg_buffer;
 
- size_t arg_buffer_size;
 
- starpu_codelet_pack_args(&arg_buffer, &arg_buffer_size,
 
-                     STARPU_VALUE, &ifactor, sizeof(ifactor),
 
-                     STARPU_VALUE, &ffactor, sizeof(ffactor),
 
-                     0);
 
- task->cl_arg = arg_buffer;
 
- task->cl_arg_size = arg_buffer_size;
 
- int ret = starpu_task_submit(task);
 
- \endcode
 
- Here a similar call using ::STARPU_DATA_ARRAY.
 
- \code{.c}
 
- starpu_task_insert(&mycodelet,
 
-                    STARPU_DATA_ARRAY, data_handles, 2,
 
-                    STARPU_VALUE, &ifactor, sizeof(ifactor),
 
-                    STARPU_VALUE, &ffactor, sizeof(ffactor),
 
-                    0);
 
- \endcode
 
- If some part of the task insertion depends on the value of some computation,
 
- the macro ::STARPU_DATA_ACQUIRE_CB can be very convenient. For
 
- instance, assuming that the index variable <c>i</c> was registered as handle
 
- <c>A_handle[i]</c>:
 
- \code{.c}
 
- /* Compute which portion we will work on, e.g. pivot */
 
- starpu_task_insert(&which_index, STARPU_W, i_handle, 0);
 
- /* And submit the corresponding task */
 
- STARPU_DATA_ACQUIRE_CB(i_handle, STARPU_R,
 
-                        starpu_task_insert(&work, STARPU_RW, A_handle[i], 0));
 
- \endcode
 
- The macro ::STARPU_DATA_ACQUIRE_CB submits an asynchronous request for
 
- acquiring data <c>i</c> for the main application, and will execute the code
 
- given as third parameter when it is acquired. In other words, as soon as the
 
- value of <c>i</c> computed by the codelet <c>which_index</c> can be read, the
 
- portion of code passed as third parameter of ::STARPU_DATA_ACQUIRE_CB will
 
- be executed, and is allowed to read from <c>i</c> to use it e.g. as an
 
- index. Note that this macro is only avaible when compiling StarPU with
 
- the compiler <c>gcc</c>.
 
- There is several ways of calling the function starpu_codelet_unpack_args().
 
- \code{.c}
 
- void func_cpu(void *descr[], void *_args)
 
- {
 
-         int ifactor;
 
-         float ffactor;
 
-         starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
 
- }
 
- \endcode
 
- \code{.c}
 
- void func_cpu(void *descr[], void *_args)
 
- {
 
-         int ifactor;
 
-         float ffactor;
 
-         starpu_codelet_unpack_args(_args, &ifactor, 0);
 
-         starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
 
- }
 
- \endcode
 
- \code{.c}
 
- void func_cpu(void *descr[], void *_args)
 
- {
 
-         int ifactor;
 
-         float ffactor;
 
- 	char buffer[100];
 
-         starpu_codelet_unpack_args_and_copyleft(_args, buffer, 100, &ifactor, 0);
 
-         starpu_codelet_unpack_args(buffer, &ffactor);
 
- }
 
- \endcode
 
- \section GettingTaskChildren Getting Task Children
 
- It may be interesting to get the list of tasks which depend on a given task,
 
- notably when using implicit dependencies, since this list is computed by StarPU.
 
- starpu_task_get_task_succs() provides it. For instance:
 
- \code{.c}
 
- struct starpu_task *tasks[4];
 
- ret = starpu_task_get_task_succs(task, sizeof(tasks)/sizeof(*tasks), tasks);
 
- \endcode
 
- \section ParallelTasks Parallel Tasks
 
- StarPU can leverage existing parallel computation libraries by the means of
 
- parallel tasks. A parallel task is a task which gets worked on by a set of CPUs
 
- (called a parallel or combined worker) at the same time, by using an existing
 
- parallel CPU implementation of the computation to be achieved. This can also be
 
- useful to improve the load balance between slow CPUs and fast GPUs: since CPUs
 
- work collectively on a single task, the completion time of tasks on CPUs become
 
- comparable to the completion time on GPUs, thus relieving from granularity
 
- discrepancy concerns. <c>hwloc</c> support needs to be enabled to get
 
- good performance, otherwise StarPU will not know how to better group
 
- cores.
 
- Two modes of execution exist to accomodate with existing usages.
 
- \subsection Fork-modeParallelTasks Fork-mode Parallel Tasks
 
- In the Fork mode, StarPU will call the codelet function on one
 
- of the CPUs of the combined worker. The codelet function can use
 
- starpu_combined_worker_get_size() to get the number of threads it is
 
- allowed to start to achieve the computation. The CPU binding mask for the whole
 
- set of CPUs is already enforced, so that threads created by the function will
 
- inherit the mask, and thus execute where StarPU expected, the OS being in charge
 
- of choosing how to schedule threads on the corresponding CPUs. The application
 
- can also choose to bind threads by hand, using e.g. <c>sched_getaffinity</c> to know
 
- the CPU binding mask that StarPU chose.
 
- For instance, using OpenMP (full source is available in
 
- <c>examples/openmp/vector_scal.c</c>):
 
- \snippet forkmode.c To be included. You should update doxygen if you see this text.
 
- Other examples include for instance calling a BLAS parallel CPU implementation
 
- (see <c>examples/mult/xgemm.c</c>).
 
- \subsection SPMD-modeParallelTasks SPMD-mode Parallel Tasks
 
- In the SPMD mode, StarPU will call the codelet function on
 
- each CPU of the combined worker. The codelet function can use
 
- starpu_combined_worker_get_size() to get the total number of CPUs
 
- involved in the combined worker, and thus the number of calls that are made in
 
- parallel to the function, and starpu_combined_worker_get_rank() to get
 
- the rank of the current CPU within the combined worker. For instance:
 
- \code{.c}
 
- static void func(void *buffers[], void *args)
 
- {
 
-     unsigned i;
 
-     float *factor = _args;
 
-     struct starpu_vector_interface *vector = buffers[0];
 
-     unsigned n = STARPU_VECTOR_GET_NX(vector);
 
-     float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
 
-     /* Compute slice to compute */
 
-     unsigned m = starpu_combined_worker_get_size();
 
-     unsigned j = starpu_combined_worker_get_rank();
 
-     unsigned slice = (n+m-1)/m;
 
-     for (i = j * slice; i < (j+1) * slice && i < n; i++)
 
-         val[i] *= *factor;
 
- }
 
- static struct starpu_codelet cl =
 
- {
 
-     .modes = { STARPU_RW },
 
-     .type = STARPU_SPMD,
 
-     .max_parallelism = INT_MAX,
 
-     .cpu_funcs = { func },
 
-     .cpu_funcs_name = { "func" },
 
-     .nbuffers = 1,
 
- }
 
- \endcode
 
- Of course, this trivial example will not really benefit from parallel task
 
- execution, and was only meant to be simple to understand.  The benefit comes
 
- when the computation to be done is so that threads have to e.g. exchange
 
- intermediate results, or write to the data in a complex but safe way in the same
 
- buffer.
 
- \subsection ParallelTasksPerformance Parallel Tasks Performance
 
- To benefit from parallel tasks, a parallel-task-aware StarPU scheduler has to
 
- be used. When exposed to codelets with a flag ::STARPU_FORKJOIN or
 
- ::STARPU_SPMD, the schedulers <c>pheft</c> (parallel-heft) and <c>peager</c>
 
- (parallel eager) will indeed also try to execute tasks with
 
- several CPUs. It will automatically try the various available combined
 
- worker sizes (making several measurements for each worker size) and
 
- thus be able to avoid choosing a large combined worker if the codelet
 
- does not actually scale so much.
 
- \subsection CombinedWorkers Combined Workers
 
- By default, StarPU creates combined workers according to the architecture
 
- structure as detected by <c>hwloc</c>. It means that for each object of the <c>hwloc</c>
 
- topology (NUMA node, socket, cache, ...) a combined worker will be created. If
 
- some nodes of the hierarchy have a big arity (e.g. many cores in a socket
 
- without a hierarchy of shared caches), StarPU will create combined workers of
 
- intermediate sizes. The variable \ref STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER
 
- permits to tune the maximum arity between levels of combined workers.
 
- The combined workers actually produced can be seen in the output of the
 
- tool <c>starpu_machine_display</c> (the environment variable
 
- \ref STARPU_SCHED has to be set to a combined worker-aware scheduler such
 
- as <c>pheft</c> or <c>peager</c>).
 
- \subsection ConcurrentParallelTasks Concurrent Parallel Tasks
 
- Unfortunately, many environments and librairies do not support concurrent
 
- calls.
 
- For instance, most OpenMP implementations (including the main ones) do not
 
- support concurrent <c>pragma omp parallel</c> statements without nesting them in
 
- another <c>pragma omp parallel</c> statement, but StarPU does not yet support
 
- creating its CPU workers by using such pragma.
 
- Other parallel libraries are also not safe when being invoked concurrently
 
- from different threads, due to the use of global variables in their sequential
 
- sections for instance.
 
- The solution is then to use only one combined worker at a time.  This can be
 
- done by setting the field starpu_conf::single_combined_worker to <c>1</c>, or
 
- setting the environment variable \ref STARPU_SINGLE_COMBINED_WORKER
 
- to <c>1</c>. StarPU will then run only one parallel task at a time (but other
 
- CPU and GPU tasks are not affected and can be run concurrently). The parallel
 
- task scheduler will however still try varying combined worker
 
- sizes to look for the most efficient ones.
 
- \subsection SynchronizationTasks Synchronization Tasks
 
- For the application conveniency, it may be useful to define tasks which do not
 
- actually make any computation, but wear for instance dependencies between other
 
- tasks or tags, or to be submitted in callbacks, etc.
 
- The obvious way is of course to make kernel functions empty, but such task will
 
- thus have to wait for a worker to become ready, transfer data, etc.
 
- A much lighter way to define a synchronization task is to set its starpu_task::cl
 
- field to <c>NULL</c>. The task will thus be a mere synchronization point,
 
- without any data access or execution content: as soon as its dependencies become
 
- available, it will terminate, call the callbacks, and release dependencies.
 
- An intermediate solution is to define a codelet with its
 
- starpu_codelet::where field set to \ref STARPU_NOWHERE, for instance:
 
- \code{.c}
 
- struct starpu_codelet cl =
 
- {
 
- 	.where = STARPU_NOWHERE,
 
- 	.nbuffers = 1,
 
- 	.modes = { STARPU_R },
 
- }
 
- task = starpu_task_create();
 
- task->cl = &cl;
 
- task->handles[0] = handle;
 
- starpu_task_submit(task);
 
- \endcode
 
- will create a task which simply waits for the value of <c>handle</c> to be
 
- available for read. This task can then be depended on, etc.
 
- */
 
 
  |