301_tasks.doxy 24 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639
  1. /* StarPU --- Runtime system for heterogeneous multicore architectures.
  2. *
  3. * Copyright (C) 2010-2019 CNRS
  4. * Copyright (C) 2011,2012,2018 Inria
  5. * Copyright (C) 2009-2011,2014-2016,2018 Université de Bordeaux
  6. *
  7. * StarPU is free software; you can redistribute it and/or modify
  8. * it under the terms of the GNU Lesser General Public License as published by
  9. * the Free Software Foundation; either version 2.1 of the License, or (at
  10. * your option) any later version.
  11. *
  12. * StarPU is distributed in the hope that it will be useful, but
  13. * WITHOUT ANY WARRANTY; without even the implied warranty of
  14. * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
  15. *
  16. * See the GNU Lesser General Public License in COPYING.LGPL for more details.
  17. */
  18. /*! \page TasksInStarPU Tasks In StarPU
  19. \section TaskGranularity Task Granularity
  20. Like any other runtime, StarPU has some overhead to manage tasks. Since
  21. it does smart scheduling and data management, this overhead is not always
  22. neglectable. The order of magnitude of the overhead is typically a couple of
  23. microseconds, which is actually quite smaller than the CUDA overhead itself. The
  24. amount of work that a task should do should thus be somewhat
  25. bigger, to make sure that the overhead becomes neglectible. The offline
  26. performance feedback can provide a measure of task length, which should thus be
  27. checked if bad performance are observed. To get a grasp at the scalability
  28. possibility according to task size, one can run
  29. <c>tests/microbenchs/tasks_size_overhead.sh</c> which draws curves of the
  30. speedup of independent tasks of very small sizes.
  31. The choice of scheduler also has impact over the overhead: for instance, the
  32. scheduler <c>dmda</c> takes time to make a decision, while <c>eager</c> does
  33. not. <c>tasks_size_overhead.sh</c> can again be used to get a grasp at how much
  34. impact that has on the target machine.
  35. \section TaskSubmission Task Submission
  36. To let StarPU make online optimizations, tasks should be submitted
  37. asynchronously as much as possible. Ideally, all the tasks should be
  38. submitted, and mere calls to starpu_task_wait_for_all() or
  39. starpu_data_unregister() be done to wait for
  40. termination. StarPU will then be able to rework the whole schedule, overlap
  41. computation with communication, manage accelerator local memory usage, etc.
  42. \section TaskPriorities Task Priorities
  43. By default, StarPU will consider the tasks in the order they are submitted by
  44. the application. If the application programmer knows that some tasks should
  45. be performed in priority (for instance because their output is needed by many
  46. other tasks and may thus be a bottleneck if not executed early
  47. enough), the field starpu_task::priority should be set to transmit the
  48. priority information to StarPU.
  49. \section TaskDependencies Task Dependencies
  50. \subsection SequentialConsistency Sequential Consistency
  51. By default, task dependencies are inferred from data dependency (sequential
  52. coherency) by StarPU. The application can however disable sequential coherency
  53. for some data, and dependencies can be specifically expressed.
  54. Setting (or unsetting) sequential consistency can be done at the data
  55. level by calling starpu_data_set_sequential_consistency_flag() for a
  56. specific data or starpu_data_set_default_sequential_consistency_flag()
  57. for all datas.
  58. Setting (or unsetting) sequential consistency can also be done at task
  59. level by setting the field starpu_task::sequential_consistency to \c 0.
  60. Sequential consistency can also be set (or unset) for each handle of a
  61. specific task, this is done by using the field
  62. starpu_task::handles_sequential_consistency. When set, its value
  63. should be a array with the number of elements being the number of
  64. handles for the task, each element of the array being the sequential
  65. consistency for the \c i-th handle of the task. The field can easily be
  66. set when calling starpu_task_insert() with the flag
  67. ::STARPU_HANDLES_SEQUENTIAL_CONSISTENCY
  68. \code{.c}
  69. char *seq_consistency = malloc(cl.nbuffers * sizeof(char));
  70. seq_consistency[0] = 1;
  71. seq_consistency[1] = 1;
  72. seq_consistency[2] = 0;
  73. ret = starpu_task_insert(&cl,
  74. STARPU_RW, handleA, STARPU_RW, handleB, STARPU_RW, handleC,
  75. STARPU_HANDLES_SEQUENTIAL_CONSISTENCY, seq_consistency,
  76. 0);
  77. free(seq_consistency);
  78. \endcode
  79. The internal algorithm used by StarPU to set up implicit dependency is
  80. as follows:
  81. \code{.c}
  82. if (sequential_consistency(task) == 1)
  83. for(i=0 ; i<STARPU_TASK_GET_NBUFFERS(task) ; i++)
  84. if (sequential_consistency(i-th data, task) == 1)
  85. if (sequential_consistency(i-th data) == 1)
  86. create_implicit_dependency(...)
  87. \endcode
  88. \subsection TasksAndTagsDependencies Tasks And Tags Dependencies
  89. One can explicitely set dependencies between tasks using
  90. starpu_task_declare_deps() or starpu_task_declare_deps_array(). Dependencies between tasks can be
  91. expressed through tags associated to a tag with the field
  92. starpu_task::tag_id and using the function starpu_tag_declare_deps()
  93. or starpu_tag_declare_deps_array().
  94. The termination of a task can be delayed through the function
  95. starpu_task_end_dep_add() which specifies the number of calls to the function
  96. starpu_task_end_dep_release() needed to trigger the task termination. One can
  97. also use starpu_task_declare_end_deps() or starpu_task_declare_end_deps_array()
  98. to delay the termination of a task until the termination of other tasks.
  99. \section SettingManyDataHandlesForATask Setting Many Data Handles For a Task
  100. The maximum number of data a task can manage is fixed by the environment variable
  101. \ref STARPU_NMAXBUFS which has a default value which can be changed
  102. through the \c configure option \ref enable-maxbuffers "--enable-maxbuffers".
  103. However, it is possible to define tasks managing more data by using
  104. the field starpu_task::dyn_handles when defining a task and the field
  105. starpu_codelet::dyn_modes when defining the corresponding codelet.
  106. \code{.c}
  107. enum starpu_data_access_mode modes[STARPU_NMAXBUFS+1] =
  108. {
  109. STARPU_R, STARPU_R, ...
  110. };
  111. struct starpu_codelet dummy_big_cl =
  112. {
  113. .cuda_funcs = { dummy_big_kernel },
  114. .opencl_funcs = { dummy_big_kernel },
  115. .cpu_funcs = { dummy_big_kernel },
  116. .cpu_funcs_name = { "dummy_big_kernel" },
  117. .nbuffers = STARPU_NMAXBUFS+1,
  118. .dyn_modes = modes
  119. };
  120. task = starpu_task_create();
  121. task->cl = &dummy_big_cl;
  122. task->dyn_handles = malloc(task->cl->nbuffers * sizeof(starpu_data_handle_t));
  123. for(i=0 ; i<task->cl->nbuffers ; i++)
  124. {
  125. task->dyn_handles[i] = handle;
  126. }
  127. starpu_task_submit(task);
  128. \endcode
  129. \code{.c}
  130. starpu_data_handle_t *handles = malloc(dummy_big_cl.nbuffers * sizeof(starpu_data_handle_t));
  131. for(i=0 ; i<dummy_big_cl.nbuffers ; i++)
  132. {
  133. handles[i] = handle;
  134. }
  135. starpu_task_insert(&dummy_big_cl,
  136. STARPU_VALUE, &dummy_big_cl.nbuffers, sizeof(dummy_big_cl.nbuffers),
  137. STARPU_DATA_ARRAY, handles, dummy_big_cl.nbuffers,
  138. 0);
  139. \endcode
  140. The whole code for this complex data interface is available in the
  141. directory <c>examples/basic_examples/dynamic_handles.c</c>.
  142. \section SettingVariableDataHandlesForATask Setting a Variable Number Of Data Handles For a Task
  143. Normally, the number of data handles given to a task is fixed in the
  144. starpu_codelet::nbuffers codelet field. This field can however be set to
  145. \ref STARPU_VARIABLE_NBUFFERS, in which case the starpu_task::nbuffers task field
  146. must be set, and the starpu_task::modes field (or starpu_task::dyn_modes field,
  147. see \ref SettingManyDataHandlesForATask) should be used to specify the modes for
  148. the handles.
  149. \section UsingMultipleImplementationsOfACodelet Using Multiple Implementations Of A Codelet
  150. One may want to write multiple implementations of a codelet for a single type of
  151. device and let StarPU choose which one to run. As an example, we will show how
  152. to use SSE to scale a vector. The codelet can be written as follows:
  153. \code{.c}
  154. #include <xmmintrin.h>
  155. void scal_sse_func(void *buffers[], void *cl_arg)
  156. {
  157. float *vector = (float *) STARPU_VECTOR_GET_PTR(buffers[0]);
  158. unsigned int n = STARPU_VECTOR_GET_NX(buffers[0]);
  159. unsigned int n_iterations = n/4;
  160. if (n % 4 != 0)
  161. n_iterations++;
  162. __m128 *VECTOR = (__m128*) vector;
  163. __m128 factor __attribute__((aligned(16)));
  164. factor = _mm_set1_ps(*(float *) cl_arg);
  165. unsigned int i;
  166. for (i = 0; i < n_iterations; i++)
  167. VECTOR[i] = _mm_mul_ps(factor, VECTOR[i]);
  168. }
  169. \endcode
  170. \code{.c}
  171. struct starpu_codelet cl =
  172. {
  173. .cpu_funcs = { scal_cpu_func, scal_sse_func },
  174. .cpu_funcs_name = { "scal_cpu_func", "scal_sse_func" },
  175. .nbuffers = 1,
  176. .modes = { STARPU_RW }
  177. };
  178. \endcode
  179. Schedulers which are multi-implementation aware (only <c>dmda</c> and
  180. <c>pheft</c> for now) will use the performance models of all the
  181. implementations it was given, and pick the one which seems to be the fastest.
  182. \section EnablingImplementationAccordingToCapabilities Enabling Implementation According To Capabilities
  183. Some implementations may not run on some devices. For instance, some CUDA
  184. devices do not support double floating point precision, and thus the kernel
  185. execution would just fail; or the device may not have enough shared memory for
  186. the implementation being used. The field starpu_codelet::can_execute
  187. permits to express this. For instance:
  188. \code{.c}
  189. static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
  190. {
  191. const struct cudaDeviceProp *props;
  192. if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)
  193. return 1;
  194. /* Cuda device */
  195. props = starpu_cuda_get_device_properties(workerid);
  196. if (props->major >= 2 || props->minor >= 3)
  197. /* At least compute capability 1.3, supports doubles */
  198. return 1;
  199. /* Old card, does not support doubles */
  200. return 0;
  201. }
  202. struct starpu_codelet cl =
  203. {
  204. .can_execute = can_execute,
  205. .cpu_funcs = { cpu_func },
  206. .cpu_funcs_name = { "cpu_func" },
  207. .cuda_funcs = { gpu_func }
  208. .nbuffers = 1,
  209. .modes = { STARPU_RW }
  210. };
  211. \endcode
  212. This can be essential e.g. when running on a machine which mixes various models
  213. of CUDA devices, to take benefit from the new models without crashing on old models.
  214. Note: the function starpu_codelet::can_execute is called by the
  215. scheduler each time it tries to match a task with a worker, and should
  216. thus be very fast. The function starpu_cuda_get_device_properties()
  217. provides a quick access to CUDA properties of CUDA devices to achieve
  218. such efficiency.
  219. Another example is to compile CUDA code for various compute capabilities,
  220. resulting with two CUDA functions, e.g. <c>scal_gpu_13</c> for compute capability
  221. 1.3, and <c>scal_gpu_20</c> for compute capability 2.0. Both functions can be
  222. provided to StarPU by using starpu_codelet::cuda_funcs, and
  223. starpu_codelet::can_execute can then be used to rule out the
  224. <c>scal_gpu_20</c> variant on a CUDA device which will not be able to execute it:
  225. \code{.c}
  226. static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
  227. {
  228. const struct cudaDeviceProp *props;
  229. if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)
  230. return 1;
  231. /* Cuda device */
  232. if (nimpl == 0)
  233. /* Trying to execute the 1.3 capability variant, we assume it is ok in all cases. */
  234. return 1;
  235. /* Trying to execute the 2.0 capability variant, check that the card can do it. */
  236. props = starpu_cuda_get_device_properties(workerid);
  237. if (props->major >= 2 || props->minor >= 0)
  238. /* At least compute capability 2.0, can run it */
  239. return 1;
  240. /* Old card, does not support 2.0, will not be able to execute the 2.0 variant. */
  241. return 0;
  242. }
  243. struct starpu_codelet cl =
  244. {
  245. .can_execute = can_execute,
  246. .cpu_funcs = { cpu_func },
  247. .cpu_funcs_name = { "cpu_func" },
  248. .cuda_funcs = { scal_gpu_13, scal_gpu_20 },
  249. .nbuffers = 1,
  250. .modes = { STARPU_RW }
  251. };
  252. \endcode
  253. Another example is having specialized implementations for some given common
  254. sizes, for instance here we have a specialized implementation for 1024x1024
  255. matrices:
  256. \code{.c}
  257. static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
  258. {
  259. const struct cudaDeviceProp *props;
  260. if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)
  261. return 1;
  262. /* Cuda device */
  263. switch (nimpl)
  264. {
  265. case 0:
  266. /* Trying to execute the generic capability variant. */
  267. return 1;
  268. case 1:
  269. {
  270. /* Trying to execute the size == 1024 specific variant. */
  271. struct starpu_matrix_interface *interface = starpu_data_get_interface_on_node(task->handles[0]);
  272. return STARPU_MATRIX_GET_NX(interface) == 1024 && STARPU_MATRIX_GET_NY(interface == 1024);
  273. }
  274. }
  275. }
  276. struct starpu_codelet cl =
  277. {
  278. .can_execute = can_execute,
  279. .cpu_funcs = { cpu_func },
  280. .cpu_funcs_name = { "cpu_func" },
  281. .cuda_funcs = { potrf_gpu_generic, potrf_gpu_1024 },
  282. .nbuffers = 1,
  283. .modes = { STARPU_RW }
  284. };
  285. \endcode
  286. Note: the most generic variant should be provided first, as some schedulers are
  287. not able to try the different variants.
  288. \section InsertTaskUtility Insert Task Utility
  289. StarPU provides the wrapper function starpu_task_insert() to ease
  290. the creation and submission of tasks.
  291. Here the implementation of the codelet:
  292. \code{.c}
  293. void func_cpu(void *descr[], void *_args)
  294. {
  295. int *x0 = (int *)STARPU_VARIABLE_GET_PTR(descr[0]);
  296. float *x1 = (float *)STARPU_VARIABLE_GET_PTR(descr[1]);
  297. int ifactor;
  298. float ffactor;
  299. starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
  300. *x0 = *x0 * ifactor;
  301. *x1 = *x1 * ffactor;
  302. }
  303. struct starpu_codelet mycodelet =
  304. {
  305. .cpu_funcs = { func_cpu },
  306. .cpu_funcs_name = { "func_cpu" },
  307. .nbuffers = 2,
  308. .modes = { STARPU_RW, STARPU_RW }
  309. };
  310. \endcode
  311. And the call to the function starpu_task_insert():
  312. \code{.c}
  313. starpu_task_insert(&mycodelet,
  314. STARPU_VALUE, &ifactor, sizeof(ifactor),
  315. STARPU_VALUE, &ffactor, sizeof(ffactor),
  316. STARPU_RW, data_handles[0],
  317. STARPU_RW, data_handles[1],
  318. 0);
  319. \endcode
  320. The call to starpu_task_insert() is equivalent to the following
  321. code:
  322. \code{.c}
  323. struct starpu_task *task = starpu_task_create();
  324. task->cl = &mycodelet;
  325. task->handles[0] = data_handles[0];
  326. task->handles[1] = data_handles[1];
  327. char *arg_buffer;
  328. size_t arg_buffer_size;
  329. starpu_codelet_pack_args(&arg_buffer, &arg_buffer_size,
  330. STARPU_VALUE, &ifactor, sizeof(ifactor),
  331. STARPU_VALUE, &ffactor, sizeof(ffactor),
  332. 0);
  333. task->cl_arg = arg_buffer;
  334. task->cl_arg_size = arg_buffer_size;
  335. int ret = starpu_task_submit(task);
  336. \endcode
  337. Here a similar call using ::STARPU_DATA_ARRAY.
  338. \code{.c}
  339. starpu_task_insert(&mycodelet,
  340. STARPU_DATA_ARRAY, data_handles, 2,
  341. STARPU_VALUE, &ifactor, sizeof(ifactor),
  342. STARPU_VALUE, &ffactor, sizeof(ffactor),
  343. 0);
  344. \endcode
  345. If some part of the task insertion depends on the value of some computation,
  346. the macro ::STARPU_DATA_ACQUIRE_CB can be very convenient. For
  347. instance, assuming that the index variable <c>i</c> was registered as handle
  348. <c>A_handle[i]</c>:
  349. \code{.c}
  350. /* Compute which portion we will work on, e.g. pivot */
  351. starpu_task_insert(&which_index, STARPU_W, i_handle, 0);
  352. /* And submit the corresponding task */
  353. STARPU_DATA_ACQUIRE_CB(i_handle, STARPU_R,
  354. starpu_task_insert(&work, STARPU_RW, A_handle[i], 0));
  355. \endcode
  356. The macro ::STARPU_DATA_ACQUIRE_CB submits an asynchronous request for
  357. acquiring data <c>i</c> for the main application, and will execute the code
  358. given as third parameter when it is acquired. In other words, as soon as the
  359. value of <c>i</c> computed by the codelet <c>which_index</c> can be read, the
  360. portion of code passed as third parameter of ::STARPU_DATA_ACQUIRE_CB will
  361. be executed, and is allowed to read from <c>i</c> to use it e.g. as an
  362. index. Note that this macro is only avaible when compiling StarPU with
  363. the compiler <c>gcc</c>.
  364. StarPU also provides a utility function starpu_codelet_unpack_args() to retrieve the ::STARPU_VALUE arguments passed to the task. There is several ways of calling this function starpu_codelet_unpack_args().
  365. \code{.c}
  366. void func_cpu(void *descr[], void *_args)
  367. {
  368. int ifactor;
  369. float ffactor;
  370. starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
  371. }
  372. \endcode
  373. \code{.c}
  374. void func_cpu(void *descr[], void *_args)
  375. {
  376. int ifactor;
  377. float ffactor;
  378. starpu_codelet_unpack_args(_args, &ifactor, 0);
  379. starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
  380. }
  381. \endcode
  382. \code{.c}
  383. void func_cpu(void *descr[], void *_args)
  384. {
  385. int ifactor;
  386. float ffactor;
  387. char buffer[100];
  388. starpu_codelet_unpack_args_and_copyleft(_args, buffer, 100, &ifactor, 0);
  389. starpu_codelet_unpack_args(buffer, &ffactor);
  390. }
  391. \endcode
  392. \section GettingTaskChildren Getting Task Children
  393. It may be interesting to get the list of tasks which depend on a given task,
  394. notably when using implicit dependencies, since this list is computed by StarPU.
  395. starpu_task_get_task_succs() provides it. For instance:
  396. \code{.c}
  397. struct starpu_task *tasks[4];
  398. ret = starpu_task_get_task_succs(task, sizeof(tasks)/sizeof(*tasks), tasks);
  399. \endcode
  400. \section ParallelTasks Parallel Tasks
  401. StarPU can leverage existing parallel computation libraries by the means of
  402. parallel tasks. A parallel task is a task which gets worked on by a set of CPUs
  403. (called a parallel or combined worker) at the same time, by using an existing
  404. parallel CPU implementation of the computation to be achieved. This can also be
  405. useful to improve the load balance between slow CPUs and fast GPUs: since CPUs
  406. work collectively on a single task, the completion time of tasks on CPUs become
  407. comparable to the completion time on GPUs, thus relieving from granularity
  408. discrepancy concerns. <c>hwloc</c> support needs to be enabled to get
  409. good performance, otherwise StarPU will not know how to better group
  410. cores.
  411. Two modes of execution exist to accomodate with existing usages.
  412. \subsection Fork-modeParallelTasks Fork-mode Parallel Tasks
  413. In the Fork mode, StarPU will call the codelet function on one
  414. of the CPUs of the combined worker. The codelet function can use
  415. starpu_combined_worker_get_size() to get the number of threads it is
  416. allowed to start to achieve the computation. The CPU binding mask for the whole
  417. set of CPUs is already enforced, so that threads created by the function will
  418. inherit the mask, and thus execute where StarPU expected, the OS being in charge
  419. of choosing how to schedule threads on the corresponding CPUs. The application
  420. can also choose to bind threads by hand, using e.g. <c>sched_getaffinity</c> to know
  421. the CPU binding mask that StarPU chose.
  422. For instance, using OpenMP (full source is available in
  423. <c>examples/openmp/vector_scal.c</c>):
  424. \snippet forkmode.c To be included. You should update doxygen if you see this text.
  425. Other examples include for instance calling a BLAS parallel CPU implementation
  426. (see <c>examples/mult/xgemm.c</c>).
  427. \subsection SPMD-modeParallelTasks SPMD-mode Parallel Tasks
  428. In the SPMD mode, StarPU will call the codelet function on
  429. each CPU of the combined worker. The codelet function can use
  430. starpu_combined_worker_get_size() to get the total number of CPUs
  431. involved in the combined worker, and thus the number of calls that are made in
  432. parallel to the function, and starpu_combined_worker_get_rank() to get
  433. the rank of the current CPU within the combined worker. For instance:
  434. \code{.c}
  435. static void func(void *buffers[], void *args)
  436. {
  437. unsigned i;
  438. float *factor = _args;
  439. struct starpu_vector_interface *vector = buffers[0];
  440. unsigned n = STARPU_VECTOR_GET_NX(vector);
  441. float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
  442. /* Compute slice to compute */
  443. unsigned m = starpu_combined_worker_get_size();
  444. unsigned j = starpu_combined_worker_get_rank();
  445. unsigned slice = (n+m-1)/m;
  446. for (i = j * slice; i < (j+1) * slice && i < n; i++)
  447. val[i] *= *factor;
  448. }
  449. static struct starpu_codelet cl =
  450. {
  451. .modes = { STARPU_RW },
  452. .type = STARPU_SPMD,
  453. .max_parallelism = INT_MAX,
  454. .cpu_funcs = { func },
  455. .cpu_funcs_name = { "func" },
  456. .nbuffers = 1,
  457. }
  458. \endcode
  459. Of course, this trivial example will not really benefit from parallel task
  460. execution, and was only meant to be simple to understand. The benefit comes
  461. when the computation to be done is so that threads have to e.g. exchange
  462. intermediate results, or write to the data in a complex but safe way in the same
  463. buffer.
  464. \subsection ParallelTasksPerformance Parallel Tasks Performance
  465. To benefit from parallel tasks, a parallel-task-aware StarPU scheduler has to
  466. be used. When exposed to codelets with a flag ::STARPU_FORKJOIN or
  467. ::STARPU_SPMD, the schedulers <c>pheft</c> (parallel-heft) and <c>peager</c>
  468. (parallel eager) will indeed also try to execute tasks with
  469. several CPUs. It will automatically try the various available combined
  470. worker sizes (making several measurements for each worker size) and
  471. thus be able to avoid choosing a large combined worker if the codelet
  472. does not actually scale so much.
  473. \subsection CombinedWorkers Combined Workers
  474. By default, StarPU creates combined workers according to the architecture
  475. structure as detected by <c>hwloc</c>. It means that for each object of the <c>hwloc</c>
  476. topology (NUMA node, socket, cache, ...) a combined worker will be created. If
  477. some nodes of the hierarchy have a big arity (e.g. many cores in a socket
  478. without a hierarchy of shared caches), StarPU will create combined workers of
  479. intermediate sizes. The variable \ref STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER
  480. permits to tune the maximum arity between levels of combined workers.
  481. The combined workers actually produced can be seen in the output of the
  482. tool <c>starpu_machine_display</c> (the environment variable
  483. \ref STARPU_SCHED has to be set to a combined worker-aware scheduler such
  484. as <c>pheft</c> or <c>peager</c>).
  485. \subsection ConcurrentParallelTasks Concurrent Parallel Tasks
  486. Unfortunately, many environments and librairies do not support concurrent
  487. calls.
  488. For instance, most OpenMP implementations (including the main ones) do not
  489. support concurrent <c>pragma omp parallel</c> statements without nesting them in
  490. another <c>pragma omp parallel</c> statement, but StarPU does not yet support
  491. creating its CPU workers by using such pragma.
  492. Other parallel libraries are also not safe when being invoked concurrently
  493. from different threads, due to the use of global variables in their sequential
  494. sections for instance.
  495. The solution is then to use only one combined worker at a time. This can be
  496. done by setting the field starpu_conf::single_combined_worker to <c>1</c>, or
  497. setting the environment variable \ref STARPU_SINGLE_COMBINED_WORKER
  498. to <c>1</c>. StarPU will then run only one parallel task at a time (but other
  499. CPU and GPU tasks are not affected and can be run concurrently). The parallel
  500. task scheduler will however still try varying combined worker
  501. sizes to look for the most efficient ones.
  502. \subsection SynchronizationTasks Synchronization Tasks
  503. For the application conveniency, it may be useful to define tasks which do not
  504. actually make any computation, but wear for instance dependencies between other
  505. tasks or tags, or to be submitted in callbacks, etc.
  506. The obvious way is of course to make kernel functions empty, but such task will
  507. thus have to wait for a worker to become ready, transfer data, etc.
  508. A much lighter way to define a synchronization task is to set its starpu_task::cl
  509. field to <c>NULL</c>. The task will thus be a mere synchronization point,
  510. without any data access or execution content: as soon as its dependencies become
  511. available, it will terminate, call the callbacks, and release dependencies.
  512. An intermediate solution is to define a codelet with its
  513. starpu_codelet::where field set to \ref STARPU_NOWHERE, for instance:
  514. \code{.c}
  515. struct starpu_codelet cl =
  516. {
  517. .where = STARPU_NOWHERE,
  518. .nbuffers = 1,
  519. .modes = { STARPU_R },
  520. }
  521. task = starpu_task_create();
  522. task->cl = &cl;
  523. task->handles[0] = handle;
  524. starpu_task_submit(task);
  525. \endcode
  526. will create a task which simply waits for the value of <c>handle</c> to be
  527. available for read. This task can then be depended on, etc.
  528. */