06tasks.doxy 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439
  1. /*
  2. * This file is part of the StarPU Handbook.
  3. * Copyright (C) 2009--2011 Universit@'e de Bordeaux 1
  4. * Copyright (C) 2010, 2011, 2012, 2013, 2014 Centre National de la Recherche Scientifique
  5. * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
  6. * See the file version.doxy for copying conditions.
  7. */
  8. /*! \page TasksInStarPU Tasks In StarPU
  9. \section TaskGranularity Task Granularity
  10. Like any other runtime, StarPU has some overhead to manage tasks. Since
  11. it does smart scheduling and data management, that overhead is not always
  12. neglectable. The order of magnitude of the overhead is typically a couple of
  13. microseconds, which is actually quite smaller than the CUDA overhead itself. The
  14. amount of work that a task should do should thus be somewhat
  15. bigger, to make sure that the overhead becomes neglectible. The offline
  16. performance feedback can provide a measure of task length, which should thus be
  17. checked if bad performance are observed. To get a grasp at the scalability
  18. possibility according to task size, one can run
  19. <c>tests/microbenchs/tasks_size_overhead.sh</c> which draws curves of the
  20. speedup of independent tasks of very small sizes.
  21. The choice of scheduler also has impact over the overhead: for instance, the
  22. scheduler <c>dmda</c> takes time to make a decision, while <c>eager</c> does
  23. not. <c>tasks_size_overhead.sh</c> can again be used to get a grasp at how much
  24. impact that has on the target machine.
  25. \section TaskSubmission Task Submission
  26. To let StarPU make online optimizations, tasks should be submitted
  27. asynchronously as much as possible. Ideally, all the tasks should be
  28. submitted, and mere calls to starpu_task_wait_for_all() or
  29. starpu_data_unregister() be done to wait for
  30. termination. StarPU will then be able to rework the whole schedule, overlap
  31. computation with communication, manage accelerator local memory usage, etc.
  32. \section TaskPriorities Task Priorities
  33. By default, StarPU will consider the tasks in the order they are submitted by
  34. the application. If the application programmer knows that some tasks should
  35. be performed in priority (for instance because their output is needed by many
  36. other tasks and may thus be a bottleneck if not executed early
  37. enough), the field starpu_task::priority should be set to transmit the
  38. priority information to StarPU.
  39. \section SettingTheDataHandlesForATask Setting The Data Handles For A Task
  40. The number of data a task can manage is fixed by the environment variable
  41. \ref STARPU_NMAXBUFS which has a default value which can be changed
  42. through the configure option \ref enable-maxbuffers "--enable-maxbuffers".
  43. However, it is possible to define tasks managing more data by using
  44. the field starpu_task::dyn_handles when defining a task and the field
  45. starpu_codelet::dyn_modes when defining the corresponding codelet.
  46. \code{.c}
  47. enum starpu_data_access_mode modes[STARPU_NMAXBUFS+1] = {
  48. STARPU_R, STARPU_R, ...
  49. };
  50. struct starpu_codelet dummy_big_cl =
  51. {
  52. .cuda_funcs = { dummy_big_kernel, NULL },
  53. .opencl_funcs = { dummy_big_kernel, NULL },
  54. .cpu_funcs = { dummy_big_kernel, NULL },
  55. .cpu_funcs_name = { "dummy_big_kernel", NULL },
  56. .nbuffers = STARPU_NMAXBUFS+1,
  57. .dyn_modes = modes
  58. };
  59. task = starpu_task_create();
  60. task->cl = &dummy_big_cl;
  61. task->dyn_handles = malloc(task->cl->nbuffers * sizeof(starpu_data_handle_t));
  62. for(i=0 ; i<task->cl->nbuffers ; i++)
  63. {
  64. task->dyn_handles[i] = handle;
  65. }
  66. starpu_task_submit(task);
  67. \endcode
  68. \code{.c}
  69. starpu_data_handle_t *handles = malloc(dummy_big_cl.nbuffers * sizeof(starpu_data_handle_t));
  70. for(i=0 ; i<dummy_big_cl.nbuffers ; i++)
  71. {
  72. handles[i] = handle;
  73. }
  74. starpu_task_insert(&dummy_big_cl,
  75. STARPU_VALUE, &dummy_big_cl.nbuffers, sizeof(dummy_big_cl.nbuffers),
  76. STARPU_DATA_ARRAY, handles, dummy_big_cl.nbuffers,
  77. 0);
  78. \endcode
  79. The whole code for this complex data interface is available in the
  80. directory <c>examples/basic_examples/dynamic_handles.c</c>.
  81. \section UsingMultipleImplementationsOfACodelet Using Multiple Implementations Of A Codelet
  82. One may want to write multiple implementations of a codelet for a single type of
  83. device and let StarPU choose which one to run. As an example, we will show how
  84. to use SSE to scale a vector. The codelet can be written as follows:
  85. \code{.c}
  86. #include <xmmintrin.h>
  87. void scal_sse_func(void *buffers[], void *cl_arg)
  88. {
  89. float *vector = (float *) STARPU_VECTOR_GET_PTR(buffers[0]);
  90. unsigned int n = STARPU_VECTOR_GET_NX(buffers[0]);
  91. unsigned int n_iterations = n/4;
  92. if (n % 4 != 0)
  93. n_iterations++;
  94. __m128 *VECTOR = (__m128*) vector;
  95. __m128 factor __attribute__((aligned(16)));
  96. factor = _mm_set1_ps(*(float *) cl_arg);
  97. unsigned int i;
  98. for (i = 0; i < n_iterations; i++)
  99. VECTOR[i] = _mm_mul_ps(factor, VECTOR[i]);
  100. }
  101. \endcode
  102. \code{.c}
  103. struct starpu_codelet cl = {
  104. .cpu_funcs = { scal_cpu_func, scal_sse_func, NULL },
  105. .cpu_funcs_name = { "scal_cpu_func", "scal_sse_func", NULL },
  106. .nbuffers = 1,
  107. .modes = { STARPU_RW }
  108. };
  109. \endcode
  110. Schedulers which are multi-implementation aware (only <c>dmda</c> and
  111. <c>pheft</c> for now) will use the performance models of all the
  112. implementations it was given, and pick the one that seems to be the fastest.
  113. \section EnablingImplementationAccordingToCapabilities Enabling Implementation According To Capabilities
  114. Some implementations may not run on some devices. For instance, some CUDA
  115. devices do not support double floating point precision, and thus the kernel
  116. execution would just fail; or the device may not have enough shared memory for
  117. the implementation being used. The field starpu_codelet::can_execute
  118. permits to express this. For instance:
  119. \code{.c}
  120. static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
  121. {
  122. const struct cudaDeviceProp *props;
  123. if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)
  124. return 1;
  125. /* Cuda device */
  126. props = starpu_cuda_get_device_properties(workerid);
  127. if (props->major >= 2 || props->minor >= 3)
  128. /* At least compute capability 1.3, supports doubles */
  129. return 1;
  130. /* Old card, does not support doubles */
  131. return 0;
  132. }
  133. struct starpu_codelet cl = {
  134. .can_execute = can_execute,
  135. .cpu_funcs = { cpu_func, NULL },
  136. .cpu_funcs_name = { "cpu_func", NULL },
  137. .cuda_funcs = { gpu_func, NULL }
  138. .nbuffers = 1,
  139. .modes = { STARPU_RW }
  140. };
  141. \endcode
  142. This can be essential e.g. when running on a machine which mixes various models
  143. of CUDA devices, to take benefit from the new models without crashing on old models.
  144. Note: the function starpu_codelet::can_execute is called by the
  145. scheduler each time it tries to match a task with a worker, and should
  146. thus be very fast. The function starpu_cuda_get_device_properties()
  147. provides a quick access to CUDA properties of CUDA devices to achieve
  148. such efficiency.
  149. Another example is to compile CUDA code for various compute capabilities,
  150. resulting with two CUDA functions, e.g. <c>scal_gpu_13</c> for compute capability
  151. 1.3, and <c>scal_gpu_20</c> for compute capability 2.0. Both functions can be
  152. provided to StarPU by using starpu_codelet::cuda_funcs, and
  153. starpu_codelet::can_execute can then be used to rule out the
  154. <c>scal_gpu_20</c> variant on a CUDA device which will not be able to execute it:
  155. \code{.c}
  156. static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
  157. {
  158. const struct cudaDeviceProp *props;
  159. if (starpu_worker_get_type(workerid) == STARPU_CPU_WORKER)
  160. return 1;
  161. /* Cuda device */
  162. if (nimpl == 0)
  163. /* Trying to execute the 1.3 capability variant, we assume it is ok in all cases. */
  164. return 1;
  165. /* Trying to execute the 2.0 capability variant, check that the card can do it. */
  166. props = starpu_cuda_get_device_properties(workerid);
  167. if (props->major >= 2 || props->minor >= 0)
  168. /* At least compute capability 2.0, can run it */
  169. return 1;
  170. /* Old card, does not support 2.0, will not be able to execute the 2.0 variant. */
  171. return 0;
  172. }
  173. struct starpu_codelet cl = {
  174. .can_execute = can_execute,
  175. .cpu_funcs = { cpu_func, NULL },
  176. .cpu_funcs_name = { "cpu_func", NULL },
  177. .cuda_funcs = { scal_gpu_13, scal_gpu_20, NULL },
  178. .nbuffers = 1,
  179. .modes = { STARPU_RW }
  180. };
  181. \endcode
  182. Note: the most generic variant should be provided first, as some schedulers are
  183. not able to try the different variants.
  184. \section InsertTaskUtility Insert Task Utility
  185. StarPU provides the wrapper function starpu_task_insert() to ease
  186. the creation and submission of tasks.
  187. Here the implementation of the codelet:
  188. \code{.c}
  189. void func_cpu(void *descr[], void *_args)
  190. {
  191. int *x0 = (int *)STARPU_VARIABLE_GET_PTR(descr[0]);
  192. float *x1 = (float *)STARPU_VARIABLE_GET_PTR(descr[1]);
  193. int ifactor;
  194. float ffactor;
  195. starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
  196. *x0 = *x0 * ifactor;
  197. *x1 = *x1 * ffactor;
  198. }
  199. struct starpu_codelet mycodelet = {
  200. .cpu_funcs = { func_cpu, NULL },
  201. .cpu_funcs_name = { "func_cpu", NULL },
  202. .nbuffers = 2,
  203. .modes = { STARPU_RW, STARPU_RW }
  204. };
  205. \endcode
  206. And the call to the function starpu_task_insert():
  207. \code{.c}
  208. starpu_task_insert(&mycodelet,
  209. STARPU_VALUE, &ifactor, sizeof(ifactor),
  210. STARPU_VALUE, &ffactor, sizeof(ffactor),
  211. STARPU_RW, data_handles[0], STARPU_RW, data_handles[1],
  212. 0);
  213. \endcode
  214. The call to starpu_task_insert() is equivalent to the following
  215. code:
  216. \code{.c}
  217. struct starpu_task *task = starpu_task_create();
  218. task->cl = &mycodelet;
  219. task->handles[0] = data_handles[0];
  220. task->handles[1] = data_handles[1];
  221. char *arg_buffer;
  222. size_t arg_buffer_size;
  223. starpu_codelet_pack_args(&arg_buffer, &arg_buffer_size,
  224. STARPU_VALUE, &ifactor, sizeof(ifactor),
  225. STARPU_VALUE, &ffactor, sizeof(ffactor),
  226. 0);
  227. task->cl_arg = arg_buffer;
  228. task->cl_arg_size = arg_buffer_size;
  229. int ret = starpu_task_submit(task);
  230. \endcode
  231. Here a similar call using ::STARPU_DATA_ARRAY.
  232. \code{.c}
  233. starpu_task_insert(&mycodelet,
  234. STARPU_DATA_ARRAY, data_handles, 2,
  235. STARPU_VALUE, &ifactor, sizeof(ifactor),
  236. STARPU_VALUE, &ffactor, sizeof(ffactor),
  237. 0);
  238. \endcode
  239. If some part of the task insertion depends on the value of some computation,
  240. the macro ::STARPU_DATA_ACQUIRE_CB can be very convenient. For
  241. instance, assuming that the index variable <c>i</c> was registered as handle
  242. <c>A_handle[i]</c>:
  243. \code{.c}
  244. /* Compute which portion we will work on, e.g. pivot */
  245. starpu_task_insert(&which_index, STARPU_W, i_handle, 0);
  246. /* And submit the corresponding task */
  247. STARPU_DATA_ACQUIRE_CB(i_handle, STARPU_R,
  248. starpu_task_insert(&work, STARPU_RW, A_handle[i], 0));
  249. \endcode
  250. The macro ::STARPU_DATA_ACQUIRE_CB submits an asynchronous request for
  251. acquiring data <c>i</c> for the main application, and will execute the code
  252. given as third parameter when it is acquired. In other words, as soon as the
  253. value of <c>i</c> computed by the codelet <c>which_index</c> can be read, the
  254. portion of code passed as third parameter of ::STARPU_DATA_ACQUIRE_CB will
  255. be executed, and is allowed to read from <c>i</c> to use it e.g. as an
  256. index. Note that this macro is only avaible when compiling StarPU with
  257. the compiler <c>gcc</c>.
  258. \section ParallelTasks Parallel Tasks
  259. StarPU can leverage existing parallel computation libraries by the means of
  260. parallel tasks. A parallel task is a task which gets worked on by a set of CPUs
  261. (called a parallel or combined worker) at the same time, by using an existing
  262. parallel CPU implementation of the computation to be achieved. This can also be
  263. useful to improve the load balance between slow CPUs and fast GPUs: since CPUs
  264. work collectively on a single task, the completion time of tasks on CPUs become
  265. comparable to the completion time on GPUs, thus relieving from granularity
  266. discrepancy concerns. <c>hwloc</c> support needs to be enabled to get
  267. good performance, otherwise StarPU will not know how to better group
  268. cores.
  269. Two modes of execution exist to accomodate with existing usages.
  270. \subsection Fork-modeParallelTasks Fork-mode Parallel Tasks
  271. In the Fork mode, StarPU will call the codelet function on one
  272. of the CPUs of the combined worker. The codelet function can use
  273. starpu_combined_worker_get_size() to get the number of threads it is
  274. allowed to start to achieve the computation. The CPU binding mask for the whole
  275. set of CPUs is already enforced, so that threads created by the function will
  276. inherit the mask, and thus execute where StarPU expected, the OS being in charge
  277. of choosing how to schedule threads on the corresponding CPUs. The application
  278. can also choose to bind threads by hand, using e.g. sched_getaffinity to know
  279. the CPU binding mask that StarPU chose.
  280. For instance, using OpenMP (full source is available in
  281. <c>examples/openmp/vector_scal.c</c>):
  282. \snippet forkmode.c To be included. You should update doxygen if you see this text.
  283. Other examples include for instance calling a BLAS parallel CPU implementation
  284. (see <c>examples/mult/xgemm.c</c>).
  285. \subsection SPMD-modeParallelTasks SPMD-mode Parallel Tasks
  286. In the SPMD mode, StarPU will call the codelet function on
  287. each CPU of the combined worker. The codelet function can use
  288. starpu_combined_worker_get_size() to get the total number of CPUs
  289. involved in the combined worker, and thus the number of calls that are made in
  290. parallel to the function, and starpu_combined_worker_get_rank() to get
  291. the rank of the current CPU within the combined worker. For instance:
  292. \code{.c}
  293. static void func(void *buffers[], void *args)
  294. {
  295. unsigned i;
  296. float *factor = _args;
  297. struct starpu_vector_interface *vector = buffers[0];
  298. unsigned n = STARPU_VECTOR_GET_NX(vector);
  299. float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
  300. /* Compute slice to compute */
  301. unsigned m = starpu_combined_worker_get_size();
  302. unsigned j = starpu_combined_worker_get_rank();
  303. unsigned slice = (n+m-1)/m;
  304. for (i = j * slice; i < (j+1) * slice && i < n; i++)
  305. val[i] *= *factor;
  306. }
  307. static struct starpu_codelet cl =
  308. {
  309. .modes = { STARPU_RW },
  310. .type = STARPU_SPMD,
  311. .max_parallelism = INT_MAX,
  312. .cpu_funcs = { func, NULL },
  313. .cpu_funcs_name = { "func", NULL },
  314. .nbuffers = 1,
  315. }
  316. \endcode
  317. Of course, this trivial example will not really benefit from parallel task
  318. execution, and was only meant to be simple to understand. The benefit comes
  319. when the computation to be done is so that threads have to e.g. exchange
  320. intermediate results, or write to the data in a complex but safe way in the same
  321. buffer.
  322. \subsection ParallelTasksPerformance Parallel Tasks Performance
  323. To benefit from parallel tasks, a parallel-task-aware StarPU scheduler has to
  324. be used. When exposed to codelets with a flag ::STARPU_FORKJOIN or
  325. ::STARPU_SPMD, the schedulers <c>pheft</c> (parallel-heft) and <c>peager</c>
  326. (parallel eager) will indeed also try to execute tasks with
  327. several CPUs. It will automatically try the various available combined
  328. worker sizes (making several measurements for each worker size) and
  329. thus be able to avoid choosing a large combined worker if the codelet
  330. does not actually scale so much.
  331. \subsection CombinedWorkers Combined Workers
  332. By default, StarPU creates combined workers according to the architecture
  333. structure as detected by <c>hwloc</c>. It means that for each object of the <c>hwloc</c>
  334. topology (NUMA node, socket, cache, ...) a combined worker will be created. If
  335. some nodes of the hierarchy have a big arity (e.g. many cores in a socket
  336. without a hierarchy of shared caches), StarPU will create combined workers of
  337. intermediate sizes. The variable \ref
  338. STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER permits to tune the maximum
  339. arity between levels of combined workers.
  340. The combined workers actually produced can be seen in the output of the
  341. tool <c>starpu_machine_display</c> (the environment variable \ref
  342. STARPU_SCHED has to be set to a combined worker-aware scheduler such
  343. as <c>pheft</c> or <c>peager</c>).
  344. \subsection ConcurrentParallelTasks Concurrent Parallel Tasks
  345. Unfortunately, many environments and librairies do not support concurrent
  346. calls.
  347. For instance, most OpenMP implementations (including the main ones) do not
  348. support concurrent <c>pragma omp parallel</c> statements without nesting them in
  349. another <c>pragma omp parallel</c> statement, but StarPU does not yet support
  350. creating its CPU workers by using such pragma.
  351. Other parallel libraries are also not safe when being invoked concurrently
  352. from different threads, due to the use of global variables in their sequential
  353. sections for instance.
  354. The solution is then to use only one combined worker at a time. This can be
  355. done by setting the field starpu_conf::single_combined_worker to <c>1</c>, or
  356. setting the environment variable \ref STARPU_SINGLE_COMBINED_WORKER
  357. to <c>1</c>. StarPU will then run only one parallel task at a time (but other
  358. CPU and GPU tasks are not affected and can be run concurrently). The parallel
  359. task scheduler will however still however still try varying combined worker
  360. sizes to look for the most efficient ones.
  361. */