advanced-examples.texi 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479
  1. @c -*-texinfo-*-
  2. @c This file is part of the StarPU Handbook.
  3. @c Copyright (C) 2009--2011 Universit@'e de Bordeaux 1
  4. @c Copyright (C) 2010, 2011 Centre National de la Recherche Scientifique
  5. @c Copyright (C) 2011 Institut National de Recherche en Informatique et Automatique
  6. @c See the file starpu.texi for copying conditions.
  7. @node Advanced Examples
  8. @chapter Advanced Examples
  9. @menu
  10. * Using multiple implementations of a codelet::
  11. * Task and Worker Profiling::
  12. * Partitioning Data:: Partitioning Data
  13. * Performance model example::
  14. * Theoretical lower bound on execution time::
  15. * Insert Task Utility::
  16. * More examples:: More examples shipped with StarPU
  17. * Debugging:: When things go wrong.
  18. @end menu
  19. @node Using multiple implementations of a codelet
  20. @section Using multiple implementations of a codelet
  21. One may want to write multiple implementations of a codelet for a single type of
  22. device and let StarPU choose which one to run. As an example, we will show how
  23. to use SSE to scale a vector. The codelet can be written as follows :
  24. @cartouche
  25. @smallexample
  26. #include <xmmintrin.h>
  27. void scal_sse_func(void *buffers[], void *cl_arg)
  28. @{
  29. float *vector = (float *) STARPU_VECTOR_GET_PTR(buffers[0]);
  30. unsigned int n = STARPU_VECTOR_GET_NX(buffers[0]);
  31. unsigned int n_iterations = n/4;
  32. if (n % 4 != 0)
  33. n_iterations++;
  34. __m128 *VECTOR = (__m128*) vector;
  35. __m128 factor __attribute__((aligned(16)));
  36. factor = _mm_set1_ps(*(float *) cl_arg);
  37. unsigned int i;
  38. for (i = 0; i < n_iterations; i++)
  39. VECTOR[i] = _mm_mul_ps(factor, VECTOR[i]);
  40. @}
  41. @end smallexample
  42. @end cartouche
  43. The @code{cpu_func} field of the @code{starpu_codelet} structure has to be set
  44. to the special value @code{STARPU_MULTIPLE_CPU_IMPLEMENTATIONS}. Note that
  45. @code{STARPU_MULTIPLE_CUDA_IMPLEMENTATIONS} and
  46. @code{STARPU_MULTIPLE_OPENCL_IMPLEMENTATIONS} are also available.
  47. @cartouche
  48. @smallexample
  49. starpu_codelet cl = @{
  50. .where = STARPU_CPU,
  51. .cpu_func = STARPU_MULTIPLE_CPU_IMPLEMENTATIONS,
  52. .cpu_funcs = @{ scal_cpu_func, scal_sse_func @},
  53. .nbuffers = 1
  54. @};
  55. @end smallexample
  56. @end cartouche
  57. Scheduler which are multi-implementation aware (only @code{dmda}, @code{heft}
  58. and @code{pheft} for now) will use the performance models of all the
  59. implementations it was given, and pick the one that seems to be the fastest.
  60. @node Task and Worker Profiling
  61. @section Task and Worker Profiling
  62. A full example showing how to use the profiling API is available in
  63. the StarPU sources in the directory @code{examples/profiling/}.
  64. @cartouche
  65. @smallexample
  66. struct starpu_task *task = starpu_task_create();
  67. task->cl = &cl;
  68. task->synchronous = 1;
  69. /* We will destroy the task structure by hand so that we can
  70. * query the profiling info before the task is destroyed. */
  71. task->destroy = 0;
  72. /* Submit and wait for completion (since synchronous was set to 1) */
  73. starpu_task_submit(task);
  74. /* The task is finished, get profiling information */
  75. struct starpu_task_profiling_info *info = task->profiling_info;
  76. /* How much time did it take before the task started ? */
  77. double delay += starpu_timing_timespec_delay_us(&info->submit_time, &info->start_time);
  78. /* How long was the task execution ? */
  79. double length += starpu_timing_timespec_delay_us(&info->start_time, &info->end_time);
  80. /* We don't need the task structure anymore */
  81. starpu_task_destroy(task);
  82. @end smallexample
  83. @end cartouche
  84. @cartouche
  85. @smallexample
  86. /* Display the occupancy of all workers during the test */
  87. int worker;
  88. for (worker = 0; worker < starpu_worker_get_count(); worker++)
  89. @{
  90. struct starpu_worker_profiling_info worker_info;
  91. int ret = starpu_worker_get_profiling_info(worker, &worker_info);
  92. STARPU_ASSERT(!ret);
  93. double total_time = starpu_timing_timespec_to_us(&worker_info.total_time);
  94. double executing_time = starpu_timing_timespec_to_us(&worker_info.executing_time);
  95. double sleeping_time = starpu_timing_timespec_to_us(&worker_info.sleeping_time);
  96. float executing_ratio = 100.0*executing_time/total_time;
  97. float sleeping_ratio = 100.0*sleeping_time/total_time;
  98. char workername[128];
  99. starpu_worker_get_name(worker, workername, 128);
  100. fprintf(stderr, "Worker %s:\n", workername);
  101. fprintf(stderr, "\ttotal time : %.2lf ms\n", total_time*1e-3);
  102. fprintf(stderr, "\texec time : %.2lf ms (%.2f %%)\n", executing_time*1e-3,
  103. executing_ratio);
  104. fprintf(stderr, "\tblocked time : %.2lf ms (%.2f %%)\n", sleeping_time*1e-3,
  105. sleeping_ratio);
  106. @}
  107. @end smallexample
  108. @end cartouche
  109. @node Partitioning Data
  110. @section Partitioning Data
  111. An existing piece of data can be partitioned in sub parts to be used by different tasks, for instance:
  112. @cartouche
  113. @smallexample
  114. int vector[NX];
  115. starpu_data_handle handle;
  116. /* Declare data to StarPU */
  117. starpu_vector_data_register(&handle, 0, (uintptr_t)vector, NX, sizeof(vector[0]));
  118. /* Partition the vector in PARTS sub-vectors */
  119. starpu_filter f =
  120. @{
  121. .filter_func = starpu_block_filter_func_vector,
  122. .nchildren = PARTS
  123. @};
  124. starpu_data_partition(handle, &f);
  125. @end smallexample
  126. @end cartouche
  127. @cartouche
  128. @smallexample
  129. /* Submit a task on each sub-vector */
  130. for (i=0; i<starpu_data_get_nb_children(handle); i++) @{
  131. /* Get subdata number i (there is only 1 dimension) */
  132. starpu_data_handle sub_handle = starpu_data_get_sub_data(handle, 1, i);
  133. struct starpu_task *task = starpu_task_create();
  134. task->buffers[0].handle = sub_handle;
  135. task->buffers[0].mode = STARPU_RW;
  136. task->cl = &cl;
  137. task->synchronous = 1;
  138. task->cl_arg = &factor;
  139. task->cl_arg_size = sizeof(factor);
  140. starpu_task_submit(task);
  141. @}
  142. @end smallexample
  143. @end cartouche
  144. Partitioning can be applied several times, see
  145. @code{examples/basic_examples/mult.c} and @code{examples/filters/}.
  146. @node Performance model example
  147. @section Performance model example
  148. To achieve good scheduling, StarPU scheduling policies need to be able to
  149. estimate in advance the duration of a task. This is done by giving to codelets
  150. a performance model, by defining a @code{starpu_perfmodel} structure and
  151. providing its address in the @code{model} field of the @code{starpu_codelet}
  152. structure. The @code{symbol} and @code{type} fields of @code{starpu_perfmodel}
  153. are mandatory, to give a name to the model, and the type of the model, since
  154. there are several kinds of performance models.
  155. @itemize
  156. @item
  157. Measured at runtime (@code{STARPU_HISTORY_BASED} model type). This assumes that for a
  158. given set of data input/output sizes, the performance will always be about the
  159. same. This is very true for regular kernels on GPUs for instance (<0.1% error),
  160. and just a bit less true on CPUs (~=1% error). This also assumes that there are
  161. few different sets of data input/output sizes. StarPU will then keep record of
  162. the average time of previous executions on the various processing units, and use
  163. it as an estimation. History is done per task size, by using a hash of the input
  164. and ouput sizes as an index.
  165. It will also save it in @code{~/.starpu/sampling/codelets}
  166. for further executions, and can be observed by using the
  167. @code{starpu_perfmodel_display} command, or drawn by using
  168. the @code{starpu_perfmodel_plot}. The models are indexed by machine name. To
  169. share the models between machines (e.g. for a homogeneous cluster), use
  170. @code{export STARPU_HOSTNAME=some_global_name}. Measurements are only done when using a task scheduler which makes use of it, such as @code{heft} or @code{dmda}.
  171. The following is a small code example.
  172. If e.g. the code is recompiled with other compilation options, or several
  173. variants of the code are used, the symbol string should be changed to reflect
  174. that, in order to recalibrate a new model from zero. The symbol string can even
  175. be constructed dynamically at execution time, as long as this is done before
  176. submitting any task using it.
  177. @cartouche
  178. @smallexample
  179. static struct starpu_perfmodel mult_perf_model = @{
  180. .type = STARPU_HISTORY_BASED,
  181. .symbol = "mult_perf_model"
  182. @};
  183. starpu_codelet cl = @{
  184. .where = STARPU_CPU,
  185. .cpu_func = cpu_mult,
  186. .nbuffers = 3,
  187. /* for the scheduling policy to be able to use performance models */
  188. .model = &mult_perf_model
  189. @};
  190. @end smallexample
  191. @end cartouche
  192. @item
  193. Measured at runtime and refined by regression (@code{STARPU_REGRESSION_*_BASED}
  194. model type). This still assumes performance regularity, but can work
  195. with various data input sizes, by applying regression over observed
  196. execution times. STARPU_REGRESSION_BASED uses an a*n^b regression
  197. form, STARPU_NL_REGRESSION_BASED uses an a*n^b+c (more precise than
  198. STARPU_REGRESSION_BASED, but costs a lot more to compute). For instance,
  199. @code{tests/perfmodels/regression_based.c} uses a regression-based performance
  200. model for the @code{memset} operation. Of course, the application has to issue
  201. tasks with varying size so that the regression can be computed. StarPU will not
  202. trust the regression unless there is at least 10% difference between the minimum
  203. and maximum observed input size. For non-linear regression, since computing it
  204. is quite expensive, it is only done at termination of the application. This
  205. means that the first execution uses history-based performance model to perform
  206. scheduling.
  207. @item
  208. Provided as an estimation from the application itself (@code{STARPU_COMMON} model type and @code{cost_model} field),
  209. see for instance
  210. @code{examples/common/blas_model.h} and @code{examples/common/blas_model.c}.
  211. @item
  212. Provided explicitly by the application (@code{STARPU_PER_ARCH} model type): the
  213. @code{.per_arch[i].cost_model} fields have to be filled with pointers to
  214. functions which return the expected duration of the task in micro-seconds, one
  215. per architecture.
  216. @end itemize
  217. How to use schedulers which can benefit from such performance model is explained
  218. in @ref{Task scheduling policy}.
  219. The same can be done for task power consumption estimation, by setting the
  220. @code{power_model} field the same way as the @code{model} field. Note: for
  221. now, the application has to give to the power consumption performance model
  222. a name which is different from the execution time performance model.
  223. The application can request time estimations from the StarPU performance
  224. models by filling a task structure as usual without actually submitting
  225. it. The data handles can be created by calling @code{starpu_data_register}
  226. functions with a @code{NULL} pointer (and need to be unregistered as usual)
  227. and the desired data sizes. The @code{starpu_task_expected_length} and
  228. @code{starpu_task_expected_power} functions can then be called to get an
  229. estimation of the task duration on a given arch. @code{starpu_task_destroy}
  230. needs to be called to destroy the dummy task afterwards. See
  231. @code{tests/perfmodels/regression_based.c} for an example.
  232. @node Theoretical lower bound on execution time
  233. @section Theoretical lower bound on execution time
  234. For kernels with history-based performance models, StarPU can very easily provide a theoretical lower
  235. bound for the execution time of a whole set of tasks. See for
  236. instance @code{examples/lu/lu_example.c}: before submitting tasks,
  237. call @code{starpu_bound_start}, and after complete execution, call
  238. @code{starpu_bound_stop}. @code{starpu_bound_print_lp} or
  239. @code{starpu_bound_print_mps} can then be used to output a Linear Programming
  240. problem corresponding to the schedule of your tasks. Run it through
  241. @code{lp_solve} or any other linear programming solver, and that will give you a
  242. lower bound for the total execution time of your tasks. If StarPU was compiled
  243. with the glpk library installed, @code{starpu_bound_compute} can be used to
  244. solve it immediately and get the optimized minimum, in ms. Its @code{integer}
  245. parameter allows to decide whether integer resolution should be computed
  246. and returned too.
  247. The @code{deps} parameter tells StarPU whether to take tasks and implicit data
  248. dependencies into account. It must be understood that the linear programming
  249. problem size is quadratic with the number of tasks and thus the time to solve it
  250. will be very long, it could be minutes for just a few dozen tasks. You should
  251. probably use @code{lp_solve -timeout 1 test.pl -wmps test.mps} to convert the
  252. problem to MPS format and then use a better solver, @code{glpsol} might be
  253. better than @code{lp_solve} for instance (the @code{--pcost} option may be
  254. useful), but sometimes doesn't manage to converge. @code{cbc} might look
  255. slower, but it is parallel. Be sure to try at least all the @code{-B} options
  256. of @code{lp_solve}. For instance, we often just use
  257. @code{lp_solve -cc -B1 -Bb -Bg -Bp -Bf -Br -BG -Bd -Bs -BB -Bo -Bc -Bi} , and
  258. the @code{-gr} option can also be quite useful.
  259. Setting @code{deps} to 0 will only take into account the actual computations
  260. on processing units. It however still properly takes into account the varying
  261. performances of kernels and processing units, which is quite more accurate than
  262. just comparing StarPU performances with the fastest of the kernels being used.
  263. The @code{prio} parameter tells StarPU whether to simulate taking into account
  264. the priorities as the StarPU scheduler would, i.e. schedule prioritized
  265. tasks before less prioritized tasks, to check to which extend this results
  266. to a less optimal solution. This increases even more computation time.
  267. Note that for simplicity, all this however doesn't take into account data
  268. transfers, which are assumed to be completely overlapped.
  269. @node Insert Task Utility
  270. @section Insert Task Utility
  271. StarPU provides the wrapper function @code{starpu_insert_task} to ease
  272. the creation and submission of tasks.
  273. @deftypefun int starpu_insert_task (starpu_codelet *@var{cl}, ...)
  274. Create and submit a task corresponding to @var{cl} with the following
  275. arguments. The argument list must be zero-terminated.
  276. The arguments following the codelets can be of the following types:
  277. @itemize
  278. @item
  279. @code{STARPU_R}, @code{STARPU_W}, @code{STARPU_RW}, @code{STARPU_SCRATCH}, @code{STARPU_REDUX} an access mode followed by a data handle;
  280. @item
  281. @code{STARPU_VALUE} followed by a pointer to a constant value and
  282. the size of the constant;
  283. @item
  284. @code{STARPU_CALLBACK} followed by a pointer to a callback function;
  285. @item
  286. @code{STARPU_CALLBACK_ARG} followed by a pointer to be given as an
  287. argument to the callback function;
  288. @item
  289. @code{STARPU_CALLBACK_WITH_ARG} followed by two pointers: one to a callback
  290. function, and the other to be given as an argument to the callback
  291. function; this is equivalent to using both @code{STARPU_CALLBACK} and
  292. @code{STARPU_CALLBACK_WITH_ARG}
  293. @item
  294. @code{STARPU_PRIORITY} followed by a integer defining a priority level.
  295. @end itemize
  296. Parameters to be passed to the codelet implementation are defined
  297. through the type @code{STARPU_VALUE}. The function
  298. @code{starpu_unpack_cl_args} must be called within the codelet
  299. implementation to retrieve them.
  300. @end deftypefun
  301. Here the implementation of the codelet:
  302. @smallexample
  303. void func_cpu(void *descr[], void *_args)
  304. @{
  305. int *x0 = (int *)STARPU_VARIABLE_GET_PTR(descr[0]);
  306. float *x1 = (float *)STARPU_VARIABLE_GET_PTR(descr[1]);
  307. int ifactor;
  308. float ffactor;
  309. starpu_unpack_cl_args(_args, &ifactor, &ffactor);
  310. *x0 = *x0 * ifactor;
  311. *x1 = *x1 * ffactor;
  312. @}
  313. starpu_codelet mycodelet = @{
  314. .where = STARPU_CPU,
  315. .cpu_func = func_cpu,
  316. .nbuffers = 2
  317. @};
  318. @end smallexample
  319. And the call to the @code{starpu_insert_task} wrapper:
  320. @smallexample
  321. starpu_insert_task(&mycodelet,
  322. STARPU_VALUE, &ifactor, sizeof(ifactor),
  323. STARPU_VALUE, &ffactor, sizeof(ffactor),
  324. STARPU_RW, data_handles[0], STARPU_RW, data_handles[1],
  325. 0);
  326. @end smallexample
  327. The call to @code{starpu_insert_task} is equivalent to the following
  328. code:
  329. @smallexample
  330. struct starpu_task *task = starpu_task_create();
  331. task->cl = &mycodelet;
  332. task->buffers[0].handle = data_handles[0];
  333. task->buffers[0].mode = STARPU_RW;
  334. task->buffers[1].handle = data_handles[1];
  335. task->buffers[1].mode = STARPU_RW;
  336. char *arg_buffer;
  337. size_t arg_buffer_size;
  338. starpu_pack_cl_args(&arg_buffer, &arg_buffer_size,
  339. STARPU_VALUE, &ifactor, sizeof(ifactor),
  340. STARPU_VALUE, &ffactor, sizeof(ffactor),
  341. 0);
  342. task->cl_arg = arg_buffer;
  343. task->cl_arg_size = arg_buffer_size;
  344. int ret = starpu_task_submit(task);
  345. @end smallexample
  346. If some part of the task insertion depends on the value of some computation,
  347. the @code{STARPU_DATA_ACQUIRE_CB} macro can be very convenient. For
  348. instance, assuming that the index variable @code{i} was registered as handle
  349. @code{i_handle}:
  350. @smallexample
  351. /* Compute which portion we will work on, e.g. pivot */
  352. starpu_insert_task(&which_index, STARPU_W, i_handle, 0);
  353. /* And submit the corresponding task */
  354. STARPU_DATA_ACQUIRE_CB(i_handle, STARPU_R, starpu_insert_task(&work, STARPU_RW, A_handle[i], 0));
  355. @end smallexample
  356. The @code{STARPU_DATA_ACQUIRE_CB} macro submits an asynchronous request for
  357. acquiring data @code{i} for the main application, and will execute the code
  358. given as third parameter when it is acquired. In other words, as soon as the
  359. value of @code{i} computed by the @code{which_index} codelet can be read, the
  360. portion of code passed as third parameter of @code{STARPU_DATA_ACQUIRE_CB} will
  361. be executed, and is allowed to read from @code{i} to use it e.g. as an
  362. index. Note that this macro is only avaible when compiling StarPU with
  363. the compiler @code{gcc}.
  364. @node Debugging
  365. @section Debugging
  366. StarPU provides several tools to help debugging aplications. Execution traces
  367. can be generated and displayed graphically, see @ref{Generating traces}. Some
  368. gdb helpers are also provided to show the whole StarPU state:
  369. @smallexample
  370. (gdb) source tools/gdbinit
  371. (gdb) help starpu
  372. @end smallexample
  373. @node More examples
  374. @section More examples
  375. More examples are available in the StarPU sources in the @code{examples/}
  376. directory. Simple examples include:
  377. @table @asis
  378. @item @code{incrementer/}:
  379. Trivial incrementation test.
  380. @item @code{basic_examples/}:
  381. Simple documented Hello world (as shown in @ref{Hello World}), vector/scalar product (as shown
  382. in @ref{Vector Scaling on an Hybrid CPU/GPU Machine}), matrix
  383. product examples (as shown in @ref{Performance model example}), an example using the blocked matrix data
  384. interface, an example using the variable data interface, and an example
  385. using different formats on CPUs and GPUs.
  386. @item @code{matvecmult/}:
  387. OpenCL example from NVidia, adapted to StarPU.
  388. @item @code{axpy/}:
  389. AXPY CUBLAS operation adapted to StarPU.
  390. @item @code{fortran/}:
  391. Example of Fortran bindings.
  392. @end table
  393. More advanced examples include:
  394. @table @asis
  395. @item @code{filters/}:
  396. Examples using filters, as shown in @ref{Partitioning Data}.
  397. @item @code{lu/}:
  398. LU matrix factorization, see for instance @code{xlu_implicit.c}
  399. @item @code{cholesky/}:
  400. Cholesky matrix factorization, see for instance @code{cholesky_implicit.c}.
  401. @end table