07data_management.doxy 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539
  1. /*
  2. * This file is part of the StarPU Handbook.
  3. * Copyright (C) 2009--2011 Universit@'e de Bordeaux
  4. * Copyright (C) 2010, 2011, 2012, 2013, 2014 Centre National de la Recherche Scientifique
  5. * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
  6. * See the file version.doxy for copying conditions.
  7. */
  8. /*! \page DataManagement Data Management
  9. intro qui parle de coherency entre autres
  10. \section DataManagement Data Management
  11. When the application allocates data, whenever possible it should use
  12. the function starpu_malloc(), which will ask CUDA or OpenCL to make
  13. the allocation itself and pin the corresponding allocated memory. This
  14. is needed to permit asynchronous data transfer, i.e. permit data
  15. transfer to overlap with computations. Otherwise, the trace will show
  16. that the <c>DriverCopyAsync</c> state takes a lot of time, this is
  17. because CUDA or OpenCL then reverts to synchronous transfers.
  18. By default, StarPU leaves replicates of data wherever they were used, in case they
  19. will be re-used by other tasks, thus saving the data transfer time. When some
  20. task modifies some data, all the other replicates are invalidated, and only the
  21. processing unit which ran that task will have a valid replicate of the data. If the application knows
  22. that this data will not be re-used by further tasks, it should advise StarPU to
  23. immediately replicate it to a desired list of memory nodes (given through a
  24. bitmask). This can be understood like the write-through mode of CPU caches.
  25. \code{.c}
  26. starpu_data_set_wt_mask(img_handle, 1<<0);
  27. \endcode
  28. will for instance request to always automatically transfer a replicate into the
  29. main memory (node <c>0</c>), as bit <c>0</c> of the write-through bitmask is being set.
  30. \code{.c}
  31. starpu_data_set_wt_mask(img_handle, ~0U);
  32. \endcode
  33. will request to always automatically broadcast the updated data to all memory
  34. nodes.
  35. Setting the write-through mask to <c>~0U</c> can also be useful to make sure all
  36. memory nodes always have a copy of the data, so that it is never evicted when
  37. memory gets scarse.
  38. Implicit data dependency computation can become expensive if a lot
  39. of tasks access the same piece of data. If no dependency is required
  40. on some piece of data (e.g. because it is only accessed in read-only
  41. mode, or because write accesses are actually commutative), use the
  42. function starpu_data_set_sequential_consistency_flag() to disable
  43. implicit dependencies on that data.
  44. In the same vein, accumulation of results in the same data can become a
  45. bottleneck. The use of the mode ::STARPU_REDUX permits to optimize such
  46. accumulation (see \ref DataReduction). To a lesser extent, the use of
  47. the flag ::STARPU_COMMUTE keeps the bottleneck, but at least permits
  48. the accumulation to happen in any order.
  49. Applications often need a data just for temporary results. In such a case,
  50. registration can be made without an initial value, for instance this produces a vector data:
  51. \code{.c}
  52. starpu_vector_data_register(&handle, -1, 0, n, sizeof(float));
  53. \endcode
  54. StarPU will then allocate the actual buffer only when it is actually needed,
  55. e.g. directly on the GPU without allocating in main memory.
  56. In the same vein, once the temporary results are not useful any more, the
  57. data should be thrown away. If the handle is not to be reused, it can be
  58. unregistered:
  59. \code{.c}
  60. starpu_data_unregister_submit(handle);
  61. \endcode
  62. actual unregistration will be done after all tasks working on the handle
  63. terminate.
  64. If the handle is to be reused, instead of unregistering it, it can simply be invalidated:
  65. \code{.c}
  66. starpu_data_invalidate_submit(handle);
  67. \endcode
  68. the buffers containing the current value will then be freed, and reallocated
  69. only when another task writes some value to the handle.
  70. \section DataPrefetch Data Prefetch
  71. The scheduling policies <c>heft</c>, <c>dmda</c> and <c>pheft</c>
  72. perform data prefetch (see \ref STARPU_PREFETCH):
  73. as soon as a scheduling decision is taken for a task, requests are issued to
  74. transfer its required data to the target processing unit, if needed, so that
  75. when the processing unit actually starts the task, its data will hopefully be
  76. already available and it will not have to wait for the transfer to finish.
  77. The application may want to perform some manual prefetching, for several reasons
  78. such as excluding initial data transfers from performance measurements, or
  79. setting up an initial statically-computed data distribution on the machine
  80. before submitting tasks, which will thus guide StarPU toward an initial task
  81. distribution (since StarPU will try to avoid further transfers).
  82. This can be achieved by giving the function starpu_data_prefetch_on_node()
  83. the handle and the desired target memory node.
  84. \section PartitioningData Partitioning Data
  85. An existing piece of data can be partitioned in sub parts to be used by different tasks, for instance:
  86. \code{.c}
  87. int vector[NX];
  88. starpu_data_handle_t handle;
  89. /* Declare data to StarPU */
  90. starpu_vector_data_register(&handle, STARPU_MAIN_RAM, (uintptr_t)vector,
  91. NX, sizeof(vector[0]));
  92. /* Partition the vector in PARTS sub-vectors */
  93. struct starpu_data_filter f =
  94. {
  95. .filter_func = starpu_vector_filter_block,
  96. .nchildren = PARTS
  97. };
  98. starpu_data_partition(handle, &f);
  99. \endcode
  100. The task submission then uses the function starpu_data_get_sub_data()
  101. to retrieve the sub-handles to be passed as tasks parameters.
  102. \code{.c}
  103. /* Submit a task on each sub-vector */
  104. for (i=0; i<starpu_data_get_nb_children(handle); i++) {
  105. /* Get subdata number i (there is only 1 dimension) */
  106. starpu_data_handle_t sub_handle = starpu_data_get_sub_data(handle, 1, i);
  107. struct starpu_task *task = starpu_task_create();
  108. task->handles[0] = sub_handle;
  109. task->cl = &cl;
  110. task->synchronous = 1;
  111. task->cl_arg = &factor;
  112. task->cl_arg_size = sizeof(factor);
  113. starpu_task_submit(task);
  114. }
  115. \endcode
  116. Partitioning can be applied several times, see
  117. <c>examples/basic_examples/mult.c</c> and <c>examples/filters/</c>.
  118. Wherever the whole piece of data is already available, the partitioning will
  119. be done in-place, i.e. without allocating new buffers but just using pointers
  120. inside the existing copy. This is particularly important to be aware of when
  121. using OpenCL, where the kernel parameters are not pointers, but handles. The
  122. kernel thus needs to be also passed the offset within the OpenCL buffer:
  123. \code{.c}
  124. void opencl_func(void *buffers[], void *cl_arg)
  125. {
  126. cl_mem vector = (cl_mem) STARPU_VECTOR_GET_DEV_HANDLE(buffers[0]);
  127. unsigned offset = STARPU_BLOCK_GET_OFFSET(buffers[0]);
  128. ...
  129. clSetKernelArg(kernel, 0, sizeof(vector), &vector);
  130. clSetKernelArg(kernel, 1, sizeof(offset), &offset);
  131. ...
  132. }
  133. \endcode
  134. And the kernel has to shift from the pointer passed by the OpenCL driver:
  135. \code{.c}
  136. __kernel void opencl_kernel(__global int *vector, unsigned offset)
  137. {
  138. block = (__global void *)block + offset;
  139. ...
  140. }
  141. \endcode
  142. StarPU provides various interfaces and filters for matrices, vectors, etc.,
  143. but applications can also write their own data interfaces and filters, see
  144. <c>examples/interface</c> and <c>examples/filters/custom_mf</c> for an example.
  145. \section DataReduction Data Reduction
  146. In various cases, some piece of data is used to accumulate intermediate
  147. results. For instances, the dot product of a vector, maximum/minimum finding,
  148. the histogram of a photograph, etc. When these results are produced along the
  149. whole machine, it would not be efficient to accumulate them in only one place,
  150. incurring data transmission each and access concurrency.
  151. StarPU provides a mode ::STARPU_REDUX, which permits to optimize
  152. that case: it will allocate a buffer on each memory node, and accumulate
  153. intermediate results there. When the data is eventually accessed in the normal
  154. mode ::STARPU_R, StarPU will collect the intermediate results in just one
  155. buffer.
  156. For this to work, the user has to use the function
  157. starpu_data_set_reduction_methods() to declare how to initialize these
  158. buffers, and how to assemble partial results.
  159. For instance, <c>cg</c> uses that to optimize its dot product: it first defines
  160. the codelets for initialization and reduction:
  161. \code{.c}
  162. struct starpu_codelet bzero_variable_cl =
  163. {
  164. .cpu_funcs = { bzero_variable_cpu },
  165. .cpu_funcs_name = { "bzero_variable_cpu" },
  166. .cuda_funcs = { bzero_variable_cuda },
  167. .nbuffers = 1,
  168. }
  169. static void accumulate_variable_cpu(void *descr[], void *cl_arg)
  170. {
  171. double *v_dst = (double *)STARPU_VARIABLE_GET_PTR(descr[0]);
  172. double *v_src = (double *)STARPU_VARIABLE_GET_PTR(descr[1]);
  173. *v_dst = *v_dst + *v_src;
  174. }
  175. static void accumulate_variable_cuda(void *descr[], void *cl_arg)
  176. {
  177. double *v_dst = (double *)STARPU_VARIABLE_GET_PTR(descr[0]);
  178. double *v_src = (double *)STARPU_VARIABLE_GET_PTR(descr[1]);
  179. cublasaxpy(1, (double)1.0, v_src, 1, v_dst, 1);
  180. cudaStreamSynchronize(starpu_cuda_get_local_stream());
  181. }
  182. struct starpu_codelet accumulate_variable_cl =
  183. {
  184. .cpu_funcs = { accumulate_variable_cpu },
  185. .cpu_funcs_name = { "accumulate_variable_cpu" },
  186. .cuda_funcs = { accumulate_variable_cuda },
  187. .nbuffers = 1,
  188. }
  189. \endcode
  190. and attaches them as reduction methods for its handle <c>dtq</c>:
  191. \code{.c}
  192. starpu_variable_data_register(&dtq_handle, -1, NULL, sizeof(type));
  193. starpu_data_set_reduction_methods(dtq_handle,
  194. &accumulate_variable_cl, &bzero_variable_cl);
  195. \endcode
  196. and <c>dtq_handle</c> can now be used in mode ::STARPU_REDUX for the
  197. dot products with partitioned vectors:
  198. \code{.c}
  199. for (b = 0; b < nblocks; b++)
  200. starpu_task_insert(&dot_kernel_cl,
  201. STARPU_REDUX, dtq_handle,
  202. STARPU_R, starpu_data_get_sub_data(v1, 1, b),
  203. STARPU_R, starpu_data_get_sub_data(v2, 1, b),
  204. 0);
  205. \endcode
  206. During registration, we have here provided <c>NULL</c>, i.e. there is
  207. no initial value to be taken into account during reduction. StarPU
  208. will thus only take into account the contributions from the tasks
  209. <c>dot_kernel_cl</c>. Also, it will not allocate any memory for
  210. <c>dtq_handle</c> before tasks <c>dot_kernel_cl</c> are ready to run.
  211. If another dot product has to be performed, one could unregister
  212. <c>dtq_handle</c>, and re-register it. But one can also call
  213. starpu_data_invalidate_submit() with the parameter <c>dtq_handle</c>,
  214. which will clear all data from the handle, thus resetting it back to
  215. the initial status <c>register(NULL)</c>.
  216. The example <c>cg</c> also uses reduction for the blocked gemv kernel,
  217. leading to yet more relaxed dependencies and more parallelism.
  218. ::STARPU_REDUX can also be passed to starpu_mpi_task_insert() in the MPI
  219. case. That will however not produce any MPI communication, but just pass
  220. ::STARPU_REDUX to the underlying starpu_task_insert(). It is up to the
  221. application to call starpu_mpi_redux_data(), which posts tasks that will
  222. reduce the partial results among MPI nodes into the MPI node which owns the
  223. data. For instance, some hypothetical application which collects partial results
  224. into data <c>res</c>, then uses it for other computation, before looping again
  225. with a new reduction:
  226. \code{.c}
  227. for (i = 0; i < 100; i++) {
  228. starpu_mpi_task_insert(MPI_COMM_WORLD, &init_res, STARPU_W, res, 0);
  229. starpu_mpi_task_insert(MPI_COMM_WORLD, &work, STARPU_RW, A,
  230. STARPU_R, B, STARPU_REDUX, res, 0);
  231. starpu_mpi_redux_data(MPI_COMM_WORLD, res);
  232. starpu_mpi_task_insert(MPI_COMM_WORLD, &work2, STARPU_RW, B, STARPU_R, res, 0);
  233. }
  234. \endcode
  235. \section TemporaryBuffers Temporary Buffers
  236. There are two kinds of temporary buffers: temporary data which just pass results
  237. from a task to another, and scratch data which are needed only internally by
  238. tasks.
  239. \subsection TemporaryData Temporary Data
  240. Data can sometimes be entirely produced by a task, and entirely consumed by
  241. another task, without the need for other parts of the application to access
  242. it. In such case, registration can be done without prior allocation, by using
  243. the special memory node number <c>-1</c>, and passing a zero pointer. StarPU will
  244. actually allocate memory only when the task creating the content gets scheduled,
  245. and destroy it on unregistration.
  246. In addition to that, it can be tedious for the application to have to unregister
  247. the data, since it will not use its content anyway. The unregistration can be
  248. done lazily by using the function starpu_data_unregister_submit(),
  249. which will record that no more tasks accessing the handle will be submitted, so
  250. that it can be freed as soon as the last task accessing it is over.
  251. The following code examplifies both points: it registers the temporary
  252. data, submits three tasks accessing it, and records the data for automatic
  253. unregistration.
  254. \code{.c}
  255. starpu_vector_data_register(&handle, -1, 0, n, sizeof(float));
  256. starpu_task_insert(&produce_data, STARPU_W, handle, 0);
  257. starpu_task_insert(&compute_data, STARPU_RW, handle, 0);
  258. starpu_task_insert(&summarize_data, STARPU_R, handle, STARPU_W, result_handle, 0);
  259. starpu_data_unregister_submit(handle);
  260. \endcode
  261. The application may also want to see the temporary data initialized
  262. on the fly before being used by the task. This can be done by using
  263. starpu_data_set_reduction_methods() to set an initialization codelet (no redux
  264. codelet is needed).
  265. \subsection ScratchData Scratch Data
  266. Some kernels sometimes need temporary data to achieve the computations, i.e. a
  267. workspace. The application could allocate it at the start of the codelet
  268. function, and free it at the end, but that would be costly. It could also
  269. allocate one buffer per worker (similarly to \ref
  270. HowToInitializeAComputationLibraryOnceForEachWorker), but that would
  271. make them systematic and permanent. A more optimized way is to use
  272. the data access mode ::STARPU_SCRATCH, as examplified below, which
  273. provides per-worker buffers without content consistency.
  274. \code{.c}
  275. starpu_vector_data_register(&workspace, -1, 0, sizeof(float));
  276. for (i = 0; i < N; i++)
  277. starpu_task_insert(&compute, STARPU_R, input[i],
  278. STARPU_SCRATCH, workspace, STARPU_W, output[i], 0);
  279. \endcode
  280. StarPU will make sure that the buffer is allocated before executing the task,
  281. and make this allocation per-worker: for CPU workers, notably, each worker has
  282. its own buffer. This means that each task submitted above will actually have its
  283. own workspace, which will actually be the same for all tasks running one after
  284. the other on the same worker. Also, if for instance GPU memory becomes scarce,
  285. StarPU will notice that it can free such buffers easily, since the content does
  286. not matter.
  287. The example <c>examples/pi</c> uses scratches for some temporary buffer.
  288. \section TheMultiformatInterface The Multiformat Interface
  289. It may be interesting to represent the same piece of data using two different
  290. data structures: one that would only be used on CPUs, and one that would only
  291. be used on GPUs. This can be done by using the multiformat interface. StarPU
  292. will be able to convert data from one data structure to the other when needed.
  293. Note that the scheduler <c>dmda</c> is the only one optimized for this
  294. interface. The user must provide StarPU with conversion codelets:
  295. \snippet multiformat.c To be included. You should update doxygen if you see this text.
  296. Kernels can be written almost as for any other interface. Note that
  297. ::STARPU_MULTIFORMAT_GET_CPU_PTR shall only be used for CPU kernels. CUDA kernels
  298. must use ::STARPU_MULTIFORMAT_GET_CUDA_PTR, and OpenCL kernels must use
  299. ::STARPU_MULTIFORMAT_GET_OPENCL_PTR. ::STARPU_MULTIFORMAT_GET_NX may
  300. be used in any kind of kernel.
  301. \code{.c}
  302. static void
  303. multiformat_scal_cpu_func(void *buffers[], void *args)
  304. {
  305. struct point *aos;
  306. unsigned int n;
  307. aos = STARPU_MULTIFORMAT_GET_CPU_PTR(buffers[0]);
  308. n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
  309. ...
  310. }
  311. extern "C" void multiformat_scal_cuda_func(void *buffers[], void *_args)
  312. {
  313. unsigned int n;
  314. struct struct_of_arrays *soa;
  315. soa = (struct struct_of_arrays *) STARPU_MULTIFORMAT_GET_CUDA_PTR(buffers[0]);
  316. n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
  317. ...
  318. }
  319. \endcode
  320. A full example may be found in <c>examples/basic_examples/multiformat.c</c>.
  321. \section DefiningANewDataInterface Defining A New Data Interface
  322. Let's define a new data interface to manage complex numbers.
  323. \code{.c}
  324. /* interface for complex numbers */
  325. struct starpu_complex_interface
  326. {
  327. double *real;
  328. double *imaginary;
  329. int nx;
  330. };
  331. \endcode
  332. Registering such a data to StarPU is easily done using the function
  333. starpu_data_register(). The last
  334. parameter of the function, <c>interface_complex_ops</c>, will be
  335. described below.
  336. \code{.c}
  337. void starpu_complex_data_register(starpu_data_handle_t *handle,
  338. unsigned home_node, double *real, double *imaginary, int nx)
  339. {
  340. struct starpu_complex_interface complex =
  341. {
  342. .real = real,
  343. .imaginary = imaginary,
  344. .nx = nx
  345. };
  346. if (interface_complex_ops.interfaceid == STARPU_UNKNOWN_INTERFACE_ID)
  347. {
  348. interface_complex_ops.interfaceid = starpu_data_interface_get_next_id();
  349. }
  350. starpu_data_register(handleptr, home_node, &complex, &interface_complex_ops);
  351. }
  352. \endcode
  353. Different operations need to be defined for a data interface through
  354. the type starpu_data_interface_ops. We only define here the basic
  355. operations needed to run simple applications. The source code for the
  356. different functions can be found in the file
  357. <c>examples/interface/complex_interface.c</c>.
  358. \code{.c}
  359. static struct starpu_data_interface_ops interface_complex_ops =
  360. {
  361. .register_data_handle = complex_register_data_handle,
  362. .allocate_data_on_node = complex_allocate_data_on_node,
  363. .copy_methods = &complex_copy_methods,
  364. .get_size = complex_get_size,
  365. .footprint = complex_footprint,
  366. .interfaceid = STARPU_UNKNOWN_INTERFACE_ID,
  367. .interface_size = sizeof(struct starpu_complex_interface),
  368. };
  369. \endcode
  370. Functions need to be defined to access the different fields of the
  371. complex interface from a StarPU data handle.
  372. \code{.c}
  373. double *starpu_complex_get_real(starpu_data_handle_t handle)
  374. {
  375. struct starpu_complex_interface *complex_interface =
  376. (struct starpu_complex_interface *) starpu_data_get_interface_on_node(handle, STARPU_MAIN_RAM);
  377. return complex_interface->real;
  378. }
  379. double *starpu_complex_get_imaginary(starpu_data_handle_t handle);
  380. int starpu_complex_get_nx(starpu_data_handle_t handle);
  381. \endcode
  382. Similar functions need to be defined to access the different fields of the
  383. complex interface from a <c>void *</c> pointer to be used within codelet
  384. implemetations.
  385. \snippet complex.c To be included. You should update doxygen if you see this text.
  386. Complex data interfaces can then be registered to StarPU.
  387. \code{.c}
  388. double real = 45.0;
  389. double imaginary = 12.0;starpu_complex_data_register(&handle1, STARPU_MAIN_RAM, &real, &imaginary, 1);
  390. starpu_task_insert(&cl_display, STARPU_R, handle1, 0);
  391. \endcode
  392. and used by codelets.
  393. \code{.c}
  394. void display_complex_codelet(void *descr[], __attribute__ ((unused)) void *_args)
  395. {
  396. int nx = STARPU_COMPLEX_GET_NX(descr[0]);
  397. double *real = STARPU_COMPLEX_GET_REAL(descr[0]);
  398. double *imaginary = STARPU_COMPLEX_GET_IMAGINARY(descr[0]);
  399. int i;
  400. for(i=0 ; i<nx ; i++)
  401. {
  402. fprintf(stderr, "Complex[%d] = %3.2f + %3.2f i\n", i, real[i], imaginary[i]);
  403. }
  404. }
  405. \endcode
  406. The whole code for this complex data interface is available in the
  407. directory <c>examples/interface/</c>.
  408. \section SpecifyingATargetNode Specifying a target node for task data
  409. When executing a task on a GPU for instance, StarPU would normally copy all the
  410. needed data for the tasks on the embedded memory of the GPU. It may however
  411. happen that the task kernel would rather have some of the datas kept in the
  412. main memory instead of copied in the GPU, a pivoting vector for instance.
  413. This can be achieved by setting the starpu_codelet::specific_nodes flag to
  414. 1, and then fill the starpu_codelet::nodes array (or starpu_codelet::dyn_nodes when
  415. starpu_codelet::nbuffers is greater than STARPU_NMAXBUFS) with the node numbers
  416. where data should be copied to, or -1 to let StarPU copy it to the memory node
  417. where the task will be executed. For instance, with the following codelet:
  418. \code{.c}
  419. struct starpu_codelet cl =
  420. {
  421. .cuda_funcs = { kernel },
  422. .nbuffers = 2,
  423. .modes = {STARPU_RW, STARPU_RW},
  424. .specific_nodes = 1,
  425. .nodes = {STARPU_MAIN_RAM, -1},
  426. };
  427. \endcode
  428. the first data of the task will be kept in the main memory, while the second
  429. data will be copied to the CUDA GPU as usual.
  430. */