05check_list_performance.doxy 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282
  1. /*
  2. * This file is part of the StarPU Handbook.
  3. * Copyright (C) 2009--2011 Universit@'e de Bordeaux
  4. * Copyright (C) 2010, 2011, 2012, 2013, 2014 CNRS
  5. * Copyright (C) 2011, 2012 INRIA
  6. * See the file version.doxy for copying conditions.
  7. */
  8. /*! \page CheckListWhenPerformanceAreNotThere Check List When Performance Are Not There
  9. TODO: improve!
  10. Simply encapsulating application kernels into tasks already permits to
  11. seamlessly support CPU and GPUs at the same time. To achieve good
  12. performance, we give below a list of features which should be checked.
  13. \section DataRelatedFeaturesToImprovePerformance Data Related Features That May Improve Performance
  14. link to \ref DataManagement
  15. link to \ref DataPrefetch
  16. \section TaskRelatedFeaturesToImprovePerformance Task Related Features That May Improve Performance
  17. link to \ref TaskGranularity
  18. link to \ref TaskSubmission
  19. link to \ref TaskPriorities
  20. \section SchedulingRelatedFeaturesToImprovePerformance Scheduling Related Features That May Improve Performance
  21. link to \ref TaskSchedulingPolicy
  22. link to \ref TaskDistributionVsDataTransfer
  23. link to \ref Power-basedScheduling
  24. link to \ref StaticScheduling
  25. \section CUDA-specificOptimizations CUDA-specific Optimizations
  26. Due to CUDA limitations, StarPU will have a hard time overlapping its own
  27. communications and the codelet computations if the application does not use a
  28. dedicated CUDA stream for its computations instead of the default stream,
  29. which synchronizes all operations of the GPU. StarPU provides one by the use
  30. of starpu_cuda_get_local_stream() which can be used by all CUDA codelet
  31. operations to avoid this issue. For instance:
  32. \code{.c}
  33. func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);
  34. cudaStreamSynchronize(starpu_cuda_get_local_stream());
  35. \endcode
  36. Unfortunately, some CUDA libraries do not have stream variants of
  37. kernels. That will lower the potential for overlapping.
  38. Calling starpu_cublas_init() makes StarPU already do appropriate calls for the
  39. CUBLAS library. Some libraries like Magma may however change the current stream,
  40. one then has to call cublasSetKernelStream(starpu_cuda_get_local_stream()); at
  41. the beginning of the codelet to make sure that CUBLAS is really using the proper
  42. stream.
  43. If the kernel can be made to only use this local stream or other self-allocated
  44. streams, i.e. the whole kernel submission can be made asynchronous, then
  45. one should enable asynchronous execution of the kernel. That means setting
  46. the flag ::STARPU_CUDA_ASYNC in the corresponding field starpu_codelet::cuda_flags, and dropping the
  47. cudaStreamSynchronize() call at the end of the cuda_func function, so that it
  48. returns immediately after having queued the kernel to the local stream. That way, StarPU will be
  49. able to submit and complete data transfers while kernels are executing, instead of only at each
  50. kernel submission. The kernel just has to make sure that StarPU can use the
  51. local stream to synchronize with the kernel startup and completion.
  52. Using the flag ::STARPU_CUDA_ASYNC also permits to enable concurrent kernel
  53. execution, on cards which support it (Kepler and later, notably). This is
  54. enabled by setting the environment variable \ref STARPU_NWORKER_PER_CUDA to the
  55. number of kernels to execute concurrently. This is useful when kernels are
  56. small and do not feed the whole GPU with threads to run.
  57. \section OpenCL-specificOptimizations OpenCL-specific Optimizations
  58. If the kernel can be made to only use the StarPU-provided command queue or other self-allocated
  59. queues, i.e. the whole kernel submission can be made asynchronous, then
  60. one should enable asynchronous execution of the kernel. This means setting
  61. the flag ::STARPU_OPENCL_ASYNC in the corresponding field starpu_codelet::opencl_flags and dropping the
  62. clFinish() and starpu_opencl_collect_stats() calls at the end of the kernel, so
  63. that it returns immediately after having queued the kernel to the provided queue.
  64. That way, StarPU will be able to submit and complete data transfers while kernels are executing, instead of
  65. only at each kernel submission. The kernel just has to make sure
  66. that StarPU can use the command queue it has provided to synchronize with the
  67. kernel startup and completion.
  68. \section DetectionStuckConditions Detection Stuck Conditions
  69. It may happen that for some reason, StarPU does not make progress for a long
  70. period of time. Reason are sometimes due to contention inside StarPU, but
  71. sometimes this is due to external reasons, such as stuck MPI driver, or CUDA
  72. driver, etc.
  73. <c>export STARPU_WATCHDOG_TIMEOUT=10000</c> (\ref STARPU_WATCHDOG_TIMEOUT)
  74. allows to make StarPU print an error message whenever StarPU does not terminate
  75. any task for 10ms. In addition to that,
  76. <c>export STARPU_WATCHDOG_CRASH=1</c> (\ref STARPU_WATCHDOG_CRASH)
  77. raises SIGABRT in that condition, thus allowing to catch the situation in gdb.
  78. It can also be useful to type "handle SIGABRT nopass" in gdb to be able to let
  79. the process continue, after inspecting the state of the process.
  80. \section HowToLimitMemoryPerNode How to limit memory per node
  81. TODO
  82. Talk about
  83. \ref STARPU_LIMIT_CUDA_devid_MEM, \ref STARPU_LIMIT_CUDA_MEM,
  84. \ref STARPU_LIMIT_OPENCL_devid_MEM, \ref STARPU_LIMIT_OPENCL_MEM
  85. and \ref STARPU_LIMIT_CPU_MEM
  86. starpu_memory_get_total()
  87. starpu_memory_get_available()
  88. \section HowToReduceTheMemoryFootprintOfInternalDataStructures How To Reduce The Memory Footprint Of Internal Data Structures
  89. It is possible to reduce the memory footprint of the task and data internal
  90. structures of StarPU by describing the shape of your machine and/or your
  91. application at the configure step.
  92. To reduce the memory footprint of the data internal structures of StarPU, one
  93. can set the \ref enable-maxcpus "--enable-maxcpus", \ref enable-maxcudadev
  94. "--enable-maxcudadev", \ref enable-maxopencldev "--enable-maxopencldev" and
  95. \ref enable-maxnodes "--enable-maxnodes" configure parameters to give StarPU
  96. the architecture of the machine it will run on, thus tuning the size of the
  97. structures to the machine.
  98. To reduce the memory footprint of the task internal structures of StarPU, one
  99. can set the \ref enable-maxbuffers "--enable-maxbuffers" configure parameter to
  100. give StarPU the maximum number of buffers that a task can use during an
  101. execution. For example, in the Cholesky factorization (dense linear algebra
  102. application), the GEMM task uses up to 3 buffers, so it is possible to set the
  103. maximum number of task buffers to 3 to run a Cholesky factorization on StarPU.
  104. \section HowtoReuseMemory How to reuse memory
  105. When your application needs to allocate more data than the available amount of
  106. memory usable by StarPU (given by \ref starpu_memory_get_available() ), the
  107. allocation cache system can reuse data buffers used by previously executed
  108. tasks. For that system to work with MPI tasks, you need to submit tasks progressively instead
  109. of as soon as possible, because in the case of MPI receives, the allocation cache check for reusing data
  110. buffers will be done at submission time, not at execution time.
  111. You have two options to control the task submission flow. The first one is by
  112. controlling the number of submitted tasks during the whole execution. This can
  113. be done whether by setting the environment variables \ref
  114. STARPU_LIMIT_MAX_NSUBMITTED_TASKS and \ref STARPU_LIMIT_MIN_NSUBMITTED_TASKS to
  115. tell StarPU when to stop submitting tasks and when to wake up and submit tasks
  116. again, or by explicitely calling \ref starpu_task_wait_for_n_submitted() in
  117. your application code for finest grain control (for example, between two
  118. iterations of a submission loop).
  119. The second option is to control the memory size of the allocation cache. This
  120. can be done in the application by using jointly \ref
  121. starpu_memory_get_available() and \ref starpu_memory_wait_available() to submit
  122. tasks only when there is enough memory space to allocate the data needed by the
  123. task, i.e when enough data are available for reuse in the allocation cache.
  124. \section PerformanceModelCalibration Performance Model Calibration
  125. Most schedulers are based on an estimation of codelet duration on each kind
  126. of processing unit. For this to be possible, the application programmer needs
  127. to configure a performance model for the codelets of the application (see
  128. \ref PerformanceModelExample for instance). History-based performance models
  129. use on-line calibration. StarPU will automatically calibrate codelets
  130. which have never been calibrated yet, and save the result in
  131. <c>$STARPU_HOME/.starpu/sampling/codelets</c>.
  132. The models are indexed by machine name. To share the models between
  133. machines (e.g. for a homogeneous cluster), use <c>export
  134. STARPU_HOSTNAME=some_global_name</c> (\ref STARPU_HOSTNAME). To force continuing calibration,
  135. use <c>export STARPU_CALIBRATE=1</c> (\ref STARPU_CALIBRATE). This may be necessary if your application
  136. has not-so-stable performance. StarPU will force calibration (and thus ignore
  137. the current result) until 10 (<c>_STARPU_CALIBRATION_MINIMUM</c>) measurements have been
  138. made on each architecture, to avoid badly scheduling tasks just because the
  139. first measurements were not so good. Details on the current performance model status
  140. can be obtained from the command <c>starpu_perfmodel_display</c>: the <c>-l</c>
  141. option lists the available performance models, and the <c>-s</c> option permits
  142. to choose the performance model to be displayed. The result looks like:
  143. \verbatim
  144. $ starpu_perfmodel_display -s starpu_slu_lu_model_11
  145. performance model for cpu_impl_0
  146. # hash size flops mean dev n
  147. 914f3bef 1048576 0.000000e+00 2.503577e+04 1.982465e+02 8
  148. 3e921964 65536 0.000000e+00 5.527003e+02 1.848114e+01 7
  149. e5a07e31 4096 0.000000e+00 1.717457e+01 5.190038e+00 14
  150. ...
  151. \endverbatim
  152. Which shows that for the LU 11 kernel with a 1MiB matrix, the average
  153. execution time on CPUs was about 25ms, with a 0.2ms standard deviation, over
  154. 8 samples. It is a good idea to check this before doing actual performance
  155. measurements.
  156. A graph can be drawn by using the tool <c>starpu_perfmodel_plot</c>:
  157. \verbatim
  158. $ starpu_perfmodel_plot -s starpu_slu_lu_model_11
  159. 4096 16384 65536 262144 1048576 4194304
  160. $ gnuplot starpu_starpu_slu_lu_model_11.gp
  161. $ gv starpu_starpu_slu_lu_model_11.eps
  162. \endverbatim
  163. \image html starpu_starpu_slu_lu_model_11.png
  164. \image latex starpu_starpu_slu_lu_model_11.eps "" width=\textwidth
  165. If a kernel source code was modified (e.g. performance improvement), the
  166. calibration information is stale and should be dropped, to re-calibrate from
  167. start. This can be done by using <c>export STARPU_CALIBRATE=2</c> (\ref STARPU_CALIBRATE).
  168. Note: history-based performance models get calibrated
  169. only if a performance-model-based scheduler is chosen.
  170. The history-based performance models can also be explicitly filled by the
  171. application without execution, if e.g. the application already has a series of
  172. measurements. This can be done by using starpu_perfmodel_update_history(),
  173. for instance:
  174. \code{.c}
  175. static struct starpu_perfmodel perf_model = {
  176. .type = STARPU_HISTORY_BASED,
  177. .symbol = "my_perfmodel",
  178. };
  179. struct starpu_codelet cl = {
  180. .cuda_funcs = { cuda_func1, cuda_func2 },
  181. .nbuffers = 1,
  182. .modes = {STARPU_W},
  183. .model = &perf_model
  184. };
  185. void feed(void) {
  186. struct my_measure *measure;
  187. struct starpu_task task;
  188. starpu_task_init(&task);
  189. task.cl = &cl;
  190. for (measure = &measures[0]; measure < measures[last]; measure++) {
  191. starpu_data_handle_t handle;
  192. starpu_vector_data_register(&handle, -1, 0, measure->size, sizeof(float));
  193. task.handles[0] = handle;
  194. starpu_perfmodel_update_history(&perf_model, &task,
  195. STARPU_CUDA_DEFAULT + measure->cudadev, 0,
  196. measure->implementation, measure->time);
  197. starpu_task_clean(&task);
  198. starpu_data_unregister(handle);
  199. }
  200. }
  201. \endcode
  202. Measurement has to be provided in milliseconds for the completion time models,
  203. and in Joules for the energy consumption models.
  204. \section Profiling Profiling
  205. A quick view of how many tasks each worker has executed can be obtained by setting
  206. <c>export STARPU_WORKER_STATS=1</c> (\ref STARPU_WORKER_STATS). This is a convenient way to check that
  207. execution did happen on accelerators, without penalizing performance with
  208. the profiling overhead.
  209. A quick view of how much data transfers have been issued can be obtained by setting
  210. <c>export STARPU_BUS_STATS=1</c> (\ref STARPU_BUS_STATS).
  211. More detailed profiling information can be enabled by using <c>export STARPU_PROFILING=1</c> (\ref STARPU_PROFILING)
  212. or by
  213. calling starpu_profiling_status_set() from the source code.
  214. Statistics on the execution can then be obtained by using <c>export
  215. STARPU_BUS_STATS=1</c> and <c>export STARPU_WORKER_STATS=1</c> .
  216. More details on performance feedback are provided in the next chapter.
  217. */