05check_list_performance.doxy 9.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227
  1. /*
  2. * This file is part of the StarPU Handbook.
  3. * Copyright (C) 2009--2011 Universit@'e de Bordeaux 1
  4. * Copyright (C) 2010, 2011, 2012, 2013, 2014 Centre National de la Recherche Scientifique
  5. * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
  6. * See the file version.doxy for copying conditions.
  7. */
  8. /*! \page CheckListWhenPerformanceAreNotThere Check List When Performance Are Not There
  9. TODO: improve!
  10. Simply encapsulating application kernels into tasks already permits to
  11. seamlessly support CPU and GPUs at the same time. To achieve good
  12. performance, we give below a list of features which should be checked.
  13. \section DataRelatedFeaturesToImprovePerformance Data Related Features That May Improve Performance
  14. link to \ref DataManagement
  15. link to \ref DataPrefetch
  16. \section TaskRelatedFeaturesToImprovePerformance Task Related Features That May Improve Performance
  17. link to \ref TaskGranularity
  18. link to \ref TaskSubmission
  19. link to \ref TaskPriorities
  20. \section SchedulingRelatedFeaturesToImprovePerformance Scheduling Related Features That May Improve Performance
  21. link to \ref TaskSchedulingPolicy
  22. link to \ref TaskDistributionVsDataTransfer
  23. link to \ref Power-basedScheduling
  24. link to \ref StaticScheduling
  25. \section CUDA-specificOptimizations CUDA-specific Optimizations
  26. Due to CUDA limitations, StarPU will have a hard time overlapping its own
  27. communications and the codelet computations if the application does not use a
  28. dedicated CUDA stream for its computations instead of the default stream,
  29. which synchronizes all operations of the GPU. StarPU provides one by the use
  30. of starpu_cuda_get_local_stream() which can be used by all CUDA codelet
  31. operations to avoid this issue. For instance:
  32. \code{.c}
  33. func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);
  34. cudaStreamSynchronize(starpu_cuda_get_local_stream());
  35. \endcode
  36. StarPU already does appropriate calls for the CUBLAS library.
  37. If the kernel can be made to only use this local stream or other self-allocated
  38. streams, i.e. the whole kernel submission can be made asynchronous, then
  39. one should enable asynchronous execution of the kernel. This means setting
  40. the corresponding cuda_flags[] flag in the codelet and dropping the
  41. cudaStreamSynchronize() call at the end of the kernel. That way, StarPU will be
  42. able to pipeline submitting tasks to GPUs, instead of synchronizing at each
  43. kernel submission. The kernel just has to make sure that StarPU can use the
  44. local stream to synchronize with the kernel startup and completion.
  45. Unfortunately, some CUDA libraries do not have stream variants of
  46. kernels. That will lower the potential for overlapping.
  47. \section OpenCL-specificOptimizations OpenCL-specific Optimizations
  48. If the kernel can be made to only use the StarPU-provided command queue or other self-allocated
  49. streams, i.e. the whole kernel submission can be made asynchronous, then
  50. one should enable asynchronous execution of the kernel. This means setting
  51. the corresponding opencl_flags[] flag in the codelet and dropping the
  52. clFinish() and starpu_opencl_collect_stats() calls at the end of the kernel.
  53. That way, StarPU will be able to pipeline submitting tasks to GPUs, instead of
  54. synchronizing at each kernel submission. The kernel just has to make sure
  55. that StarPU can use the command queue it has provided to synchronize with the
  56. kernel startup and completion.
  57. \section DetectionStuckConditions Detection Stuck Conditions
  58. It may happen that for some reason, StarPU does not make progress for a long
  59. period of time. Reason are sometimes due to contention inside StarPU, but
  60. sometimes this is due to external reasons, such as stuck MPI driver, or CUDA
  61. driver, etc.
  62. <c>export STARPU_WATCHDOG_TIMEOUT=10000</c>
  63. allows to make StarPU print an error message whenever StarPU does not terminate
  64. any task for 10ms. In addition to that,
  65. <c>export STARPU_WATCHDOG_CRASH=1</c>
  66. triggers a crash in that condition, thus allowing to catch the situation in gdb
  67. etc.
  68. \section HowToLimitMemoryPerNode How to limit memory per node
  69. TODO
  70. Talk about
  71. \ref STARPU_LIMIT_CUDA_devid_MEM, \ref STARPU_LIMIT_CUDA_MEM,
  72. \ref STARPU_LIMIT_OPENCL_devid_MEM, \ref STARPU_LIMIT_OPENCL_MEM
  73. and \ref STARPU_LIMIT_CPU_MEM
  74. starpu_memory_get_total()
  75. starpu_memory_get_available()
  76. \section PerformanceModelCalibration Performance Model Calibration
  77. Most schedulers are based on an estimation of codelet duration on each kind
  78. of processing unit. For this to be possible, the application programmer needs
  79. to configure a performance model for the codelets of the application (see
  80. \ref PerformanceModelExample for instance). History-based performance models
  81. use on-line calibration. StarPU will automatically calibrate codelets
  82. which have never been calibrated yet, and save the result in
  83. <c>$STARPU_HOME/.starpu/sampling/codelets</c>.
  84. The models are indexed by machine name. To share the models between
  85. machines (e.g. for a homogeneous cluster), use <c>export
  86. STARPU_HOSTNAME=some_global_name</c>. To force continuing calibration,
  87. use <c>export STARPU_CALIBRATE=1</c> . This may be necessary if your application
  88. has not-so-stable performance. StarPU will force calibration (and thus ignore
  89. the current result) until 10 (<c>_STARPU_CALIBRATION_MINIMUM</c>) measurements have been
  90. made on each architecture, to avoid badly scheduling tasks just because the
  91. first measurements were not so good. Details on the current performance model status
  92. can be obtained from the command <c>starpu_perfmodel_display</c>: the <c>-l</c>
  93. option lists the available performance models, and the <c>-s</c> option permits
  94. to choose the performance model to be displayed. The result looks like:
  95. \verbatim
  96. $ starpu_perfmodel_display -s starpu_slu_lu_model_11
  97. performance model for cpu_impl_0
  98. # hash size flops mean dev n
  99. 914f3bef 1048576 0.000000e+00 2.503577e+04 1.982465e+02 8
  100. 3e921964 65536 0.000000e+00 5.527003e+02 1.848114e+01 7
  101. e5a07e31 4096 0.000000e+00 1.717457e+01 5.190038e+00 14
  102. ...
  103. \endverbatim
  104. Which shows that for the LU 11 kernel with a 1MiB matrix, the average
  105. execution time on CPUs was about 25ms, with a 0.2ms standard deviation, over
  106. 8 samples. It is a good idea to check this before doing actual performance
  107. measurements.
  108. A graph can be drawn by using the tool <c>starpu_perfmodel_plot</c>:
  109. \verbatim
  110. $ starpu_perfmodel_plot -s starpu_slu_lu_model_11
  111. 4096 16384 65536 262144 1048576 4194304
  112. $ gnuplot starpu_starpu_slu_lu_model_11.gp
  113. $ gv starpu_starpu_slu_lu_model_11.eps
  114. \endverbatim
  115. \image html starpu_starpu_slu_lu_model_11.png
  116. \image latex starpu_starpu_slu_lu_model_11.eps "" width=\textwidth
  117. If a kernel source code was modified (e.g. performance improvement), the
  118. calibration information is stale and should be dropped, to re-calibrate from
  119. start. This can be done by using <c>export STARPU_CALIBRATE=2</c>.
  120. Note: due to CUDA limitations, to be able to measure kernel duration,
  121. calibration mode needs to disable asynchronous data transfers. Calibration thus
  122. disables data transfer / computation overlapping, and should thus not be used
  123. for eventual benchmarks. Note 2: history-based performance models get calibrated
  124. only if a performance-model-based scheduler is chosen.
  125. The history-based performance models can also be explicitly filled by the
  126. application without execution, if e.g. the application already has a series of
  127. measurements. This can be done by using starpu_perfmodel_update_history(),
  128. for instance:
  129. \code{.c}
  130. static struct starpu_perfmodel perf_model = {
  131. .type = STARPU_HISTORY_BASED,
  132. .symbol = "my_perfmodel",
  133. };
  134. struct starpu_codelet cl = {
  135. .cuda_funcs = { cuda_func1, cuda_func2, NULL },
  136. .nbuffers = 1,
  137. .modes = {STARPU_W},
  138. .model = &perf_model
  139. };
  140. void feed(void) {
  141. struct my_measure *measure;
  142. struct starpu_task task;
  143. starpu_task_init(&task);
  144. task.cl = &cl;
  145. for (measure = &measures[0]; measure < measures[last]; measure++) {
  146. starpu_data_handle_t handle;
  147. starpu_vector_data_register(&handle, -1, 0, measure->size, sizeof(float));
  148. task.handles[0] = handle;
  149. starpu_perfmodel_update_history(&perf_model, &task,
  150. STARPU_CUDA_DEFAULT + measure->cudadev, 0,
  151. measure->implementation, measure->time);
  152. starpu_task_clean(&task);
  153. starpu_data_unregister(handle);
  154. }
  155. }
  156. \endcode
  157. Measurement has to be provided in milliseconds for the completion time models,
  158. and in Joules for the energy consumption models.
  159. \section Profiling Profiling
  160. A quick view of how many tasks each worker has executed can be obtained by setting
  161. <c>export STARPU_WORKER_STATS=1</c> This is a convenient way to check that
  162. execution did happen on accelerators, without penalizing performance with
  163. the profiling overhead.
  164. A quick view of how much data transfers have been issued can be obtained by setting
  165. <c>export STARPU_BUS_STATS=1</c> .
  166. More detailed profiling information can be enabled by using <c>export STARPU_PROFILING=1</c> or by
  167. calling starpu_profiling_status_set() from the source code.
  168. Statistics on the execution can then be obtained by using <c>export
  169. STARPU_BUS_STATS=1</c> and <c>export STARPU_WORKER_STATS=1</c> .
  170. More details on performance feedback are provided by the next chapter.
  171. */