210_check_list_performance.doxy 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433
  1. /*
  2. * This file is part of the StarPU Handbook.
  3. * Copyright (C) 2009--2011 Universit@'e de Bordeaux
  4. * Copyright (C) 2010, 2011, 2012, 2013, 2014, 2016, 2017 CNRS
  5. * Copyright (C) 2011, 2012, 2017 INRIA
  6. * See the file version.doxy for copying conditions.
  7. */
  8. /*! \page CheckListWhenPerformanceAreNotThere Check List When Performance Are Not There
  9. TODO: improve!
  10. To achieve good
  11. performance, we give below a list of features which should be checked.
  12. For a start, you can use \ref OfflinePerformanceTools to get a Gantt chart which
  13. will show roughly where time is spent, and focus correspondingly.
  14. \section ConfigurationImprovePerformance Configuration That May Improve Performance
  15. The \ref enable-fast "--enable-fast" configuration option disables all
  16. assertions. This makes StarPU more performant for really small tasks by
  17. disabling all sanity checks. Only use this for measurements and production, not for development, since this will drop all basic checks.
  18. \section DataRelatedFeaturesToImprovePerformance Data Related Features That May Improve Performance
  19. link to \ref DataManagement
  20. link to \ref DataPrefetch
  21. \section TaskRelatedFeaturesToImprovePerformance Task Related Features That May Improve Performance
  22. link to \ref TaskGranularity
  23. link to \ref TaskSubmission
  24. link to \ref TaskPriorities
  25. \section SchedulingRelatedFeaturesToImprovePerformance Scheduling Related Features That May Improve Performance
  26. link to \ref TaskSchedulingPolicy
  27. link to \ref TaskDistributionVsDataTransfer
  28. link to \ref Energy-basedScheduling
  29. link to \ref StaticScheduling
  30. \section CUDA-specificOptimizations CUDA-specific Optimizations
  31. Due to CUDA limitations, StarPU will have a hard time overlapping its own
  32. communications and the codelet computations if the application does not use a
  33. dedicated CUDA stream for its computations instead of the default stream,
  34. which synchronizes all operations of the GPU. StarPU provides one by the use
  35. of starpu_cuda_get_local_stream() which can be used by all CUDA codelet
  36. operations to avoid this issue. For instance:
  37. \code{.c}
  38. func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);
  39. cudaStreamSynchronize(starpu_cuda_get_local_stream());
  40. \endcode
  41. Unfortunately, some CUDA libraries do not have stream variants of
  42. kernels. That will lower the potential for overlapping.
  43. Calling starpu_cublas_init() makes StarPU already do appropriate calls for the
  44. CUBLAS library. Some libraries like Magma may however change the current stream of CUBLAS v1,
  45. one then has to call <c>cublasSetKernelStream(starpu_cuda_get_local_stream())</c> at
  46. the beginning of the codelet to make sure that CUBLAS is really using the proper
  47. stream. When using CUBLAS v2, starpu_cublas_get_local_handle() can be called to queue CUBLAS
  48. kernels with the proper configuration.
  49. Similarly, calling starpu_cusparse_init() makes StarPU create CUSPARSE handles
  50. on each CUDA device, starpu_cusparse_get_local_handle() can then be used to
  51. queue CUSPARSE kernels with the proper configuration.
  52. If the kernel can be made to only use this local stream or other self-allocated
  53. streams, i.e. the whole kernel submission can be made asynchronous, then
  54. one should enable asynchronous execution of the kernel. That means setting
  55. the flag ::STARPU_CUDA_ASYNC in the corresponding field starpu_codelet::cuda_flags, and dropping the
  56. <c>cudaStreamSynchronize()</c> call at the end of the <c>cuda_func</c> function, so that it
  57. returns immediately after having queued the kernel to the local stream. That way, StarPU will be
  58. able to submit and complete data transfers while kernels are executing, instead of only at each
  59. kernel submission. The kernel just has to make sure that StarPU can use the
  60. local stream to synchronize with the kernel startup and completion.
  61. If the kernel uses its own non-default stream, one can synchronize that stream
  62. with the StarPU-provided stream this way:
  63. \code{.c}
  64. cudaEvent_t event;
  65. call_kernel_with_its_own_stream()
  66. cudaEventCreateWithFlags(&event, cudaEventDisableTiming);
  67. cudaEventRecord(event, get_kernel_stream());
  68. cudaStreamWaitEvent(starpu_cuda_get_local_stream(), event, 0);
  69. cudaEventDestroy(event);
  70. \endcode
  71. That code makes the StarPU-provided stream wait for a new event, which will be
  72. triggered by the completion of the kernel.
  73. Using the flag ::STARPU_CUDA_ASYNC also permits to enable concurrent kernel
  74. execution, on cards which support it (Kepler and later, notably). This is
  75. enabled by setting the environment variable \ref STARPU_NWORKER_PER_CUDA to the
  76. number of kernels to execute concurrently. This is useful when kernels are
  77. small and do not feed the whole GPU with threads to run.
  78. \section OpenCL-specificOptimizations OpenCL-specific Optimizations
  79. If the kernel can be made to only use the StarPU-provided command queue or other self-allocated
  80. queues, i.e. the whole kernel submission can be made asynchronous, then
  81. one should enable asynchronous execution of the kernel. This means setting
  82. the flag ::STARPU_OPENCL_ASYNC in the corresponding field starpu_codelet::opencl_flags and dropping the
  83. <c>clFinish()</c> and starpu_opencl_collect_stats() calls at the end of the kernel, so
  84. that it returns immediately after having queued the kernel to the provided queue.
  85. That way, StarPU will be able to submit and complete data transfers while kernels are executing, instead of
  86. only at each kernel submission. The kernel just has to make sure
  87. that StarPU can use the command queue it has provided to synchronize with the
  88. kernel startup and completion.
  89. \section DetectionStuckConditions Detecting Stuck Conditions
  90. It may happen that for some reason, StarPU does not make progress for a long
  91. period of time. Reason are sometimes due to contention inside StarPU, but
  92. sometimes this is due to external reasons, such as stuck MPI driver, or CUDA
  93. driver, etc.
  94. <c>export STARPU_WATCHDOG_TIMEOUT=10000</c> (\ref STARPU_WATCHDOG_TIMEOUT)
  95. allows to make StarPU print an error message whenever StarPU does not terminate
  96. any task for 10ms, but lets the application continue normally. In addition to that,
  97. <c>export STARPU_WATCHDOG_CRASH=1</c> (\ref STARPU_WATCHDOG_CRASH)
  98. raises <c>SIGABRT</c> in that condition, thus allowing to catch the situation in gdb.
  99. It can also be useful to type <c>handle SIGABRT nopass</c> in <c>gdb</c> to be able to let
  100. the process continue, after inspecting the state of the process.
  101. \section HowToLimitMemoryPerNode How to Limit Memory Used By StarPU And Cache Buffer Allocations
  102. By default, StarPU makes sure to use at most 90% of the memory of GPU devices,
  103. moving data in and out of the device as appropriate and with prefetch and
  104. writeback optimizations. Concerning the main memory, by default it will not
  105. limit its consumption, since by default it has nowhere to push the data to when
  106. memory gets tight. This also means that by default StarPU will not cache buffer
  107. allocations in main memory, since it does not know how much of the system memory
  108. it can afford.
  109. In the case of GPUs, the \ref STARPU_LIMIT_CUDA_MEM, \ref STARPU_LIMIT_CUDA_devid_MEM,
  110. \ref STARPU_LIMIT_OPENCL_MEM, and \ref STARPU_LIMIT_OPENCL_devid_MEM environment variables
  111. can be used to control how
  112. much (in MiB) of the GPU device memory should be used at most by StarPU (their
  113. default values are 90% of the available memory).
  114. In the case of the main memory, the \ref STARPU_LIMIT_CPU_MEM environment
  115. variable can be used to specify how much (in MiB) of the main memory should be
  116. used at most by StarPU for buffer allocations. This way, StarPU will be able to
  117. cache buffer allocations (which can be a real benefit if a lot of bufferes are
  118. involved, or if allocation fragmentation can become a problem), and when using
  119. \ref OutOfCore, StarPU will know when it should evict data out to the disk.
  120. It should be noted that by default only buffer allocations automatically
  121. done by StarPU are accounted here, i.e. allocations performed through
  122. starpu_malloc_on_node() which are used by the data interfaces
  123. (matrix, vector, etc.). This does not include allocations performed by
  124. the application through e.g. malloc(). It does not include allocations
  125. performed through starpu_malloc() either, only allocations
  126. performed explicitly with the \ref STARPU_MALLOC_COUNT flag, i.e. by calling
  127. \code{.c}
  128. starpu_malloc_flags(STARPU_MALLOC_COUNT)
  129. \endcode
  130. are taken into account. If the
  131. application wants to make StarPU aware of its own allocations, so that StarPU
  132. knows precisely how much data is allocated, and thus when to evict allocation
  133. caches or data out to the disk, starpu_memory_allocate() can be used to
  134. specify an amount of memory to be accounted for. starpu_memory_deallocate()
  135. can be used to account freed memory back. Those can for instance be used by data
  136. interfaces with dynamic data buffers: instead of using starpu_malloc_on_node(),
  137. they would dynamically allocate data with malloc/realloc, and notify starpu of
  138. the delta thanks to starpu_memory_allocate() and starpu_memory_deallocate() calls.
  139. starpu_memory_get_total() and starpu_memory_get_available()
  140. can be used to get an estimation of how much memory is available.
  141. starpu_memory_wait_available() can also be used to block until an
  142. amount of memory becomes available, but it may be preferrable to call
  143. \code{.c}
  144. starpu_memory_allocate(STARPU_MEMORY_WAIT)
  145. \endcode
  146. to reserve that amount immediately.
  147. \section HowToReduceTheMemoryFootprintOfInternalDataStructures How To Reduce The Memory Footprint Of Internal Data Structures
  148. It is possible to reduce the memory footprint of the task and data internal
  149. structures of StarPU by describing the shape of your machine and/or your
  150. application at the configure step.
  151. To reduce the memory footprint of the data internal structures of StarPU, one
  152. can set the
  153. \ref enable-maxcpus "--enable-maxcpus",
  154. \ref enable-maxnumanodes "--enable-maxnumanodes",
  155. \ref enable-maxcudadev "--enable-maxcudadev",
  156. \ref enable-maxopencldev "--enable-maxopencldev" and
  157. \ref enable-maxnodes "--enable-maxnodes"
  158. configure parameters to give StarPU
  159. the architecture of the machine it will run on, thus tuning the size of the
  160. structures to the machine.
  161. To reduce the memory footprint of the task internal structures of StarPU, one
  162. can set the \ref enable-maxbuffers "--enable-maxbuffers" configure parameter to
  163. give StarPU the maximum number of buffers that a task can use during an
  164. execution. For example, in the Cholesky factorization (dense linear algebra
  165. application), the GEMM task uses up to 3 buffers, so it is possible to set the
  166. maximum number of task buffers to 3 to run a Cholesky factorization on StarPU.
  167. The size of the various structures of StarPU can be printed by
  168. <c>tests/microbenchs/display_structures_size</c>.
  169. It is also often useless to submit *all* the tasks at the same time. One can
  170. make the starpu_task_submit() function block when a reasonable given number of
  171. tasks have been submitted, by setting the \ref STARPU_LIMIT_MIN_SUBMITTED_TASKS and
  172. \ref STARPU_LIMIT_MAX_SUBMITTED_TASKS environment variables, for instance:
  173. <c>
  174. export STARPU_LIMIT_MAX_SUBMITTED_TASKS=10000
  175. export STARPU_LIMIT_MIN_SUBMITTED_TASKS=9000
  176. </c>
  177. To make StarPU block submission when 10000 tasks are submitted, and unblock
  178. submission when only 9000 tasks are still submitted, i.e. 1000 tasks have
  179. completed among the 10000 that were submitted when submission was blocked. Of
  180. course this may reduce parallelism if the threshold is set too low. The precise
  181. balance depends on the application task graph.
  182. An idea of how much memory is used for tasks and data handles can be obtained by
  183. setting the \ref STARPU_MAX_MEMORY_USE environment variable to <c>1</c>.
  184. \section HowtoReuseMemory How To Reuse Memory
  185. When your application needs to allocate more data than the available amount of
  186. memory usable by StarPU (given by starpu_memory_get_available()), the
  187. allocation cache system can reuse data buffers used by previously executed
  188. tasks. For that system to work with MPI tasks, you need to submit tasks progressively instead
  189. of as soon as possible, because in the case of MPI receives, the allocation cache check for reusing data
  190. buffers will be done at submission time, not at execution time.
  191. You have two options to control the task submission flow. The first one is by
  192. controlling the number of submitted tasks during the whole execution. This can
  193. be done whether by setting the environment variables
  194. \ref STARPU_LIMIT_MAX_SUBMITTED_TASKS and \ref STARPU_LIMIT_MIN_SUBMITTED_TASKS to
  195. tell StarPU when to stop submitting tasks and when to wake up and submit tasks
  196. again, or by explicitely calling starpu_task_wait_for_n_submitted() in
  197. your application code for finest grain control (for example, between two
  198. iterations of a submission loop).
  199. The second option is to control the memory size of the allocation cache. This
  200. can be done in the application by using jointly
  201. starpu_memory_get_available() and starpu_memory_wait_available() to submit
  202. tasks only when there is enough memory space to allocate the data needed by the
  203. task, i.e when enough data are available for reuse in the allocation cache.
  204. \section PerformanceModelCalibration Performance Model Calibration
  205. Most schedulers are based on an estimation of codelet duration on each kind
  206. of processing unit. For this to be possible, the application programmer needs
  207. to configure a performance model for the codelets of the application (see
  208. \ref PerformanceModelExample for instance). History-based performance models
  209. use on-line calibration. StarPU will automatically calibrate codelets
  210. which have never been calibrated yet, and save the result in
  211. <c>$STARPU_HOME/.starpu/sampling/codelets</c>.
  212. The models are indexed by machine name.
  213. By default, StarPU stores separate performance models according to the hostname
  214. of the system. To avoid having to calibrate performance models for each node
  215. of a homogeneous cluster for instance, the model can be shared by using
  216. <c>export STARPU_HOSTNAME=some_global_name</c> (\ref STARPU_HOSTNAME), where
  217. <c>some_global_name</c> is the name of the cluster for instance, which thus
  218. overrides the hostname of the system.
  219. By default, StarPU stores separate performance models for each GPU. To avoid
  220. having to calibrate performance models for each GPU of a homogeneous set of GPU
  221. devices for instance, the model can be shared by setting
  222. <c>export STARPU_PERF_MODEL_HOMOGENEOUS_CUDA=1</c> ,
  223. <c>export STARPU_PERF_MODEL_HOMOGENEOUS_OPENCL=1</c> ,
  224. <c>export STARPU_PERF_MODEL_HOMOGENEOUS_MIC=1</c> ,
  225. <c>export STARPU_PERF_MODEL_HOMOGENEOUS_MPI_MS=1</c> , or
  226. <c>export STARPU_PERF_MODEL_HOMOGENEOUS_SCC=1</c> (depending on your GPU device type).
  227. To force continuing calibration,
  228. use <c>export STARPU_CALIBRATE=1</c> (\ref STARPU_CALIBRATE). This may be necessary if your application
  229. has not-so-stable performance. StarPU will force calibration (and thus ignore
  230. the current result) until 10 (<c>_STARPU_CALIBRATION_MINIMUM</c>) measurements have been
  231. made on each architecture, to avoid badly scheduling tasks just because the
  232. first measurements were not so good. Details on the current performance model status
  233. can be obtained from the tool <c>starpu_perfmodel_display</c>: the <c>-l</c>
  234. option lists the available performance models, and the <c>-s</c> option permits
  235. to choose the performance model to be displayed. The result looks like:
  236. \verbatim
  237. $ starpu_perfmodel_display -s starpu_slu_lu_model_11
  238. performance model for cpu_impl_0
  239. # hash size flops mean dev n
  240. 914f3bef 1048576 0.000000e+00 2.503577e+04 1.982465e+02 8
  241. 3e921964 65536 0.000000e+00 5.527003e+02 1.848114e+01 7
  242. e5a07e31 4096 0.000000e+00 1.717457e+01 5.190038e+00 14
  243. ...
  244. \endverbatim
  245. Which shows that for the LU 11 kernel with a 1MiB matrix, the average
  246. execution time on CPUs was about 25ms, with a 0.2ms standard deviation, over
  247. 8 samples. It is a good idea to check this before doing actual performance
  248. measurements.
  249. A graph can be drawn by using the tool <c>starpu_perfmodel_plot</c>:
  250. \verbatim
  251. $ starpu_perfmodel_plot -s starpu_slu_lu_model_11
  252. 4096 16384 65536 262144 1048576 4194304
  253. $ gnuplot starpu_starpu_slu_lu_model_11.gp
  254. $ gv starpu_starpu_slu_lu_model_11.eps
  255. \endverbatim
  256. \image html starpu_starpu_slu_lu_model_11.png
  257. \image latex starpu_starpu_slu_lu_model_11.eps "" width=\textwidth
  258. If a kernel source code was modified (e.g. performance improvement), the
  259. calibration information is stale and should be dropped, to re-calibrate from
  260. start. This can be done by using <c>export STARPU_CALIBRATE=2</c> (\ref STARPU_CALIBRATE).
  261. Note: history-based performance models get calibrated
  262. only if a performance-model-based scheduler is chosen.
  263. The history-based performance models can also be explicitly filled by the
  264. application without execution, if e.g. the application already has a series of
  265. measurements. This can be done by using starpu_perfmodel_update_history(),
  266. for instance:
  267. \code{.c}
  268. static struct starpu_perfmodel perf_model =
  269. {
  270. .type = STARPU_HISTORY_BASED,
  271. .symbol = "my_perfmodel",
  272. };
  273. struct starpu_codelet cl =
  274. {
  275. .cuda_funcs = { cuda_func1, cuda_func2 },
  276. .nbuffers = 1,
  277. .modes = {STARPU_W},
  278. .model = &perf_model
  279. };
  280. void feed(void)
  281. {
  282. struct my_measure *measure;
  283. struct starpu_task task;
  284. starpu_task_init(&task);
  285. task.cl = &cl;
  286. for (measure = &measures[0]; measure < measures[last]; measure++)
  287. {
  288. starpu_data_handle_t handle;
  289. starpu_vector_data_register(&handle, -1, 0, measure->size, sizeof(float));
  290. task.handles[0] = handle;
  291. starpu_perfmodel_update_history(&perf_model, &task,
  292. STARPU_CUDA_DEFAULT + measure->cudadev, 0,
  293. measure->implementation, measure->time);
  294. starpu_task_clean(&task);
  295. starpu_data_unregister(handle);
  296. }
  297. }
  298. \endcode
  299. Measurement has to be provided in milliseconds for the completion time models,
  300. and in Joules for the energy consumption models.
  301. \section Profiling Profiling
  302. A quick view of how many tasks each worker has executed can be obtained by setting
  303. <c>export STARPU_WORKER_STATS=1</c> (\ref STARPU_WORKER_STATS). This is a convenient way to check that
  304. execution did happen on accelerators, without penalizing performance with
  305. the profiling overhead.
  306. A quick view of how much data transfers have been issued can be obtained by setting
  307. <c>export STARPU_BUS_STATS=1</c> (\ref STARPU_BUS_STATS).
  308. More detailed profiling information can be enabled by using <c>export STARPU_PROFILING=1</c> (\ref STARPU_PROFILING)
  309. or by
  310. calling starpu_profiling_status_set() from the source code.
  311. Statistics on the execution can then be obtained by using <c>export
  312. STARPU_BUS_STATS=1</c> and <c>export STARPU_WORKER_STATS=1</c> .
  313. More details on performance feedback are provided in the next chapter.
  314. \section OverheadProfiling Overhead Profiling
  315. \ref OfflinePerformanceTools can already provide an idea of to what extent and
  316. which part of StarPU bring overhead on the execution time. To get a more precise
  317. analysis of the parts of StarPU which bring most overhead, <c>gprof</c> can be used.
  318. First, recompile and reinstall StarPU with <c>gprof</c> support:
  319. \code
  320. ./configure --enable-perf-debug --disable-shared --disable-build-tests --disable-build-examples
  321. \endcode
  322. Make sure not to leave a dynamic version of StarPU in the target path: remove
  323. any remaining <c>libstarpu-*.so</c>
  324. Then relink your application with the static StarPU library, make sure that
  325. running <c>ldd</c> on your application does not mention any libstarpu
  326. (i.e. it's really statically-linked).
  327. \code
  328. gcc test.c -o test $(pkg-config --cflags starpu-1.3) $(pkg-config --libs starpu-1.3)
  329. \endcode
  330. Now you can run your application, and a <c>gmon.out</c> file should appear in the
  331. current directory, you can process it by running <c>gprof</c> on your application:
  332. \code
  333. gprof ./test
  334. \endcode
  335. That will dump an analysis of the time spent in StarPU functions.
  336. */