210_check_list_performance.doxy 23 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506
  1. /* StarPU --- Runtime system for heterogeneous multicore architectures.
  2. *
  3. * Copyright (C) 2009-2020 Université de Bordeaux, CNRS (LaBRI UMR 5800), Inria
  4. *
  5. * StarPU is free software; you can redistribute it and/or modify
  6. * it under the terms of the GNU Lesser General Public License as published by
  7. * the Free Software Foundation; either version 2.1 of the License, or (at
  8. * your option) any later version.
  9. *
  10. * StarPU is distributed in the hope that it will be useful, but
  11. * WITHOUT ANY WARRANTY; without even the implied warranty of
  12. * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
  13. *
  14. * See the GNU Lesser General Public License in COPYING.LGPL for more details.
  15. */
  16. /*! \page CheckListWhenPerformanceAreNotThere Check List When Performance Are Not There
  17. TODO: improve!
  18. To achieve good
  19. performance, we give below a list of features which should be checked.
  20. For a start, you can use \ref OfflinePerformanceTools to get a Gantt chart which
  21. will show roughly where time is spent, and focus correspondingly.
  22. \section CheckTaskSize Check Task Size
  23. Make sure that your tasks are not too small, as the StarPU runtime overhead
  24. is not completely zero. As explained in \ref TaskSizeOverhead, you can
  25. run the script \c tasks_size_overhead.sh to get an
  26. idea of the scalability of tasks depending on their duration (in µs), on your
  27. own system.
  28. Typically, 10µs-ish tasks are definitely too small, the CUDA overhead itself is
  29. much bigger than this.
  30. 1ms-ish tasks may be a good start, but will not necessarily scale to many dozens
  31. of cores, so it's better to try to get 10ms-ish tasks.
  32. Tasks durations can easily be observed when performance models are defined (see
  33. \ref PerformanceModelExample) by using the tools <c>starpu_perfmodel_plot</c> or
  34. <c>starpu_perfmodel_display</c> (see \ref PerformanceOfCodelets)
  35. When using parallel tasks, the problem is even worse since StarPU has to
  36. synchronize the tasks execution.
  37. \section ConfigurationImprovePerformance Configuration Which May Improve Performance
  38. The \c configure option \ref enable-fast "--enable-fast" disables all
  39. assertions. This makes StarPU more performant for really small tasks by
  40. disabling all sanity checks. Only use this for measurements and production, not for development, since this will drop all basic checks.
  41. \section DataRelatedFeaturesToImprovePerformance Data Related Features Which May Improve Performance
  42. link to \ref DataManagement
  43. link to \ref DataPrefetch
  44. \section TaskRelatedFeaturesToImprovePerformance Task Related Features Which May Improve Performance
  45. link to \ref TaskGranularity
  46. link to \ref TaskSubmission
  47. link to \ref TaskPriorities
  48. \section SchedulingRelatedFeaturesToImprovePerformance Scheduling Related Features Which May Improve Performance
  49. link to \ref TaskSchedulingPolicy
  50. link to \ref TaskDistributionVsDataTransfer
  51. link to \ref Energy-basedScheduling
  52. link to \ref StaticScheduling
  53. \section CUDA-specificOptimizations CUDA-specific Optimizations
  54. For proper overlapping of asynchronous GPU data transfers, data has to be pinned
  55. by CUDA. Data allocated with starpu_malloc() is always properly pinned. If the
  56. application registers to StarPU some data which has not been allocated with
  57. starpu_malloc(), starpu_memory_pin() should be called to pin the data memory.
  58. Due to CUDA limitations, StarPU will have a hard time overlapping its own
  59. communications and the codelet computations if the application does not use a
  60. dedicated CUDA stream for its computations instead of the default stream,
  61. which synchronizes all operations of the GPU. The function
  62. starpu_cuda_get_local_stream() returns a stream which can be used by all CUDA codelet
  63. operations to avoid this issue. For instance:
  64. \code{.c}
  65. func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);
  66. cudaError_t status = cudaGetLastError();
  67. if (status != cudaSuccess) STARPU_CUDA_REPORT_ERROR(status);
  68. cudaStreamSynchronize(starpu_cuda_get_local_stream());
  69. \endcode
  70. as well as the use of \c cudaMemcpyAsync(), etc. for each CUDA operation one needs
  71. to use a version that takes the a stream parameter.
  72. Unfortunately, some CUDA libraries do not have stream variants of
  73. kernels. This will seriously lower the potential for overlapping.
  74. If some CUDA calls are made without specifying this local stream,
  75. synchronization needs to be explicited with cudaDeviceSynchronize() around these
  76. calls, to make sure that they get properly synchronized with the calls using
  77. the local stream. Notably, \c cudaMemcpy() and \c cudaMemset() are actually
  78. asynchronous and need such explicit synchronization! Use \c cudaMemcpyAsync() and
  79. \c cudaMemsetAsync() instead.
  80. Calling starpu_cublas_init() will ensure StarPU to properly call the
  81. CUBLAS library functions. Some libraries like Magma may however change the current stream of CUBLAS v1,
  82. one then has to call <c>cublasSetKernelStream(</c>starpu_cuda_get_local_stream()<c>)</c> at
  83. the beginning of the codelet to make sure that CUBLAS is really using the proper
  84. stream. When using CUBLAS v2, starpu_cublas_get_local_handle() can be called to queue CUBLAS
  85. kernels with the proper configuration.
  86. Similarly, calling starpu_cusparse_init() makes StarPU create CUSPARSE handles
  87. on each CUDA device, starpu_cusparse_get_local_handle() can then be used to
  88. queue CUSPARSE kernels with the proper configuration.
  89. If the kernel can be made to only use this local stream or other self-allocated
  90. streams, i.e. the whole kernel submission can be made asynchronous, then
  91. one should enable asynchronous execution of the kernel. This means setting
  92. the flag ::STARPU_CUDA_ASYNC in the corresponding field starpu_codelet::cuda_flags, and dropping the
  93. <c>cudaStreamSynchronize()</c> call at the end of the <c>cuda_func</c> function, so that it
  94. returns immediately after having queued the kernel to the local stream. That way, StarPU will be
  95. able to submit and complete data transfers while kernels are executing, instead of only at each
  96. kernel submission. The kernel just has to make sure that StarPU can use the
  97. local stream to synchronize with the kernel startup and completion.
  98. If the kernel uses its own non-default stream, one can synchronize this stream
  99. with the StarPU-provided stream this way:
  100. \code{.c}
  101. cudaEvent_t event;
  102. call_kernel_with_its_own_stream()
  103. cudaEventCreateWithFlags(&event, cudaEventDisableTiming);
  104. cudaEventRecord(event, get_kernel_stream());
  105. cudaStreamWaitEvent(starpu_cuda_get_local_stream(), event, 0);
  106. cudaEventDestroy(event);
  107. \endcode
  108. This code makes the StarPU-provided stream wait for a new event, which will be
  109. triggered by the completion of the kernel.
  110. Using the flag ::STARPU_CUDA_ASYNC also permits to enable concurrent kernel
  111. execution, on cards which support it (Kepler and later, notably). This is
  112. enabled by setting the environment variable \ref STARPU_NWORKER_PER_CUDA to the
  113. number of kernels to be executed concurrently. This is useful when kernels are
  114. small and do not feed the whole GPU with threads to run.
  115. Concerning memory allocation, you should really not use \c cudaMalloc()/ \c cudaFree()
  116. within the kernel, since \c cudaFree() introduces a awfully lot of synchronizations
  117. within CUDA itself. You should instead add a parameter to the codelet with the
  118. ::STARPU_SCRATCH mode access. You can then pass to the task a handle registered
  119. with the desired size but with the \c NULL pointer, the handle can even be
  120. shared between tasks, StarPU will allocate per-task data on the fly before task
  121. execution, and reuse the allocated data between tasks.
  122. See <c>examples/pi/pi_redux.c</c> for an example of use.
  123. \section OpenCL-specificOptimizations OpenCL-specific Optimizations
  124. If the kernel can be made to only use the StarPU-provided command queue or other self-allocated
  125. queues, i.e. the whole kernel submission can be made asynchronous, then
  126. one should enable asynchronous execution of the kernel. This means setting
  127. the flag ::STARPU_OPENCL_ASYNC in the corresponding field starpu_codelet::opencl_flags and dropping the
  128. <c>clFinish()</c> and starpu_opencl_collect_stats() calls at the end of the kernel, so
  129. that it returns immediately after having queued the kernel to the provided queue.
  130. That way, StarPU will be able to submit and complete data transfers while kernels are executing, instead of
  131. only at each kernel submission. The kernel just has to make sure
  132. that StarPU can use the command queue it has provided to synchronize with the
  133. kernel startup and completion.
  134. \section DetectionStuckConditions Detecting Stuck Conditions
  135. It may happen that for some reason, StarPU does not make progress for a long
  136. period of time. Reason are sometimes due to contention inside StarPU, but
  137. sometimes this is due to external reasons, such as a stuck MPI or CUDA
  138. driver.
  139. <c>export STARPU_WATCHDOG_TIMEOUT=10000</c> (\ref STARPU_WATCHDOG_TIMEOUT)
  140. allows to make StarPU print an error message whenever StarPU does not terminate
  141. any task for 10ms, but lets the application continue normally. In addition to that,
  142. <c>export STARPU_WATCHDOG_CRASH=1</c> (\ref STARPU_WATCHDOG_CRASH)
  143. raises <c>SIGABRT</c> in this condition, thus allowing to catch the
  144. situation in \c gdb.
  145. It can also be useful to type <c>handle SIGABRT nopass</c> in <c>gdb</c> to be able to let
  146. the process continue, after inspecting the state of the process.
  147. \section HowToLimitMemoryPerNode How to Limit Memory Used By StarPU And Cache Buffer Allocations
  148. By default, StarPU makes sure to use at most 90% of the memory of GPU devices,
  149. moving data in and out of the device as appropriate, as well as using
  150. prefetch and writeback optimizations.
  151. The environment variables \ref STARPU_LIMIT_CUDA_MEM, \ref STARPU_LIMIT_CUDA_devid_MEM,
  152. \ref STARPU_LIMIT_OPENCL_MEM, and \ref STARPU_LIMIT_OPENCL_devid_MEM
  153. can be used to control how much (in MiB) of the GPU device memory
  154. should be used at most by StarPU (the default value is to use 90% of the
  155. available memory).
  156. By default, the usage of the main memory is not limited, as the
  157. default mechanims do not provide means to evict main memory when it
  158. gets too tight. This also means that by default StarPU will not cache buffer
  159. allocations in main memory, since it does not know how much of the
  160. system memory it can afford.
  161. The environment variable \ref STARPU_LIMIT_CPU_MEM can be used to
  162. specify how much (in MiB) of the main memory should be used at most by
  163. StarPU for buffer allocations. This way, StarPU will be able to
  164. cache buffer allocations (which can be a real benefit if a lot of buffers are
  165. involved, or if allocation fragmentation can become a problem), and when using
  166. \ref OutOfCore, StarPU will know when it should evict data out to the disk.
  167. It should be noted that by default only buffer allocations automatically
  168. done by StarPU are accounted here, i.e. allocations performed through
  169. starpu_malloc_on_node() which are used by the data interfaces
  170. (matrix, vector, etc.). This does not include allocations performed by
  171. the application through e.g. malloc(). It does not include allocations
  172. performed through starpu_malloc() either, only allocations
  173. performed explicitly with the \ref STARPU_MALLOC_COUNT flag, i.e. by calling
  174. \code{.c}
  175. starpu_malloc_flags(STARPU_MALLOC_COUNT)
  176. \endcode
  177. are taken into account. If the
  178. application wants to make StarPU aware of its own allocations, so that StarPU
  179. knows precisely how much data is allocated, and thus when to evict allocation
  180. caches or data out to the disk, starpu_memory_allocate() can be used to
  181. specify an amount of memory to be accounted for. starpu_memory_deallocate()
  182. can be used to account freed memory back. Those can for instance be used by data
  183. interfaces with dynamic data buffers: instead of using starpu_malloc_on_node(),
  184. they would dynamically allocate data with \c malloc()/\c realloc(), and notify StarPU of
  185. the delta by calling starpu_memory_allocate() and starpu_memory_deallocate().
  186. starpu_memory_get_total() and starpu_memory_get_available()
  187. can be used to get an estimation of how much memory is available.
  188. starpu_memory_wait_available() can also be used to block until an
  189. amount of memory becomes available, but it may be preferrable to call
  190. \code{.c}
  191. starpu_memory_allocate(STARPU_MEMORY_WAIT)
  192. \endcode
  193. to reserve this amount immediately.
  194. \section HowToReduceTheMemoryFootprintOfInternalDataStructures How To Reduce The Memory Footprint Of Internal Data Structures
  195. It is possible to reduce the memory footprint of the task and data internal
  196. structures of StarPU by describing the shape of your machine and/or your
  197. application when calling \c configure.
  198. To reduce the memory footprint of the data internal structures of StarPU, one
  199. can set the
  200. \ref enable-maxcpus "--enable-maxcpus",
  201. \ref enable-maxnumanodes "--enable-maxnumanodes",
  202. \ref enable-maxcudadev "--enable-maxcudadev",
  203. \ref enable-maxopencldev "--enable-maxopencldev" and
  204. \ref enable-maxnodes "--enable-maxnodes"
  205. \c configure parameters to give StarPU
  206. the architecture of the machine it will run on, thus tuning the size of the
  207. structures to the machine.
  208. To reduce the memory footprint of the task internal structures of StarPU, one
  209. can set the \ref enable-maxbuffers "--enable-maxbuffers" \c configure parameter to
  210. give StarPU the maximum number of buffers that a task can use during an
  211. execution. For example, in the Cholesky factorization (dense linear algebra
  212. application), the GEMM task uses up to 3 buffers, so it is possible to set the
  213. maximum number of task buffers to 3 to run a Cholesky factorization on StarPU.
  214. The size of the various structures of StarPU can be printed by
  215. <c>tests/microbenchs/display_structures_size</c>.
  216. It is also often useless to submit *all* the tasks at the same time.
  217. Task submission can be blocked when a reasonable given number of
  218. tasks have been submitted, by setting the environment variables \ref
  219. STARPU_LIMIT_MIN_SUBMITTED_TASKS and \ref STARPU_LIMIT_MAX_SUBMITTED_TASKS.
  220. <c>
  221. export STARPU_LIMIT_MAX_SUBMITTED_TASKS=10000
  222. export STARPU_LIMIT_MIN_SUBMITTED_TASKS=9000
  223. </c>
  224. will make StarPU block submission when 10000 tasks are submitted, and unblock
  225. submission when only 9000 tasks are still submitted, i.e. 1000 tasks have
  226. completed among the 10000 which were submitted when submission was blocked. Of
  227. course this may reduce parallelism if the threshold is set too low. The precise
  228. balance depends on the application task graph.
  229. An idea of how much memory is used for tasks and data handles can be obtained by
  230. setting the environment variable \ref STARPU_MAX_MEMORY_USE to <c>1</c>.
  231. \section HowtoReuseMemory How To Reuse Memory
  232. When your application needs to allocate more data than the available amount of
  233. memory usable by StarPU (given by starpu_memory_get_available()), the
  234. allocation cache system can reuse data buffers used by previously executed
  235. tasks. For this system to work with MPI tasks, you need to submit tasks progressively instead
  236. of as soon as possible, because in the case of MPI receives, the allocation cache check for reusing data
  237. buffers will be done at submission time, not at execution time.
  238. There is two options to control the task submission flow. The first one is by
  239. controlling the number of submitted tasks during the whole execution. This can
  240. be done whether by setting the environment variables
  241. \ref STARPU_LIMIT_MAX_SUBMITTED_TASKS and \ref STARPU_LIMIT_MIN_SUBMITTED_TASKS to
  242. tell StarPU when to stop submitting tasks and when to wake up and submit tasks
  243. again, or by explicitely calling starpu_task_wait_for_n_submitted() in
  244. your application code for finest grain control (for example, between two
  245. iterations of a submission loop).
  246. The second option is to control the memory size of the allocation cache. This
  247. can be done in the application by using jointly
  248. starpu_memory_get_available() and starpu_memory_wait_available() to submit
  249. tasks only when there is enough memory space to allocate the data needed by the
  250. task, i.e when enough data are available for reuse in the allocation cache.
  251. \section PerformanceModelCalibration Performance Model Calibration
  252. Most schedulers are based on an estimation of codelet duration on each kind
  253. of processing unit. For this to be possible, the application programmer needs
  254. to configure a performance model for the codelets of the application (see
  255. \ref PerformanceModelExample for instance). History-based performance models
  256. use on-line calibration. StarPU will automatically calibrate codelets
  257. which have never been calibrated yet, and save the result in
  258. <c>$STARPU_HOME/.starpu/sampling/codelets</c>.
  259. The models are indexed by machine name.
  260. By default, StarPU stores separate performance models according to the hostname
  261. of the system. To avoid having to calibrate performance models for each node
  262. of a homogeneous cluster for instance, the model can be shared by using
  263. <c>export STARPU_HOSTNAME=some_global_name</c> (\ref STARPU_HOSTNAME), where
  264. <c>some_global_name</c> is the name of the cluster for instance, which thus
  265. overrides the hostname of the system.
  266. By default, StarPU stores separate performance models for each GPU. To avoid
  267. having to calibrate performance models for each GPU of a homogeneous set of GPU
  268. devices for instance, the model can be shared by setting
  269. <c>export STARPU_PERF_MODEL_HOMOGENEOUS_CUDA=1</c> (\ref STARPU_PERF_MODEL_HOMOGENEOUS_CUDA),
  270. <c>export STARPU_PERF_MODEL_HOMOGENEOUS_OPENCL=1</c> (\ref STARPU_PERF_MODEL_HOMOGENEOUS_OPENCL),
  271. <c>export STARPU_PERF_MODEL_HOMOGENEOUS_MIC=1</c> (\ref STARPU_PERF_MODEL_HOMOGENEOUS_MIC),
  272. <c>export STARPU_PERF_MODEL_HOMOGENEOUS_MPI_MS=1</c> (\ref STARPU_PERF_MODEL_HOMOGENEOUS_MPI_MS) depending on your GPU device type.
  273. To force continuing calibration,
  274. use <c>export STARPU_CALIBRATE=1</c> (\ref STARPU_CALIBRATE). This may be necessary if your application
  275. has not-so-stable performance. StarPU will force calibration (and thus ignore
  276. the current result) until 10 (<c>_STARPU_CALIBRATION_MINIMUM</c>) measurements have been
  277. made on each architecture, to avoid bad scheduling decisions just because the
  278. first measurements were not so good.
  279. Note that StarPU will not record the very first measurement for a given codelet
  280. and a given size, because it would most often be hit by computation library
  281. loading or initialization. StarPU will also throw measurements away if it
  282. notices that after computing an average execution time, it notices that most
  283. subsequent tasks have an execution time largely outside the computed average
  284. ("Too big deviation for model..." warning messages). By looking at the details
  285. of the message and their reported measurements, it can highlight that your
  286. computation library really has non-stable measurements, which is probably an
  287. indication of an issue in the computation library, or the execution environment
  288. (e.g. rogue daemons).
  289. Details on the current performance model status
  290. can be obtained with the tool <c>starpu_perfmodel_display</c>: the
  291. option <c>-l</c> lists the available performance models, and the
  292. option <c>-s</c> allows to choose the performance model to be
  293. displayed. The result looks like:
  294. \verbatim
  295. $ starpu_perfmodel_display -s starpu_slu_lu_model_11
  296. performance model for cpu_impl_0
  297. # hash size flops mean dev n
  298. 914f3bef 1048576 0.000000e+00 2.503577e+04 1.982465e+02 8
  299. 3e921964 65536 0.000000e+00 5.527003e+02 1.848114e+01 7
  300. e5a07e31 4096 0.000000e+00 1.717457e+01 5.190038e+00 14
  301. ...
  302. \endverbatim
  303. which shows that for the LU 11 kernel with a 1MiB matrix, the average
  304. execution time on CPUs was about 25ms, with a 0.2ms standard deviation, over
  305. 8 samples. It is a good idea to check this before doing actual performance
  306. measurements.
  307. A graph can be drawn by using the tool <c>starpu_perfmodel_plot</c>:
  308. \verbatim
  309. $ starpu_perfmodel_plot -s starpu_slu_lu_model_11
  310. 4096 16384 65536 262144 1048576 4194304
  311. $ gnuplot starpu_starpu_slu_lu_model_11.gp
  312. $ gv starpu_starpu_slu_lu_model_11.eps
  313. \endverbatim
  314. \image html starpu_starpu_slu_lu_model_11.png
  315. \image latex starpu_starpu_slu_lu_model_11.eps "" width=\textwidth
  316. If a kernel source code was modified (e.g. performance improvement), the
  317. calibration information is stale and should be dropped, to re-calibrate from
  318. start. This can be done by using <c>export STARPU_CALIBRATE=2</c> (\ref STARPU_CALIBRATE).
  319. Note: history-based performance models get calibrated
  320. only if a performance-model-based scheduler is chosen.
  321. The history-based performance models can also be explicitly filled by the
  322. application without execution, if e.g. the application already has a series of
  323. measurements. This can be done by using starpu_perfmodel_update_history(),
  324. for instance:
  325. \code{.c}
  326. static struct starpu_perfmodel perf_model =
  327. {
  328. .type = STARPU_HISTORY_BASED,
  329. .symbol = "my_perfmodel",
  330. };
  331. struct starpu_codelet cl =
  332. {
  333. .cuda_funcs = { cuda_func1, cuda_func2 },
  334. .nbuffers = 1,
  335. .modes = {STARPU_W},
  336. .model = &perf_model
  337. };
  338. void feed(void)
  339. {
  340. struct my_measure *measure;
  341. struct starpu_task task;
  342. starpu_task_init(&task);
  343. task.cl = &cl;
  344. for (measure = &measures[0]; measure < measures[last]; measure++)
  345. {
  346. starpu_data_handle_t handle;
  347. starpu_vector_data_register(&handle, -1, 0, measure->size, sizeof(float));
  348. task.handles[0] = handle;
  349. starpu_perfmodel_update_history(&perf_model, &task, STARPU_CUDA_DEFAULT + measure->cudadev, 0, measure->implementation, measure->time);
  350. starpu_task_clean(&task);
  351. starpu_data_unregister(handle);
  352. }
  353. }
  354. \endcode
  355. Measurement has to be provided in milliseconds for the completion time models,
  356. and in Joules for the energy consumption models.
  357. \section Profiling Profiling
  358. A quick view of how many tasks each worker has executed can be obtained by setting
  359. <c>export STARPU_WORKER_STATS=1</c> (\ref STARPU_WORKER_STATS). This is a convenient way to check that
  360. execution did happen on accelerators, without penalizing performance with
  361. the profiling overhead. \ref STARPU_WORKER_STATS_FILE can be defined
  362. to specify a filename in which to display statistics, by default
  363. statistics are printed on the standard error stream.
  364. A quick view of how much data transfers have been issued can be obtained by setting
  365. <c>export STARPU_BUS_STATS=1</c> (\ref STARPU_BUS_STATS). \ref
  366. STARPU_BUS_STATS_FILE can be defined to specify a filename in which to
  367. display statistics, by default statistics are printed on the standard error stream.
  368. More detailed profiling information can be enabled by using <c>export STARPU_PROFILING=1</c> (\ref STARPU_PROFILING)
  369. or by
  370. calling starpu_profiling_status_set() from the source code.
  371. Statistics on the execution can then be obtained by using <c>export
  372. STARPU_BUS_STATS=1</c> and <c>export STARPU_WORKER_STATS=1</c> .
  373. More details on performance feedback are provided in the next chapter.
  374. \section OverheadProfiling Overhead Profiling
  375. \ref OfflinePerformanceTools can already provide an idea of to what extent and
  376. which part of StarPU brings an overhead on the execution time. To get a more precise
  377. analysis of which parts of StarPU bring the most overhead, <c>gprof</c> can be used.
  378. First, recompile and reinstall StarPU with <c>gprof</c> support:
  379. \code
  380. ../configure --enable-perf-debug --disable-shared --disable-build-tests --disable-build-examples
  381. \endcode
  382. Make sure not to leave a dynamic version of StarPU in the target path: remove
  383. any remaining <c>libstarpu-*.so</c>
  384. Then relink your application with the static StarPU library, make sure that
  385. running <c>ldd</c> on your application does not mention any \c libstarpu
  386. (i.e. it's really statically-linked).
  387. \code
  388. gcc test.c -o test $(pkg-config --cflags starpu-1.3) $(pkg-config --libs starpu-1.3)
  389. \endcode
  390. Now you can run your application, this will create a file
  391. <c>gmon.out</c> in the current directory, it can be processed by
  392. running <c>gprof</c> on your application:
  393. \code
  394. gprof ./test
  395. \endcode
  396. This will dump an analysis of the time spent in StarPU functions.
  397. */