performance_feedback.doxy 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574
  1. /*
  2. * This file is part of the StarPU Handbook.
  3. * Copyright (C) 2009--2011 Universit@'e de Bordeaux 1
  4. * Copyright (C) 2010, 2011, 2012, 2013 Centre National de la Recherche Scientifique
  5. * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
  6. * See the file version.doxy for copying conditions.
  7. */
  8. /*! \page performanceFeedback Performance Feedback
  9. \section Using_the_Temanejo_task_debugger Using the Temanejo task debugger
  10. StarPU can connect to Temanejo (see
  11. http://www.hlrs.de/temanejo), to permit
  12. nice visual task debugging. To do so, build Temanejo's <c>libayudame.so</c>,
  13. install <c>Ayudame.h</c> to e.g. <c>/usr/local/include</c>, apply the
  14. <c>tools/patch-ayudame</c> to it to fix C build, re-<c>./configure</c>, make
  15. sure that it found it, rebuild StarPU. Run the Temanejo GUI, give it the path
  16. to your application, any options you want to pass it, the path to libayudame.so.
  17. Make sure to specify at least the same number of CPUs in the dialog box as your
  18. machine has, otherwise an error will happen during execution. Future versions
  19. of Temanejo should be able to tell StarPU the number of CPUs to use.
  20. Tag numbers have to be below <c>4000000000000000000ULL</c> to be usable for
  21. Temanejo (so as to distinguish them from tasks).
  22. \section On-line_performance_feedback On-line performance feedback
  23. \subsection Enabling_on-line_performance_monitoring Enabling on-line performance monitoring
  24. In order to enable online performance monitoring, the application can call
  25. <c>starpu_profiling_status_set(STARPU_PROFILING_ENABLE)</c>. It is possible to
  26. detect whether monitoring is already enabled or not by calling
  27. starpu_profiling_status_get(). Enabling monitoring also reinitialize all
  28. previously collected feedback. The <c>STARPU_PROFILING</c> environment variable
  29. can also be set to 1 to achieve the same effect.
  30. Likewise, performance monitoring is stopped by calling
  31. <c>starpu_profiling_status_set(STARPU_PROFILING_DISABLE)</c>. Note that this
  32. does not reset the performance counters so that the application may consult
  33. them later on.
  34. More details about the performance monitoring API are available in section
  35. @ref{Profiling API}.
  36. \subsection Per-Task_feedback Per-task feedback
  37. If profiling is enabled, a pointer to a <c>struct starpu_profiling_task_info</c>
  38. is put in the <c>.profiling_info</c> field of the <c>starpu_task</c>
  39. structure when a task terminates.
  40. This structure is automatically destroyed when the task structure is destroyed,
  41. either automatically or by calling starpu_task_destroy().
  42. The <c>struct starpu_profiling_task_info</c> indicates the date when the
  43. task was submitted (<c>submit_time</c>), started (<c>start_time</c>), and
  44. terminated (<c>end_time</c>), relative to the initialization of
  45. StarPU with starpu_init(). It also specifies the identifier of the worker
  46. that has executed the task (<c>workerid</c>).
  47. These date are stored as <c>timespec</c> structures which the user may convert
  48. into micro-seconds using the starpu_timing_timespec_to_us() helper
  49. function.
  50. It it worth noting that the application may directly access this structure from
  51. the callback executed at the end of the task. The <c>starpu_task</c> structure
  52. associated to the callback currently being executed is indeed accessible with
  53. the starpu_task_get_current() function.
  54. \subsection Per-codelet_feedback Per-codelet feedback
  55. The <c>per_worker_stats</c> field of the <c>struct starpu_codelet</c> structure is
  56. an array of counters. The i-th entry of the array is incremented every time a
  57. task implementing the codelet is executed on the i-th worker.
  58. This array is not reinitialized when profiling is enabled or disabled.
  59. \subsection Per-worker_feedback Per-worker feedback
  60. The second argument returned by the starpu_profiling_worker_get_info()
  61. function is a <c>struct starpu_profiling_worker_info</c> that gives
  62. statistics about the specified worker. This structure specifies when StarPU
  63. started collecting profiling information for that worker (<c>start_time</c>),
  64. the duration of the profiling measurement interval (<c>total_time</c>), the
  65. time spent executing kernels (<c>executing_time</c>), the time spent sleeping
  66. because there is no task to execute at all (<c>sleeping_time</c>), and the
  67. number of tasks that were executed while profiling was enabled.
  68. These values give an estimation of the proportion of time spent do real work,
  69. and the time spent either sleeping because there are not enough executable
  70. tasks or simply wasted in pure StarPU overhead.
  71. Calling starpu_profiling_worker_get_info() resets the profiling
  72. information associated to a worker.
  73. When an FxT trace is generated (see \ref Generating_traces_with_FxT), it is also
  74. possible to use the <c>starpu_workers_activity</c> script (see \ref Monitoring_activity) to
  75. generate a graphic showing the evolution of these values during the time, for
  76. the different workers.
  77. \subsection Bus-related_feedback Bus-related feedback
  78. TODO: ajouter STARPU_BUS_STATS
  79. \internal
  80. how to enable/disable performance monitoring
  81. what kind of information do we get ?
  82. \endinternal
  83. The bus speed measured by StarPU can be displayed by using the
  84. <c>starpu_machine_display</c> tool, for instance:
  85. \verbatim
  86. StarPU has found:
  87. 3 CUDA devices
  88. CUDA 0 (Tesla C2050 02:00.0)
  89. CUDA 1 (Tesla C2050 03:00.0)
  90. CUDA 2 (Tesla C2050 84:00.0)
  91. from to RAM to CUDA 0 to CUDA 1 to CUDA 2
  92. RAM 0.000000 5176.530428 5176.492994 5191.710722
  93. CUDA 0 4523.732446 0.000000 2414.074751 2417.379201
  94. CUDA 1 4523.718152 2414.078822 0.000000 2417.375119
  95. CUDA 2 4534.229519 2417.069025 2417.060863 0.000000
  96. \endverbatim
  97. \subsection StarPU-Top_interface StarPU-Top interface
  98. StarPU-Top is an interface which remotely displays the on-line state of a StarPU
  99. application and permits the user to change parameters on the fly.
  100. Variables to be monitored can be registered by calling the
  101. starpu_top_add_data_boolean(), starpu_top_add_data_integer(),
  102. starpu_top_add_data_float() functions, e.g.:
  103. \code{.c}
  104. starpu_top_data *data = starpu_top_add_data_integer("mynum", 0, 100, 1);
  105. \endcode
  106. The application should then call starpu_top_init_and_wait() to give its name
  107. and wait for StarPU-Top to get a start request from the user. The name is used
  108. by StarPU-Top to quickly reload a previously-saved layout of parameter display.
  109. \code{.c}
  110. starpu_top_init_and_wait("the application");
  111. \endcode
  112. The new values can then be provided thanks to
  113. starpu_top_update_data_boolean(), starpu_top_update_data_integer(),
  114. starpu_top_update_data_float(), e.g.:
  115. \code{.c}
  116. starpu_top_update_data_integer(data, mynum);
  117. \endcode
  118. Updateable parameters can be registered thanks to starpu_top_register_parameter_boolean(), starpu_top_register_parameter_integer(), starpu_top_register_parameter_float(), e.g.:
  119. \code{.c}
  120. float alpha;
  121. starpu_top_register_parameter_float("alpha", &alpha, 0, 10, modif_hook);
  122. \endcode
  123. <c>modif_hook</c> is a function which will be called when the parameter is being modified, it can for instance print the new value:
  124. \code{.c}
  125. void modif_hook(struct starpu_top_param *d) {
  126. fprintf(stderr,"%s has been modified: %f\n", d->name, alpha);
  127. }
  128. \endcode
  129. Task schedulers should notify StarPU-Top when it has decided when a task will be
  130. scheduled, so that it can show it in its Gantt chart, for instance:
  131. \code{.c}
  132. starpu_top_task_prevision(task, workerid, begin, end);
  133. \endcode
  134. Starting StarPU-Top (StarPU-Top is started via the binary
  135. <c>starpu_top</c>.) and the application can be done two ways:
  136. <ul>
  137. <li> The application is started by hand on some machine (and thus already
  138. waiting for the start event). In the Preference dialog of StarPU-Top, the SSH
  139. checkbox should be unchecked, and the hostname and port (default is 2011) on
  140. which the application is already running should be specified. Clicking on the
  141. connection button will thus connect to the already-running application.
  142. </li>
  143. <li> StarPU-Top is started first, and clicking on the connection button will
  144. start the application itself (possibly on a remote machine). The SSH checkbox
  145. should be checked, and a command line provided, e.g.:
  146. \verbatim
  147. $ ssh myserver STARPU_SCHED=dmda ./application
  148. \endverbatim
  149. If port 2011 of the remote machine can not be accessed directly, an ssh port bridge should be added:
  150. \verbatim
  151. $ ssh -L 2011:localhost:2011 myserver STARPU_SCHED=dmda ./application
  152. \endverbatim
  153. and "localhost" should be used as IP Address to connect to.
  154. </li>
  155. </ul>
  156. \section Off-line_performance_feedback Off-line performance feedback
  157. \subsection Generating_traces_with_FxT Generating traces with FxT
  158. StarPU can use the FxT library (see
  159. https://savannah.nongnu.org/projects/fkt/) to generate traces
  160. with a limited runtime overhead.
  161. You can either get a tarball:
  162. \verbatim
  163. $ wget http://download.savannah.gnu.org/releases/fkt/fxt-0.2.11.tar.gz
  164. \endverbatim
  165. or use the FxT library from CVS (autotools are required):
  166. \verbatim
  167. $ cvs -d :pserver:anonymous\@cvs.sv.gnu.org:/sources/fkt co FxT
  168. $ ./bootstrap
  169. \endverbatim
  170. Compiling and installing the FxT library in the <c>$FXTDIR</c> path is
  171. done following the standard procedure:
  172. \verbatim
  173. $ ./configure --prefix=$FXTDIR
  174. $ make
  175. $ make install
  176. \endverbatim
  177. In order to have StarPU to generate traces, StarPU should be configured with
  178. the <c>--with-fxt</c> option:
  179. \verbatim
  180. $ ./configure --with-fxt=$FXTDIR
  181. \endverbatim
  182. Or you can simply point the <c>PKG_CONFIG_PATH</c> to
  183. <c>$FXTDIR/lib/pkgconfig</c> and pass <c>--with-fxt</c> to <c>./configure</c>
  184. When FxT is enabled, a trace is generated when StarPU is terminated by calling
  185. starpu_shutdown()). The trace is a binary file whose name has the form
  186. <c>prof_file_XXX_YYY</c> where <c>XXX</c> is the user name, and
  187. <c>YYY</c> is the pid of the process that used StarPU. This file is saved in the
  188. <c>/tmp/</c> directory by default, or by the directory specified by
  189. the <c>STARPU_FXT_PREFIX</c> environment variable.
  190. \subsection Creating_a_Gantt_Diagram Creating a Gantt Diagram
  191. When the FxT trace file <c>filename</c> has been generated, it is possible to
  192. generate a trace in the Paje format by calling:
  193. \verbatim
  194. $ starpu_fxt_tool -i filename
  195. \endverbatim
  196. Or alternatively, setting the <c>STARPU_GENERATE_TRACE</c> environment variable
  197. to <c>1</c> before application execution will make StarPU do it automatically at
  198. application shutdown.
  199. This will create a <c>paje.trace</c> file in the current directory that
  200. can be inspected with the <a href="http://vite.gforge.inria.fr/">ViTE trace
  201. visualizing open-source tool</a>. It is possible to open the
  202. <c>paje.trace</c> file with ViTE by using the following command:
  203. \verbatim
  204. $ vite paje.trace
  205. \endverbatim
  206. To get names of tasks instead of "unknown", fill the optional <c>name</c> field
  207. of the codelets, or use a performance model for them.
  208. In the MPI execution case, collect the trace files from the MPI nodes, and
  209. specify them all on the <c>starpu_fxt_tool</c> command, for instance:
  210. \verbatim
  211. $ starpu_fxt_tool -i filename1 -i filename2
  212. \endverbatim
  213. By default, all tasks are displayed using a green color. To display tasks with
  214. varying colors, pass option <c>-c</c> to <c>starpu_fxt_tool</c>.
  215. Traces can also be inspected by hand by using the <c>fxt_print</c> tool, for instance:
  216. \verbatim
  217. $ fxt_print -o -f filename
  218. \endverbatim
  219. Timings are in nanoseconds (while timings as seen in <c>vite</c> are in milliseconds).
  220. \subsection Creating_a_DAG_with_graphviz Creating a DAG with graphviz
  221. When the FxT trace file <c>filename</c> has been generated, it is possible to
  222. generate a task graph in the DOT format by calling:
  223. \verbatim
  224. $ starpu_fxt_tool -i filename
  225. \endverbatim
  226. This will create a <c>dag.dot</c> file in the current directory. This file is a
  227. task graph described using the DOT language. It is possible to get a
  228. graphical output of the graph by using the graphviz library:
  229. \verbatim
  230. $ dot -Tpdf dag.dot -o output.pdf
  231. \endverbatim
  232. \subsection Monitoring_activity Monitoring activity
  233. When the FxT trace file <c>filename</c> has been generated, it is possible to
  234. generate an activity trace by calling:
  235. \verbatim
  236. $ starpu_fxt_tool -i filename
  237. \endverbatim
  238. This will create an <c>activity.data</c> file in the current
  239. directory. A profile of the application showing the activity of StarPU
  240. during the execution of the program can be generated:
  241. \verbatim
  242. $ starpu_workers_activity activity.data
  243. \endverbatim
  244. This will create a file named <c>activity.eps</c> in the current directory.
  245. This picture is composed of two parts.
  246. The first part shows the activity of the different workers. The green sections
  247. indicate which proportion of the time was spent executed kernels on the
  248. processing unit. The red sections indicate the proportion of time spent in
  249. StartPU: an important overhead may indicate that the granularity may be too
  250. low, and that bigger tasks may be appropriate to use the processing unit more
  251. efficiently. The black sections indicate that the processing unit was blocked
  252. because there was no task to process: this may indicate a lack of parallelism
  253. which may be alleviated by creating more tasks when it is possible.
  254. The second part of the <c>activity.eps</c> picture is a graph showing the
  255. evolution of the number of tasks available in the system during the execution.
  256. Ready tasks are shown in black, and tasks that are submitted but not
  257. schedulable yet are shown in grey.
  258. \section Performance_of_codelets Performance of codelets
  259. The performance model of codelets (see \ref Performance_model_example) can be examined by using the
  260. <c>starpu_perfmodel_display</c> tool:
  261. \verbatim
  262. $ starpu_perfmodel_display -l
  263. file: <malloc_pinned.hannibal>
  264. file: <starpu_slu_lu_model_21.hannibal>
  265. file: <starpu_slu_lu_model_11.hannibal>
  266. file: <starpu_slu_lu_model_22.hannibal>
  267. file: <starpu_slu_lu_model_12.hannibal>
  268. \endverbatim
  269. Here, the codelets of the lu example are available. We can examine the
  270. performance of the 22 kernel (in micro-seconds), which is history-based:
  271. \verbatim
  272. $ starpu_perfmodel_display -s starpu_slu_lu_model_22
  273. performance model for cpu
  274. # hash size mean dev n
  275. 57618ab0 19660800 2.851069e+05 1.829369e+04 109
  276. performance model for cuda_0
  277. # hash size mean dev n
  278. 57618ab0 19660800 1.164144e+04 1.556094e+01 315
  279. performance model for cuda_1
  280. # hash size mean dev n
  281. 57618ab0 19660800 1.164271e+04 1.330628e+01 360
  282. performance model for cuda_2
  283. # hash size mean dev n
  284. 57618ab0 19660800 1.166730e+04 3.390395e+02 456
  285. \endverbatim
  286. We can see that for the given size, over a sample of a few hundreds of
  287. execution, the GPUs are about 20 times faster than the CPUs (numbers are in
  288. us). The standard deviation is extremely low for the GPUs, and less than 10% for
  289. CPUs.
  290. This tool can also be used for regression-based performance models. It will then
  291. display the regression formula, and in the case of non-linear regression, the
  292. same performance log as for history-based performance models:
  293. \verbatim
  294. $ starpu_perfmodel_display -s non_linear_memset_regression_based
  295. performance model for cpu_impl_0
  296. Regression : #sample = 1400
  297. Linear: y = alpha size ^ beta
  298. alpha = 1.335973e-03
  299. beta = 8.024020e-01
  300. Non-Linear: y = a size ^b + c
  301. a = 5.429195e-04
  302. b = 8.654899e-01
  303. c = 9.009313e-01
  304. # hash size mean stddev n
  305. a3d3725e 4096 4.763200e+00 7.650928e-01 100
  306. 870a30aa 8192 1.827970e+00 2.037181e-01 100
  307. 48e988e9 16384 2.652800e+00 1.876459e-01 100
  308. 961e65d2 32768 4.255530e+00 3.518025e-01 100
  309. ...
  310. \endverbatim
  311. The same can also be achieved by using StarPU's library API, see
  312. @ref{Performance Model API} and notably the starpu_perfmodel_load_symbol()
  313. function. The source code of the <c>starpu_perfmodel_display</c> tool can be a
  314. useful example.
  315. The <c>starpu_perfmodel_plot</c> tool can be used to draw performance models.
  316. It writes a <c>.gp</c> file in the current directory, to be run in the
  317. <c>gnuplot</c> tool, which shows the corresponding curve.
  318. When the <c>flops</c> field of tasks is set, <c>starpu_perfmodel_plot</c> can
  319. directly draw a GFlops curve, by simply adding the <c>-f</c> option:
  320. \verbatim
  321. $ starpu_perfmodel_display -f -s chol_model_11
  322. \endverbatim
  323. This will however disable displaying the regression model, for which we can not
  324. compute GFlops.
  325. When the FxT trace file <c>filename</c> has been generated, it is possible to
  326. get a profiling of each codelet by calling:
  327. \verbatim
  328. $ starpu_fxt_tool -i filename
  329. $ starpu_codelet_profile distrib.data codelet_name
  330. \endverbatim
  331. This will create profiling data files, and a <c>.gp</c> file in the current
  332. directory, which draws the distribution of codelet time over the application
  333. execution, according to data input size.
  334. This is also available in the <c>starpu_perfmodel_plot</c> tool, by passing it
  335. the fxt trace:
  336. \verbatim
  337. $ starpu_perfmodel_plot -s non_linear_memset_regression_based -i /tmp/prof_file_foo_0
  338. \endverbatim
  339. It will produce a <c>.gp</c> file which contains both the performance model
  340. curves, and the profiling measurements.
  341. If you have the R statistical tool installed, you can additionally use
  342. \verbatim
  343. $ starpu_codelet_histo_profile distrib.data
  344. \endverbatim
  345. Which will create one pdf file per codelet and per input size, showing a
  346. histogram of the codelet execution time distribution.
  347. \section Theoretical_lower_bound_on_execution_time Theoretical lower bound on execution time
  348. StarPU can record a trace of what tasks are needed to complete the
  349. application, and then, by using a linear system, provide a theoretical lower
  350. bound of the execution time (i.e. with an ideal scheduling).
  351. The computed bound is not really correct when not taking into account
  352. dependencies, but for an application which have enough parallelism, it is very
  353. near to the bound computed with dependencies enabled (which takes a huge lot
  354. more time to compute), and thus provides a good-enough estimation of the ideal
  355. execution time.
  356. \ref Theoretical_lower_bound_on_execution_time provides an example on how to
  357. use this.
  358. \section Memory_feedback Memory feedback
  359. It is possible to enable memory statistics. To do so, you need to pass the option
  360. <c>--enable-memory-stats</c> when running configure. It is then
  361. possible to call the function starpu_display_memory_stats() to
  362. display statistics about the current data handles registered within StarPU.
  363. Moreover, statistics will be displayed at the end of the execution on
  364. data handles which have not been cleared out. This can be disabled by
  365. setting the environment variable <c>STARPU_MEMORY_STATS</c> to 0.
  366. For example, if you do not unregister data at the end of the complex
  367. example, you will get something similar to:
  368. \verbatim
  369. $ STARPU_MEMORY_STATS=0 ./examples/interface/complex
  370. Complex[0] = 45.00 + 12.00 i
  371. Complex[0] = 78.00 + 78.00 i
  372. Complex[0] = 45.00 + 12.00 i
  373. Complex[0] = 45.00 + 12.00 i
  374. \endverbatim
  375. \verbatim
  376. $ STARPU_MEMORY_STATS=1 ./examples/interface/complex
  377. Complex[0] = 45.00 + 12.00 i
  378. Complex[0] = 78.00 + 78.00 i
  379. Complex[0] = 45.00 + 12.00 i
  380. Complex[0] = 45.00 + 12.00 i
  381. #---------------------
  382. Memory stats:
  383. #-------
  384. Data on Node #3
  385. #-----
  386. Data : 0x553ff40
  387. Size : 16
  388. #--
  389. Data access stats
  390. /!\ Work Underway
  391. Node #0
  392. Direct access : 4
  393. Loaded (Owner) : 0
  394. Loaded (Shared) : 0
  395. Invalidated (was Owner) : 0
  396. Node #3
  397. Direct access : 0
  398. Loaded (Owner) : 0
  399. Loaded (Shared) : 1
  400. Invalidated (was Owner) : 0
  401. #-----
  402. Data : 0x5544710
  403. Size : 16
  404. #--
  405. Data access stats
  406. /!\ Work Underway
  407. Node #0
  408. Direct access : 2
  409. Loaded (Owner) : 0
  410. Loaded (Shared) : 1
  411. Invalidated (was Owner) : 1
  412. Node #3
  413. Direct access : 0
  414. Loaded (Owner) : 1
  415. Loaded (Shared) : 0
  416. Invalidated (was Owner) : 0
  417. \endverbatim
  418. \section Data_statistics Data statistics
  419. Different data statistics can be displayed at the end of the execution
  420. of the application. To enable them, you need to pass the option
  421. <c>--enable-stats</c> when calling <c>configure</c>. When calling
  422. starpu_shutdown() various statistics will be displayed,
  423. execution, MSI cache statistics, allocation cache statistics, and data
  424. transfer statistics. The display can be disabled by setting the
  425. environment variable <c>STARPU_STATS</c> to 0.
  426. \verbatim
  427. $ ./examples/cholesky/cholesky_tag
  428. Computation took (in ms)
  429. 518.16
  430. Synthetic GFlops : 44.21
  431. #---------------------
  432. MSI cache stats :
  433. TOTAL MSI stats hit 1622 (66.23 %) miss 827 (33.77 %)
  434. ...
  435. \endverbatim
  436. \verbatim
  437. $ STARPU_STATS=0 ./examples/cholesky/cholesky_tag
  438. Computation took (in ms)
  439. 518.16
  440. Synthetic GFlops : 44.21
  441. \endverbatim
  442. \internal
  443. TODO: data transfer stats are similar to the ones displayed when
  444. setting STARPU_BUS_STATS
  445. \endinternal
  446. */