perf-feedback.texi 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565
  1. @c -*-texinfo-*-
  2. @c This file is part of the StarPU Handbook.
  3. @c Copyright (C) 2009--2011 Universit@'e de Bordeaux 1
  4. @c Copyright (C) 2010, 2011, 2012 Centre National de la Recherche Scientifique
  5. @c Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
  6. @c See the file starpu.texi for copying conditions.
  7. @menu
  8. * Task debugger:: Using the Temanejo task debugger
  9. * On-line:: On-line performance feedback
  10. * Off-line:: Off-line performance feedback
  11. * Codelet performance:: Performance of codelets
  12. * Theoretical lower bound on execution time API::
  13. * Memory feedback::
  14. * Data statistics::
  15. @end menu
  16. @node Task debugger
  17. @section Using the Temanejo task debugger
  18. StarPU can connect to Temanejo (see
  19. @url{http://www.hlrs.de/organization/av/spmt/research/temanejo/}), to permit
  20. nice visual task debugging. To do so, build Temanejo's @code{libayudame.so},
  21. install @code{Ayudame} to e.g. @code{/usr/local/include}, apply the
  22. @code{tools/patch-ayudame} to it to fix C build, re-@code{./configure}, make
  23. sure that it found it, rebuild StarPU. Run the Temanejo GUI, give it the path
  24. to your application, any options you want to pass it, the path to libayudame.so.
  25. The number of CPUs currently can not be set.
  26. Only dependencies detected implicitly are currently shown.
  27. @node On-line
  28. @section On-line performance feedback
  29. @menu
  30. * Enabling on-line performance monitoring::
  31. * Task feedback:: Per-task feedback
  32. * Codelet feedback:: Per-codelet feedback
  33. * Worker feedback:: Per-worker feedback
  34. * Bus feedback:: Bus-related feedback
  35. * StarPU-Top:: StarPU-Top interface
  36. @end menu
  37. @node Enabling on-line performance monitoring
  38. @subsection Enabling on-line performance monitoring
  39. In order to enable online performance monitoring, the application can call
  40. @code{starpu_profiling_status_set(STARPU_PROFILING_ENABLE)}. It is possible to
  41. detect whether monitoring is already enabled or not by calling
  42. @code{starpu_profiling_status_get()}. Enabling monitoring also reinitialize all
  43. previously collected feedback. The @code{STARPU_PROFILING} environment variable
  44. can also be set to 1 to achieve the same effect.
  45. Likewise, performance monitoring is stopped by calling
  46. @code{starpu_profiling_status_set(STARPU_PROFILING_DISABLE)}. Note that this
  47. does not reset the performance counters so that the application may consult
  48. them later on.
  49. More details about the performance monitoring API are available in section
  50. @ref{Profiling API}.
  51. @node Task feedback
  52. @subsection Per-task feedback
  53. If profiling is enabled, a pointer to a @code{starpu_task_profiling_info}
  54. structure is put in the @code{.profiling_info} field of the @code{starpu_task}
  55. structure when a task terminates.
  56. This structure is automatically destroyed when the task structure is destroyed,
  57. either automatically or by calling @code{starpu_task_destroy}.
  58. The @code{starpu_task_profiling_info} structure indicates the date when the
  59. task was submitted (@code{submit_time}), started (@code{start_time}), and
  60. terminated (@code{end_time}), relative to the initialization of
  61. StarPU with @code{starpu_init}. It also specifies the identifier of the worker
  62. that has executed the task (@code{workerid}).
  63. These date are stored as @code{timespec} structures which the user may convert
  64. into micro-seconds using the @code{starpu_timing_timespec_to_us} helper
  65. function.
  66. It it worth noting that the application may directly access this structure from
  67. the callback executed at the end of the task. The @code{starpu_task} structure
  68. associated to the callback currently being executed is indeed accessible with
  69. the @code{starpu_task_get_current()} function.
  70. @node Codelet feedback
  71. @subsection Per-codelet feedback
  72. The @code{per_worker_stats} field of the @code{struct starpu_codelet} structure is
  73. an array of counters. The i-th entry of the array is incremented every time a
  74. task implementing the codelet is executed on the i-th worker.
  75. This array is not reinitialized when profiling is enabled or disabled.
  76. @node Worker feedback
  77. @subsection Per-worker feedback
  78. The second argument returned by the @code{starpu_worker_get_profiling_info}
  79. function is a @code{starpu_worker_profiling_info} structure that gives
  80. statistics about the specified worker. This structure specifies when StarPU
  81. started collecting profiling information for that worker (@code{start_time}),
  82. the duration of the profiling measurement interval (@code{total_time}), the
  83. time spent executing kernels (@code{executing_time}), the time spent sleeping
  84. because there is no task to execute at all (@code{sleeping_time}), and the
  85. number of tasks that were executed while profiling was enabled.
  86. These values give an estimation of the proportion of time spent do real work,
  87. and the time spent either sleeping because there are not enough executable
  88. tasks or simply wasted in pure StarPU overhead.
  89. Calling @code{starpu_worker_get_profiling_info} resets the profiling
  90. information associated to a worker.
  91. When an FxT trace is generated (see @ref{Generating traces}), it is also
  92. possible to use the @code{starpu_workers_activity} script (described in @ref{starpu-workers-activity}) to
  93. generate a graphic showing the evolution of these values during the time, for
  94. the different workers.
  95. @node Bus feedback
  96. @subsection Bus-related feedback
  97. TODO: ajouter STARPU_BUS_STATS
  98. @c how to enable/disable performance monitoring
  99. @c what kind of information do we get ?
  100. The bus speed measured by StarPU can be displayed by using the
  101. @code{starpu_machine_display} tool, for instance:
  102. @example
  103. StarPU has found:
  104. 3 CUDA devices
  105. CUDA 0 (Tesla C2050 02:00.0)
  106. CUDA 1 (Tesla C2050 03:00.0)
  107. CUDA 2 (Tesla C2050 84:00.0)
  108. from to RAM to CUDA 0 to CUDA 1 to CUDA 2
  109. RAM 0.000000 5176.530428 5176.492994 5191.710722
  110. CUDA 0 4523.732446 0.000000 2414.074751 2417.379201
  111. CUDA 1 4523.718152 2414.078822 0.000000 2417.375119
  112. CUDA 2 4534.229519 2417.069025 2417.060863 0.000000
  113. @end example
  114. @node StarPU-Top
  115. @subsection StarPU-Top interface
  116. StarPU-Top is an interface which remotely displays the on-line state of a StarPU
  117. application and permits the user to change parameters on the fly.
  118. Variables to be monitored can be registered by calling the
  119. @code{starpu_top_add_data_boolean}, @code{starpu_top_add_data_integer},
  120. @code{starpu_top_add_data_float} functions, e.g.:
  121. @cartouche
  122. @smallexample
  123. starpu_top_data *data = starpu_top_add_data_integer("mynum", 0, 100, 1);
  124. @end smallexample
  125. @end cartouche
  126. The application should then call @code{starpu_top_init_and_wait} to give its name
  127. and wait for StarPU-Top to get a start request from the user. The name is used
  128. by StarPU-Top to quickly reload a previously-saved layout of parameter display.
  129. @cartouche
  130. @smallexample
  131. starpu_top_init_and_wait("the application");
  132. @end smallexample
  133. @end cartouche
  134. The new values can then be provided thanks to
  135. @code{starpu_top_update_data_boolean}, @code{starpu_top_update_data_integer},
  136. @code{starpu_top_update_data_float}, e.g.:
  137. @cartouche
  138. @smallexample
  139. starpu_top_update_data_integer(data, mynum);
  140. @end smallexample
  141. @end cartouche
  142. Updateable parameters can be registered thanks to @code{starpu_top_register_parameter_boolean}, @code{starpu_top_register_parameter_integer}, @code{starpu_top_register_parameter_float}, e.g.:
  143. @cartouche
  144. @smallexample
  145. float alpha;
  146. starpu_top_register_parameter_float("alpha", &alpha, 0, 10, modif_hook);
  147. @end smallexample
  148. @end cartouche
  149. @code{modif_hook} is a function which will be called when the parameter is being modified, it can for instance print the new value:
  150. @cartouche
  151. @smallexample
  152. void modif_hook(struct starpu_top_param *d) @{
  153. fprintf(stderr,"%s has been modified: %f\n", d->name, alpha);
  154. @}
  155. @end smallexample
  156. @end cartouche
  157. Task schedulers should notify StarPU-Top when it has decided when a task will be
  158. scheduled, so that it can show it in its Gantt chart, for instance:
  159. @cartouche
  160. @smallexample
  161. starpu_top_task_prevision(task, workerid, begin, end);
  162. @end smallexample
  163. @end cartouche
  164. Starting StarPU-Top@footnote{StarPU-Top is started via the binary
  165. @code{starpu_top}.} and the application can be done two ways:
  166. @itemize
  167. @item The application is started by hand on some machine (and thus already
  168. waiting for the start event). In the Preference dialog of StarPU-Top, the SSH
  169. checkbox should be unchecked, and the hostname and port (default is 2011) on
  170. which the application is already running should be specified. Clicking on the
  171. connection button will thus connect to the already-running application.
  172. @item StarPU-Top is started first, and clicking on the connection button will
  173. start the application itself (possibly on a remote machine). The SSH checkbox
  174. should be checked, and a command line provided, e.g.:
  175. @example
  176. ssh myserver STARPU_SCHED=heft ./application
  177. @end example
  178. If port 2011 of the remote machine can not be accessed directly, an ssh port bridge should be added:
  179. @example
  180. ssh -L 2011:localhost:2011 myserver STARPU_SCHED=heft ./application
  181. @end example
  182. and "localhost" should be used as IP Address to connect to.
  183. @end itemize
  184. @node Off-line
  185. @section Off-line performance feedback
  186. @menu
  187. * Generating traces:: Generating traces with FxT
  188. * Gantt diagram:: Creating a Gantt Diagram
  189. * DAG:: Creating a DAG with graphviz
  190. * starpu-workers-activity:: Monitoring activity
  191. @end menu
  192. @node Generating traces
  193. @subsection Generating traces with FxT
  194. StarPU can use the FxT library (see
  195. @indicateurl{https://savannah.nongnu.org/projects/fkt/}) to generate traces
  196. with a limited runtime overhead.
  197. You can either get a tarball:
  198. @example
  199. % wget http://download.savannah.gnu.org/releases/fkt/fxt-0.2.2.tar.gz
  200. @end example
  201. or use the FxT library from CVS (autotools are required):
  202. @example
  203. % cvs -d :pserver:anonymous@@cvs.sv.gnu.org:/sources/fkt co FxT
  204. % ./bootstrap
  205. @end example
  206. Compiling and installing the FxT library in the @code{$FXTDIR} path is
  207. done following the standard procedure:
  208. @example
  209. % ./configure --prefix=$FXTDIR
  210. % make
  211. % make install
  212. @end example
  213. In order to have StarPU to generate traces, StarPU should be configured with
  214. the @code{--with-fxt} option:
  215. @example
  216. $ ./configure --with-fxt=$FXTDIR
  217. @end example
  218. Or you can simply point the @code{PKG_CONFIG_PATH} to
  219. @code{$FXTDIR/lib/pkgconfig} and pass @code{--with-fxt} to @code{./configure}
  220. When FxT is enabled, a trace is generated when StarPU is terminated by calling
  221. @code{starpu_shutdown()}). The trace is a binary file whose name has the form
  222. @code{prof_file_XXX_YYY} where @code{XXX} is the user name, and
  223. @code{YYY} is the pid of the process that used StarPU. This file is saved in the
  224. @code{/tmp/} directory by default, or by the directory specified by
  225. the @code{STARPU_FXT_PREFIX} environment variable.
  226. @node Gantt diagram
  227. @subsection Creating a Gantt Diagram
  228. When the FxT trace file @code{filename} has been generated, it is possible to
  229. generate a trace in the Paje format by calling:
  230. @example
  231. % starpu_fxt_tool -i filename
  232. @end example
  233. Or alternatively, setting the @code{STARPU_GENERATE_TRACE} environment variable
  234. to 1 before application execution will make StarPU do it automatically at
  235. application shutdown.
  236. This will create a @code{paje.trace} file in the current directory that
  237. can be inspected with the @url{http://vite.gforge.inria.fr/, ViTE trace
  238. visualizing open-source tool}. It is possible to open the
  239. @code{paje.trace} file with ViTE by using the following command:
  240. @example
  241. % vite paje.trace
  242. @end example
  243. To get names of tasks instead of "unknown", fill the optional @code{name} field
  244. of the codelets, or use a performance model for them.
  245. By default, all tasks are displayed using a green color. To display tasks with
  246. varying colors, pass option @code{-c} to @code{starpu_fxt_tool}.
  247. @node DAG
  248. @subsection Creating a DAG with graphviz
  249. When the FxT trace file @code{filename} has been generated, it is possible to
  250. generate a task graph in the DOT format by calling:
  251. @example
  252. $ starpu_fxt_tool -i filename
  253. @end example
  254. This will create a @code{dag.dot} file in the current directory. This file is a
  255. task graph described using the DOT language. It is possible to get a
  256. graphical output of the graph by using the graphviz library:
  257. @example
  258. $ dot -Tpdf dag.dot -o output.pdf
  259. @end example
  260. @node starpu-workers-activity
  261. @subsection Monitoring activity
  262. When the FxT trace file @code{filename} has been generated, it is possible to
  263. generate an activity trace by calling:
  264. @example
  265. $ starpu_fxt_tool -i filename
  266. @end example
  267. This will create an @code{activity.data} file in the current
  268. directory. A profile of the application showing the activity of StarPU
  269. during the execution of the program can be generated:
  270. @example
  271. $ starpu_workers_activity activity.data
  272. @end example
  273. This will create a file named @code{activity.eps} in the current directory.
  274. This picture is composed of two parts.
  275. The first part shows the activity of the different workers. The green sections
  276. indicate which proportion of the time was spent executed kernels on the
  277. processing unit. The red sections indicate the proportion of time spent in
  278. StartPU: an important overhead may indicate that the granularity may be too
  279. low, and that bigger tasks may be appropriate to use the processing unit more
  280. efficiently. The black sections indicate that the processing unit was blocked
  281. because there was no task to process: this may indicate a lack of parallelism
  282. which may be alleviated by creating more tasks when it is possible.
  283. The second part of the @code{activity.eps} picture is a graph showing the
  284. evolution of the number of tasks available in the system during the execution.
  285. Ready tasks are shown in black, and tasks that are submitted but not
  286. schedulable yet are shown in grey.
  287. @node Codelet performance
  288. @section Performance of codelets
  289. The performance model of codelets (described in @ref{Performance model example}) can be examined by using the
  290. @code{starpu_perfmodel_display} tool:
  291. @example
  292. $ starpu_perfmodel_display -l
  293. file: <malloc_pinned.hannibal>
  294. file: <starpu_slu_lu_model_21.hannibal>
  295. file: <starpu_slu_lu_model_11.hannibal>
  296. file: <starpu_slu_lu_model_22.hannibal>
  297. file: <starpu_slu_lu_model_12.hannibal>
  298. @end example
  299. Here, the codelets of the lu example are available. We can examine the
  300. performance of the 22 kernel (in micro-seconds):
  301. @example
  302. $ starpu_perfmodel_display -s starpu_slu_lu_model_22
  303. performance model for cpu
  304. # hash size mean dev n
  305. 57618ab0 19660800 2.851069e+05 1.829369e+04 109
  306. performance model for cuda_0
  307. # hash size mean dev n
  308. 57618ab0 19660800 1.164144e+04 1.556094e+01 315
  309. performance model for cuda_1
  310. # hash size mean dev n
  311. 57618ab0 19660800 1.164271e+04 1.330628e+01 360
  312. performance model for cuda_2
  313. # hash size mean dev n
  314. 57618ab0 19660800 1.166730e+04 3.390395e+02 456
  315. @end example
  316. We can see that for the given size, over a sample of a few hundreds of
  317. execution, the GPUs are about 20 times faster than the CPUs (numbers are in
  318. us). The standard deviation is extremely low for the GPUs, and less than 10% for
  319. CPUs.
  320. The @code{starpu_regression_display} tool does the same for regression-based
  321. performance models. It also writes a @code{.gp} file in the current directory,
  322. to be run in the @code{gnuplot} tool, which shows the corresponding curve.
  323. The same can also be achieved by using StarPU's library API, see
  324. @ref{Performance Model API} and notably the @code{starpu_perfmodel_load_symbol}
  325. function. The source code of the @code{starpu_perfmodel_display} tool can be a
  326. useful example.
  327. @node Theoretical lower bound on execution time API
  328. @section Theoretical lower bound on execution time
  329. See @ref{Theoretical lower bound on execution time} for an example on how to use
  330. this API. It permits to record a trace of what tasks are needed to complete the
  331. application, and then, by using a linear system, provide a theoretical lower
  332. bound of the execution time (i.e. with an ideal scheduling).
  333. The computed bound is not really correct when not taking into account
  334. dependencies, but for an application which have enough parallelism, it is very
  335. near to the bound computed with dependencies enabled (which takes a huge lot
  336. more time to compute), and thus provides a good-enough estimation of the ideal
  337. execution time.
  338. @deftypefun void starpu_bound_start (int @var{deps}, int @var{prio})
  339. Start recording tasks (resets stats). @var{deps} tells whether
  340. dependencies should be recorded too (this is quite expensive)
  341. @end deftypefun
  342. @deftypefun void starpu_bound_stop (void)
  343. Stop recording tasks
  344. @end deftypefun
  345. @deftypefun void starpu_bound_print_dot ({FILE *}@var{output})
  346. Print the DAG that was recorded
  347. @end deftypefun
  348. @deftypefun void starpu_bound_compute ({double *}@var{res}, {double *}@var{integer_res}, int @var{integer})
  349. Get theoretical upper bound (in ms) (needs glpk support detected by @code{configure} script). It returns 0 if some performance models are not calibrated.
  350. @end deftypefun
  351. @deftypefun void starpu_bound_print_lp ({FILE *}@var{output})
  352. Emit the Linear Programming system on @var{output} for the recorded tasks, in
  353. the lp format
  354. @end deftypefun
  355. @deftypefun void starpu_bound_print_mps ({FILE *}@var{output})
  356. Emit the Linear Programming system on @var{output} for the recorded tasks, in
  357. the mps format
  358. @end deftypefun
  359. @deftypefun void starpu_bound_print ({FILE *}@var{output}, int @var{integer})
  360. Emit statistics of actual execution vs theoretical upper bound. @var{integer}
  361. permits to choose between integer solving (which takes a long time but is
  362. correct), and relaxed solving (which provides an approximate solution).
  363. @end deftypefun
  364. @node Memory feedback
  365. @section Memory feedback
  366. It is possible to enable memory statistics. To do so, you need to pass the option
  367. @code{--enable-memory-stats} when running configure. It is then
  368. possible to call the function @code{starpu_display_memory_stats()} to
  369. display statistics about the current data handles registered within StarPU.
  370. Moreover, statistics will be displayed at the end of the execution on
  371. data handles which have not been cleared out. This can be disabled by
  372. setting the environment variable @code{STARPU_MEMORY_STATS} to 0.
  373. For example, if you do not unregister data at the end of the complex
  374. example, you will get something similar to:
  375. @example
  376. $ STARPU_MEMORY_STATS=0 ./examples/interface/complex
  377. Complex[0] = 45.00 + 12.00 i
  378. Complex[0] = 78.00 + 78.00 i
  379. Complex[0] = 45.00 + 12.00 i
  380. Complex[0] = 45.00 + 12.00 i
  381. @end example
  382. @example
  383. $ STARPU_MEMORY_STATS=1 ./examples/interface/complex
  384. Complex[0] = 45.00 + 12.00 i
  385. Complex[0] = 78.00 + 78.00 i
  386. Complex[0] = 45.00 + 12.00 i
  387. Complex[0] = 45.00 + 12.00 i
  388. #---------------------
  389. Memory stats:
  390. #-------
  391. Data on Node #3
  392. #-----
  393. Data : 0x553ff40
  394. Size : 16
  395. #--
  396. Data access stats
  397. /!\ Work Underway
  398. Node #0
  399. Direct access : 4
  400. Loaded (Owner) : 0
  401. Loaded (Shared) : 0
  402. Invalidated (was Owner) : 0
  403. Node #3
  404. Direct access : 0
  405. Loaded (Owner) : 0
  406. Loaded (Shared) : 1
  407. Invalidated (was Owner) : 0
  408. #-----
  409. Data : 0x5544710
  410. Size : 16
  411. #--
  412. Data access stats
  413. /!\ Work Underway
  414. Node #0
  415. Direct access : 2
  416. Loaded (Owner) : 0
  417. Loaded (Shared) : 1
  418. Invalidated (was Owner) : 1
  419. Node #3
  420. Direct access : 0
  421. Loaded (Owner) : 1
  422. Loaded (Shared) : 0
  423. Invalidated (was Owner) : 0
  424. @end example
  425. @node Data statistics
  426. @section Data statistics
  427. Different data statistics can be displayed at the end of the execution
  428. of the application. To enable them, you need to pass the option
  429. @code{--enable-stats} when calling @code{configure}. When calling
  430. @code{starpu_shutdown()} various statistics will be displayed,
  431. execution, MSI cache statistics, allocation cache statistics, and data
  432. transfer statistics. The display can be disabled by setting the
  433. environment variable @code{STARPU_STATS} to 0.
  434. @example
  435. $ ./examples/cholesky/cholesky_tag
  436. Computation took (in ms)
  437. 518.16
  438. Synthetic GFlops : 44.21
  439. #---------------------
  440. MSI cache stats :
  441. TOTAL MSI stats hit 1622 (66.23 %) miss 827 (33.77 %)
  442. ...
  443. @end example
  444. @example
  445. $ STARPU_STATS=0 ./examples/cholesky/cholesky_tag
  446. Computation took (in ms)
  447. 518.16
  448. Synthetic GFlops : 44.21
  449. @end example
  450. @c TODO: data transfer stats are similar to the ones displayed when
  451. @c setting STARPU_BUS_STATS