perf-feedback.texi 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584
  1. @c -*-texinfo-*-
  2. @c This file is part of the StarPU Handbook.
  3. @c Copyright (C) 2009--2011 Universit@'e de Bordeaux 1
  4. @c Copyright (C) 2010, 2011, 2012, 2013 Centre National de la Recherche Scientifique
  5. @c Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
  6. @c See the file starpu.texi for copying conditions.
  7. @menu
  8. * Task debugger:: Using the Temanejo task debugger
  9. * On-line:: On-line performance feedback
  10. * Off-line:: Off-line performance feedback
  11. * Codelet performance:: Performance of codelets
  12. * Memory feedback::
  13. * Data statistics::
  14. @end menu
  15. @node Task debugger
  16. @section Using the Temanejo task debugger
  17. StarPU can connect to Temanejo (see
  18. @url{http://www.hlrs.de/temanejo}), to permit
  19. nice visual task debugging. To do so, build Temanejo's @code{libayudame.so},
  20. install @code{Ayudame.h} to e.g. @code{/usr/local/include}, apply the
  21. @code{tools/patch-ayudame} to it to fix C build, re-@code{./configure}, make
  22. sure that it found it, rebuild StarPU. Run the Temanejo GUI, give it the path
  23. to your application, any options you want to pass it, the path to libayudame.so.
  24. Make sure to specify at least the same number of CPUs in the dialog box as your
  25. machine has, otherwise an error will happen during execution. Future versions
  26. of Temanejo should be able to tell StarPU the number of CPUs to use.
  27. Tag numbers have to be below @code{4000000000000000000ULL} to be usable for
  28. Temanejo (so as to distinguish them from tasks).
  29. @node On-line
  30. @section On-line performance feedback
  31. @menu
  32. * Enabling on-line performance monitoring::
  33. * Task feedback:: Per-task feedback
  34. * Codelet feedback:: Per-codelet feedback
  35. * Worker feedback:: Per-worker feedback
  36. * Bus feedback:: Bus-related feedback
  37. * StarPU-Top:: StarPU-Top interface
  38. @end menu
  39. @node Enabling on-line performance monitoring
  40. @subsection Enabling on-line performance monitoring
  41. In order to enable online performance monitoring, the application can call
  42. @code{starpu_profiling_status_set(STARPU_PROFILING_ENABLE)}. It is possible to
  43. detect whether monitoring is already enabled or not by calling
  44. @code{starpu_profiling_status_get()}. Enabling monitoring also reinitialize all
  45. previously collected feedback. The @code{STARPU_PROFILING} environment variable
  46. can also be set to 1 to achieve the same effect.
  47. Likewise, performance monitoring is stopped by calling
  48. @code{starpu_profiling_status_set(STARPU_PROFILING_DISABLE)}. Note that this
  49. does not reset the performance counters so that the application may consult
  50. them later on.
  51. More details about the performance monitoring API are available in section
  52. @ref{Profiling API}.
  53. @node Task feedback
  54. @subsection Per-task feedback
  55. If profiling is enabled, a pointer to a @code{starpu_task_profiling_info}
  56. structure is put in the @code{.profiling_info} field of the @code{starpu_task}
  57. structure when a task terminates.
  58. This structure is automatically destroyed when the task structure is destroyed,
  59. either automatically or by calling @code{starpu_task_destroy}.
  60. The @code{starpu_task_profiling_info} structure indicates the date when the
  61. task was submitted (@code{submit_time}), started (@code{start_time}), and
  62. terminated (@code{end_time}), relative to the initialization of
  63. StarPU with @code{starpu_init}. It also specifies the identifier of the worker
  64. that has executed the task (@code{workerid}).
  65. These date are stored as @code{timespec} structures which the user may convert
  66. into micro-seconds using the @code{starpu_timing_timespec_to_us} helper
  67. function.
  68. It it worth noting that the application may directly access this structure from
  69. the callback executed at the end of the task. The @code{starpu_task} structure
  70. associated to the callback currently being executed is indeed accessible with
  71. the @code{starpu_task_get_current()} function.
  72. @node Codelet feedback
  73. @subsection Per-codelet feedback
  74. The @code{per_worker_stats} field of the @code{struct starpu_codelet} structure is
  75. an array of counters. The i-th entry of the array is incremented every time a
  76. task implementing the codelet is executed on the i-th worker.
  77. This array is not reinitialized when profiling is enabled or disabled.
  78. @node Worker feedback
  79. @subsection Per-worker feedback
  80. The second argument returned by the @code{starpu_worker_get_profiling_info}
  81. function is a @code{starpu_worker_profiling_info} structure that gives
  82. statistics about the specified worker. This structure specifies when StarPU
  83. started collecting profiling information for that worker (@code{start_time}),
  84. the duration of the profiling measurement interval (@code{total_time}), the
  85. time spent executing kernels (@code{executing_time}), the time spent sleeping
  86. because there is no task to execute at all (@code{sleeping_time}), and the
  87. number of tasks that were executed while profiling was enabled.
  88. These values give an estimation of the proportion of time spent do real work,
  89. and the time spent either sleeping because there are not enough executable
  90. tasks or simply wasted in pure StarPU overhead.
  91. Calling @code{starpu_worker_get_profiling_info} resets the profiling
  92. information associated to a worker.
  93. When an FxT trace is generated (see @ref{Generating traces}), it is also
  94. possible to use the @code{starpu_workers_activity} script (described in @ref{starpu-workers-activity}) to
  95. generate a graphic showing the evolution of these values during the time, for
  96. the different workers.
  97. @node Bus feedback
  98. @subsection Bus-related feedback
  99. TODO: ajouter STARPU_BUS_STATS
  100. @c how to enable/disable performance monitoring
  101. @c what kind of information do we get ?
  102. The bus speed measured by StarPU can be displayed by using the
  103. @code{starpu_machine_display} tool, for instance:
  104. @example
  105. StarPU has found:
  106. 3 CUDA devices
  107. CUDA 0 (Tesla C2050 02:00.0)
  108. CUDA 1 (Tesla C2050 03:00.0)
  109. CUDA 2 (Tesla C2050 84:00.0)
  110. from to RAM to CUDA 0 to CUDA 1 to CUDA 2
  111. RAM 0.000000 5176.530428 5176.492994 5191.710722
  112. CUDA 0 4523.732446 0.000000 2414.074751 2417.379201
  113. CUDA 1 4523.718152 2414.078822 0.000000 2417.375119
  114. CUDA 2 4534.229519 2417.069025 2417.060863 0.000000
  115. @end example
  116. @node StarPU-Top
  117. @subsection StarPU-Top interface
  118. StarPU-Top is an interface which remotely displays the on-line state of a StarPU
  119. application and permits the user to change parameters on the fly.
  120. Variables to be monitored can be registered by calling the
  121. @code{starpu_top_add_data_boolean}, @code{starpu_top_add_data_integer},
  122. @code{starpu_top_add_data_float} functions, e.g.:
  123. @cartouche
  124. @smallexample
  125. starpu_top_data *data = starpu_top_add_data_integer("mynum", 0, 100, 1);
  126. @end smallexample
  127. @end cartouche
  128. The application should then call @code{starpu_top_init_and_wait} to give its name
  129. and wait for StarPU-Top to get a start request from the user. The name is used
  130. by StarPU-Top to quickly reload a previously-saved layout of parameter display.
  131. @cartouche
  132. @smallexample
  133. starpu_top_init_and_wait("the application");
  134. @end smallexample
  135. @end cartouche
  136. The new values can then be provided thanks to
  137. @code{starpu_top_update_data_boolean}, @code{starpu_top_update_data_integer},
  138. @code{starpu_top_update_data_float}, e.g.:
  139. @cartouche
  140. @smallexample
  141. starpu_top_update_data_integer(data, mynum);
  142. @end smallexample
  143. @end cartouche
  144. Updateable parameters can be registered thanks to @code{starpu_top_register_parameter_boolean}, @code{starpu_top_register_parameter_integer}, @code{starpu_top_register_parameter_float}, e.g.:
  145. @cartouche
  146. @smallexample
  147. float alpha;
  148. starpu_top_register_parameter_float("alpha", &alpha, 0, 10, modif_hook);
  149. @end smallexample
  150. @end cartouche
  151. @code{modif_hook} is a function which will be called when the parameter is being modified, it can for instance print the new value:
  152. @cartouche
  153. @smallexample
  154. void modif_hook(struct starpu_top_param *d) @{
  155. fprintf(stderr,"%s has been modified: %f\n", d->name, alpha);
  156. @}
  157. @end smallexample
  158. @end cartouche
  159. Task schedulers should notify StarPU-Top when it has decided when a task will be
  160. scheduled, so that it can show it in its Gantt chart, for instance:
  161. @cartouche
  162. @smallexample
  163. starpu_top_task_prevision(task, workerid, begin, end);
  164. @end smallexample
  165. @end cartouche
  166. Starting StarPU-Top@footnote{StarPU-Top is started via the binary
  167. @code{starpu_top}.} and the application can be done two ways:
  168. @itemize
  169. @item The application is started by hand on some machine (and thus already
  170. waiting for the start event). In the Preference dialog of StarPU-Top, the SSH
  171. checkbox should be unchecked, and the hostname and port (default is 2011) on
  172. which the application is already running should be specified. Clicking on the
  173. connection button will thus connect to the already-running application.
  174. @item StarPU-Top is started first, and clicking on the connection button will
  175. start the application itself (possibly on a remote machine). The SSH checkbox
  176. should be checked, and a command line provided, e.g.:
  177. @example
  178. $ ssh myserver STARPU_SCHED=dmda ./application
  179. @end example
  180. If port 2011 of the remote machine can not be accessed directly, an ssh port bridge should be added:
  181. @example
  182. $ ssh -L 2011:localhost:2011 myserver STARPU_SCHED=dmda ./application
  183. @end example
  184. and "localhost" should be used as IP Address to connect to.
  185. @end itemize
  186. @node Off-line
  187. @section Off-line performance feedback
  188. @menu
  189. * Generating traces:: Generating traces with FxT
  190. * Gantt diagram:: Creating a Gantt Diagram
  191. * DAG:: Creating a DAG with graphviz
  192. * starpu-workers-activity:: Monitoring activity
  193. @end menu
  194. @node Generating traces
  195. @subsection Generating traces with FxT
  196. StarPU can use the FxT library (see
  197. @url{https://savannah.nongnu.org/projects/fkt/}) to generate traces
  198. with a limited runtime overhead.
  199. You can either get a tarball:
  200. @example
  201. $ wget http://download.savannah.gnu.org/releases/fkt/fxt-0.2.11.tar.gz
  202. @end example
  203. or use the FxT library from CVS (autotools are required):
  204. @example
  205. $ cvs -d :pserver:anonymous@@cvs.sv.gnu.org:/sources/fkt co FxT
  206. $ ./bootstrap
  207. @end example
  208. Compiling and installing the FxT library in the @code{$FXTDIR} path is
  209. done following the standard procedure:
  210. @example
  211. $ ./configure --prefix=$FXTDIR
  212. $ make
  213. $ make install
  214. @end example
  215. In order to have StarPU to generate traces, StarPU should be configured with
  216. the @code{--with-fxt} option:
  217. @example
  218. $ ./configure --with-fxt=$FXTDIR
  219. @end example
  220. Or you can simply point the @code{PKG_CONFIG_PATH} to
  221. @code{$FXTDIR/lib/pkgconfig} and pass @code{--with-fxt} to @code{./configure}
  222. When FxT is enabled, a trace is generated when StarPU is terminated by calling
  223. @code{starpu_shutdown()}). The trace is a binary file whose name has the form
  224. @code{prof_file_XXX_YYY} where @code{XXX} is the user name, and
  225. @code{YYY} is the pid of the process that used StarPU. This file is saved in the
  226. @code{/tmp/} directory by default, or by the directory specified by
  227. the @code{STARPU_FXT_PREFIX} environment variable.
  228. @node Gantt diagram
  229. @subsection Creating a Gantt Diagram
  230. When the FxT trace file @code{filename} has been generated, it is possible to
  231. generate a trace in the Paje format by calling:
  232. @example
  233. $ starpu_fxt_tool -i filename
  234. @end example
  235. Or alternatively, setting the @code{STARPU_GENERATE_TRACE} environment variable
  236. to @code{1} before application execution will make StarPU do it automatically at
  237. application shutdown.
  238. This will create a @code{paje.trace} file in the current directory that
  239. can be inspected with the @url{http://vite.gforge.inria.fr/, ViTE trace
  240. visualizing open-source tool}. It is possible to open the
  241. @code{paje.trace} file with ViTE by using the following command:
  242. @example
  243. $ vite paje.trace
  244. @end example
  245. To get names of tasks instead of "unknown", fill the optional @code{name} field
  246. of the codelets, or use a performance model for them.
  247. In the MPI execution case, collect the trace files from the MPI nodes, and
  248. specify them all on the @code{starpu_fxt_tool} command, for instance:
  249. @smallexample
  250. $ starpu_fxt_tool -i filename1 -i filename2
  251. @end smallexample
  252. By default, all tasks are displayed using a green color. To display tasks with
  253. varying colors, pass option @code{-c} to @code{starpu_fxt_tool}.
  254. @node DAG
  255. @subsection Creating a DAG with graphviz
  256. When the FxT trace file @code{filename} has been generated, it is possible to
  257. generate a task graph in the DOT format by calling:
  258. @example
  259. $ starpu_fxt_tool -i filename
  260. @end example
  261. This will create a @code{dag.dot} file in the current directory. This file is a
  262. task graph described using the DOT language. It is possible to get a
  263. graphical output of the graph by using the graphviz library:
  264. @example
  265. $ dot -Tpdf dag.dot -o output.pdf
  266. @end example
  267. @node starpu-workers-activity
  268. @subsection Monitoring activity
  269. When the FxT trace file @code{filename} has been generated, it is possible to
  270. generate an activity trace by calling:
  271. @example
  272. $ starpu_fxt_tool -i filename
  273. @end example
  274. This will create an @code{activity.data} file in the current
  275. directory. A profile of the application showing the activity of StarPU
  276. during the execution of the program can be generated:
  277. @example
  278. $ starpu_workers_activity activity.data
  279. @end example
  280. This will create a file named @code{activity.eps} in the current directory.
  281. This picture is composed of two parts.
  282. The first part shows the activity of the different workers. The green sections
  283. indicate which proportion of the time was spent executed kernels on the
  284. processing unit. The red sections indicate the proportion of time spent in
  285. StartPU: an important overhead may indicate that the granularity may be too
  286. low, and that bigger tasks may be appropriate to use the processing unit more
  287. efficiently. The black sections indicate that the processing unit was blocked
  288. because there was no task to process: this may indicate a lack of parallelism
  289. which may be alleviated by creating more tasks when it is possible.
  290. The second part of the @code{activity.eps} picture is a graph showing the
  291. evolution of the number of tasks available in the system during the execution.
  292. Ready tasks are shown in black, and tasks that are submitted but not
  293. schedulable yet are shown in grey.
  294. @node Codelet performance
  295. @section Performance of codelets
  296. The performance model of codelets (described in @ref{Performance model example}) can be examined by using the
  297. @code{starpu_perfmodel_display} tool:
  298. @example
  299. $ starpu_perfmodel_display -l
  300. file: <malloc_pinned.hannibal>
  301. file: <starpu_slu_lu_model_21.hannibal>
  302. file: <starpu_slu_lu_model_11.hannibal>
  303. file: <starpu_slu_lu_model_22.hannibal>
  304. file: <starpu_slu_lu_model_12.hannibal>
  305. @end example
  306. Here, the codelets of the lu example are available. We can examine the
  307. performance of the 22 kernel (in micro-seconds), which is history-based:
  308. @example
  309. $ starpu_perfmodel_display -s starpu_slu_lu_model_22
  310. performance model for cpu
  311. # hash size mean dev n
  312. 57618ab0 19660800 2.851069e+05 1.829369e+04 109
  313. performance model for cuda_0
  314. # hash size mean dev n
  315. 57618ab0 19660800 1.164144e+04 1.556094e+01 315
  316. performance model for cuda_1
  317. # hash size mean dev n
  318. 57618ab0 19660800 1.164271e+04 1.330628e+01 360
  319. performance model for cuda_2
  320. # hash size mean dev n
  321. 57618ab0 19660800 1.166730e+04 3.390395e+02 456
  322. @end example
  323. We can see that for the given size, over a sample of a few hundreds of
  324. execution, the GPUs are about 20 times faster than the CPUs (numbers are in
  325. us). The standard deviation is extremely low for the GPUs, and less than 10% for
  326. CPUs.
  327. This tool can also be used for regression-based performance models. It will then
  328. display the regression formula, and in the case of non-linear regression, the
  329. same performance log as for history-based performance models:
  330. @example
  331. $ starpu_perfmodel_display -s non_linear_memset_regression_based
  332. performance model for cpu_impl_0
  333. Regression : #sample = 1400
  334. Linear: y = alpha size ^ beta
  335. alpha = 1.335973e-03
  336. beta = 8.024020e-01
  337. Non-Linear: y = a size ^b + c
  338. a = 5.429195e-04
  339. b = 8.654899e-01
  340. c = 9.009313e-01
  341. # hash size mean stddev n
  342. a3d3725e 4096 4.763200e+00 7.650928e-01 100
  343. 870a30aa 8192 1.827970e+00 2.037181e-01 100
  344. 48e988e9 16384 2.652800e+00 1.876459e-01 100
  345. 961e65d2 32768 4.255530e+00 3.518025e-01 100
  346. ...
  347. @end example
  348. The same can also be achieved by using StarPU's library API, see
  349. @ref{Performance Model API} and notably the @code{starpu_perfmodel_load_symbol}
  350. function. The source code of the @code{starpu_perfmodel_display} tool can be a
  351. useful example.
  352. The @code{starpu_perfmodel_plot} tool can be used to draw performance models.
  353. It writes a @code{.gp} file in the current directory, to be run in the
  354. @code{gnuplot} tool, which shows the corresponding curve.
  355. When the @code{flops} field of tasks is set, @code{starpu_perfmodel_plot} can
  356. directly draw a GFlops curve, by simply adding the @code{-f} option:
  357. @example
  358. $ starpu_perfmodel_display -f -s chol_model_11
  359. @end example
  360. This will however disable displaying the regression model, for which we can not
  361. compute GFlops.
  362. When the FxT trace file @code{filename} has been generated, it is possible to
  363. get a profiling of each codelet by calling:
  364. @example
  365. $ starpu_fxt_tool -i filename
  366. $ starpu_codelet_profile distrib.data codelet_name
  367. @end example
  368. This will create profiling data files, and a @code{.gp} file in the current
  369. directory, which draws the distribution of codelet time over the application
  370. execution, according to data input size.
  371. This is also available in the @code{starpu_perfmodel_plot} tool, by passing it
  372. the fxt trace:
  373. @example
  374. $ starpu_perfmodel_plot -s non_linear_memset_regression_based -i /tmp/prof_file_foo_0
  375. @end example
  376. It will produce a @code{.gp} file which contains both the performance model
  377. curves, and the profiling measurements.
  378. If you have the R statistical tool installed, you can additionally use
  379. @example
  380. $ starpu_codelet_histo_profile distrib.data
  381. @end example
  382. Which will create one pdf file per codelet and per input size, showing a
  383. histogram of the codelet execution time distribution.
  384. @node Memory feedback
  385. @section Memory feedback
  386. It is possible to enable memory statistics. To do so, you need to pass the option
  387. @code{--enable-memory-stats} when running configure. It is then
  388. possible to call the function @code{starpu_display_memory_stats()} to
  389. display statistics about the current data handles registered within StarPU.
  390. Moreover, statistics will be displayed at the end of the execution on
  391. data handles which have not been cleared out. This can be disabled by
  392. setting the environment variable @code{STARPU_MEMORY_STATS} to 0.
  393. For example, if you do not unregister data at the end of the complex
  394. example, you will get something similar to:
  395. @example
  396. $ STARPU_MEMORY_STATS=0 ./examples/interface/complex
  397. Complex[0] = 45.00 + 12.00 i
  398. Complex[0] = 78.00 + 78.00 i
  399. Complex[0] = 45.00 + 12.00 i
  400. Complex[0] = 45.00 + 12.00 i
  401. @end example
  402. @example
  403. $ STARPU_MEMORY_STATS=1 ./examples/interface/complex
  404. Complex[0] = 45.00 + 12.00 i
  405. Complex[0] = 78.00 + 78.00 i
  406. Complex[0] = 45.00 + 12.00 i
  407. Complex[0] = 45.00 + 12.00 i
  408. #---------------------
  409. Memory stats:
  410. #-------
  411. Data on Node #3
  412. #-----
  413. Data : 0x553ff40
  414. Size : 16
  415. #--
  416. Data access stats
  417. /!\ Work Underway
  418. Node #0
  419. Direct access : 4
  420. Loaded (Owner) : 0
  421. Loaded (Shared) : 0
  422. Invalidated (was Owner) : 0
  423. Node #3
  424. Direct access : 0
  425. Loaded (Owner) : 0
  426. Loaded (Shared) : 1
  427. Invalidated (was Owner) : 0
  428. #-----
  429. Data : 0x5544710
  430. Size : 16
  431. #--
  432. Data access stats
  433. /!\ Work Underway
  434. Node #0
  435. Direct access : 2
  436. Loaded (Owner) : 0
  437. Loaded (Shared) : 1
  438. Invalidated (was Owner) : 1
  439. Node #3
  440. Direct access : 0
  441. Loaded (Owner) : 1
  442. Loaded (Shared) : 0
  443. Invalidated (was Owner) : 0
  444. @end example
  445. @node Data statistics
  446. @section Data statistics
  447. Different data statistics can be displayed at the end of the execution
  448. of the application. To enable them, you need to pass the option
  449. @code{--enable-stats} when calling @code{configure}. When calling
  450. @code{starpu_shutdown()} various statistics will be displayed,
  451. execution, MSI cache statistics, allocation cache statistics, and data
  452. transfer statistics. The display can be disabled by setting the
  453. environment variable @code{STARPU_STATS} to 0.
  454. @example
  455. $ ./examples/cholesky/cholesky_tag
  456. Computation took (in ms)
  457. 518.16
  458. Synthetic GFlops : 44.21
  459. #---------------------
  460. MSI cache stats :
  461. TOTAL MSI stats hit 1622 (66.23 %) miss 827 (33.77 %)
  462. ...
  463. @end example
  464. @example
  465. $ STARPU_STATS=0 ./examples/cholesky/cholesky_tag
  466. Computation took (in ms)
  467. 518.16
  468. Synthetic GFlops : 44.21
  469. @end example
  470. @c TODO: data transfer stats are similar to the ones displayed when
  471. @c setting STARPU_BUS_STATS