perf-feedback.texi 17 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436
  1. @c -*-texinfo-*-
  2. @c This file is part of the StarPU Handbook.
  3. @c Copyright (C) 2009--2011 Universit@'e de Bordeaux 1
  4. @c Copyright (C) 2010, 2011, 2012 Centre National de la Recherche Scientifique
  5. @c Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
  6. @c See the file starpu.texi for copying conditions.
  7. @menu
  8. * On-line:: On-line performance feedback
  9. * Off-line:: Off-line performance feedback
  10. * Codelet performance:: Performance of codelets
  11. * Theoretical lower bound on execution time API::
  12. @end menu
  13. @node On-line
  14. @section On-line performance feedback
  15. @menu
  16. * Enabling monitoring:: Enabling on-line performance monitoring
  17. * Task feedback:: Per-task feedback
  18. * Codelet feedback:: Per-codelet feedback
  19. * Worker feedback:: Per-worker feedback
  20. * Bus feedback:: Bus-related feedback
  21. * StarPU-Top:: StarPU-Top interface
  22. @end menu
  23. @node Enabling monitoring
  24. @subsection Enabling on-line performance monitoring
  25. In order to enable online performance monitoring, the application can call
  26. @code{starpu_profiling_status_set(STARPU_PROFILING_ENABLE)}. It is possible to
  27. detect whether monitoring is already enabled or not by calling
  28. @code{starpu_profiling_status_get()}. Enabling monitoring also reinitialize all
  29. previously collected feedback. The @code{STARPU_PROFILING} environment variable
  30. can also be set to 1 to achieve the same effect.
  31. Likewise, performance monitoring is stopped by calling
  32. @code{starpu_profiling_status_set(STARPU_PROFILING_DISABLE)}. Note that this
  33. does not reset the performance counters so that the application may consult
  34. them later on.
  35. More details about the performance monitoring API are available in section
  36. @ref{Profiling API}.
  37. @node Task feedback
  38. @subsection Per-task feedback
  39. If profiling is enabled, a pointer to a @code{starpu_task_profiling_info}
  40. structure is put in the @code{.profiling_info} field of the @code{starpu_task}
  41. structure when a task terminates.
  42. This structure is automatically destroyed when the task structure is destroyed,
  43. either automatically or by calling @code{starpu_task_destroy}.
  44. The @code{starpu_task_profiling_info} structure indicates the date when the
  45. task was submitted (@code{submit_time}), started (@code{start_time}), and
  46. terminated (@code{end_time}), relative to the initialization of
  47. StarPU with @code{starpu_init}. It also specifies the identifier of the worker
  48. that has executed the task (@code{workerid}).
  49. These date are stored as @code{timespec} structures which the user may convert
  50. into micro-seconds using the @code{starpu_timing_timespec_to_us} helper
  51. function.
  52. It it worth noting that the application may directly access this structure from
  53. the callback executed at the end of the task. The @code{starpu_task} structure
  54. associated to the callback currently being executed is indeed accessible with
  55. the @code{starpu_task_get_current()} function.
  56. @node Codelet feedback
  57. @subsection Per-codelet feedback
  58. The @code{per_worker_stats} field of the @code{struct starpu_codelet} structure is
  59. an array of counters. The i-th entry of the array is incremented every time a
  60. task implementing the codelet is executed on the i-th worker.
  61. This array is not reinitialized when profiling is enabled or disabled.
  62. @node Worker feedback
  63. @subsection Per-worker feedback
  64. The second argument returned by the @code{starpu_worker_get_profiling_info}
  65. function is a @code{starpu_worker_profiling_info} structure that gives
  66. statistics about the specified worker. This structure specifies when StarPU
  67. started collecting profiling information for that worker (@code{start_time}),
  68. the duration of the profiling measurement interval (@code{total_time}), the
  69. time spent executing kernels (@code{executing_time}), the time spent sleeping
  70. because there is no task to execute at all (@code{sleeping_time}), and the
  71. number of tasks that were executed while profiling was enabled.
  72. These values give an estimation of the proportion of time spent do real work,
  73. and the time spent either sleeping because there are not enough executable
  74. tasks or simply wasted in pure StarPU overhead.
  75. Calling @code{starpu_worker_get_profiling_info} resets the profiling
  76. information associated to a worker.
  77. When an FxT trace is generated (see @ref{Generating traces}), it is also
  78. possible to use the @code{starpu_workers_activity} script (described in @ref{starpu-workers-activity}) to
  79. generate a graphic showing the evolution of these values during the time, for
  80. the different workers.
  81. @node Bus feedback
  82. @subsection Bus-related feedback
  83. TODO: ajouter STARPU_BUS_STATS
  84. @c how to enable/disable performance monitoring
  85. @c what kind of information do we get ?
  86. The bus speed measured by StarPU can be displayed by using the
  87. @code{starpu_machine_display} tool, for instance:
  88. @example
  89. StarPU has found:
  90. 3 CUDA devices
  91. CUDA 0 (Tesla C2050 02:00.0)
  92. CUDA 1 (Tesla C2050 03:00.0)
  93. CUDA 2 (Tesla C2050 84:00.0)
  94. from to RAM to CUDA 0 to CUDA 1 to CUDA 2
  95. RAM 0.000000 5176.530428 5176.492994 5191.710722
  96. CUDA 0 4523.732446 0.000000 2414.074751 2417.379201
  97. CUDA 1 4523.718152 2414.078822 0.000000 2417.375119
  98. CUDA 2 4534.229519 2417.069025 2417.060863 0.000000
  99. @end example
  100. @node StarPU-Top
  101. @subsection StarPU-Top interface
  102. StarPU-Top is an interface which remotely displays the on-line state of a StarPU
  103. application and permits the user to change parameters on the fly.
  104. Variables to be monitored can be registered by calling the
  105. @code{starpu_top_add_data_boolean}, @code{starpu_top_add_data_integer},
  106. @code{starpu_top_add_data_float} functions, e.g.:
  107. @cartouche
  108. @smallexample
  109. starpu_top_data *data = starpu_top_add_data_integer("mynum", 0, 100, 1);
  110. @end smallexample
  111. @end cartouche
  112. The application should then call @code{starpu_top_init_and_wait} to give its name
  113. and wait for StarPU-Top to get a start request from the user. The name is used
  114. by StarPU-Top to quickly reload a previously-saved layout of parameter display.
  115. @cartouche
  116. @smallexample
  117. starpu_top_init_and_wait("the application");
  118. @end smallexample
  119. @end cartouche
  120. The new values can then be provided thanks to
  121. @code{starpu_top_update_data_boolean}, @code{starpu_top_update_data_integer},
  122. @code{starpu_top_update_data_float}, e.g.:
  123. @cartouche
  124. @smallexample
  125. starpu_top_update_data_integer(data, mynum);
  126. @end smallexample
  127. @end cartouche
  128. Updateable parameters can be registered thanks to @code{starpu_top_register_parameter_boolean}, @code{starpu_top_register_parameter_integer}, @code{starpu_top_register_parameter_float}, e.g.:
  129. @cartouche
  130. @smallexample
  131. float alpha;
  132. starpu_top_register_parameter_float("alpha", &alpha, 0, 10, modif_hook);
  133. @end smallexample
  134. @end cartouche
  135. @code{modif_hook} is a function which will be called when the parameter is being modified, it can for instance print the new value:
  136. @cartouche
  137. @smallexample
  138. void modif_hook(struct starpu_top_param *d) @{
  139. fprintf(stderr,"%s has been modified: %f\n", d->name, alpha);
  140. @}
  141. @end smallexample
  142. @end cartouche
  143. Task schedulers should notify StarPU-Top when it has decided when a task will be
  144. scheduled, so that it can show it in its Gantt chart, for instance:
  145. @cartouche
  146. @smallexample
  147. starpu_top_task_prevision(task, workerid, begin, end);
  148. @end smallexample
  149. @end cartouche
  150. Starting StarPU-Top@footnote{StarPU-Top is started via the binary
  151. @code{starpu_top}.} and the application can be done two ways:
  152. @itemize
  153. @item The application is started by hand on some machine (and thus already
  154. waiting for the start event). In the Preference dialog of StarPU-Top, the SSH
  155. checkbox should be unchecked, and the hostname and port (default is 2011) on
  156. which the application is already running should be specified. Clicking on the
  157. connection button will thus connect to the already-running application.
  158. @item StarPU-Top is started first, and clicking on the connection button will
  159. start the application itself (possibly on a remote machine). The SSH checkbox
  160. should be checked, and a command line provided, e.g.:
  161. @example
  162. ssh myserver STARPU_SCHED=heft ./application
  163. @end example
  164. If port 2011 of the remote machine can not be accessed directly, an ssh port bridge should be added:
  165. @example
  166. ssh -L 2011:localhost:2011 myserver STARPU_SCHED=heft ./application
  167. @end example
  168. and "localhost" should be used as IP Address to connect to.
  169. @end itemize
  170. @node Off-line
  171. @section Off-line performance feedback
  172. @menu
  173. * Generating traces:: Generating traces with FxT
  174. * Gantt diagram:: Creating a Gantt Diagram
  175. * DAG:: Creating a DAG with graphviz
  176. * starpu-workers-activity:: Monitoring activity
  177. @end menu
  178. @node Generating traces
  179. @subsection Generating traces with FxT
  180. StarPU can use the FxT library (see
  181. @indicateurl{https://savannah.nongnu.org/projects/fkt/}) to generate traces
  182. with a limited runtime overhead.
  183. You can either get a tarball:
  184. @example
  185. % wget http://download.savannah.gnu.org/releases/fkt/fxt-0.2.2.tar.gz
  186. @end example
  187. or use the FxT library from CVS (autotools are required):
  188. @example
  189. % cvs -d :pserver:anonymous@@cvs.sv.gnu.org:/sources/fkt co FxT
  190. % ./bootstrap
  191. @end example
  192. Compiling and installing the FxT library in the @code{$FXTDIR} path is
  193. done following the standard procedure:
  194. @example
  195. % ./configure --prefix=$FXTDIR
  196. % make
  197. % make install
  198. @end example
  199. In order to have StarPU to generate traces, StarPU should be configured with
  200. the @code{--with-fxt} option:
  201. @example
  202. $ ./configure --with-fxt=$FXTDIR
  203. @end example
  204. Or you can simply point the @code{PKG_CONFIG_PATH} to
  205. @code{$FXTDIR/lib/pkgconfig} and pass @code{--with-fxt} to @code{./configure}
  206. When FxT is enabled, a trace is generated when StarPU is terminated by calling
  207. @code{starpu_shutdown()}). The trace is a binary file whose name has the form
  208. @code{prof_file_XXX_YYY} where @code{XXX} is the user name, and
  209. @code{YYY} is the pid of the process that used StarPU. This file is saved in the
  210. @code{/tmp/} directory by default, or by the directory specified by
  211. the @code{STARPU_FXT_PREFIX} environment variable.
  212. @node Gantt diagram
  213. @subsection Creating a Gantt Diagram
  214. When the FxT trace file @code{filename} has been generated, it is possible to
  215. generate a trace in the Paje format by calling:
  216. @example
  217. % starpu_fxt_tool -i filename
  218. @end example
  219. Or alternatively, setting the @code{STARPU_GENERATE_TRACE} environment variable
  220. to 1 before application execution will make StarPU do it automatically at
  221. application shutdown.
  222. This will create a @code{paje.trace} file in the current directory that
  223. can be inspected with the @url{http://vite.gforge.inria.fr/, ViTE trace
  224. visualizing open-source tool}. It is possible to open the
  225. @code{paje.trace} file with ViTE by using the following command:
  226. @example
  227. % vite paje.trace
  228. @end example
  229. To get names of tasks instead of "unknown", fill the optional @code{name} field
  230. of the codelets, or use a performance model for them.
  231. By default, all tasks are displayed using a green color. To display tasks with
  232. varying colors, pass option @code{-c} to @code{starpu_fxt_tool}.
  233. @node DAG
  234. @subsection Creating a DAG with graphviz
  235. When the FxT trace file @code{filename} has been generated, it is possible to
  236. generate a task graph in the DOT format by calling:
  237. @example
  238. $ starpu_fxt_tool -i filename
  239. @end example
  240. This will create a @code{dag.dot} file in the current directory. This file is a
  241. task graph described using the DOT language. It is possible to get a
  242. graphical output of the graph by using the graphviz library:
  243. @example
  244. $ dot -Tpdf dag.dot -o output.pdf
  245. @end example
  246. @node starpu-workers-activity
  247. @subsection Monitoring activity
  248. When the FxT trace file @code{filename} has been generated, it is possible to
  249. generate an activity trace by calling:
  250. @example
  251. $ starpu_fxt_tool -i filename
  252. @end example
  253. This will create an @code{activity.data} file in the current
  254. directory. A profile of the application showing the activity of StarPU
  255. during the execution of the program can be generated:
  256. @example
  257. $ starpu_workers_activity activity.data
  258. @end example
  259. This will create a file named @code{activity.eps} in the current directory.
  260. This picture is composed of two parts.
  261. The first part shows the activity of the different workers. The green sections
  262. indicate which proportion of the time was spent executed kernels on the
  263. processing unit. The red sections indicate the proportion of time spent in
  264. StartPU: an important overhead may indicate that the granularity may be too
  265. low, and that bigger tasks may be appropriate to use the processing unit more
  266. efficiently. The black sections indicate that the processing unit was blocked
  267. because there was no task to process: this may indicate a lack of parallelism
  268. which may be alleviated by creating more tasks when it is possible.
  269. The second part of the @code{activity.eps} picture is a graph showing the
  270. evolution of the number of tasks available in the system during the execution.
  271. Ready tasks are shown in black, and tasks that are submitted but not
  272. schedulable yet are shown in grey.
  273. @node Codelet performance
  274. @section Performance of codelets
  275. The performance model of codelets (described in @ref{Performance model example}) can be examined by using the
  276. @code{starpu_perfmodel_display} tool:
  277. @example
  278. $ starpu_perfmodel_display -l
  279. file: <malloc_pinned.hannibal>
  280. file: <starpu_slu_lu_model_21.hannibal>
  281. file: <starpu_slu_lu_model_11.hannibal>
  282. file: <starpu_slu_lu_model_22.hannibal>
  283. file: <starpu_slu_lu_model_12.hannibal>
  284. @end example
  285. Here, the codelets of the lu example are available. We can examine the
  286. performance of the 22 kernel (in micro-seconds):
  287. @example
  288. $ starpu_perfmodel_display -s starpu_slu_lu_model_22
  289. performance model for cpu
  290. # hash size mean dev n
  291. 57618ab0 19660800 2.851069e+05 1.829369e+04 109
  292. performance model for cuda_0
  293. # hash size mean dev n
  294. 57618ab0 19660800 1.164144e+04 1.556094e+01 315
  295. performance model for cuda_1
  296. # hash size mean dev n
  297. 57618ab0 19660800 1.164271e+04 1.330628e+01 360
  298. performance model for cuda_2
  299. # hash size mean dev n
  300. 57618ab0 19660800 1.166730e+04 3.390395e+02 456
  301. @end example
  302. We can see that for the given size, over a sample of a few hundreds of
  303. execution, the GPUs are about 20 times faster than the CPUs (numbers are in
  304. us). The standard deviation is extremely low for the GPUs, and less than 10% for
  305. CPUs.
  306. The @code{starpu_regression_display} tool does the same for regression-based
  307. performance models. It also writes a @code{.gp} file in the current directory,
  308. to be run in the @code{gnuplot} tool, which shows the corresponding curve.
  309. The same can also be achieved by using StarPU's library API, see
  310. @ref{Performance Model API} and notably the @code{starpu_perfmodel_load_symbol}
  311. function. The source code of the @code{starpu_perfmodel_display} tool can be a
  312. useful example.
  313. @node Theoretical lower bound on execution time API
  314. @section Theoretical lower bound on execution time
  315. See @ref{Theoretical lower bound on execution time} for an example on how to use
  316. this API. It permits to record a trace of what tasks are needed to complete the
  317. application, and then, by using a linear system, provide a theoretical lower
  318. bound of the execution time (i.e. with an ideal scheduling).
  319. The computed bound is not really correct when not taking into account
  320. dependencies, but for an application which have enough parallelism, it is very
  321. near to the bound computed with dependencies enabled (which takes a huge lot
  322. more time to compute), and thus provides a good-enough estimation of the ideal
  323. execution time.
  324. @deftypefun void starpu_bound_start (int @var{deps}, int @var{prio})
  325. Start recording tasks (resets stats). @var{deps} tells whether
  326. dependencies should be recorded too (this is quite expensive)
  327. @end deftypefun
  328. @deftypefun void starpu_bound_stop (void)
  329. Stop recording tasks
  330. @end deftypefun
  331. @deftypefun void starpu_bound_print_dot ({FILE *}@var{output})
  332. Print the DAG that was recorded
  333. @end deftypefun
  334. @deftypefun void starpu_bound_compute ({double *}@var{res}, {double *}@var{integer_res}, int @var{integer})
  335. Get theoretical upper bound (in ms) (needs glpk support detected by @code{configure} script). It returns 0 if some performance models are not calibrated.
  336. @end deftypefun
  337. @deftypefun void starpu_bound_print_lp ({FILE *}@var{output})
  338. Emit the Linear Programming system on @var{output} for the recorded tasks, in
  339. the lp format
  340. @end deftypefun
  341. @deftypefun void starpu_bound_print_mps ({FILE *}@var{output})
  342. Emit the Linear Programming system on @var{output} for the recorded tasks, in
  343. the mps format
  344. @end deftypefun
  345. @deftypefun void starpu_bound_print ({FILE *}@var{output}, int @var{integer})
  346. Emit statistics of actual execution vs theoretical upper bound. @var{integer}
  347. permits to choose between integer solving (which takes a long time but is
  348. correct), and relaxed solving (which provides an approximate solution).
  349. @end deftypefun