perf-feedback.texi 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421
  1. @c -*-texinfo-*-
  2. @c This file is part of the StarPU Handbook.
  3. @c Copyright (C) 2009--2011 Universit@'e de Bordeaux 1
  4. @c Copyright (C) 2010, 2011 Centre National de la Recherche Scientifique
  5. @c Copyright (C) 2011 Institut National de Recherche en Informatique et Automatique
  6. @c See the file starpu.texi for copying conditions.
  7. @node Performance feedback
  8. @chapter Performance feedback
  9. @menu
  10. * On-line:: On-line performance feedback
  11. * Off-line:: Off-line performance feedback
  12. * Codelet performance:: Performance of codelets
  13. * Theoretical lower bound on execution time API::
  14. @end menu
  15. @node On-line
  16. @section On-line performance feedback
  17. @menu
  18. * Enabling monitoring:: Enabling on-line performance monitoring
  19. * Task feedback:: Per-task feedback
  20. * Codelet feedback:: Per-codelet feedback
  21. * Worker feedback:: Per-worker feedback
  22. * Bus feedback:: Bus-related feedback
  23. * StarPU-Top:: StarPU-Top interface
  24. @end menu
  25. @node Enabling monitoring
  26. @subsection Enabling on-line performance monitoring
  27. In order to enable online performance monitoring, the application can call
  28. @code{starpu_profiling_status_set(STARPU_PROFILING_ENABLE)}. It is possible to
  29. detect whether monitoring is already enabled or not by calling
  30. @code{starpu_profiling_status_get()}. Enabling monitoring also reinitialize all
  31. previously collected feedback. The @code{STARPU_PROFILING} environment variable
  32. can also be set to 1 to achieve the same effect.
  33. Likewise, performance monitoring is stopped by calling
  34. @code{starpu_profiling_status_set(STARPU_PROFILING_DISABLE)}. Note that this
  35. does not reset the performance counters so that the application may consult
  36. them later on.
  37. More details about the performance monitoring API are available in section
  38. @ref{Profiling API}.
  39. @node Task feedback
  40. @subsection Per-task feedback
  41. If profiling is enabled, a pointer to a @code{starpu_task_profiling_info}
  42. structure is put in the @code{.profiling_info} field of the @code{starpu_task}
  43. structure when a task terminates.
  44. This structure is automatically destroyed when the task structure is destroyed,
  45. either automatically or by calling @code{starpu_task_destroy}.
  46. The @code{starpu_task_profiling_info} structure indicates the date when the
  47. task was submitted (@code{submit_time}), started (@code{start_time}), and
  48. terminated (@code{end_time}), relative to the initialization of
  49. StarPU with @code{starpu_init}. It also specifies the identifier of the worker
  50. that has executed the task (@code{workerid}).
  51. These date are stored as @code{timespec} structures which the user may convert
  52. into micro-seconds using the @code{starpu_timing_timespec_to_us} helper
  53. function.
  54. It it worth noting that the application may directly access this structure from
  55. the callback executed at the end of the task. The @code{starpu_task} structure
  56. associated to the callback currently being executed is indeed accessible with
  57. the @code{starpu_get_current_task()} function.
  58. @node Codelet feedback
  59. @subsection Per-codelet feedback
  60. The @code{per_worker_stats} field of the @code{starpu_codelet_t} structure is
  61. an array of counters. The i-th entry of the array is incremented every time a
  62. task implementing the codelet is executed on the i-th worker.
  63. This array is not reinitialized when profiling is enabled or disabled.
  64. @node Worker feedback
  65. @subsection Per-worker feedback
  66. The second argument returned by the @code{starpu_worker_get_profiling_info}
  67. function is a @code{starpu_worker_profiling_info} structure that gives
  68. statistics about the specified worker. This structure specifies when StarPU
  69. started collecting profiling information for that worker (@code{start_time}),
  70. the duration of the profiling measurement interval (@code{total_time}), the
  71. time spent executing kernels (@code{executing_time}), the time spent sleeping
  72. because there is no task to execute at all (@code{sleeping_time}), and the
  73. number of tasks that were executed while profiling was enabled.
  74. These values give an estimation of the proportion of time spent do real work,
  75. and the time spent either sleeping because there are not enough executable
  76. tasks or simply wasted in pure StarPU overhead.
  77. Calling @code{starpu_worker_get_profiling_info} resets the profiling
  78. information associated to a worker.
  79. When an FxT trace is generated (see @ref{Generating traces}), it is also
  80. possible to use the @code{starpu_top} script (described in @ref{starpu-top}) to
  81. generate a graphic showing the evolution of these values during the time, for
  82. the different workers.
  83. @node Bus feedback
  84. @subsection Bus-related feedback
  85. TODO
  86. @c how to enable/disable performance monitoring
  87. @c what kind of information do we get ?
  88. The bus speed measured by StarPU can be displayed by using the
  89. @code{starpu_machine_display} tool, for instance:
  90. @example
  91. StarPU has found :
  92. 3 CUDA devices
  93. CUDA 0 (Tesla C2050 02:00.0)
  94. CUDA 1 (Tesla C2050 03:00.0)
  95. CUDA 2 (Tesla C2050 84:00.0)
  96. from to RAM to CUDA 0 to CUDA 1 to CUDA 2
  97. RAM 0.000000 5176.530428 5176.492994 5191.710722
  98. CUDA 0 4523.732446 0.000000 2414.074751 2417.379201
  99. CUDA 1 4523.718152 2414.078822 0.000000 2417.375119
  100. CUDA 2 4534.229519 2417.069025 2417.060863 0.000000
  101. @end example
  102. @node StarPU-Top
  103. @subsection StarPU-Top interface
  104. StarPU-Top is an interface which remotely displays the on-line state of a StarPU
  105. application and permits the user to change parameters on the fly.
  106. Variables to be monitored can be registered by calling the
  107. @code{starpu_top_add_data_boolean}, @code{starpu_top_add_data_integer},
  108. @code{starpu_top_add_data_float} functions, e.g.:
  109. @example
  110. starpu_top_data *data = starpu_top_add_data_integer("mynum", 0, 100, 1);
  111. @end example
  112. The application should then call @code{starpu_top_init_and_wait} to give its name
  113. and wait for StarPU-Top to get a start request from the user. The name is used
  114. by StarPU-Top to quickly reload a previously-saved layout of parameter display.
  115. @example
  116. starpu_top_init_and_wait("the application");
  117. @end example
  118. The new values can then be provided thanks to
  119. @code{starpu_top_update_data_boolean}, @code{starpu_top_update_data_integer},
  120. @code{starpu_top_update_data_float}, e.g.:
  121. @example
  122. starpu_top_update_data_integer(data, mynum);
  123. @end example
  124. Updateable parameters can be registered thanks to @code{starpu_top_register_parameter_boolean}, @code{starpu_top_register_parameter_integer}, @code{starpu_top_register_parameter_float}, e.g.:
  125. @example
  126. float alpha;
  127. starpu_top_register_parameter_float("alpha", &alpha, 0, 10, modif_hook);
  128. @end example
  129. @code{modif_hook} is a function which will be called when the parameter is being modified, it can for instance print the new value:
  130. @example
  131. void modif_hook(struct starpu_top_param_t *d) @{
  132. fprintf(stderr,"%s has been modified: %f\n", d->name, alpha);
  133. @}
  134. @end example
  135. Task schedulers should notify StarPU-Top when it has decided when a task will be
  136. scheduled, so that it can show it in its Gantt chart, for instance:
  137. @example
  138. starpu_top_task_prevision(task, workerid, begin, end);
  139. @end example
  140. Starting StarPU-Top and the application can be done two ways:
  141. @itemize
  142. @item The application is started by hand on some machine (and thus already
  143. waiting for the start event). In the Preference dialog of StarPU-Top, the SSH
  144. checkbox should be unchecked, and the hostname and port (default is 2011) on
  145. which the application is already running should be specified. Clicking on the
  146. connection button will thus connect to the already-running application.
  147. @item StarPU-Top is started first, and clicking on the connection button will
  148. start the application itself (possibly on a remote machine). The SSH checkbox
  149. should be checked, and a command line provided, e.g.:
  150. @example
  151. ssh myserver STARPU_SCHED=heft ./application
  152. @end example
  153. If port 2011 of the remote machine can not be accessed directly, an ssh port bridge should be added:
  154. @example
  155. ssh -L 2011:localhost:2011 myserver STARPU_SCHED=heft ./application
  156. @end example
  157. and "localhost" should be used as IP Address to connect to.
  158. @end itemize
  159. @node Off-line
  160. @section Off-line performance feedback
  161. @menu
  162. * Generating traces:: Generating traces with FxT
  163. * Gantt diagram:: Creating a Gantt Diagram
  164. * DAG:: Creating a DAG with graphviz
  165. * starpu-top:: Monitoring activity
  166. @end menu
  167. @node Generating traces
  168. @subsection Generating traces with FxT
  169. StarPU can use the FxT library (see
  170. @indicateurl{https://savannah.nongnu.org/projects/fkt/}) to generate traces
  171. with a limited runtime overhead.
  172. You can either get a tarball:
  173. @example
  174. % wget http://download.savannah.gnu.org/releases/fkt/fxt-0.2.2.tar.gz
  175. @end example
  176. or use the FxT library from CVS (autotools are required):
  177. @example
  178. % cvs -d :pserver:anonymous@@cvs.sv.gnu.org:/sources/fkt co FxT
  179. % ./bootstrap
  180. @end example
  181. Compiling and installing the FxT library in the @code{$FXTDIR} path is
  182. done following the standard procedure:
  183. @example
  184. % ./configure --prefix=$FXTDIR
  185. % make
  186. % make install
  187. @end example
  188. In order to have StarPU to generate traces, StarPU should be configured with
  189. the @code{--with-fxt} option:
  190. @example
  191. $ ./configure --with-fxt=$FXTDIR
  192. @end example
  193. Or you can simply point the @code{PKG_CONFIG_PATH} to
  194. @code{$FXTDIR/lib/pkgconfig} and pass @code{--with-fxt} to @code{./configure}
  195. When FxT is enabled, a trace is generated when StarPU is terminated by calling
  196. @code{starpu_shutdown()}). The trace is a binary file whose name has the form
  197. @code{prof_file_XXX_YYY} where @code{XXX} is the user name, and
  198. @code{YYY} is the pid of the process that used StarPU. This file is saved in the
  199. @code{/tmp/} directory by default, or by the directory specified by
  200. the @code{STARPU_FXT_PREFIX} environment variable.
  201. @node Gantt diagram
  202. @subsection Creating a Gantt Diagram
  203. When the FxT trace file @code{filename} has been generated, it is possible to
  204. generate a trace in the Paje format by calling:
  205. @example
  206. % starpu_fxt_tool -i filename
  207. @end example
  208. Or alternatively, setting the @code{STARPU_GENERATE_TRACE} environment variable
  209. to 1 before application execution will make StarPU do it automatically at
  210. application shutdown.
  211. This will create a @code{paje.trace} file in the current directory that can be
  212. inspected with the ViTE trace visualizing open-source tool. More information
  213. about ViTE is available at @indicateurl{http://vite.gforge.inria.fr/}. It is
  214. possible to open the @code{paje.trace} file with ViTE by using the following
  215. command:
  216. @example
  217. % vite paje.trace
  218. @end example
  219. @node DAG
  220. @subsection Creating a DAG with graphviz
  221. When the FxT trace file @code{filename} has been generated, it is possible to
  222. generate a task graph in the DOT format by calling:
  223. @example
  224. $ starpu_fxt_tool -i filename
  225. @end example
  226. This will create a @code{dag.dot} file in the current directory. This file is a
  227. task graph described using the DOT language. It is possible to get a
  228. graphical output of the graph by using the graphviz library:
  229. @example
  230. $ dot -Tpdf dag.dot -o output.pdf
  231. @end example
  232. @node starpu-top
  233. @subsection Monitoring activity
  234. When the FxT trace file @code{filename} has been generated, it is possible to
  235. generate a activity trace by calling:
  236. @example
  237. $ starpu_fxt_tool -i filename
  238. @end example
  239. This will create an @code{activity.data} file in the current
  240. directory. A profile of the application showing the activity of StarPU
  241. during the execution of the program can be generated:
  242. @example
  243. $ starpu_top activity.data
  244. @end example
  245. This will create a file named @code{activity.eps} in the current directory.
  246. This picture is composed of two parts.
  247. The first part shows the activity of the different workers. The green sections
  248. indicate which proportion of the time was spent executed kernels on the
  249. processing unit. The red sections indicate the proportion of time spent in
  250. StartPU: an important overhead may indicate that the granularity may be too
  251. low, and that bigger tasks may be appropriate to use the processing unit more
  252. efficiently. The black sections indicate that the processing unit was blocked
  253. because there was no task to process: this may indicate a lack of parallelism
  254. which may be alleviated by creating more tasks when it is possible.
  255. The second part of the @code{activity.eps} picture is a graph showing the
  256. evolution of the number of tasks available in the system during the execution.
  257. Ready tasks are shown in black, and tasks that are submitted but not
  258. schedulable yet are shown in grey.
  259. @node Codelet performance
  260. @section Performance of codelets
  261. The performance model of codelets (described in @ref{Performance model example}) can be examined by using the
  262. @code{starpu_perfmodel_display} tool:
  263. @example
  264. $ starpu_perfmodel_display -l
  265. file: <malloc_pinned.hannibal>
  266. file: <starpu_slu_lu_model_21.hannibal>
  267. file: <starpu_slu_lu_model_11.hannibal>
  268. file: <starpu_slu_lu_model_22.hannibal>
  269. file: <starpu_slu_lu_model_12.hannibal>
  270. @end example
  271. Here, the codelets of the lu example are available. We can examine the
  272. performance of the 22 kernel:
  273. @example
  274. $ starpu_perfmodel_display -s starpu_slu_lu_model_22
  275. performance model for cpu
  276. # hash size mean dev n
  277. 57618ab0 19660800 2.851069e+05 1.829369e+04 109
  278. performance model for cuda_0
  279. # hash size mean dev n
  280. 57618ab0 19660800 1.164144e+04 1.556094e+01 315
  281. performance model for cuda_1
  282. # hash size mean dev n
  283. 57618ab0 19660800 1.164271e+04 1.330628e+01 360
  284. performance model for cuda_2
  285. # hash size mean dev n
  286. 57618ab0 19660800 1.166730e+04 3.390395e+02 456
  287. @end example
  288. We can see that for the given size, over a sample of a few hundreds of
  289. execution, the GPUs are about 20 times faster than the CPUs (numbers are in
  290. us). The standard deviation is extremely low for the GPUs, and less than 10% for
  291. CPUs.
  292. The @code{starpu_regression_display} tool does the same for regression-based
  293. performance models. It also writes a @code{.gp} file in the current directory,
  294. to be run in the @code{gnuplot} tool, which shows the corresponding curve.
  295. The same can also be achieved by using StarPU's library API, see
  296. @ref{Performance Model API} and notably the @code{starpu_load_history_debug}
  297. function. The source code of the @code{starpu_perfmodel_display} tool can be a
  298. useful example.
  299. @node Theoretical lower bound on execution time API
  300. @section Theoretical lower bound on execution time
  301. See @ref{Theoretical lower bound on execution time} for an example on how to use
  302. this API. It permits to record a trace of what tasks are needed to complete the
  303. application, and then, by using a linear system, provide a theoretical lower
  304. bound of the execution time (i.e. with an ideal scheduling).
  305. The computed bound is not really correct when not taking into account
  306. dependencies, but for an application which have enough parallelism, it is very
  307. near to the bound computed with dependencies enabled (which takes a huge lot
  308. more time to compute), and thus provides a good-enough estimation of the ideal
  309. execution time.
  310. @deftypefun void starpu_bound_start (int @var{deps}, int @var{prio})
  311. Start recording tasks (resets stats). @var{deps} tells whether
  312. dependencies should be recorded too (this is quite expensive)
  313. @end deftypefun
  314. @deftypefun void starpu_bound_stop (void)
  315. Stop recording tasks
  316. @end deftypefun
  317. @deftypefun void starpu_bound_print_dot ({FILE *}@var{output})
  318. Print the DAG that was recorded
  319. @end deftypefun
  320. @deftypefun void starpu_bound_compute ({double *}@var{res}, {double *}@var{integer_res}, int @var{integer})
  321. Get theoretical upper bound (in ms) (needs glpk support detected by @code{configure} script)
  322. @end deftypefun
  323. @deftypefun void starpu_bound_print_lp ({FILE *}@var{output})
  324. Emit the Linear Programming system on @var{output} for the recorded tasks, in
  325. the lp format
  326. @end deftypefun
  327. @deftypefun void starpu_bound_print_mps ({FILE *}@var{output})
  328. Emit the Linear Programming system on @var{output} for the recorded tasks, in
  329. the mps format
  330. @end deftypefun
  331. @deftypefun void starpu_bound_print ({FILE *}@var{output}, int @var{integer})
  332. Emit statistics of actual execution vs theoretical upper bound. @var{integer}
  333. permits to choose between integer solving (which takes a long time but is
  334. correct), and relaxed solving (which provides an approximate solution).
  335. @end deftypefun