123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565 |
- @c -*-texinfo-*-
- @c This file is part of the StarPU Handbook.
- @c Copyright (C) 2009--2011 Universit@'e de Bordeaux 1
- @c Copyright (C) 2010, 2011, 2012 Centre National de la Recherche Scientifique
- @c Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
- @c See the file starpu.texi for copying conditions.
- @menu
- * Task debugger:: Using the Temanejo task debugger
- * On-line:: On-line performance feedback
- * Off-line:: Off-line performance feedback
- * Codelet performance:: Performance of codelets
- * Theoretical lower bound on execution time API::
- * Memory feedback::
- * Data statistics::
- @end menu
- @node Task debugger
- @section Using the Temanejo task debugger
- StarPU can connect to Temanejo (see
- @url{http://www.hlrs.de/organization/av/spmt/research/temanejo/}), to permit
- nice visual task debugging. To do so, build Temanejo's @code{libayudame.so},
- install @code{Ayudame} to e.g. @code{/usr/local/include}, apply the
- @code{tools/patch-ayudame} to it to fix C build, re-@code{./configure}, make
- sure that it found it, rebuild StarPU. Run the Temanejo GUI, give it the path
- to your application, any options you want to pass it, the path to libayudame.so.
- The number of CPUs currently can not be set.
- Only dependencies detected implicitly are currently shown.
- @node On-line
- @section On-line performance feedback
- @menu
- * Enabling on-line performance monitoring::
- * Task feedback:: Per-task feedback
- * Codelet feedback:: Per-codelet feedback
- * Worker feedback:: Per-worker feedback
- * Bus feedback:: Bus-related feedback
- * StarPU-Top:: StarPU-Top interface
- @end menu
- @node Enabling on-line performance monitoring
- @subsection Enabling on-line performance monitoring
- In order to enable online performance monitoring, the application can call
- @code{starpu_profiling_status_set(STARPU_PROFILING_ENABLE)}. It is possible to
- detect whether monitoring is already enabled or not by calling
- @code{starpu_profiling_status_get()}. Enabling monitoring also reinitialize all
- previously collected feedback. The @code{STARPU_PROFILING} environment variable
- can also be set to 1 to achieve the same effect.
- Likewise, performance monitoring is stopped by calling
- @code{starpu_profiling_status_set(STARPU_PROFILING_DISABLE)}. Note that this
- does not reset the performance counters so that the application may consult
- them later on.
- More details about the performance monitoring API are available in section
- @ref{Profiling API}.
- @node Task feedback
- @subsection Per-task feedback
- If profiling is enabled, a pointer to a @code{starpu_task_profiling_info}
- structure is put in the @code{.profiling_info} field of the @code{starpu_task}
- structure when a task terminates.
- This structure is automatically destroyed when the task structure is destroyed,
- either automatically or by calling @code{starpu_task_destroy}.
- The @code{starpu_task_profiling_info} structure indicates the date when the
- task was submitted (@code{submit_time}), started (@code{start_time}), and
- terminated (@code{end_time}), relative to the initialization of
- StarPU with @code{starpu_init}. It also specifies the identifier of the worker
- that has executed the task (@code{workerid}).
- These date are stored as @code{timespec} structures which the user may convert
- into micro-seconds using the @code{starpu_timing_timespec_to_us} helper
- function.
- It it worth noting that the application may directly access this structure from
- the callback executed at the end of the task. The @code{starpu_task} structure
- associated to the callback currently being executed is indeed accessible with
- the @code{starpu_task_get_current()} function.
- @node Codelet feedback
- @subsection Per-codelet feedback
- The @code{per_worker_stats} field of the @code{struct starpu_codelet} structure is
- an array of counters. The i-th entry of the array is incremented every time a
- task implementing the codelet is executed on the i-th worker.
- This array is not reinitialized when profiling is enabled or disabled.
- @node Worker feedback
- @subsection Per-worker feedback
- The second argument returned by the @code{starpu_worker_get_profiling_info}
- function is a @code{starpu_worker_profiling_info} structure that gives
- statistics about the specified worker. This structure specifies when StarPU
- started collecting profiling information for that worker (@code{start_time}),
- the duration of the profiling measurement interval (@code{total_time}), the
- time spent executing kernels (@code{executing_time}), the time spent sleeping
- because there is no task to execute at all (@code{sleeping_time}), and the
- number of tasks that were executed while profiling was enabled.
- These values give an estimation of the proportion of time spent do real work,
- and the time spent either sleeping because there are not enough executable
- tasks or simply wasted in pure StarPU overhead.
- Calling @code{starpu_worker_get_profiling_info} resets the profiling
- information associated to a worker.
- When an FxT trace is generated (see @ref{Generating traces}), it is also
- possible to use the @code{starpu_workers_activity} script (described in @ref{starpu-workers-activity}) to
- generate a graphic showing the evolution of these values during the time, for
- the different workers.
- @node Bus feedback
- @subsection Bus-related feedback
- TODO: ajouter STARPU_BUS_STATS
- @c how to enable/disable performance monitoring
- @c what kind of information do we get ?
- The bus speed measured by StarPU can be displayed by using the
- @code{starpu_machine_display} tool, for instance:
- @example
- StarPU has found:
- 3 CUDA devices
- CUDA 0 (Tesla C2050 02:00.0)
- CUDA 1 (Tesla C2050 03:00.0)
- CUDA 2 (Tesla C2050 84:00.0)
- from to RAM to CUDA 0 to CUDA 1 to CUDA 2
- RAM 0.000000 5176.530428 5176.492994 5191.710722
- CUDA 0 4523.732446 0.000000 2414.074751 2417.379201
- CUDA 1 4523.718152 2414.078822 0.000000 2417.375119
- CUDA 2 4534.229519 2417.069025 2417.060863 0.000000
- @end example
- @node StarPU-Top
- @subsection StarPU-Top interface
- StarPU-Top is an interface which remotely displays the on-line state of a StarPU
- application and permits the user to change parameters on the fly.
- Variables to be monitored can be registered by calling the
- @code{starpu_top_add_data_boolean}, @code{starpu_top_add_data_integer},
- @code{starpu_top_add_data_float} functions, e.g.:
- @cartouche
- @smallexample
- starpu_top_data *data = starpu_top_add_data_integer("mynum", 0, 100, 1);
- @end smallexample
- @end cartouche
- The application should then call @code{starpu_top_init_and_wait} to give its name
- and wait for StarPU-Top to get a start request from the user. The name is used
- by StarPU-Top to quickly reload a previously-saved layout of parameter display.
- @cartouche
- @smallexample
- starpu_top_init_and_wait("the application");
- @end smallexample
- @end cartouche
- The new values can then be provided thanks to
- @code{starpu_top_update_data_boolean}, @code{starpu_top_update_data_integer},
- @code{starpu_top_update_data_float}, e.g.:
- @cartouche
- @smallexample
- starpu_top_update_data_integer(data, mynum);
- @end smallexample
- @end cartouche
- Updateable parameters can be registered thanks to @code{starpu_top_register_parameter_boolean}, @code{starpu_top_register_parameter_integer}, @code{starpu_top_register_parameter_float}, e.g.:
- @cartouche
- @smallexample
- float alpha;
- starpu_top_register_parameter_float("alpha", &alpha, 0, 10, modif_hook);
- @end smallexample
- @end cartouche
- @code{modif_hook} is a function which will be called when the parameter is being modified, it can for instance print the new value:
- @cartouche
- @smallexample
- void modif_hook(struct starpu_top_param *d) @{
- fprintf(stderr,"%s has been modified: %f\n", d->name, alpha);
- @}
- @end smallexample
- @end cartouche
- Task schedulers should notify StarPU-Top when it has decided when a task will be
- scheduled, so that it can show it in its Gantt chart, for instance:
- @cartouche
- @smallexample
- starpu_top_task_prevision(task, workerid, begin, end);
- @end smallexample
- @end cartouche
- Starting StarPU-Top@footnote{StarPU-Top is started via the binary
- @code{starpu_top}.} and the application can be done two ways:
- @itemize
- @item The application is started by hand on some machine (and thus already
- waiting for the start event). In the Preference dialog of StarPU-Top, the SSH
- checkbox should be unchecked, and the hostname and port (default is 2011) on
- which the application is already running should be specified. Clicking on the
- connection button will thus connect to the already-running application.
- @item StarPU-Top is started first, and clicking on the connection button will
- start the application itself (possibly on a remote machine). The SSH checkbox
- should be checked, and a command line provided, e.g.:
- @example
- ssh myserver STARPU_SCHED=heft ./application
- @end example
- If port 2011 of the remote machine can not be accessed directly, an ssh port bridge should be added:
- @example
- ssh -L 2011:localhost:2011 myserver STARPU_SCHED=heft ./application
- @end example
- and "localhost" should be used as IP Address to connect to.
- @end itemize
- @node Off-line
- @section Off-line performance feedback
- @menu
- * Generating traces:: Generating traces with FxT
- * Gantt diagram:: Creating a Gantt Diagram
- * DAG:: Creating a DAG with graphviz
- * starpu-workers-activity:: Monitoring activity
- @end menu
- @node Generating traces
- @subsection Generating traces with FxT
- StarPU can use the FxT library (see
- @indicateurl{https://savannah.nongnu.org/projects/fkt/}) to generate traces
- with a limited runtime overhead.
- You can either get a tarball:
- @example
- % wget http://download.savannah.gnu.org/releases/fkt/fxt-0.2.2.tar.gz
- @end example
- or use the FxT library from CVS (autotools are required):
- @example
- % cvs -d :pserver:anonymous@@cvs.sv.gnu.org:/sources/fkt co FxT
- % ./bootstrap
- @end example
- Compiling and installing the FxT library in the @code{$FXTDIR} path is
- done following the standard procedure:
- @example
- % ./configure --prefix=$FXTDIR
- % make
- % make install
- @end example
- In order to have StarPU to generate traces, StarPU should be configured with
- the @code{--with-fxt} option:
- @example
- $ ./configure --with-fxt=$FXTDIR
- @end example
- Or you can simply point the @code{PKG_CONFIG_PATH} to
- @code{$FXTDIR/lib/pkgconfig} and pass @code{--with-fxt} to @code{./configure}
- When FxT is enabled, a trace is generated when StarPU is terminated by calling
- @code{starpu_shutdown()}). The trace is a binary file whose name has the form
- @code{prof_file_XXX_YYY} where @code{XXX} is the user name, and
- @code{YYY} is the pid of the process that used StarPU. This file is saved in the
- @code{/tmp/} directory by default, or by the directory specified by
- the @code{STARPU_FXT_PREFIX} environment variable.
- @node Gantt diagram
- @subsection Creating a Gantt Diagram
- When the FxT trace file @code{filename} has been generated, it is possible to
- generate a trace in the Paje format by calling:
- @example
- % starpu_fxt_tool -i filename
- @end example
- Or alternatively, setting the @code{STARPU_GENERATE_TRACE} environment variable
- to 1 before application execution will make StarPU do it automatically at
- application shutdown.
- This will create a @code{paje.trace} file in the current directory that
- can be inspected with the @url{http://vite.gforge.inria.fr/, ViTE trace
- visualizing open-source tool}. It is possible to open the
- @code{paje.trace} file with ViTE by using the following command:
- @example
- % vite paje.trace
- @end example
- To get names of tasks instead of "unknown", fill the optional @code{name} field
- of the codelets, or use a performance model for them.
- By default, all tasks are displayed using a green color. To display tasks with
- varying colors, pass option @code{-c} to @code{starpu_fxt_tool}.
- @node DAG
- @subsection Creating a DAG with graphviz
- When the FxT trace file @code{filename} has been generated, it is possible to
- generate a task graph in the DOT format by calling:
- @example
- $ starpu_fxt_tool -i filename
- @end example
- This will create a @code{dag.dot} file in the current directory. This file is a
- task graph described using the DOT language. It is possible to get a
- graphical output of the graph by using the graphviz library:
- @example
- $ dot -Tpdf dag.dot -o output.pdf
- @end example
- @node starpu-workers-activity
- @subsection Monitoring activity
- When the FxT trace file @code{filename} has been generated, it is possible to
- generate an activity trace by calling:
- @example
- $ starpu_fxt_tool -i filename
- @end example
- This will create an @code{activity.data} file in the current
- directory. A profile of the application showing the activity of StarPU
- during the execution of the program can be generated:
- @example
- $ starpu_workers_activity activity.data
- @end example
- This will create a file named @code{activity.eps} in the current directory.
- This picture is composed of two parts.
- The first part shows the activity of the different workers. The green sections
- indicate which proportion of the time was spent executed kernels on the
- processing unit. The red sections indicate the proportion of time spent in
- StartPU: an important overhead may indicate that the granularity may be too
- low, and that bigger tasks may be appropriate to use the processing unit more
- efficiently. The black sections indicate that the processing unit was blocked
- because there was no task to process: this may indicate a lack of parallelism
- which may be alleviated by creating more tasks when it is possible.
- The second part of the @code{activity.eps} picture is a graph showing the
- evolution of the number of tasks available in the system during the execution.
- Ready tasks are shown in black, and tasks that are submitted but not
- schedulable yet are shown in grey.
- @node Codelet performance
- @section Performance of codelets
- The performance model of codelets (described in @ref{Performance model example}) can be examined by using the
- @code{starpu_perfmodel_display} tool:
- @example
- $ starpu_perfmodel_display -l
- file: <malloc_pinned.hannibal>
- file: <starpu_slu_lu_model_21.hannibal>
- file: <starpu_slu_lu_model_11.hannibal>
- file: <starpu_slu_lu_model_22.hannibal>
- file: <starpu_slu_lu_model_12.hannibal>
- @end example
- Here, the codelets of the lu example are available. We can examine the
- performance of the 22 kernel (in micro-seconds):
- @example
- $ starpu_perfmodel_display -s starpu_slu_lu_model_22
- performance model for cpu
- # hash size mean dev n
- 57618ab0 19660800 2.851069e+05 1.829369e+04 109
- performance model for cuda_0
- # hash size mean dev n
- 57618ab0 19660800 1.164144e+04 1.556094e+01 315
- performance model for cuda_1
- # hash size mean dev n
- 57618ab0 19660800 1.164271e+04 1.330628e+01 360
- performance model for cuda_2
- # hash size mean dev n
- 57618ab0 19660800 1.166730e+04 3.390395e+02 456
- @end example
- We can see that for the given size, over a sample of a few hundreds of
- execution, the GPUs are about 20 times faster than the CPUs (numbers are in
- us). The standard deviation is extremely low for the GPUs, and less than 10% for
- CPUs.
- The @code{starpu_regression_display} tool does the same for regression-based
- performance models. It also writes a @code{.gp} file in the current directory,
- to be run in the @code{gnuplot} tool, which shows the corresponding curve.
- The same can also be achieved by using StarPU's library API, see
- @ref{Performance Model API} and notably the @code{starpu_perfmodel_load_symbol}
- function. The source code of the @code{starpu_perfmodel_display} tool can be a
- useful example.
- @node Theoretical lower bound on execution time API
- @section Theoretical lower bound on execution time
- See @ref{Theoretical lower bound on execution time} for an example on how to use
- this API. It permits to record a trace of what tasks are needed to complete the
- application, and then, by using a linear system, provide a theoretical lower
- bound of the execution time (i.e. with an ideal scheduling).
- The computed bound is not really correct when not taking into account
- dependencies, but for an application which have enough parallelism, it is very
- near to the bound computed with dependencies enabled (which takes a huge lot
- more time to compute), and thus provides a good-enough estimation of the ideal
- execution time.
- @deftypefun void starpu_bound_start (int @var{deps}, int @var{prio})
- Start recording tasks (resets stats). @var{deps} tells whether
- dependencies should be recorded too (this is quite expensive)
- @end deftypefun
- @deftypefun void starpu_bound_stop (void)
- Stop recording tasks
- @end deftypefun
- @deftypefun void starpu_bound_print_dot ({FILE *}@var{output})
- Print the DAG that was recorded
- @end deftypefun
- @deftypefun void starpu_bound_compute ({double *}@var{res}, {double *}@var{integer_res}, int @var{integer})
- Get theoretical upper bound (in ms) (needs glpk support detected by @code{configure} script). It returns 0 if some performance models are not calibrated.
- @end deftypefun
- @deftypefun void starpu_bound_print_lp ({FILE *}@var{output})
- Emit the Linear Programming system on @var{output} for the recorded tasks, in
- the lp format
- @end deftypefun
- @deftypefun void starpu_bound_print_mps ({FILE *}@var{output})
- Emit the Linear Programming system on @var{output} for the recorded tasks, in
- the mps format
- @end deftypefun
- @deftypefun void starpu_bound_print ({FILE *}@var{output}, int @var{integer})
- Emit statistics of actual execution vs theoretical upper bound. @var{integer}
- permits to choose between integer solving (which takes a long time but is
- correct), and relaxed solving (which provides an approximate solution).
- @end deftypefun
- @node Memory feedback
- @section Memory feedback
- It is possible to enable memory statistics. To do so, you need to pass the option
- @code{--enable-memory-stats} when running configure. It is then
- possible to call the function @code{starpu_display_memory_stats()} to
- display statistics about the current data handles registered within StarPU.
- Moreover, statistics will be displayed at the end of the execution on
- data handles which have not been cleared out. This can be disabled by
- setting the environment variable @code{STARPU_MEMORY_STATS} to 0.
- For example, if you do not unregister data at the end of the complex
- example, you will get something similar to:
- @example
- $ STARPU_MEMORY_STATS=0 ./examples/interface/complex
- Complex[0] = 45.00 + 12.00 i
- Complex[0] = 78.00 + 78.00 i
- Complex[0] = 45.00 + 12.00 i
- Complex[0] = 45.00 + 12.00 i
- @end example
- @example
- $ STARPU_MEMORY_STATS=1 ./examples/interface/complex
- Complex[0] = 45.00 + 12.00 i
- Complex[0] = 78.00 + 78.00 i
- Complex[0] = 45.00 + 12.00 i
- Complex[0] = 45.00 + 12.00 i
- #---------------------
- Memory stats:
- #-------
- Data on Node #3
- #-----
- Data : 0x553ff40
- Size : 16
- #--
- Data access stats
- /!\ Work Underway
- Node #0
- Direct access : 4
- Loaded (Owner) : 0
- Loaded (Shared) : 0
- Invalidated (was Owner) : 0
- Node #3
- Direct access : 0
- Loaded (Owner) : 0
- Loaded (Shared) : 1
- Invalidated (was Owner) : 0
- #-----
- Data : 0x5544710
- Size : 16
- #--
- Data access stats
- /!\ Work Underway
- Node #0
- Direct access : 2
- Loaded (Owner) : 0
- Loaded (Shared) : 1
- Invalidated (was Owner) : 1
- Node #3
- Direct access : 0
- Loaded (Owner) : 1
- Loaded (Shared) : 0
- Invalidated (was Owner) : 0
- @end example
- @node Data statistics
- @section Data statistics
- Different data statistics can be displayed at the end of the execution
- of the application. To enable them, you need to pass the option
- @code{--enable-stats} when calling @code{configure}. When calling
- @code{starpu_shutdown()} various statistics will be displayed,
- execution, MSI cache statistics, allocation cache statistics, and data
- transfer statistics. The display can be disabled by setting the
- environment variable @code{STARPU_STATS} to 0.
- @example
- $ ./examples/cholesky/cholesky_tag
- Computation took (in ms)
- 518.16
- Synthetic GFlops : 44.21
- #---------------------
- MSI cache stats :
- TOTAL MSI stats hit 1622 (66.23 %) miss 827 (33.77 %)
- ...
- @end example
- @example
- $ STARPU_STATS=0 ./examples/cholesky/cholesky_tag
- Computation took (in ms)
- 518.16
- Synthetic GFlops : 44.21
- @end example
- @c TODO: data transfer stats are similar to the ones displayed when
- @c setting STARPU_BUS_STATS
|