| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430 | @c -*-texinfo-*-@c This file is part of the StarPU Handbook.@c Copyright (C) 2009--2011  Universit@'e de Bordeaux 1@c Copyright (C) 2010, 2011, 2012  Centre National de la Recherche Scientifique@c Copyright (C) 2011 Institut National de Recherche en Informatique et Automatique@c See the file starpu.texi for copying conditions.@menu* On-line::                     On-line performance feedback* Off-line::                    Off-line performance feedback* Codelet performance::         Performance of codelets* Theoretical lower bound on execution time API::  @end menu@node On-line@section On-line performance feedback@menu* Enabling monitoring::         Enabling on-line performance monitoring* Task feedback::               Per-task feedback* Codelet feedback::            Per-codelet feedback* Worker feedback::             Per-worker feedback* Bus feedback::                Bus-related feedback* StarPU-Top::                  StarPU-Top interface@end menu@node Enabling monitoring@subsection Enabling on-line performance monitoringIn order to enable online performance monitoring, the application can call@code{starpu_profiling_status_set(STARPU_PROFILING_ENABLE)}. It is possible todetect whether monitoring is already enabled or not by calling@code{starpu_profiling_status_get()}. Enabling monitoring also reinitialize allpreviously collected feedback. The @code{STARPU_PROFILING} environment variablecan also be set to 1 to achieve the same effect.Likewise, performance monitoring is stopped by calling@code{starpu_profiling_status_set(STARPU_PROFILING_DISABLE)}. Note that thisdoes not reset the performance counters so that the application may consultthem later on.More details about the performance monitoring API are available in section@ref{Profiling API}.@node Task feedback@subsection Per-task feedbackIf profiling is enabled, a pointer to a @code{starpu_task_profiling_info}structure is put in the @code{.profiling_info} field of the @code{starpu_task}structure when a task terminates.This structure is automatically destroyed when the task structure is destroyed,either automatically or by calling @code{starpu_task_destroy}.The @code{starpu_task_profiling_info} structure indicates the date when thetask was submitted (@code{submit_time}), started (@code{start_time}), andterminated (@code{end_time}), relative to the initialization ofStarPU with @code{starpu_init}. It also specifies the identifier of the workerthat has executed the task (@code{workerid}).These date are stored as @code{timespec} structures which the user may convertinto micro-seconds using the @code{starpu_timing_timespec_to_us} helperfunction.It it worth noting that the application may directly access this structure fromthe callback executed at the end of the task. The @code{starpu_task} structureassociated to the callback currently being executed is indeed accessible withthe @code{starpu_get_current_task()} function.@node Codelet feedback@subsection Per-codelet feedbackThe @code{per_worker_stats} field of the @code{struct starpu_codelet} structure isan array of counters. The i-th entry of the array is incremented every time atask implementing the codelet is executed on the i-th worker.This array is not reinitialized when profiling is enabled or disabled.@node Worker feedback@subsection Per-worker feedbackThe second argument returned by the @code{starpu_worker_get_profiling_info}function is a @code{starpu_worker_profiling_info} structure that givesstatistics about the specified worker. This structure specifies when StarPUstarted collecting profiling information for that worker (@code{start_time}),the duration of the profiling measurement interval (@code{total_time}), thetime spent executing kernels (@code{executing_time}), the time spent sleepingbecause there is no task to execute at all (@code{sleeping_time}), and thenumber of tasks that were executed while profiling was enabled.These values give an estimation of the proportion of time spent do real work,and the time spent either sleeping because there are not enough executabletasks or simply wasted in pure StarPU overhead. Calling @code{starpu_worker_get_profiling_info} resets the profilinginformation associated to a worker.When an FxT trace is generated (see @ref{Generating traces}), it is alsopossible to use the @code{starpu_top} script (described in @ref{starpu-top}) togenerate a graphic showing the evolution of these values during the time, forthe different workers.@node Bus feedback@subsection Bus-related feedback TODO@c how to enable/disable performance monitoring@c what kind of information do we get ?The bus speed measured by StarPU can be displayed by using the@code{starpu_machine_display} tool, for instance:@exampleStarPU has found:        3 CUDA devices                CUDA 0 (Tesla C2050 02:00.0)                CUDA 1 (Tesla C2050 03:00.0)                CUDA 2 (Tesla C2050 84:00.0)from    to RAM          to CUDA 0       to CUDA 1       to CUDA 2RAM     0.000000        5176.530428     5176.492994     5191.710722CUDA 0  4523.732446     0.000000        2414.074751     2417.379201CUDA 1  4523.718152     2414.078822     0.000000        2417.375119CUDA 2  4534.229519     2417.069025     2417.060863     0.000000@end example@node StarPU-Top@subsection StarPU-Top interfaceStarPU-Top is an interface which remotely displays the on-line state of a StarPUapplication and permits the user to change parameters on the fly.Variables to be monitored can be registered by calling the@code{starpu_top_add_data_boolean}, @code{starpu_top_add_data_integer},@code{starpu_top_add_data_float} functions, e.g.:@cartouche@smallexamplestarpu_top_data *data = starpu_top_add_data_integer("mynum", 0, 100, 1);@end smallexample@end cartoucheThe application should then call @code{starpu_top_init_and_wait} to give its nameand wait for StarPU-Top to get a start request from the user. The name is usedby StarPU-Top to quickly reload a previously-saved layout of parameter display.@cartouche@smallexamplestarpu_top_init_and_wait("the application");@end smallexample@end cartoucheThe new values can then be provided thanks to@code{starpu_top_update_data_boolean}, @code{starpu_top_update_data_integer},@code{starpu_top_update_data_float}, e.g.:@cartouche@smallexamplestarpu_top_update_data_integer(data, mynum);@end smallexample@end cartoucheUpdateable parameters can be registered thanks to @code{starpu_top_register_parameter_boolean}, @code{starpu_top_register_parameter_integer}, @code{starpu_top_register_parameter_float}, e.g.:@cartouche@smallexamplefloat alpha;starpu_top_register_parameter_float("alpha", &alpha, 0, 10, modif_hook);@end smallexample@end cartouche@code{modif_hook} is a function which will be called when the parameter is being modified, it can for instance print the new value:@cartouche@smallexamplevoid modif_hook(struct starpu_top_param *d) @{    fprintf(stderr,"%s has been modified: %f\n", d->name, alpha);@}@end smallexample@end cartoucheTask schedulers should notify StarPU-Top when it has decided when a task will bescheduled, so that it can show it in its Gantt chart, for instance:@cartouche@smallexamplestarpu_top_task_prevision(task, workerid, begin, end);@end smallexample@end cartoucheStarting StarPU-Top and the application can be done two ways:@itemize@item The application is started by hand on some machine (and thus alreadywaiting for the start event). In the Preference dialog of StarPU-Top, the SSHcheckbox should be unchecked, and the hostname and port (default is 2011) onwhich the application is already running should be specified. Clicking on theconnection button will thus connect to the already-running application.@item StarPU-Top is started first, and clicking on the connection button willstart the application itself (possibly on a remote machine). The SSH checkboxshould be checked, and a command line provided, e.g.:@examplessh myserver STARPU_SCHED=heft ./application@end exampleIf port 2011 of the remote machine can not be accessed directly, an ssh port bridge should be added:@examplessh -L 2011:localhost:2011 myserver STARPU_SCHED=heft ./application@end exampleand "localhost" should be used as IP Address to connect to.@end itemize@node Off-line@section Off-line performance feedback@menu* Generating traces::           Generating traces with FxT* Gantt diagram::               Creating a Gantt Diagram* DAG::                         Creating a DAG with graphviz* starpu-top::                  Monitoring activity@end menu@node Generating traces@subsection Generating traces with FxTStarPU can use the FxT library (see@indicateurl{https://savannah.nongnu.org/projects/fkt/}) to generate traceswith a limited runtime overhead.You can either get a tarball:@example% wget http://download.savannah.gnu.org/releases/fkt/fxt-0.2.2.tar.gz@end exampleor use the FxT library from CVS (autotools are required):@example% cvs -d :pserver:anonymous@@cvs.sv.gnu.org:/sources/fkt co FxT% ./bootstrap@end exampleCompiling and installing the FxT library in the @code{$FXTDIR} path isdone following the standard procedure:@example% ./configure --prefix=$FXTDIR% make% make install@end exampleIn order to have StarPU to generate traces, StarPU should be configured withthe @code{--with-fxt} option:@example$ ./configure --with-fxt=$FXTDIR@end exampleOr you can simply point the @code{PKG_CONFIG_PATH} to@code{$FXTDIR/lib/pkgconfig} and pass @code{--with-fxt} to @code{./configure}When FxT is enabled, a trace is generated when StarPU is terminated by calling@code{starpu_shutdown()}). The trace is a binary file whose name has the form@code{prof_file_XXX_YYY} where @code{XXX} is the user name, and@code{YYY} is the pid of the process that used StarPU. This file is saved in the@code{/tmp/} directory by default, or by the directory specified bythe @code{STARPU_FXT_PREFIX} environment variable.@node Gantt diagram@subsection Creating a Gantt DiagramWhen the FxT trace file @code{filename} has been generated, it is possible togenerate a trace in the Paje format by calling:@example% starpu_fxt_tool -i filename@end exampleOr alternatively, setting the @code{STARPU_GENERATE_TRACE} environment variableto 1 before application execution will make StarPU do it automatically atapplication shutdown.This will create a @code{paje.trace} file in the current directory that can beinspected with the ViTE trace visualizing open-source tool. More informationabout ViTE is available at @indicateurl{http://vite.gforge.inria.fr/}. It ispossible to open the @code{paje.trace} file with ViTE by using the followingcommand:@example% vite paje.trace@end example@node DAG@subsection Creating a DAG with graphvizWhen the FxT trace file @code{filename} has been generated, it is possible togenerate a task graph in the DOT format by calling:@example$ starpu_fxt_tool -i filename@end exampleThis will create a @code{dag.dot} file in the current directory. This file is atask graph described using the DOT language. It is possible to get agraphical output of the graph by using the graphviz library:@example$ dot -Tpdf dag.dot -o output.pdf@end example@node starpu-top@subsection Monitoring activityWhen the FxT trace file @code{filename} has been generated, it is possible togenerate a activity trace by calling:@example$ starpu_fxt_tool -i filename@end exampleThis will create an @code{activity.data} file in the currentdirectory. A profile of the application showing the activity of StarPUduring the execution of the program can be generated:@example$ starpu_top activity.data@end exampleThis will create a file named @code{activity.eps} in the current directory.This picture is composed of two parts.The first part shows the activity of the different workers. The green sectionsindicate which proportion of the time was spent executed kernels on theprocessing unit. The red sections indicate the proportion of time spent inStartPU: an important overhead may indicate that the granularity may be toolow, and that bigger tasks may be appropriate to use the processing unit moreefficiently. The black sections indicate that the processing unit was blockedbecause there was no task to process: this may indicate a lack of parallelismwhich may be alleviated by creating more tasks when it is possible.The second part of the @code{activity.eps} picture is a graph showing theevolution of the number of tasks available in the system during the execution.Ready tasks are shown in black, and tasks that are submitted but notschedulable yet are shown in grey.@node Codelet performance@section Performance of codeletsThe performance model of codelets (described in @ref{Performance model example}) can be examined by using the@code{starpu_perfmodel_display} tool:@example$ starpu_perfmodel_display -lfile: <malloc_pinned.hannibal>file: <starpu_slu_lu_model_21.hannibal>file: <starpu_slu_lu_model_11.hannibal>file: <starpu_slu_lu_model_22.hannibal>file: <starpu_slu_lu_model_12.hannibal>@end exampleHere, the codelets of the lu example are available. We can examine theperformance of the 22 kernel (in micro-seconds):@example$ starpu_perfmodel_display -s starpu_slu_lu_model_22performance model for cpu# hash      size       mean          dev           n57618ab0    19660800   2.851069e+05  1.829369e+04  109performance model for cuda_0# hash      size       mean          dev           n57618ab0    19660800   1.164144e+04  1.556094e+01  315performance model for cuda_1# hash      size       mean          dev           n57618ab0    19660800   1.164271e+04  1.330628e+01  360performance model for cuda_2# hash      size       mean          dev           n57618ab0    19660800   1.166730e+04  3.390395e+02  456@end exampleWe can see that for the given size, over a sample of a few hundreds ofexecution, the GPUs are about 20 times faster than the CPUs (numbers are inus). The standard deviation is extremely low for the GPUs, and less than 10% forCPUs.The @code{starpu_regression_display} tool does the same for regression-basedperformance models. It also writes a @code{.gp} file in the current directory,to be run in the @code{gnuplot} tool, which shows the corresponding curve.The same can also be achieved by using StarPU's library API, see@ref{Performance Model API} and notably the @code{starpu_load_history_debug}function. The source code of the @code{starpu_perfmodel_display} tool can be auseful example.@node Theoretical lower bound on execution time API@section Theoretical lower bound on execution timeSee @ref{Theoretical lower bound on execution time} for an example on how to usethis API. It permits to record a trace of what tasks are needed to complete theapplication, and then, by using a linear system, provide a theoretical lowerbound of the execution time (i.e. with an ideal scheduling).The computed bound is not really correct when not taking into accountdependencies, but for an application which have enough parallelism, it is verynear to the bound computed with dependencies enabled (which takes a huge lotmore time to compute), and thus provides a good-enough estimation of the idealexecution time.@deftypefun void starpu_bound_start (int @var{deps}, int @var{prio})Start recording tasks (resets stats).  @var{deps} tells whetherdependencies should be recorded too (this is quite expensive)@end deftypefun@deftypefun void starpu_bound_stop (void)Stop recording tasks@end deftypefun@deftypefun void starpu_bound_print_dot ({FILE *}@var{output})Print the DAG that was recorded@end deftypefun@deftypefun void starpu_bound_compute ({double *}@var{res}, {double *}@var{integer_res}, int @var{integer})Get theoretical upper bound (in ms) (needs glpk support detected by @code{configure} script)@end deftypefun@deftypefun void starpu_bound_print_lp ({FILE *}@var{output})Emit the Linear Programming system on @var{output} for the recorded tasks, inthe lp format@end deftypefun@deftypefun void starpu_bound_print_mps ({FILE *}@var{output})Emit the Linear Programming system on @var{output} for the recorded tasks, inthe mps format@end deftypefun@deftypefun void starpu_bound_print ({FILE *}@var{output}, int @var{integer})Emit statistics of actual execution vs theoretical upper bound. @var{integer}permits to choose between integer solving (which takes a long time but iscorrect), and relaxed solving (which provides an approximate solution).@end deftypefun
 |