| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513 | /* StarPU --- Runtime system for heterogeneous multicore architectures. * * Copyright (C) 2009-2021  Université de Bordeaux, CNRS (LaBRI UMR 5800), Inria * * StarPU is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser General Public License as published by * the Free Software Foundation; either version 2.1 of the License, or (at * your option) any later version. * * StarPU is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. * * See the GNU Lesser General Public License in COPYING.LGPL for more details. *//*! \page CheckListWhenPerformanceAreNotThere Check List When Performance Are Not ThereTODO: improve!To achieve goodperformance, we give below a list of features which should be checked.For a start, you can use \ref OfflinePerformanceTools to get a Gantt chart whichwill show roughly where time is spent, and focus correspondingly.\section CheckTaskSize Check Task SizeMake sure that your tasks are not too small, as the StarPU runtime overheadis not completely zero. As explained in \ref TaskSizeOverhead, you canrun the script \c tasks_size_overhead.sh to get anidea of the scalability of tasks depending on their duration (in µs), on yourown system.Typically, 10µs-ish tasks are definitely too small, the CUDA overhead itself ismuch bigger than this.1ms-ish tasks may be a good start, but will not necessarily scale to many dozensof cores, so it's better to try to get 10ms-ish tasks.Tasks durations can easily be observed when performance models are defined (see\ref PerformanceModelExample) by using the tools <c>starpu_perfmodel_plot</c> or<c>starpu_perfmodel_display</c> (see \ref PerformanceOfCodelets)When using parallel tasks, the problem is even worse since StarPU has tosynchronize the tasks execution.\section ConfigurationImprovePerformance Configuration Which May Improve PerformanceThe \c configure option \ref enable-fast "--enable-fast" disables allassertions. This makes StarPU more performant for really small tasks bydisabling all sanity checks. Only use this for measurements and production, not for development, since this will drop all basic checks.\section DataRelatedFeaturesToImprovePerformance Data Related Features Which May Improve Performancelink to \ref DataManagementlink to \ref DataPrefetch\section TaskRelatedFeaturesToImprovePerformance Task Related Features Which May Improve Performancelink to \ref TaskGranularitylink to \ref TaskSubmissionlink to \ref TaskPriorities\section SchedulingRelatedFeaturesToImprovePerformance Scheduling Related Features Which May Improve Performancelink to \ref TaskSchedulingPolicylink to \ref TaskDistributionVsDataTransferlink to \ref Energy-basedSchedulinglink to \ref StaticScheduling\section CUDA-specificOptimizations CUDA-specific OptimizationsFor proper overlapping of asynchronous GPU data transfers, data has to be pinnedby CUDA. Data allocated with starpu_malloc() is always properly pinned. If theapplication registers to StarPU some data which has not been allocated withstarpu_malloc(), starpu_memory_pin() should be called to pin the data memory.Due to CUDA limitations, StarPU will have a hard time overlapping its owncommunications and the codelet computations if the application does not use adedicated CUDA stream for its computations instead of the default stream,which synchronizes all operations of the GPU. The functionstarpu_cuda_get_local_stream() returns a stream which can be used by all CUDA codeletoperations to avoid this issue. For instance:\code{.c}func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);cudaError_t status = cudaGetLastError();if (status != cudaSuccess) STARPU_CUDA_REPORT_ERROR(status);cudaStreamSynchronize(starpu_cuda_get_local_stream());\endcodeas well as the use of \c cudaMemcpyAsync(), etc. for each CUDA operation one needsto use a version that takes the a stream parameter.Unfortunately, some CUDA libraries do not have stream variants ofkernels. This will seriously lower the potential for overlapping.If some CUDA calls are made without specifying this local stream,synchronization needs to be explicited with cudaDeviceSynchronize() around thesecalls, to make sure that they get properly synchronized with the calls usingthe local stream. Notably, \c cudaMemcpy() and \c cudaMemset() are actuallyasynchronous and need such explicit synchronization! Use \c cudaMemcpyAsync() and\c cudaMemsetAsync() instead.Calling starpu_cublas_init() will ensure StarPU to properly call theCUBLAS library functions. Some libraries like Magma may however change the current stream of CUBLAS v1,one then has to call <c>cublasSetKernelStream(</c>starpu_cuda_get_local_stream()<c>)</c> atthe beginning of the codelet to make sure that CUBLAS is really using the properstream. When using CUBLAS v2, starpu_cublas_get_local_handle() can be called to queue CUBLASkernels with the proper configuration.Similarly, calling starpu_cusparse_init() makes StarPU create CUSPARSE handleson each CUDA device, starpu_cusparse_get_local_handle() can then be used toqueue CUSPARSE kernels with the proper configuration.If the kernel can be made to only use this local stream or other self-allocatedstreams, i.e. the whole kernel submission can be made asynchronous, thenone should enable asynchronous execution of the kernel.  This means settingthe flag ::STARPU_CUDA_ASYNC in the corresponding field starpu_codelet::cuda_flags, and dropping the<c>cudaStreamSynchronize()</c> call at the end of the <c>cuda_func</c> function, so that itreturns immediately after having queued the kernel to the local stream. That way, StarPU will beable to submit and complete data transfers while kernels are executing, instead of only at eachkernel submission. The kernel just has to make sure that StarPU can use thelocal stream to synchronize with the kernel startup and completion.If the kernel uses its own non-default stream, one can synchronize this streamwith the StarPU-provided stream this way:\code{.c}cudaEvent_t event;call_kernel_with_its_own_stream()cudaEventCreateWithFlags(&event, cudaEventDisableTiming);cudaEventRecord(event, get_kernel_stream());cudaStreamWaitEvent(starpu_cuda_get_local_stream(), event, 0);cudaEventDestroy(event);\endcodeThis code makes the StarPU-provided stream wait for a new event, which will betriggered by the completion of the kernel.Using the flag ::STARPU_CUDA_ASYNC also permits to enable concurrent kernelexecution, on cards which support it (Kepler and later, notably). This isenabled by setting the environment variable \ref STARPU_NWORKER_PER_CUDA to thenumber of kernels to be executed concurrently.  This is useful when kernels aresmall and do not feed the whole GPU with threads to run.Concerning memory allocation, you should really not use \c cudaMalloc()/ \c cudaFree()within the kernel, since \c cudaFree() introduces a awfully lot of synchronizationswithin CUDA itself. You should instead add a parameter to the codelet with the::STARPU_SCRATCH mode access. You can then pass to the task a handle registeredwith the desired size but with the \c NULL pointer, the handle can even beshared between tasks, StarPU will allocate per-task data on the fly before taskexecution, and reuse the allocated data between tasks.See <c>examples/pi/pi_redux.c</c> for an example of use.\section OpenCL-specificOptimizations OpenCL-specific OptimizationsIf the kernel can be made to only use the StarPU-provided command queue or other self-allocatedqueues, i.e. the whole kernel submission can be made asynchronous, thenone should enable asynchronous execution of the kernel. This means settingthe flag ::STARPU_OPENCL_ASYNC in the corresponding field starpu_codelet::opencl_flags and dropping the<c>clFinish()</c> and starpu_opencl_collect_stats() calls at the end of the kernel, sothat it returns immediately after having queued the kernel to the provided queue.That way, StarPU will be able to submit and complete data transfers while kernels are executing, instead ofonly at each kernel submission. The kernel just has to make surethat StarPU can use the command queue it has provided to synchronize with thekernel startup and completion.\section DetectionStuckConditions Detecting Stuck ConditionsIt may happen that for some reason, StarPU does not make progress for a longperiod of time.  Reason are sometimes due to contention inside StarPU, butsometimes this is due to external reasons, such as a stuck MPI or CUDAdriver.<c>export STARPU_WATCHDOG_TIMEOUT=10000</c> (\ref STARPU_WATCHDOG_TIMEOUT)allows to make StarPU print an error message whenever StarPU does not terminateany task for 10ms, but lets the application continue normally. In addition to that,<c>export STARPU_WATCHDOG_CRASH=1</c> (\ref STARPU_WATCHDOG_CRASH)raises <c>SIGABRT</c> in this condition, thus allowing to catch thesituation in \c gdb.It can also be useful to type <c>handle SIGABRT nopass</c> in <c>gdb</c> to be able to letthe process continue, after inspecting the state of the process.\section HowToLimitMemoryPerNode How to Limit Memory Used By StarPU And Cache Buffer AllocationsBy default, StarPU makes sure to use at most 90% of the memory of GPU devices,moving data in and out of the device as appropriate, as well as usingprefetch and writeback optimizations.The environment variables \ref STARPU_LIMIT_CUDA_MEM, \ref STARPU_LIMIT_CUDA_devid_MEM,\ref STARPU_LIMIT_OPENCL_MEM, and \ref STARPU_LIMIT_OPENCL_devid_MEMcan be used to control how much (in MiB) of the GPU device memoryshould be used at most by StarPU (the default value is to use 90% of theavailable memory).By default, the usage of the main memory is not limited, as thedefault mechanims do not provide means to evict main memory when itgets too tight. This also means that by default StarPU will not cache bufferallocations in main memory, since it does not know how much of thesystem memory it can afford.The environment variable \ref STARPU_LIMIT_CPU_MEM can be used tospecify how much (in MiB) of the main memory should be used at most byStarPU for buffer allocations. This way, StarPU will be able tocache buffer allocations (which can be a real benefit if a lot of buffers areinvolved, or if allocation fragmentation can become a problem), and when using\ref OutOfCore, StarPU will know when it should evict data out to the disk.It should be noted that by default only buffer allocations automaticallydone by StarPU are accounted here, i.e. allocations performed throughstarpu_malloc_on_node() which are used by the data interfaces(matrix, vector, etc.).  This does not include allocations performed bythe application through e.g. malloc(). It does not include allocationsperformed through starpu_malloc() either, only allocationsperformed explicitly with the flag \ref STARPU_MALLOC_COUNT, i.e. by calling\code{.c}starpu_malloc_flags(STARPU_MALLOC_COUNT)\endcodeare taken into account.  If theapplication wants to make StarPU aware of its own allocations, so that StarPUknows precisely how much data is allocated, and thus when to evict allocationcaches or data out to the disk, starpu_memory_allocate() can be used tospecify an amount of memory to be accounted for. starpu_memory_deallocate()can be used to account freed memory back. Those can for instance be used by datainterfaces with dynamic data buffers: instead of using starpu_malloc_on_node(),they would dynamically allocate data with \c malloc()/\c realloc(), and notify StarPU ofthe delta by calling starpu_memory_allocate() and starpu_memory_deallocate().starpu_memory_get_total() and starpu_memory_get_available()can be used to get an estimation of how much memory is available.starpu_memory_wait_available() can also be used to block until anamount of memory becomes available, but it may be preferrable to call\code{.c}starpu_memory_allocate(STARPU_MEMORY_WAIT)\endcodeto reserve this amount immediately.\section HowToReduceTheMemoryFootprintOfInternalDataStructures How To Reduce The Memory Footprint Of Internal Data StructuresIt is possible to reduce the memory footprint of the task and data internalstructures of StarPU by describing the shape of your machine and/or yourapplication when calling \c configure.To reduce the memory footprint of the data internal structures of StarPU, onecan set the \c configure parameters \ref enable-maxcpus "--enable-maxcpus",\ref enable-maxnumanodes "--enable-maxnumanodes",\ref enable-maxcudadev "--enable-maxcudadev",\ref enable-maxopencldev "--enable-maxopencldev" and\ref enable-maxnodes "--enable-maxnodes"to give StarPUthe architecture of the machine it will run on, thus tuning the size of thestructures to the machine.To reduce the memory footprint of the task internal structures of StarPU, onecan set the \c configure parameter \ref enable-maxbuffers "--enable-maxbuffers" togive StarPU the maximum number of buffers that a task can use during anexecution. For example, in the Cholesky factorization (dense linear algebraapplication), the GEMM task uses up to 3 buffers, so it is possible to set themaximum number of task buffers to 3 to run a Cholesky factorization on StarPU.The size of the various structures of StarPU can be printed by<c>tests/microbenchs/display_structures_size</c>.It is also often useless to submit \b all the tasks at the same time.Task submission can be blocked when a reasonable given number oftasks have been submitted, by setting the environment variables \refSTARPU_LIMIT_MIN_SUBMITTED_TASKS and \ref STARPU_LIMIT_MAX_SUBMITTED_TASKS.\code{.sh}export STARPU_LIMIT_MAX_SUBMITTED_TASKS=10000export STARPU_LIMIT_MIN_SUBMITTED_TASKS=9000\endcodewill make StarPU block submission when 10000 tasks are submitted, and unblocksubmission when only 9000 tasks are still submitted, i.e. 1000 tasks havecompleted among the 10000 which were submitted when submission was blocked. Ofcourse this may reduce parallelism if the threshold is set too low. The precisebalance depends on the application task graph.An idea of how much memory is used for tasks and data handles can be obtained bysetting the environment variable \ref STARPU_MAX_MEMORY_USE to <c>1</c>.\section HowtoReuseMemory How To Reuse MemoryWhen your application needs to allocate more data than the available amount ofmemory usable by StarPU (given by starpu_memory_get_available()), theallocation cache system can reuse data buffers used by previously executedtasks. For this system to work with MPI tasks, you need to submit tasks progressively insteadof as soon as possible, because in the case of MPI receives, the allocation cache check for reusing databuffers will be done at submission time, not at execution time.There is two options to control the task submission flow. The first one is bycontrolling the number of submitted tasks during the whole execution. This canbe done whether by setting the environment variables\ref STARPU_LIMIT_MAX_SUBMITTED_TASKS and \ref STARPU_LIMIT_MIN_SUBMITTED_TASKS totell StarPU when to stop submitting tasks and when to wake up and submit tasksagain, or by explicitely calling starpu_task_wait_for_n_submitted() inyour application code for finest grain control (for example, between twoiterations of a submission loop).The second option is to control the memory size of the allocation cache. Thiscan be done in the application by using jointlystarpu_memory_get_available() and starpu_memory_wait_available() to submittasks only when there is enough memory space to allocate the data needed by thetask, i.e when enough data are available for reuse in the allocation cache.\section PerformanceModelCalibration Performance Model CalibrationMost schedulers are based on an estimation of codelet duration on each kindof processing unit. For this to be possible, the application programmer needsto configure a performance model for the codelets of the application (see\ref PerformanceModelExample for instance). History-based performance modelsuse on-line calibration.  StarPU will automatically calibrate codeletswhich have never been calibrated yet, and save the result in<c>$STARPU_HOME/.starpu/sampling/codelets</c>.The models are indexed by machine name.By default, StarPU stores separate performance models according to the hostnameof the system. To avoid having to calibrate performance models for each nodeof a homogeneous cluster for instance, the model can be shared by using<c>export STARPU_HOSTNAME=some_global_name</c> (\ref STARPU_HOSTNAME), where<c>some_global_name</c> is the name of the cluster for instance, which thusoverrides the hostname of the system.By default, StarPU stores separate performance models for each GPU. To avoidhaving to calibrate performance models for each GPU of a homogeneous set of GPUdevices for instance, the model can be shared by using the environmentvariables \ref STARPU_PERF_MODEL_HOMOGENEOUS_CUDA, \refSTARPU_PERF_MODEL_HOMOGENEOUS_OPENCL and \refSTARPU_PERF_MODEL_HOMOGENEOUS_MPI_MS depending on your GPU devicetype.\code{.shell}export STARPU_PERF_MODEL_HOMOGENEOUS_CUDA=1export STARPU_PERF_MODEL_HOMOGENEOUS_OPENCL=1export STARPU_PERF_MODEL_HOMOGENEOUS_MPI_MS=1\endcodeTo force continuing calibration,use <c>export STARPU_CALIBRATE=1</c> (\ref STARPU_CALIBRATE). This may be necessary if your applicationhas not-so-stable performance. StarPU will force calibration (and thus ignorethe current result) until 10 (<c>_STARPU_CALIBRATION_MINIMUM</c>) measurements have beenmade on each architecture, to avoid bad scheduling decisions just because thefirst measurements were not so good.Note that StarPU will not record the very first measurement for a given codeletand a given size, because it would most often be hit by computation libraryloading or initialization. StarPU will also throw measurements away if itnotices that after computing an average execution time, it notices that mostsubsequent tasks have an execution time largely outside the computed average("Too big deviation for model..." warning messages). By looking at the detailsof the message and their reported measurements, it can highlight that yourcomputation library really has non-stable measurements, which is probably anindication of an issue in the computation library, or the execution environment(e.g. rogue daemons).Details on the current performance model statuscan be obtained with the tool <c>starpu_perfmodel_display</c>: theoption <c>-l</c> lists the available performance models, and theoption <c>-s</c> allows to choose the performance model to bedisplayed. The result looks like:\verbatim$ starpu_perfmodel_display -s starpu_slu_lu_model_11performance model for cpu_impl_0# hash    size     flops         mean          dev           n914f3bef  1048576  0.000000e+00  2.503577e+04  1.982465e+02  83e921964  65536    0.000000e+00  5.527003e+02  1.848114e+01  7e5a07e31  4096     0.000000e+00  1.717457e+01  5.190038e+00  14...\endverbatimwhich shows that for the LU 11 kernel with a 1MiB matrix, the averageexecution time on CPUs was about 25ms, with a 0.2ms standard deviation, over8 samples. It is a good idea to check this before doing actual performancemeasurements.A graph can be drawn by using the tool <c>starpu_perfmodel_plot</c>:\verbatim$ starpu_perfmodel_plot -s starpu_slu_lu_model_114096 16384 65536 262144 1048576 4194304$ gnuplot starpu_starpu_slu_lu_model_11.gp$ gv starpu_starpu_slu_lu_model_11.eps\endverbatim\image html starpu_starpu_slu_lu_model_11.png\image latex starpu_starpu_slu_lu_model_11.eps "" width=\textwidthIf a kernel source code was modified (e.g. performance improvement), thecalibration information is stale and should be dropped, to re-calibrate fromstart. This can be done by using <c>export STARPU_CALIBRATE=2</c> (\ref STARPU_CALIBRATE).Note: history-based performance models get calibratedonly if a performance-model-based scheduler is chosen.The history-based performance models can also be explicitly filled by theapplication without execution, if e.g. the application already has a series ofmeasurements. This can be done by using starpu_perfmodel_update_history(),for instance:\code{.c}static struct starpu_perfmodel perf_model ={    .type = STARPU_HISTORY_BASED,    .symbol = "my_perfmodel",};struct starpu_codelet cl ={    .cuda_funcs = { cuda_func1, cuda_func2 },    .nbuffers = 1,    .modes = {STARPU_W},    .model = &perf_model};void feed(void){    struct my_measure *measure;    struct starpu_task task;    starpu_task_init(&task);    task.cl = &cl;    for (measure = &measures[0]; measure < measures[last]; measure++)    {        starpu_data_handle_t handle;	starpu_vector_data_register(&handle, -1, 0, measure->size, sizeof(float));	task.handles[0] = handle;	starpu_perfmodel_update_history(&perf_model, &task, STARPU_CUDA_DEFAULT + measure->cudadev, 0, measure->implementation, measure->time);	starpu_task_clean(&task);	starpu_data_unregister(handle);    }}\endcodeMeasurement has to be provided in milliseconds for the completion time models,and in Joules for the energy consumption models.\section Profiling ProfilingA quick view of how many tasks each worker has executed can be obtained by setting<c>export STARPU_WORKER_STATS=1</c> (\ref STARPU_WORKER_STATS). This is a convenient way to check thatexecution did happen on accelerators, without penalizing performance withthe profiling overhead. The environment variable \ref STARPU_WORKER_STATS_FILE can be definedto specify a filename in which to display statistics, by defaultstatistics are printed on the standard error stream.A quick view of how much data transfers have been issued can be obtained by setting<c>export STARPU_BUS_STATS=1</c> (\ref STARPU_BUS_STATS). Theenvironment variable \refSTARPU_BUS_STATS_FILE can be defined to specify a filename in which todisplay statistics, by default statistics are printed on the standard error stream.More detailed profiling information can be enabled by using <c>export STARPU_PROFILING=1</c> (\ref STARPU_PROFILING)or bycalling starpu_profiling_status_set() from the source code.Statistics on the execution can then be obtained by using <c>exportSTARPU_BUS_STATS=1</c> and <c>export STARPU_WORKER_STATS=1</c> . More details on performance feedback are provided in the next chapter.\section OverheadProfiling Overhead Profiling\ref OfflinePerformanceTools can already provide an idea of to what extent andwhich part of StarPU brings an overhead on the execution time. To get a more preciseanalysis of which parts of StarPU bring the most overhead, <c>gprof</c> can be used.First, recompile and reinstall StarPU with <c>gprof</c> support:\code../configure --enable-perf-debug --disable-shared --disable-build-tests --disable-build-examples\endcodeMake sure not to leave a dynamic version of StarPU in the target path: removeany remaining <c>libstarpu-*.so</c>Then relink your application with the static StarPU library, make sure thatrunning <c>ldd</c> on your application does not mention any \c libstarpu(i.e. it's really statically-linked).\codegcc test.c -o test $(pkg-config --cflags starpu-1.3) $(pkg-config --libs starpu-1.3)\endcodeNow you can run your application, this will create a file<c>gmon.out</c> in the current directory, it can be processed byrunning <c>gprof</c> on your application:\codegprof ./test\endcodeThis will dump an analysis of the time spent in StarPU functions.*/
 |