| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282 | /* * This file is part of the StarPU Handbook. * Copyright (C) 2009--2011  Universit@'e de Bordeaux * Copyright (C) 2010, 2011, 2012, 2013, 2014  CNRS * Copyright (C) 2011, 2012 INRIA * See the file version.doxy for copying conditions. *//*! \page CheckListWhenPerformanceAreNotThere Check List When Performance Are Not ThereTODO: improve!Simply encapsulating application kernels into tasks already permits toseamlessly support CPU and GPUs at the same time. To achieve goodperformance, we give below a list of features which should be checked.\section DataRelatedFeaturesToImprovePerformance Data Related Features That May Improve Performancelink to \ref DataManagementlink to \ref DataPrefetch\section TaskRelatedFeaturesToImprovePerformance Task Related Features That May Improve Performancelink to \ref TaskGranularitylink to \ref TaskSubmissionlink to \ref TaskPriorities\section SchedulingRelatedFeaturesToImprovePerformance Scheduling Related Features That May Improve Performancelink to \ref TaskSchedulingPolicylink to \ref TaskDistributionVsDataTransferlink to \ref Power-basedSchedulinglink to \ref StaticScheduling\section CUDA-specificOptimizations CUDA-specific OptimizationsDue to CUDA limitations, StarPU will have a hard time overlapping its owncommunications and the codelet computations if the application does not use adedicated CUDA stream for its computations instead of the default stream,which synchronizes all operations of the GPU. StarPU provides one by the useof starpu_cuda_get_local_stream() which can be used by all CUDA codeletoperations to avoid this issue. For instance:\code{.c}func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);cudaStreamSynchronize(starpu_cuda_get_local_stream());\endcodeUnfortunately, some CUDA libraries do not have stream variants ofkernels. That will lower the potential for overlapping.Calling starpu_cublas_init() makes StarPU already do appropriate calls for theCUBLAS library. Some libraries like Magma may however change the current stream,one then has to call cublasSetKernelStream(starpu_cuda_get_local_stream()); atthe beginning of the codelet to make sure that CUBLAS is really using the properstream.If the kernel can be made to only use this local stream or other self-allocatedstreams, i.e. the whole kernel submission can be made asynchronous, thenone should enable asynchronous execution of the kernel.  That means settingthe flag ::STARPU_CUDA_ASYNC in the corresponding field starpu_codelet::cuda_flags, and dropping thecudaStreamSynchronize() call at the end of the cuda_func function, so that itreturns immediately after having queued the kernel to the local stream. That way, StarPU will beable to submit and complete data transfers while kernels are executing, instead of only at eachkernel submission. The kernel just has to make sure that StarPU can use thelocal stream to synchronize with the kernel startup and completion.Using the flag ::STARPU_CUDA_ASYNC also permits to enable concurrent kernelexecution, on cards which support it (Kepler and later, notably). This isenabled by setting the environment variable \ref STARPU_NWORKER_PER_CUDA to thenumber of kernels to execute concurrently.  This is useful when kernels aresmall and do not feed the whole GPU with threads to run.\section OpenCL-specificOptimizations OpenCL-specific OptimizationsIf the kernel can be made to only use the StarPU-provided command queue or other self-allocatedqueues, i.e. the whole kernel submission can be made asynchronous, thenone should enable asynchronous execution of the kernel. This means settingthe flag ::STARPU_OPENCL_ASYNC in the corresponding field starpu_codelet::opencl_flags and dropping theclFinish() and starpu_opencl_collect_stats() calls at the end of the kernel, sothat it returns immediately after having queued the kernel to the provided queue.That way, StarPU will be able to submit and complete data transfers while kernels are executing, instead ofonly at each kernel submission. The kernel just has to make surethat StarPU can use the command queue it has provided to synchronize with thekernel startup and completion.\section DetectionStuckConditions Detection Stuck ConditionsIt may happen that for some reason, StarPU does not make progress for a longperiod of time.  Reason are sometimes due to contention inside StarPU, butsometimes this is due to external reasons, such as stuck MPI driver, or CUDAdriver, etc.<c>export STARPU_WATCHDOG_TIMEOUT=10000</c> (\ref STARPU_WATCHDOG_TIMEOUT)allows to make StarPU print an error message whenever StarPU does not terminateany task for 10ms. In addition to that,<c>export STARPU_WATCHDOG_CRASH=1</c> (\ref STARPU_WATCHDOG_CRASH)raises SIGABRT in that condition, thus allowing to catch the situation in gdb.It can also be useful to type "handle SIGABRT nopass" in gdb to be able to letthe process continue, after inspecting the state of the process.\section HowToLimitMemoryPerNode How to limit memory per nodeTODOTalk about\ref STARPU_LIMIT_CUDA_devid_MEM, \ref STARPU_LIMIT_CUDA_MEM,\ref STARPU_LIMIT_OPENCL_devid_MEM, \ref STARPU_LIMIT_OPENCL_MEMand \ref STARPU_LIMIT_CPU_MEMstarpu_memory_get_total()starpu_memory_get_available()\section HowToReduceTheMemoryFootprintOfInternalDataStructures How To Reduce The Memory Footprint Of Internal Data StructuresIt is possible to reduce the memory footprint of the task and data internalstructures of StarPU by describing the shape of your machine and/or yourapplication at the configure step.To reduce the memory footprint of the data internal structures of StarPU, onecan set the \ref enable-maxcpus "--enable-maxcpus", \ref enable-maxcudadev"--enable-maxcudadev", \ref enable-maxopencldev "--enable-maxopencldev" and\ref enable-maxnodes "--enable-maxnodes" configure parameters to give StarPUthe architecture of the machine it will run on, thus tuning the size of thestructures to the machine.To reduce the memory footprint of the task internal structures of StarPU, onecan set the \ref enable-maxbuffers "--enable-maxbuffers" configure parameter togive StarPU the maximum number of buffers that a task can use during anexecution. For example, in the Cholesky factorization (dense linear algebraapplication), the GEMM task uses up to 3 buffers, so it is possible to set themaximum number of task buffers to 3 to run a Cholesky factorization on StarPU.\section HowtoReuseMemory How to reuse memoryWhen your application needs to allocate more data than the available amount ofmemory usable by StarPU (given by \ref starpu_memory_get_available() ), theallocation cache system can reuse data buffers used by previously executedtasks. For that system to work with MPI tasks, you need to submit tasks progressively insteadof as soon as possible, because in the case of MPI receives, the allocation cache check for reusing databuffers will be done at submission time, not at execution time.You have two options to control the task submission flow. The first one is bycontrolling the number of submitted tasks during the whole execution. This canbe done whether by setting the environment variables \refSTARPU_LIMIT_MAX_NSUBMITTED_TASKS and \ref STARPU_LIMIT_MIN_NSUBMITTED_TASKS totell StarPU when to stop submitting tasks and when to wake up and submit tasksagain, or by explicitely calling \ref starpu_task_wait_for_n_submitted() inyour application code for finest grain control (for example, between twoiterations of a submission loop).The second option is to control the memory size of the allocation cache. Thiscan be done in the application by using jointly \refstarpu_memory_get_available() and \ref starpu_memory_wait_available() to submittasks only when there is enough memory space to allocate the data needed by thetask, i.e when enough data are available for reuse in the allocation cache.\section PerformanceModelCalibration Performance Model CalibrationMost schedulers are based on an estimation of codelet duration on each kindof processing unit. For this to be possible, the application programmer needsto configure a performance model for the codelets of the application (see\ref PerformanceModelExample for instance). History-based performance modelsuse on-line calibration.  StarPU will automatically calibrate codeletswhich have never been calibrated yet, and save the result in<c>$STARPU_HOME/.starpu/sampling/codelets</c>.The models are indexed by machine name. To share the models betweenmachines (e.g. for a homogeneous cluster), use <c>exportSTARPU_HOSTNAME=some_global_name</c> (\ref STARPU_HOSTNAME). To force continuing calibration,use <c>export STARPU_CALIBRATE=1</c> (\ref STARPU_CALIBRATE). This may be necessary if your applicationhas not-so-stable performance. StarPU will force calibration (and thus ignorethe current result) until 10 (<c>_STARPU_CALIBRATION_MINIMUM</c>) measurements have beenmade on each architecture, to avoid badly scheduling tasks just because thefirst measurements were not so good. Details on the current performance model statuscan be obtained from the command <c>starpu_perfmodel_display</c>: the <c>-l</c>option lists the available performance models, and the <c>-s</c> option permitsto choose the performance model to be displayed. The result looks like:\verbatim$ starpu_perfmodel_display -s starpu_slu_lu_model_11performance model for cpu_impl_0# hash    size     flops         mean          dev           n914f3bef  1048576  0.000000e+00  2.503577e+04  1.982465e+02  83e921964  65536    0.000000e+00  5.527003e+02  1.848114e+01  7e5a07e31  4096     0.000000e+00  1.717457e+01  5.190038e+00  14...\endverbatimWhich shows that for the LU 11 kernel with a 1MiB matrix, the averageexecution time on CPUs was about 25ms, with a 0.2ms standard deviation, over8 samples. It is a good idea to check this before doing actual performancemeasurements.A graph can be drawn by using the tool <c>starpu_perfmodel_plot</c>:\verbatim$ starpu_perfmodel_plot -s starpu_slu_lu_model_114096 16384 65536 262144 1048576 4194304 $ gnuplot starpu_starpu_slu_lu_model_11.gp$ gv starpu_starpu_slu_lu_model_11.eps\endverbatim\image html starpu_starpu_slu_lu_model_11.png\image latex starpu_starpu_slu_lu_model_11.eps "" width=\textwidthIf a kernel source code was modified (e.g. performance improvement), thecalibration information is stale and should be dropped, to re-calibrate fromstart. This can be done by using <c>export STARPU_CALIBRATE=2</c> (\ref STARPU_CALIBRATE).Note: history-based performance models get calibratedonly if a performance-model-based scheduler is chosen.The history-based performance models can also be explicitly filled by theapplication without execution, if e.g. the application already has a series ofmeasurements. This can be done by using starpu_perfmodel_update_history(),for instance:\code{.c}static struct starpu_perfmodel perf_model = {    .type = STARPU_HISTORY_BASED,    .symbol = "my_perfmodel",};struct starpu_codelet cl = {    .cuda_funcs = { cuda_func1, cuda_func2 },    .nbuffers = 1,    .modes = {STARPU_W},    .model = &perf_model};void feed(void) {    struct my_measure *measure;    struct starpu_task task;    starpu_task_init(&task);    task.cl = &cl;    for (measure = &measures[0]; measure < measures[last]; measure++) {        starpu_data_handle_t handle;	starpu_vector_data_register(&handle, -1, 0, measure->size, sizeof(float));	task.handles[0] = handle;	starpu_perfmodel_update_history(&perf_model, &task,	                                STARPU_CUDA_DEFAULT + measure->cudadev, 0,	                                measure->implementation, measure->time);	starpu_task_clean(&task);	starpu_data_unregister(handle);    }}\endcodeMeasurement has to be provided in milliseconds for the completion time models,and in Joules for the energy consumption models.\section Profiling ProfilingA quick view of how many tasks each worker has executed can be obtained by setting<c>export STARPU_WORKER_STATS=1</c> (\ref STARPU_WORKER_STATS). This is a convenient way to check thatexecution did happen on accelerators, without penalizing performance withthe profiling overhead.A quick view of how much data transfers have been issued can be obtained by setting<c>export STARPU_BUS_STATS=1</c> (\ref STARPU_BUS_STATS).More detailed profiling information can be enabled by using <c>export STARPU_PROFILING=1</c> (\ref STARPU_PROFILING)or bycalling starpu_profiling_status_set() from the source code.Statistics on the execution can then be obtained by using <c>exportSTARPU_BUS_STATS=1</c> and <c>export STARPU_WORKER_STATS=1</c> . More details on performance feedback are provided in the next chapter.*/
 |