|
@@ -28,8 +28,9 @@ will show roughly where time is spent, and focus correspondingly.
|
|
|
|
|
|
\section CheckTaskSize Check Task Size
|
|
|
|
|
|
-Make sure that your tasks are not too small, because the StarPU runtime overhead
|
|
|
-is not completely zero. You can run the tasks_size_overhead.sh script to get an
|
|
|
+Make sure that your tasks are not too small, as the StarPU runtime overhead
|
|
|
+is not completely zero. As explained in \ref TaskSizeOverhead, you can
|
|
|
+run the script \c tasks_size_overhead.sh to get an
|
|
|
idea of the scalability of tasks depending on their duration (in µs), on your
|
|
|
own system.
|
|
|
|
|
@@ -40,19 +41,18 @@ much bigger than this.
|
|
|
of cores, so it's better to try to get 10ms-ish tasks.
|
|
|
|
|
|
Tasks durations can easily be observed when performance models are defined (see
|
|
|
-\ref PerformanceModelExample) by using the <c>starpu_perfmodel_plot</c> or
|
|
|
-<c>starpu_perfmodel_display</c> tool (see \ref PerformanceOfCodelets)
|
|
|
+\ref PerformanceModelExample) by using the tools <c>starpu_perfmodel_plot</c> or
|
|
|
+<c>starpu_perfmodel_display</c> (see \ref PerformanceOfCodelets)
|
|
|
|
|
|
When using parallel tasks, the problem is even worse since StarPU has to
|
|
|
-synchronize the execution of tasks.
|
|
|
+synchronize the tasks execution.
|
|
|
|
|
|
\section ConfigurationImprovePerformance Configuration Which May Improve Performance
|
|
|
|
|
|
-The \ref enable-fast "--enable-fast" \c configure option disables all
|
|
|
+The \c configure option \ref enable-fast "--enable-fast" disables all
|
|
|
assertions. This makes StarPU more performant for really small tasks by
|
|
|
disabling all sanity checks. Only use this for measurements and production, not for development, since this will drop all basic checks.
|
|
|
|
|
|
-
|
|
|
\section DataRelatedFeaturesToImprovePerformance Data Related Features Which May Improve Performance
|
|
|
|
|
|
link to \ref DataManagement
|
|
@@ -81,14 +81,14 @@ link to \ref StaticScheduling
|
|
|
|
|
|
For proper overlapping of asynchronous GPU data transfers, data has to be pinned
|
|
|
by CUDA. Data allocated with starpu_malloc() is always properly pinned. If the
|
|
|
-application is registering to StarPU some data which has not been allocated with
|
|
|
-starpu_malloc(), it should use starpu_memory_pin() to pin it.
|
|
|
+application registers to StarPU some data which has not been allocated with
|
|
|
+starpu_malloc(), starpu_memory_pin() should be called to pin the data memory.
|
|
|
|
|
|
Due to CUDA limitations, StarPU will have a hard time overlapping its own
|
|
|
communications and the codelet computations if the application does not use a
|
|
|
dedicated CUDA stream for its computations instead of the default stream,
|
|
|
-which synchronizes all operations of the GPU. StarPU provides one by the use
|
|
|
-of starpu_cuda_get_local_stream() which can be used by all CUDA codelet
|
|
|
+which synchronizes all operations of the GPU. The function
|
|
|
+starpu_cuda_get_local_stream() returns a stream which can be used by all CUDA codelet
|
|
|
operations to avoid this issue. For instance:
|
|
|
|
|
|
\code{.c}
|
|
@@ -105,11 +105,11 @@ If some CUDA calls are made without specifying this local stream,
|
|
|
synchronization needs to be explicited with cudaThreadSynchronize() around these
|
|
|
calls, to make sure that they get properly synchronized with the calls using
|
|
|
the local stream. Notably, \c cudaMemcpy() and \c cudaMemset() are actually
|
|
|
-asynchronous and need such explicit synchronization! Use cudaMemcpyAsync() and
|
|
|
-cudaMemsetAsync() instead.
|
|
|
+asynchronous and need such explicit synchronization! Use \c cudaMemcpyAsync() and
|
|
|
+\c cudaMemsetAsync() instead.
|
|
|
|
|
|
-Calling starpu_cublas_init() makes StarPU already do appropriate calls for the
|
|
|
-CUBLAS library. Some libraries like Magma may however change the current stream of CUBLAS v1,
|
|
|
+Calling starpu_cublas_init() will ensure StarPU to properly call the
|
|
|
+CUBLAS library functions. Some libraries like Magma may however change the current stream of CUBLAS v1,
|
|
|
one then has to call <c>cublasSetKernelStream(</c>starpu_cuda_get_local_stream()<c>)</c> at
|
|
|
the beginning of the codelet to make sure that CUBLAS is really using the proper
|
|
|
stream. When using CUBLAS v2, starpu_cublas_get_local_handle() can be called to queue CUBLAS
|
|
@@ -147,14 +147,14 @@ triggered by the completion of the kernel.
|
|
|
Using the flag ::STARPU_CUDA_ASYNC also permits to enable concurrent kernel
|
|
|
execution, on cards which support it (Kepler and later, notably). This is
|
|
|
enabled by setting the environment variable \ref STARPU_NWORKER_PER_CUDA to the
|
|
|
-number of kernels to execute concurrently. This is useful when kernels are
|
|
|
+number of kernels to be executed concurrently. This is useful when kernels are
|
|
|
small and do not feed the whole GPU with threads to run.
|
|
|
|
|
|
-Concerning memory allocation, you should really not use \c cudaMalloc/ \c cudaFree
|
|
|
-within the kernel, since \c cudaFree introduces a awfully lot of synchronizations
|
|
|
+Concerning memory allocation, you should really not use \c cudaMalloc()/ \c cudaFree()
|
|
|
+within the kernel, since \c cudaFree() introduces a awfully lot of synchronizations
|
|
|
within CUDA itself. You should instead add a parameter to the codelet with the
|
|
|
::STARPU_SCRATCH mode access. You can then pass to the task a handle registered
|
|
|
-with the desired size but with the \c NULL pointer, that handle can even be the
|
|
|
+with the desired size but with the \c NULL pointer, the handle can even be
|
|
|
shared between tasks, StarPU will allocate per-task data on the fly before task
|
|
|
execution, and reuse the allocated data between tasks.
|
|
|
|
|
@@ -177,8 +177,8 @@ kernel startup and completion.
|
|
|
|
|
|
It may happen that for some reason, StarPU does not make progress for a long
|
|
|
period of time. Reason are sometimes due to contention inside StarPU, but
|
|
|
-sometimes this is due to external reasons, such as stuck MPI driver, or CUDA
|
|
|
-driver, etc.
|
|
|
+sometimes this is due to external reasons, such as a stuck MPI or CUDA
|
|
|
+driver.
|
|
|
|
|
|
<c>export STARPU_WATCHDOG_TIMEOUT=10000</c> (\ref STARPU_WATCHDOG_TIMEOUT)
|
|
|
|
|
@@ -187,30 +187,34 @@ any task for 10ms, but lets the application continue normally. In addition to th
|
|
|
|
|
|
<c>export STARPU_WATCHDOG_CRASH=1</c> (\ref STARPU_WATCHDOG_CRASH)
|
|
|
|
|
|
-raises <c>SIGABRT</c> in this condition, thus allowing to catch the situation in gdb.
|
|
|
+raises <c>SIGABRT</c> in this condition, thus allowing to catch the
|
|
|
+situation in \c gdb.
|
|
|
+
|
|
|
It can also be useful to type <c>handle SIGABRT nopass</c> in <c>gdb</c> to be able to let
|
|
|
the process continue, after inspecting the state of the process.
|
|
|
|
|
|
\section HowToLimitMemoryPerNode How to Limit Memory Used By StarPU And Cache Buffer Allocations
|
|
|
|
|
|
By default, StarPU makes sure to use at most 90% of the memory of GPU devices,
|
|
|
-moving data in and out of the device as appropriate and with prefetch and
|
|
|
-writeback optimizations. Concerning the main memory, by default it will not
|
|
|
-limit its consumption, since by default it has nowhere to push the data to when
|
|
|
-memory gets tight. This also means that by default StarPU will not cache buffer
|
|
|
-allocations in main memory, since it does not know how much of the system memory
|
|
|
-it can afford.
|
|
|
-
|
|
|
-In the case of GPUs, the \ref STARPU_LIMIT_CUDA_MEM, \ref STARPU_LIMIT_CUDA_devid_MEM,
|
|
|
-\ref STARPU_LIMIT_OPENCL_MEM, and \ref STARPU_LIMIT_OPENCL_devid_MEM environment variables
|
|
|
-can be used to control how
|
|
|
-much (in MiB) of the GPU device memory should be used at most by StarPU (their
|
|
|
-default values are 90% of the available memory).
|
|
|
-
|
|
|
-In the case of the main memory, the \ref STARPU_LIMIT_CPU_MEM environment
|
|
|
-variable can be used to specify how much (in MiB) of the main memory should be
|
|
|
-used at most by StarPU for buffer allocations. This way, StarPU will be able to
|
|
|
-cache buffer allocations (which can be a real benefit if a lot of bufferes are
|
|
|
+moving data in and out of the device as appropriate, as well as using
|
|
|
+prefetch and writeback optimizations.
|
|
|
+
|
|
|
+The environment variables \ref STARPU_LIMIT_CUDA_MEM, \ref STARPU_LIMIT_CUDA_devid_MEM,
|
|
|
+\ref STARPU_LIMIT_OPENCL_MEM, and \ref STARPU_LIMIT_OPENCL_devid_MEM
|
|
|
+can be used to control how much (in MiB) of the GPU device memory
|
|
|
+should be used at most by StarPU (the default value is to use 90% of the
|
|
|
+available memory).
|
|
|
+
|
|
|
+By default, the usage of the main memory is not limited, as the
|
|
|
+default mechanims do not provide means to evict main memory when it
|
|
|
+gets too tight. This also means that by default StarPU will not cache buffer
|
|
|
+allocations in main memory, since it does not know how much of the
|
|
|
+system memory it can afford.
|
|
|
+
|
|
|
+The environment variable \ref STARPU_LIMIT_CPU_MEM can be used to
|
|
|
+specify how much (in MiB) of the main memory should be used at most by
|
|
|
+StarPU for buffer allocations. This way, StarPU will be able to
|
|
|
+cache buffer allocations (which can be a real benefit if a lot of buffers are
|
|
|
involved, or if allocation fragmentation can become a problem), and when using
|
|
|
\ref OutOfCore, StarPU will know when it should evict data out to the disk.
|
|
|
|
|
@@ -233,8 +237,8 @@ caches or data out to the disk, starpu_memory_allocate() can be used to
|
|
|
specify an amount of memory to be accounted for. starpu_memory_deallocate()
|
|
|
can be used to account freed memory back. Those can for instance be used by data
|
|
|
interfaces with dynamic data buffers: instead of using starpu_malloc_on_node(),
|
|
|
-they would dynamically allocate data with malloc/realloc, and notify starpu of
|
|
|
-the delta thanks to starpu_memory_allocate() and starpu_memory_deallocate() calls.
|
|
|
+they would dynamically allocate data with \c malloc()/\c realloc(), and notify StarPU of
|
|
|
+the delta by calling starpu_memory_allocate() and starpu_memory_deallocate().
|
|
|
|
|
|
starpu_memory_get_total() and starpu_memory_get_available()
|
|
|
can be used to get an estimation of how much memory is available.
|
|
@@ -251,7 +255,7 @@ to reserve this amount immediately.
|
|
|
|
|
|
It is possible to reduce the memory footprint of the task and data internal
|
|
|
structures of StarPU by describing the shape of your machine and/or your
|
|
|
-application at the \c configure step.
|
|
|
+application when calling \c configure.
|
|
|
|
|
|
To reduce the memory footprint of the data internal structures of StarPU, one
|
|
|
can set the
|
|
@@ -271,28 +275,27 @@ execution. For example, in the Cholesky factorization (dense linear algebra
|
|
|
application), the GEMM task uses up to 3 buffers, so it is possible to set the
|
|
|
maximum number of task buffers to 3 to run a Cholesky factorization on StarPU.
|
|
|
|
|
|
-The size of the various structures of StarPU can be printed by
|
|
|
+The size of the various structures of StarPU can be printed by
|
|
|
<c>tests/microbenchs/display_structures_size</c>.
|
|
|
|
|
|
-It is also often useless to submit *all* the tasks at the same time. One can
|
|
|
-make the starpu_task_submit() function block when a reasonable given number of
|
|
|
-tasks have been submitted, by setting the \ref STARPU_LIMIT_MIN_SUBMITTED_TASKS and
|
|
|
-\ref STARPU_LIMIT_MAX_SUBMITTED_TASKS environment variables, for instance:
|
|
|
+It is also often useless to submit *all* the tasks at the same time.
|
|
|
+Task submission can be blocked when a reasonable given number of
|
|
|
+tasks have been submitted, by setting the environment variables \ref
|
|
|
+STARPU_LIMIT_MIN_SUBMITTED_TASKS and \ref STARPU_LIMIT_MAX_SUBMITTED_TASKS.
|
|
|
|
|
|
<c>
|
|
|
export STARPU_LIMIT_MAX_SUBMITTED_TASKS=10000
|
|
|
-
|
|
|
export STARPU_LIMIT_MIN_SUBMITTED_TASKS=9000
|
|
|
</c>
|
|
|
|
|
|
-To make StarPU block submission when 10000 tasks are submitted, and unblock
|
|
|
+will make StarPU block submission when 10000 tasks are submitted, and unblock
|
|
|
submission when only 9000 tasks are still submitted, i.e. 1000 tasks have
|
|
|
completed among the 10000 which were submitted when submission was blocked. Of
|
|
|
course this may reduce parallelism if the threshold is set too low. The precise
|
|
|
balance depends on the application task graph.
|
|
|
|
|
|
An idea of how much memory is used for tasks and data handles can be obtained by
|
|
|
-setting the \ref STARPU_MAX_MEMORY_USE environment variable to <c>1</c>.
|
|
|
+setting the environment variable \ref STARPU_MAX_MEMORY_USE to <c>1</c>.
|
|
|
|
|
|
\section HowtoReuseMemory How To Reuse Memory
|
|
|
|
|
@@ -303,7 +306,7 @@ tasks. For this system to work with MPI tasks, you need to submit tasks progress
|
|
|
of as soon as possible, because in the case of MPI receives, the allocation cache check for reusing data
|
|
|
buffers will be done at submission time, not at execution time.
|
|
|
|
|
|
-You have two options to control the task submission flow. The first one is by
|
|
|
+There is two options to control the task submission flow. The first one is by
|
|
|
controlling the number of submitted tasks during the whole execution. This can
|
|
|
be done whether by setting the environment variables
|
|
|
\ref STARPU_LIMIT_MAX_SUBMITTED_TASKS and \ref STARPU_LIMIT_MIN_SUBMITTED_TASKS to
|
|
@@ -348,11 +351,12 @@ To force continuing calibration,
|
|
|
use <c>export STARPU_CALIBRATE=1</c> (\ref STARPU_CALIBRATE). This may be necessary if your application
|
|
|
has not-so-stable performance. StarPU will force calibration (and thus ignore
|
|
|
the current result) until 10 (<c>_STARPU_CALIBRATION_MINIMUM</c>) measurements have been
|
|
|
-made on each architecture, to avoid badly scheduling tasks just because the
|
|
|
+made on each architecture, to avoid bad scheduling decisions just because the
|
|
|
first measurements were not so good. Details on the current performance model status
|
|
|
-can be obtained from the tool <c>starpu_perfmodel_display</c>: the <c>-l</c>
|
|
|
-option lists the available performance models, and the <c>-s</c> option permits
|
|
|
-to choose the performance model to be displayed. The result looks like:
|
|
|
+can be obtained with the tool <c>starpu_perfmodel_display</c>: the
|
|
|
+option <c>-l</c> lists the available performance models, and the
|
|
|
+option <c>-s</c> allows to choose the performance model to be
|
|
|
+displayed. The result looks like:
|
|
|
|
|
|
\verbatim
|
|
|
$ starpu_perfmodel_display -s starpu_slu_lu_model_11
|
|
@@ -364,7 +368,7 @@ e5a07e31 4096 0.000000e+00 1.717457e+01 5.190038e+00 14
|
|
|
...
|
|
|
\endverbatim
|
|
|
|
|
|
-Which shows that for the LU 11 kernel with a 1MiB matrix, the average
|
|
|
+which shows that for the LU 11 kernel with a 1MiB matrix, the average
|
|
|
execution time on CPUs was about 25ms, with a 0.2ms standard deviation, over
|
|
|
8 samples. It is a good idea to check this before doing actual performance
|
|
|
measurements.
|
|
@@ -373,7 +377,7 @@ A graph can be drawn by using the tool <c>starpu_perfmodel_plot</c>:
|
|
|
|
|
|
\verbatim
|
|
|
$ starpu_perfmodel_plot -s starpu_slu_lu_model_11
|
|
|
-4096 16384 65536 262144 1048576 4194304
|
|
|
+4096 16384 65536 262144 1048576 4194304
|
|
|
$ gnuplot starpu_starpu_slu_lu_model_11.gp
|
|
|
$ gv starpu_starpu_slu_lu_model_11.eps
|
|
|
\endverbatim
|
|
@@ -451,28 +455,29 @@ STARPU_BUS_STATS=1</c> and <c>export STARPU_WORKER_STATS=1</c> .
|
|
|
\section OverheadProfiling Overhead Profiling
|
|
|
|
|
|
\ref OfflinePerformanceTools can already provide an idea of to what extent and
|
|
|
-which part of StarPU bring overhead on the execution time. To get a more precise
|
|
|
-analysis of the parts of StarPU which bring most overhead, <c>gprof</c> can be used.
|
|
|
+which part of StarPU brings an overhead on the execution time. To get a more precise
|
|
|
+analysis of which parts of StarPU bring the most overhead, <c>gprof</c> can be used.
|
|
|
|
|
|
First, recompile and reinstall StarPU with <c>gprof</c> support:
|
|
|
|
|
|
\code
|
|
|
-./configure --enable-perf-debug --disable-shared --disable-build-tests --disable-build-examples
|
|
|
+../configure --enable-perf-debug --disable-shared --disable-build-tests --disable-build-examples
|
|
|
\endcode
|
|
|
|
|
|
Make sure not to leave a dynamic version of StarPU in the target path: remove
|
|
|
any remaining <c>libstarpu-*.so</c>
|
|
|
|
|
|
Then relink your application with the static StarPU library, make sure that
|
|
|
-running <c>ldd</c> on your application does not mention any libstarpu
|
|
|
+running <c>ldd</c> on your application does not mention any \c libstarpu
|
|
|
(i.e. it's really statically-linked).
|
|
|
|
|
|
\code
|
|
|
gcc test.c -o test $(pkg-config --cflags starpu-1.3) $(pkg-config --libs starpu-1.3)
|
|
|
\endcode
|
|
|
|
|
|
-Now you can run your application, and a <c>gmon.out</c> file should appear in the
|
|
|
-current directory, you can process it by running <c>gprof</c> on your application:
|
|
|
+Now you can run your application, this will create a file
|
|
|
+<c>gmon.out</c> in the current directory, it can be processed by
|
|
|
+running <c>gprof</c> on your application:
|
|
|
|
|
|
\code
|
|
|
gprof ./test
|