|
@@ -14,10 +14,11 @@
|
|
|
* Performance model example::
|
|
|
* Theoretical lower bound on execution time::
|
|
|
* Insert Task Utility::
|
|
|
+* Parallel tasks::
|
|
|
+* Debugging::
|
|
|
* The multiformat interface::
|
|
|
* On-GPU rendering::
|
|
|
* More examples:: More examples shipped with StarPU
|
|
|
-* Debugging:: When things go wrong.
|
|
|
@end menu
|
|
|
|
|
|
@node Using multiple implementations of a codelet
|
|
@@ -555,6 +556,146 @@ be executed, and is allowed to read from @code{i} to use it e.g. as an
|
|
|
index. Note that this macro is only avaible when compiling StarPU with
|
|
|
the compiler @code{gcc}.
|
|
|
|
|
|
+@node Parallel Tasks
|
|
|
+@section Parallel Tasks
|
|
|
+
|
|
|
+StarPU can leverage existing parallel computation libraries by the means of
|
|
|
+parallel tasks. A parallel task is a task which gets worked on by a set of CPUs
|
|
|
+(called a parallel or combined worker) at the same time, by using an existing
|
|
|
+parallel CPU implementation of the computation to be achieved. This can also be
|
|
|
+useful to improve the load balance between slow CPUs and fast GPUs: since CPUs
|
|
|
+work collectively on a single task, the completion time of tasks on CPUs become
|
|
|
+comparable to the completion time on GPUs, thus relieving from granularity
|
|
|
+discrepancy concerns.
|
|
|
+
|
|
|
+Two modes of execution exist to accomodate with existing usages.
|
|
|
+
|
|
|
+@subsection Fork-mode parallel tasks
|
|
|
+
|
|
|
+In the Fork mode, StarPU will call the codelet function on one
|
|
|
+of the CPUs of the combined worker. The codelet function can use
|
|
|
+@code{starpu_combined_worker_get_size()} to get the number of threads it is
|
|
|
+allowed to start to achieve the computation. The CPU binding mask is already
|
|
|
+enforced, so that threads created by the function will inherit the mask, and
|
|
|
+thus execute where StarPU expected. For instance, using OpenMP (full source is
|
|
|
+available in @code{examples/openmp/vector_scal.c}):
|
|
|
+
|
|
|
+@example
|
|
|
+void scal_cpu_func(void *buffers[], void *_args)
|
|
|
+{
|
|
|
+ unsigned i;
|
|
|
+ float *factor = _args;
|
|
|
+ struct starpu_vector_interface *vector = buffers[0];
|
|
|
+ unsigned n = STARPU_VECTOR_GET_NX(vector);
|
|
|
+ float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
|
|
|
+
|
|
|
+#pragma omp parallel for num_threads(starpu_combined_worker_get_size())
|
|
|
+ for (i = 0; i < n; i++)
|
|
|
+ val[i] *= *factor;
|
|
|
+}
|
|
|
+
|
|
|
+static struct starpu_perfmodel vector_scal_model =
|
|
|
+{
|
|
|
+ .type = STARPU_HISTORY_BASED,
|
|
|
+ .symbol = "vector_scale_parallel"
|
|
|
+};
|
|
|
+
|
|
|
+static struct starpu_codelet cl =
|
|
|
+{
|
|
|
+ .modes = { STARPU_RW },
|
|
|
+ .where = STARPU_CPU,
|
|
|
+ .type = STARPU_FORKJOIN,
|
|
|
+ .max_parallelism = INT_MAX,
|
|
|
+ .cpu_funcs = {scal_cpu_func, NULL},
|
|
|
+ .nbuffers = 1,
|
|
|
+};
|
|
|
+@end example
|
|
|
+
|
|
|
+Other examples include for instance calling a BLAS parallel CPU implementation
|
|
|
+(see examples/mult/xgemm.c).
|
|
|
+
|
|
|
+@subsection SPMD-mode parallel tasks
|
|
|
+
|
|
|
+In the SPMD mode, StarPU will call the codelet function on
|
|
|
+each CPU of the combined worker. The codelet function can use
|
|
|
+@code{starpu_combined_worker_get_size()} to get the total number of CPUs
|
|
|
+involved in the combined worker, and thus the number of calls that are made in
|
|
|
+parallel to the function, and @code{starpu_combined_worker_get_rank()} to get
|
|
|
+the rank of the current CPU within the combined worker. For instance:
|
|
|
+
|
|
|
+@example
|
|
|
+static void func(void *buffers[], void *args)
|
|
|
+@{
|
|
|
+ unsigned i;
|
|
|
+ float *factor = _args;
|
|
|
+ struct starpu_vector_interface *vector = buffers[0];
|
|
|
+ unsigned n = STARPU_VECTOR_GET_NX(vector);
|
|
|
+ float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
|
|
|
+
|
|
|
+ /* Compute slice to compute */
|
|
|
+ unsigned m = starpu_combined_worker_get_size();
|
|
|
+ unsigned j = starpu_combined_worker_get_rank();
|
|
|
+ unsigned slice = (n+m-1)/m;
|
|
|
+
|
|
|
+ for (i = j * slice; i < (j+1) * slice && i < n; i++)
|
|
|
+ val[i] *= *factor;
|
|
|
+@}
|
|
|
+
|
|
|
+statuc struct starpu_codelet cl =
|
|
|
+@{
|
|
|
+ .modes = { STARPU_RW },
|
|
|
+ .where = STARP_CPU,
|
|
|
+ .type = STARPU_SPMD,
|
|
|
+ .max_parallelism = INT_MAX,
|
|
|
+ .cpu_funcs = { func, NULL },
|
|
|
+ .nbuffers = 1,
|
|
|
+@}
|
|
|
+@end example
|
|
|
+
|
|
|
+Of course, this trivial example will not really benefit from parallel task
|
|
|
+execution, and was only meant to be simple to understand. The benefit comes
|
|
|
+when the computation to be done is so that threads have to e.g. exchange
|
|
|
+intermediate results, or write to the data in a complex but safe way in the same
|
|
|
+buffer.
|
|
|
+
|
|
|
+@subsection Parallel tasks performance
|
|
|
+
|
|
|
+To benefit from parallel tasks, a parallel-task-aware StarPU scheduler has to
|
|
|
+be used. When exposed to codelets with a Fork or SPMD flag, the @code{pheft}
|
|
|
+(parallel-heft) and @code{pgreedy} (parallel greedy) schedulers will indeed also
|
|
|
+try to execute tasks with several CPUs. It will automatically try the various
|
|
|
+available combined worker sizes and thus be able to avoid choosing a large
|
|
|
+combined worker if the codelet does not actually scale so much.
|
|
|
+
|
|
|
+@subsection Combined worker sizes
|
|
|
+
|
|
|
+By default, StarPU creates combined workers according to the architecture
|
|
|
+structure as detected by HwLoc. It means that for each object of the Hwloc
|
|
|
+topology (NUMA node, socket, cache, ...) a combined worker will be created. If
|
|
|
+some nodes of the hierarchy have a big arity (e.g. many cores in a socket
|
|
|
+without a hierarchy of shared caches), StarPU will create combined workers of
|
|
|
+intermediate sizes.
|
|
|
+
|
|
|
+@subsection Concurrent parallel tasks
|
|
|
+
|
|
|
+Unfortunately, many environments and librairies do not support concurrent
|
|
|
+calls.
|
|
|
+
|
|
|
+For instance, most OpenMP implementations (including the main ones) do not
|
|
|
+support concurrent @code{pragma omp parallel} statements without nesting them in
|
|
|
+another @code{pragma omp parallel} statement, but StarPU does not yet support
|
|
|
+creating its CPU workers by using such pragma.
|
|
|
+
|
|
|
+Other parallel libraries are also not safe when being invoked concurrently
|
|
|
+from different threads, due to the use of global variables in their sequential
|
|
|
+sections for instance.
|
|
|
+
|
|
|
+The solution is then to use only a single combined worker, scoping all
|
|
|
+the CPUs. This can be done by setting @code{single_combined_worker}
|
|
|
+to 1 in the @code{starpu_conf} structure, or setting the
|
|
|
+@code{STARPU_SINGLE_COMBINED_WORKER} environment variable to 1. StarPU will then
|
|
|
+use parallel tasks only over all the CPUs at the same time.
|
|
|
+
|
|
|
@node Debugging
|
|
|
@section Debugging
|
|
|
|
|
@@ -631,23 +772,23 @@ kind of kernel.
|
|
|
static void
|
|
|
multiformat_scal_cpu_func(void *buffers[], void *args)
|
|
|
@{
|
|
|
- struct point *aos;
|
|
|
- unsigned int n;
|
|
|
+ struct point *aos;
|
|
|
+ unsigned int n;
|
|
|
|
|
|
- aos = STARPU_MULTIFORMAT_GET_PTR(buffers[0]);
|
|
|
- n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
|
|
|
- ...
|
|
|
+ aos = STARPU_MULTIFORMAT_GET_PTR(buffers[0]);
|
|
|
+ n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
|
|
|
+ ...
|
|
|
@}
|
|
|
|
|
|
extern "C" void multiformat_scal_cuda_func(void *buffers[], void *_args)
|
|
|
@{
|
|
|
- unsigned int n;
|
|
|
- struct struct_of_arrays *soa;
|
|
|
+ unsigned int n;
|
|
|
+ struct struct_of_arrays *soa;
|
|
|
|
|
|
- soa = (struct struct_of_arrays *) STARPU_MULTIFORMAT_GET_CUDA_PTR(buffers[0]);
|
|
|
- n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
|
|
|
+ soa = (struct struct_of_arrays *) STARPU_MULTIFORMAT_GET_CUDA_PTR(buffers[0]);
|
|
|
+ n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
|
|
|
|
|
|
- ...
|
|
|
+ ...
|
|
|
@}
|
|
|
@end smallexample
|
|
|
@end cartouche
|
|
@@ -667,8 +808,8 @@ be given the CUDA pointer at registration, for instance:
|
|
|
@cartouche
|
|
|
@smallexample
|
|
|
for (workerid = 0; workerid < starpu_worker_get_count(); workerid++)
|
|
|
- if (starpu_worker_get_type(workerid) == STARPU_CUDA_WORKER)
|
|
|
- break;
|
|
|
+ if (starpu_worker_get_type(workerid) == STARPU_CUDA_WORKER)
|
|
|
+ break;
|
|
|
|
|
|
cudaSetDevice(starpu_worker_get_devid(workerid));
|
|
|
cudaGraphicsResourceGetMappedPointer((void**)&output, &num_bytes, resource);
|
|
@@ -694,7 +835,7 @@ directory. Simple examples include:
|
|
|
|
|
|
@table @asis
|
|
|
@item @code{incrementer/}:
|
|
|
- Trivial incrementation test.
|
|
|
+ Trivial incrementation test.
|
|
|
@item @code{basic_examples/}:
|
|
|
Simple documented Hello world (as shown in @ref{Hello World}), vector/scalar product (as shown
|
|
|
in @ref{Vector Scaling on an Hybrid CPU/GPU Machine}), matrix
|
|
@@ -702,20 +843,20 @@ directory. Simple examples include:
|
|
|
interface, an example using the variable data interface, and an example
|
|
|
using different formats on CPUs and GPUs.
|
|
|
@item @code{matvecmult/}:
|
|
|
- OpenCL example from NVidia, adapted to StarPU.
|
|
|
+ OpenCL example from NVidia, adapted to StarPU.
|
|
|
@item @code{axpy/}:
|
|
|
- AXPY CUBLAS operation adapted to StarPU.
|
|
|
+ AXPY CUBLAS operation adapted to StarPU.
|
|
|
@item @code{fortran/}:
|
|
|
- Example of Fortran bindings.
|
|
|
+ Example of Fortran bindings.
|
|
|
@end table
|
|
|
|
|
|
More advanced examples include:
|
|
|
|
|
|
@table @asis
|
|
|
@item @code{filters/}:
|
|
|
- Examples using filters, as shown in @ref{Partitioning Data}.
|
|
|
+ Examples using filters, as shown in @ref{Partitioning Data}.
|
|
|
@item @code{lu/}:
|
|
|
- LU matrix factorization, see for instance @code{xlu_implicit.c}
|
|
|
+ LU matrix factorization, see for instance @code{xlu_implicit.c}
|
|
|
@item @code{cholesky/}:
|
|
|
- Cholesky matrix factorization, see for instance @code{cholesky_implicit.c}.
|
|
|
+ Cholesky matrix factorization, see for instance @code{cholesky_implicit.c}.
|
|
|
@end table
|