Bläddra i källkod

Document parallel tasks

Samuel Thibault 13 år sedan
förälder
incheckning
15a2f1d26f
2 ändrade filer med 170 tillägg och 20 borttagningar
  1. 161 20
      doc/chapters/advanced-examples.texi
  2. 9 0
      doc/chapters/basic-api.texi

+ 161 - 20
doc/chapters/advanced-examples.texi

@@ -14,10 +14,11 @@
 * Performance model example::   
 * Theoretical lower bound on execution time::  
 * Insert Task Utility::          
+* Parallel tasks::
+* Debugging::
 * The multiformat interface::
 * On-GPU rendering::
 * More examples::               More examples shipped with StarPU
-* Debugging::                   When things go wrong.
 @end menu
 
 @node Using multiple implementations of a codelet
@@ -555,6 +556,146 @@ be executed, and is allowed to read from @code{i} to use it e.g. as an
 index. Note that this macro is only avaible when compiling StarPU with
 the compiler @code{gcc}.
 
+@node Parallel Tasks
+@section Parallel Tasks
+
+StarPU can leverage existing parallel computation libraries by the means of
+parallel tasks. A parallel task is a task which gets worked on by a set of CPUs
+(called a parallel or combined worker) at the same time, by using an existing
+parallel CPU implementation of the computation to be achieved. This can also be
+useful to improve the load balance between slow CPUs and fast GPUs: since CPUs
+work collectively on a single task, the completion time of tasks on CPUs become
+comparable to the completion time on GPUs, thus relieving from granularity
+discrepancy concerns.
+
+Two modes of execution exist to accomodate with existing usages.
+
+@subsection Fork-mode parallel tasks
+
+In the Fork mode, StarPU will call the codelet function on one
+of the CPUs of the combined worker. The codelet function can use
+@code{starpu_combined_worker_get_size()} to get the number of threads it is
+allowed to start to achieve the computation. The CPU binding mask is already
+enforced, so that threads created by the function will inherit the mask, and
+thus execute where StarPU expected. For instance, using OpenMP (full source is
+available in @code{examples/openmp/vector_scal.c}):
+
+@example
+void scal_cpu_func(void *buffers[], void *_args)
+{
+    unsigned i;
+    float *factor = _args;
+    struct starpu_vector_interface *vector = buffers[0];
+    unsigned n = STARPU_VECTOR_GET_NX(vector);
+    float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
+
+#pragma omp parallel for num_threads(starpu_combined_worker_get_size())
+    for (i = 0; i < n; i++)
+        val[i] *= *factor;
+}
+
+static struct starpu_perfmodel vector_scal_model =
+{
+    .type = STARPU_HISTORY_BASED,
+    .symbol = "vector_scale_parallel"
+};
+
+static struct starpu_codelet cl =
+{
+    .modes = { STARPU_RW },
+    .where = STARPU_CPU,
+    .type = STARPU_FORKJOIN,
+    .max_parallelism = INT_MAX,
+    .cpu_funcs = {scal_cpu_func, NULL},
+    .nbuffers = 1,
+};
+@end example
+
+Other examples include for instance calling a BLAS parallel CPU implementation
+(see examples/mult/xgemm.c).
+
+@subsection SPMD-mode parallel tasks
+
+In the SPMD mode, StarPU will call the codelet function on
+each CPU of the combined worker. The codelet function can use
+@code{starpu_combined_worker_get_size()} to get the total number of CPUs
+involved in the combined worker, and thus the number of calls that are made in
+parallel to the function, and @code{starpu_combined_worker_get_rank()} to get
+the rank of the current CPU within the combined worker. For instance:
+
+@example
+static void func(void *buffers[], void *args)
+@{
+    unsigned i;
+    float *factor = _args;
+    struct starpu_vector_interface *vector = buffers[0];
+    unsigned n = STARPU_VECTOR_GET_NX(vector);
+    float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
+
+    /* Compute slice to compute */
+    unsigned m = starpu_combined_worker_get_size();
+    unsigned j = starpu_combined_worker_get_rank();
+    unsigned slice = (n+m-1)/m;
+
+    for (i = j * slice; i < (j+1) * slice && i < n; i++)
+        val[i] *= *factor;
+@}
+
+statuc struct starpu_codelet cl =
+@{
+    .modes = { STARPU_RW },
+    .where = STARP_CPU,
+    .type = STARPU_SPMD,
+    .max_parallelism = INT_MAX,
+    .cpu_funcs = { func, NULL },
+    .nbuffers = 1,
+@}
+@end example
+
+Of course, this trivial example will not really benefit from parallel task
+execution, and was only meant to be simple to understand.  The benefit comes
+when the computation to be done is so that threads have to e.g. exchange
+intermediate results, or write to the data in a complex but safe way in the same
+buffer.
+
+@subsection Parallel tasks performance
+
+To benefit from parallel tasks, a parallel-task-aware StarPU scheduler has to
+be used. When exposed to codelets with a Fork or SPMD flag, the @code{pheft}
+(parallel-heft) and @code{pgreedy} (parallel greedy) schedulers will indeed also
+try to execute tasks with several CPUs. It will automatically try the various
+available combined worker sizes and thus be able to avoid choosing a large
+combined worker if the codelet does not actually scale so much.
+
+@subsection Combined worker sizes
+
+By default, StarPU creates combined workers according to the architecture
+structure as detected by HwLoc. It means that for each object of the Hwloc
+topology (NUMA node, socket, cache, ...) a combined worker will be created. If
+some nodes of the hierarchy have a big arity (e.g. many cores in a socket
+without a hierarchy of shared caches), StarPU will create combined workers of
+intermediate sizes.
+
+@subsection Concurrent parallel tasks
+
+Unfortunately, many environments and librairies do not support concurrent
+calls.
+
+For instance, most OpenMP implementations (including the main ones) do not
+support concurrent @code{pragma omp parallel} statements without nesting them in
+another @code{pragma omp parallel} statement, but StarPU does not yet support
+creating its CPU workers by using such pragma.
+
+Other parallel libraries are also not safe when being invoked concurrently
+from different threads, due to the use of global variables in their sequential
+sections for instance.
+
+The solution is then to use only a single combined worker, scoping all
+the CPUs.  This can be done by setting @code{single_combined_worker}
+to 1 in the @code{starpu_conf} structure, or setting the
+@code{STARPU_SINGLE_COMBINED_WORKER} environment variable to 1. StarPU will then
+use parallel tasks only over all the CPUs at the same time.
+
 @node Debugging
 @section Debugging
 
@@ -631,23 +772,23 @@ kind of kernel.
 static void
 multiformat_scal_cpu_func(void *buffers[], void *args)
 @{
-	struct point *aos;
-	unsigned int n;
+    struct point *aos;
+    unsigned int n;
 
-	aos = STARPU_MULTIFORMAT_GET_PTR(buffers[0]);
-	n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
-	...
+    aos = STARPU_MULTIFORMAT_GET_PTR(buffers[0]);
+    n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
+    ...
 @}
 
 extern "C" void multiformat_scal_cuda_func(void *buffers[], void *_args)
 @{
-	unsigned int n;
-	struct struct_of_arrays *soa;
+    unsigned int n;
+    struct struct_of_arrays *soa;
 
-	soa = (struct struct_of_arrays *) STARPU_MULTIFORMAT_GET_CUDA_PTR(buffers[0]);
-	n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
+    soa = (struct struct_of_arrays *) STARPU_MULTIFORMAT_GET_CUDA_PTR(buffers[0]);
+    n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
 
-	...
+    ...
 @}
 @end smallexample
 @end cartouche
@@ -667,8 +808,8 @@ be given the CUDA pointer at registration, for instance:
 @cartouche
 @smallexample
 for (workerid = 0; workerid < starpu_worker_get_count(); workerid++)
-	if (starpu_worker_get_type(workerid) == STARPU_CUDA_WORKER)
-		break;
+        if (starpu_worker_get_type(workerid) == STARPU_CUDA_WORKER)
+                break;
 
 cudaSetDevice(starpu_worker_get_devid(workerid));
 cudaGraphicsResourceGetMappedPointer((void**)&output, &num_bytes, resource);
@@ -694,7 +835,7 @@ directory. Simple examples include:
 
 @table @asis
 @item @code{incrementer/}:
-	Trivial incrementation test.
+    Trivial incrementation test.
 @item @code{basic_examples/}:
         Simple documented Hello world (as shown in @ref{Hello World}), vector/scalar product (as shown
         in @ref{Vector Scaling on an Hybrid CPU/GPU Machine}), matrix
@@ -702,20 +843,20 @@ directory. Simple examples include:
         interface, an example using the variable data interface, and an example
         using different formats on CPUs and GPUs.
 @item @code{matvecmult/}:
-	OpenCL example from NVidia, adapted to StarPU.
+    OpenCL example from NVidia, adapted to StarPU.
 @item @code{axpy/}:
-	AXPY CUBLAS operation adapted to StarPU.
+    AXPY CUBLAS operation adapted to StarPU.
 @item @code{fortran/}:
-	Example of Fortran bindings.
+    Example of Fortran bindings.
 @end table
 
 More advanced examples include:
 
 @table @asis
 @item @code{filters/}:
-	Examples using filters, as shown in @ref{Partitioning Data}.
+    Examples using filters, as shown in @ref{Partitioning Data}.
 @item @code{lu/}:
-	LU matrix factorization, see for instance @code{xlu_implicit.c}
+    LU matrix factorization, see for instance @code{xlu_implicit.c}
 @item @code{cholesky/}:
-	Cholesky matrix factorization, see for instance @code{cholesky_implicit.c}.
+    Cholesky matrix factorization, see for instance @code{cholesky_implicit.c}.
 @end table

+ 9 - 0
doc/chapters/basic-api.texi

@@ -1169,6 +1169,15 @@ instance).
 @item @code{name} (optional)
 Codelets are allowed to have a name, which can be useful for debugging purposes.
 
+@item @code{type} (optional)
+The default is @code{STARPU_SEQ}, i.e. usual sequential implementation. Other
+values (@code{STARPU_SPMD} or @code{STARPU_FORKJOIN} declare that a parallel
+implementation is also available. See @ref{Parallel Tasks} for details.
+
+@item @code{max_parallelism} (optional)
+If a parallel implementation is available, this denotes the maximum combined
+worker size that StarPU will use to execute parallel tasks for this codelet.
+
 @end table
 @end deftp