Browse Source

Document parallel tasks

Samuel Thibault 13 years ago
parent
commit
15a2f1d26f
2 changed files with 170 additions and 20 deletions
  1. 161 20
      doc/chapters/advanced-examples.texi
  2. 9 0
      doc/chapters/basic-api.texi

+ 161 - 20
doc/chapters/advanced-examples.texi

@@ -14,10 +14,11 @@
 * Performance model example::   
 * Performance model example::   
 * Theoretical lower bound on execution time::  
 * Theoretical lower bound on execution time::  
 * Insert Task Utility::          
 * Insert Task Utility::          
+* Parallel tasks::
+* Debugging::
 * The multiformat interface::
 * The multiformat interface::
 * On-GPU rendering::
 * On-GPU rendering::
 * More examples::               More examples shipped with StarPU
 * More examples::               More examples shipped with StarPU
-* Debugging::                   When things go wrong.
 @end menu
 @end menu
 
 
 @node Using multiple implementations of a codelet
 @node Using multiple implementations of a codelet
@@ -555,6 +556,146 @@ be executed, and is allowed to read from @code{i} to use it e.g. as an
 index. Note that this macro is only avaible when compiling StarPU with
 index. Note that this macro is only avaible when compiling StarPU with
 the compiler @code{gcc}.
 the compiler @code{gcc}.
 
 
+@node Parallel Tasks
+@section Parallel Tasks
+
+StarPU can leverage existing parallel computation libraries by the means of
+parallel tasks. A parallel task is a task which gets worked on by a set of CPUs
+(called a parallel or combined worker) at the same time, by using an existing
+parallel CPU implementation of the computation to be achieved. This can also be
+useful to improve the load balance between slow CPUs and fast GPUs: since CPUs
+work collectively on a single task, the completion time of tasks on CPUs become
+comparable to the completion time on GPUs, thus relieving from granularity
+discrepancy concerns.
+
+Two modes of execution exist to accomodate with existing usages.
+
+@subsection Fork-mode parallel tasks
+
+In the Fork mode, StarPU will call the codelet function on one
+of the CPUs of the combined worker. The codelet function can use
+@code{starpu_combined_worker_get_size()} to get the number of threads it is
+allowed to start to achieve the computation. The CPU binding mask is already
+enforced, so that threads created by the function will inherit the mask, and
+thus execute where StarPU expected. For instance, using OpenMP (full source is
+available in @code{examples/openmp/vector_scal.c}):
+
+@example
+void scal_cpu_func(void *buffers[], void *_args)
+{
+    unsigned i;
+    float *factor = _args;
+    struct starpu_vector_interface *vector = buffers[0];
+    unsigned n = STARPU_VECTOR_GET_NX(vector);
+    float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
+
+#pragma omp parallel for num_threads(starpu_combined_worker_get_size())
+    for (i = 0; i < n; i++)
+        val[i] *= *factor;
+}
+
+static struct starpu_perfmodel vector_scal_model =
+{
+    .type = STARPU_HISTORY_BASED,
+    .symbol = "vector_scale_parallel"
+};
+
+static struct starpu_codelet cl =
+{
+    .modes = { STARPU_RW },
+    .where = STARPU_CPU,
+    .type = STARPU_FORKJOIN,
+    .max_parallelism = INT_MAX,
+    .cpu_funcs = {scal_cpu_func, NULL},
+    .nbuffers = 1,
+};
+@end example
+
+Other examples include for instance calling a BLAS parallel CPU implementation
+(see examples/mult/xgemm.c).
+
+@subsection SPMD-mode parallel tasks
+
+In the SPMD mode, StarPU will call the codelet function on
+each CPU of the combined worker. The codelet function can use
+@code{starpu_combined_worker_get_size()} to get the total number of CPUs
+involved in the combined worker, and thus the number of calls that are made in
+parallel to the function, and @code{starpu_combined_worker_get_rank()} to get
+the rank of the current CPU within the combined worker. For instance:
+
+@example
+static void func(void *buffers[], void *args)
+@{
+    unsigned i;
+    float *factor = _args;
+    struct starpu_vector_interface *vector = buffers[0];
+    unsigned n = STARPU_VECTOR_GET_NX(vector);
+    float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
+
+    /* Compute slice to compute */
+    unsigned m = starpu_combined_worker_get_size();
+    unsigned j = starpu_combined_worker_get_rank();
+    unsigned slice = (n+m-1)/m;
+
+    for (i = j * slice; i < (j+1) * slice && i < n; i++)
+        val[i] *= *factor;
+@}
+
+statuc struct starpu_codelet cl =
+@{
+    .modes = { STARPU_RW },
+    .where = STARP_CPU,
+    .type = STARPU_SPMD,
+    .max_parallelism = INT_MAX,
+    .cpu_funcs = { func, NULL },
+    .nbuffers = 1,
+@}
+@end example
+
+Of course, this trivial example will not really benefit from parallel task
+execution, and was only meant to be simple to understand.  The benefit comes
+when the computation to be done is so that threads have to e.g. exchange
+intermediate results, or write to the data in a complex but safe way in the same
+buffer.
+
+@subsection Parallel tasks performance
+
+To benefit from parallel tasks, a parallel-task-aware StarPU scheduler has to
+be used. When exposed to codelets with a Fork or SPMD flag, the @code{pheft}
+(parallel-heft) and @code{pgreedy} (parallel greedy) schedulers will indeed also
+try to execute tasks with several CPUs. It will automatically try the various
+available combined worker sizes and thus be able to avoid choosing a large
+combined worker if the codelet does not actually scale so much.
+
+@subsection Combined worker sizes
+
+By default, StarPU creates combined workers according to the architecture
+structure as detected by HwLoc. It means that for each object of the Hwloc
+topology (NUMA node, socket, cache, ...) a combined worker will be created. If
+some nodes of the hierarchy have a big arity (e.g. many cores in a socket
+without a hierarchy of shared caches), StarPU will create combined workers of
+intermediate sizes.
+
+@subsection Concurrent parallel tasks
+
+Unfortunately, many environments and librairies do not support concurrent
+calls.
+
+For instance, most OpenMP implementations (including the main ones) do not
+support concurrent @code{pragma omp parallel} statements without nesting them in
+another @code{pragma omp parallel} statement, but StarPU does not yet support
+creating its CPU workers by using such pragma.
+
+Other parallel libraries are also not safe when being invoked concurrently
+from different threads, due to the use of global variables in their sequential
+sections for instance.
+
+The solution is then to use only a single combined worker, scoping all
+the CPUs.  This can be done by setting @code{single_combined_worker}
+to 1 in the @code{starpu_conf} structure, or setting the
+@code{STARPU_SINGLE_COMBINED_WORKER} environment variable to 1. StarPU will then
+use parallel tasks only over all the CPUs at the same time.
+
 @node Debugging
 @node Debugging
 @section Debugging
 @section Debugging
 
 
@@ -631,23 +772,23 @@ kind of kernel.
 static void
 static void
 multiformat_scal_cpu_func(void *buffers[], void *args)
 multiformat_scal_cpu_func(void *buffers[], void *args)
 @{
 @{
-	struct point *aos;
-	unsigned int n;
+    struct point *aos;
+    unsigned int n;
 
 
-	aos = STARPU_MULTIFORMAT_GET_PTR(buffers[0]);
-	n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
-	...
+    aos = STARPU_MULTIFORMAT_GET_PTR(buffers[0]);
+    n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
+    ...
 @}
 @}
 
 
 extern "C" void multiformat_scal_cuda_func(void *buffers[], void *_args)
 extern "C" void multiformat_scal_cuda_func(void *buffers[], void *_args)
 @{
 @{
-	unsigned int n;
-	struct struct_of_arrays *soa;
+    unsigned int n;
+    struct struct_of_arrays *soa;
 
 
-	soa = (struct struct_of_arrays *) STARPU_MULTIFORMAT_GET_CUDA_PTR(buffers[0]);
-	n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
+    soa = (struct struct_of_arrays *) STARPU_MULTIFORMAT_GET_CUDA_PTR(buffers[0]);
+    n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
 
 
-	...
+    ...
 @}
 @}
 @end smallexample
 @end smallexample
 @end cartouche
 @end cartouche
@@ -667,8 +808,8 @@ be given the CUDA pointer at registration, for instance:
 @cartouche
 @cartouche
 @smallexample
 @smallexample
 for (workerid = 0; workerid < starpu_worker_get_count(); workerid++)
 for (workerid = 0; workerid < starpu_worker_get_count(); workerid++)
-	if (starpu_worker_get_type(workerid) == STARPU_CUDA_WORKER)
-		break;
+        if (starpu_worker_get_type(workerid) == STARPU_CUDA_WORKER)
+                break;
 
 
 cudaSetDevice(starpu_worker_get_devid(workerid));
 cudaSetDevice(starpu_worker_get_devid(workerid));
 cudaGraphicsResourceGetMappedPointer((void**)&output, &num_bytes, resource);
 cudaGraphicsResourceGetMappedPointer((void**)&output, &num_bytes, resource);
@@ -694,7 +835,7 @@ directory. Simple examples include:
 
 
 @table @asis
 @table @asis
 @item @code{incrementer/}:
 @item @code{incrementer/}:
-	Trivial incrementation test.
+    Trivial incrementation test.
 @item @code{basic_examples/}:
 @item @code{basic_examples/}:
         Simple documented Hello world (as shown in @ref{Hello World}), vector/scalar product (as shown
         Simple documented Hello world (as shown in @ref{Hello World}), vector/scalar product (as shown
         in @ref{Vector Scaling on an Hybrid CPU/GPU Machine}), matrix
         in @ref{Vector Scaling on an Hybrid CPU/GPU Machine}), matrix
@@ -702,20 +843,20 @@ directory. Simple examples include:
         interface, an example using the variable data interface, and an example
         interface, an example using the variable data interface, and an example
         using different formats on CPUs and GPUs.
         using different formats on CPUs and GPUs.
 @item @code{matvecmult/}:
 @item @code{matvecmult/}:
-	OpenCL example from NVidia, adapted to StarPU.
+    OpenCL example from NVidia, adapted to StarPU.
 @item @code{axpy/}:
 @item @code{axpy/}:
-	AXPY CUBLAS operation adapted to StarPU.
+    AXPY CUBLAS operation adapted to StarPU.
 @item @code{fortran/}:
 @item @code{fortran/}:
-	Example of Fortran bindings.
+    Example of Fortran bindings.
 @end table
 @end table
 
 
 More advanced examples include:
 More advanced examples include:
 
 
 @table @asis
 @table @asis
 @item @code{filters/}:
 @item @code{filters/}:
-	Examples using filters, as shown in @ref{Partitioning Data}.
+    Examples using filters, as shown in @ref{Partitioning Data}.
 @item @code{lu/}:
 @item @code{lu/}:
-	LU matrix factorization, see for instance @code{xlu_implicit.c}
+    LU matrix factorization, see for instance @code{xlu_implicit.c}
 @item @code{cholesky/}:
 @item @code{cholesky/}:
-	Cholesky matrix factorization, see for instance @code{cholesky_implicit.c}.
+    Cholesky matrix factorization, see for instance @code{cholesky_implicit.c}.
 @end table
 @end table

+ 9 - 0
doc/chapters/basic-api.texi

@@ -1169,6 +1169,15 @@ instance).
 @item @code{name} (optional)
 @item @code{name} (optional)
 Codelets are allowed to have a name, which can be useful for debugging purposes.
 Codelets are allowed to have a name, which can be useful for debugging purposes.
 
 
+@item @code{type} (optional)
+The default is @code{STARPU_SEQ}, i.e. usual sequential implementation. Other
+values (@code{STARPU_SPMD} or @code{STARPU_FORKJOIN} declare that a parallel
+implementation is also available. See @ref{Parallel Tasks} for details.
+
+@item @code{max_parallelism} (optional)
+If a parallel implementation is available, this denotes the maximum combined
+worker size that StarPU will use to execute parallel tasks for this codelet.
+
 @end table
 @end table
 @end deftp
 @end deftp