13 år sedan · 15a2f1d26f
--- a/doc/chapters/advanced-examples.texi
+++ b/doc/chapters/advanced-examples.texi
@@ -14,10 +14,11 @@
 
				 * Performance model example::   
			
 
				 * Theoretical lower bound on execution time::  
			
 
				 * Insert Task Utility::          
			
 
				+* Parallel tasks::
			
 
				+* Debugging::
			
 
				 * The multiformat interface::
			
 
				 * On-GPU rendering::
			
 
				 * More examples::               More examples shipped with StarPU
			
 
				-* Debugging::                   When things go wrong.
			
 
				 @end menu
			
 
				 
			
 
				 @node Using multiple implementations of a codelet
			
@@ -555,6 +556,146 @@ be executed, and is allowed to read from @code{i} to use it e.g. as an
 
				 index. Note that this macro is only avaible when compiling StarPU with
			
 
				 the compiler @code{gcc}.
			
 
				 
			
 
				+@node Parallel Tasks
			
 
				+@section Parallel Tasks
			
 
				+
			
 
				+StarPU can leverage existing parallel computation libraries by the means of
			
 
				+parallel tasks. A parallel task is a task which gets worked on by a set of CPUs
			
 
				+(called a parallel or combined worker) at the same time, by using an existing
			
 
				+parallel CPU implementation of the computation to be achieved. This can also be
			
 
				+useful to improve the load balance between slow CPUs and fast GPUs: since CPUs
			
 
				+work collectively on a single task, the completion time of tasks on CPUs become
			
 
				+comparable to the completion time on GPUs, thus relieving from granularity
			
 
				+discrepancy concerns.
			
 
				+
			
 
				+Two modes of execution exist to accomodate with existing usages.
			
 
				+
			
 
				+@subsection Fork-mode parallel tasks
			
 
				+
			
 
				+In the Fork mode, StarPU will call the codelet function on one
			
 
				+of the CPUs of the combined worker. The codelet function can use
			
 
				+@code{starpu_combined_worker_get_size()} to get the number of threads it is
			
 
				+allowed to start to achieve the computation. The CPU binding mask is already
			
 
				+enforced, so that threads created by the function will inherit the mask, and
			
 
				+thus execute where StarPU expected. For instance, using OpenMP (full source is
			
 
				+available in @code{examples/openmp/vector_scal.c}):
			
 
				+
			
 
				+@example
			
 
				+void scal_cpu_func(void *buffers[], void *_args)
			
 
				+{
			
 
				+    unsigned i;
			
 
				+    float *factor = _args;
			
 
				+    struct starpu_vector_interface *vector = buffers[0];
			
 
				+    unsigned n = STARPU_VECTOR_GET_NX(vector);
			
 
				+    float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
			
 
				+
			
 
				+#pragma omp parallel for num_threads(starpu_combined_worker_get_size())
			
 
				+    for (i = 0; i < n; i++)
			
 
				+        val[i] *= *factor;
			
 
				+}
			
 
				+
			
 
				+static struct starpu_perfmodel vector_scal_model =
			
 
				+{
			
 
				+    .type = STARPU_HISTORY_BASED,
			
 
				+    .symbol = "vector_scale_parallel"
			
 
				+};
			
 
				+
			
 
				+static struct starpu_codelet cl =
			
 
				+{
			
 
				+    .modes = { STARPU_RW },
			
 
				+    .where = STARPU_CPU,
			
 
				+    .type = STARPU_FORKJOIN,
			
 
				+    .max_parallelism = INT_MAX,
			
 
				+    .cpu_funcs = {scal_cpu_func, NULL},
			
 
				+    .nbuffers = 1,
			
 
				+};
			
 
				+@end example
			
 
				+
			
 
				+Other examples include for instance calling a BLAS parallel CPU implementation
			
 
				+(see examples/mult/xgemm.c).
			
 
				+
			
 
				+@subsection SPMD-mode parallel tasks
			
 
				+
			
 
				+In the SPMD mode, StarPU will call the codelet function on
			
 
				+each CPU of the combined worker. The codelet function can use
			
 
				+@code{starpu_combined_worker_get_size()} to get the total number of CPUs
			
 
				+involved in the combined worker, and thus the number of calls that are made in
			
 
				+parallel to the function, and @code{starpu_combined_worker_get_rank()} to get
			
 
				+the rank of the current CPU within the combined worker. For instance:
			
 
				+
			
 
				+@example
			
 
				+static void func(void *buffers[], void *args)
			
 
				+@{
			
 
				+    unsigned i;
			
 
				+    float *factor = _args;
			
 
				+    struct starpu_vector_interface *vector = buffers[0];
			
 
				+    unsigned n = STARPU_VECTOR_GET_NX(vector);
			
 
				+    float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
			
 
				+
			
 
				+    /* Compute slice to compute */
			
 
				+    unsigned m = starpu_combined_worker_get_size();
			
 
				+    unsigned j = starpu_combined_worker_get_rank();
			
 
				+    unsigned slice = (n+m-1)/m;
			
 
				+
			
 
				+    for (i = j * slice; i < (j+1) * slice && i < n; i++)
			
 
				+        val[i] *= *factor;
			
 
				+@}
			
 
				+
			
 
				+statuc struct starpu_codelet cl =
			
 
				+@{
			
 
				+    .modes = { STARPU_RW },
			
 
				+    .where = STARP_CPU,
			
 
				+    .type = STARPU_SPMD,
			
 
				+    .max_parallelism = INT_MAX,
			
 
				+    .cpu_funcs = { func, NULL },
			
 
				+    .nbuffers = 1,
			
 
				+@}
			
 
				+@end example
			
 
				+
			
 
				+Of course, this trivial example will not really benefit from parallel task
			
 
				+execution, and was only meant to be simple to understand.  The benefit comes
			
 
				+when the computation to be done is so that threads have to e.g. exchange
			
 
				+intermediate results, or write to the data in a complex but safe way in the same
			
 
				+buffer.
			
 
				+
			
 
				+@subsection Parallel tasks performance
			
 
				+
			
 
				+To benefit from parallel tasks, a parallel-task-aware StarPU scheduler has to
			
 
				+be used. When exposed to codelets with a Fork or SPMD flag, the @code{pheft}
			
 
				+(parallel-heft) and @code{pgreedy} (parallel greedy) schedulers will indeed also
			
 
				+try to execute tasks with several CPUs. It will automatically try the various
			
 
				+available combined worker sizes and thus be able to avoid choosing a large
			
 
				+combined worker if the codelet does not actually scale so much.
			
 
				+
			
 
				+@subsection Combined worker sizes
			
 
				+
			
 
				+By default, StarPU creates combined workers according to the architecture
			
 
				+structure as detected by HwLoc. It means that for each object of the Hwloc
			
 
				+topology (NUMA node, socket, cache, ...) a combined worker will be created. If
			
 
				+some nodes of the hierarchy have a big arity (e.g. many cores in a socket
			
 
				+without a hierarchy of shared caches), StarPU will create combined workers of
			
 
				+intermediate sizes.
			
 
				+
			
 
				+@subsection Concurrent parallel tasks
			
 
				+
			
 
				+Unfortunately, many environments and librairies do not support concurrent
			
 
				+calls.
			
 
				+
			
 
				+For instance, most OpenMP implementations (including the main ones) do not
			
 
				+support concurrent @code{pragma omp parallel} statements without nesting them in
			
 
				+another @code{pragma omp parallel} statement, but StarPU does not yet support
			
 
				+creating its CPU workers by using such pragma.
			
 
				+
			
 
				+Other parallel libraries are also not safe when being invoked concurrently
			
 
				+from different threads, due to the use of global variables in their sequential
			
 
				+sections for instance.
			
 
				+
			
 
				+The solution is then to use only a single combined worker, scoping all
			
 
				+the CPUs.  This can be done by setting @code{single_combined_worker}
			
 
				+to 1 in the @code{starpu_conf} structure, or setting the
			
 
				+@code{STARPU_SINGLE_COMBINED_WORKER} environment variable to 1. StarPU will then
			
 
				+use parallel tasks only over all the CPUs at the same time.
			
 
				+
			
 
				 @node Debugging
			
 
				 @section Debugging
			
 
				 
			
@@ -631,23 +772,23 @@ kind of kernel.
 
				 static void
			
 
				 multiformat_scal_cpu_func(void *buffers[], void *args)
			
 
				 @{
			
 
				-	struct point *aos;
			
 
				-	unsigned int n;
			
 
				+    struct point *aos;
			
 
				+    unsigned int n;
			
 
				 
			
 
				-	aos = STARPU_MULTIFORMAT_GET_PTR(buffers[0]);
			
 
				-	n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
			
 
				-	...
			
 
				+    aos = STARPU_MULTIFORMAT_GET_PTR(buffers[0]);
			
 
				+    n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
			
 
				+    ...
			
 
				 @}
			
 
				 
			
 
				 extern "C" void multiformat_scal_cuda_func(void *buffers[], void *_args)
			
 
				 @{
			
 
				-	unsigned int n;
			
 
				-	struct struct_of_arrays *soa;
			
 
				+    unsigned int n;
			
 
				+    struct struct_of_arrays *soa;
			
 
				 
			
 
				-	soa = (struct struct_of_arrays *) STARPU_MULTIFORMAT_GET_CUDA_PTR(buffers[0]);
			
 
				-	n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
			
 
				+    soa = (struct struct_of_arrays *) STARPU_MULTIFORMAT_GET_CUDA_PTR(buffers[0]);
			
 
				+    n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
			
 
				 
			
 
				-	...
			
 
				+    ...
			
 
				 @}
			
 
				 @end smallexample
			
 
				 @end cartouche
			
@@ -667,8 +808,8 @@ be given the CUDA pointer at registration, for instance:
 
				 @cartouche
			
 
				 @smallexample
			
 
				 for (workerid = 0; workerid < starpu_worker_get_count(); workerid++)
			
 
				-	if (starpu_worker_get_type(workerid) == STARPU_CUDA_WORKER)
			
 
				-		break;
			
 
				+        if (starpu_worker_get_type(workerid) == STARPU_CUDA_WORKER)
			
 
				+                break;
			
 
				 
			
 
				 cudaSetDevice(starpu_worker_get_devid(workerid));
			
 
				 cudaGraphicsResourceGetMappedPointer((void**)&output, &num_bytes, resource);
			
@@ -694,7 +835,7 @@ directory. Simple examples include:
 
				 
			
 
				 @table @asis
			
 
				 @item @code{incrementer/}:
			
 
				-	Trivial incrementation test.
			
 
				+    Trivial incrementation test.
			
 
				 @item @code{basic_examples/}:
			
 
				         Simple documented Hello world (as shown in @ref{Hello World}), vector/scalar product (as shown
			
 
				         in @ref{Vector Scaling on an Hybrid CPU/GPU Machine}), matrix
			
@@ -702,20 +843,20 @@ directory. Simple examples include:
 
				         interface, an example using the variable data interface, and an example
			
 
				         using different formats on CPUs and GPUs.
			
 
				 @item @code{matvecmult/}:
			
 
				-	OpenCL example from NVidia, adapted to StarPU.
			
 
				+    OpenCL example from NVidia, adapted to StarPU.
			
 
				 @item @code{axpy/}:
			
 
				-	AXPY CUBLAS operation adapted to StarPU.
			
 
				+    AXPY CUBLAS operation adapted to StarPU.
			
 
				 @item @code{fortran/}:
			
 
				-	Example of Fortran bindings.
			
 
				+    Example of Fortran bindings.
			
 
				 @end table
			
 
				 
			
 
				 More advanced examples include:
			
 
				 
			
 
				 @table @asis
			
 
				 @item @code{filters/}:
			
 
				-	Examples using filters, as shown in @ref{Partitioning Data}.
			
 
				+    Examples using filters, as shown in @ref{Partitioning Data}.
			
 
				 @item @code{lu/}:
			
 
				-	LU matrix factorization, see for instance @code{xlu_implicit.c}
			
 
				+    LU matrix factorization, see for instance @code{xlu_implicit.c}
			
 
				 @item @code{cholesky/}:
			
 
				-	Cholesky matrix factorization, see for instance @code{cholesky_implicit.c}.
			
 
				+    Cholesky matrix factorization, see for instance @code{cholesky_implicit.c}.
			
 
				 @end table
			
--- a/doc/chapters/basic-api.texi
+++ b/doc/chapters/basic-api.texi
@@ -1169,6 +1169,15 @@ instance).
 
				 @item @code{name} (optional)
			
 
				 Codelets are allowed to have a name, which can be useful for debugging purposes.
			
 
				 
			
 
				+@item @code{type} (optional)
			
 
				+The default is @code{STARPU_SEQ}, i.e. usual sequential implementation. Other
			
 
				+values (@code{STARPU_SPMD} or @code{STARPU_FORKJOIN} declare that a parallel
			
 
				+implementation is also available. See @ref{Parallel Tasks} for details.
			
 
				+
			
 
				+@item @code{max_parallelism} (optional)
			
 
				+If a parallel implementation is available, this denotes the maximum combined
			
 
				+worker size that StarPU will use to execute parallel tasks for this codelet.
			
 
				+
			
 
				 @end table
			
 
				 @end deftp