13 年之前 · 15a2f1d26f
--- a/doc/chapters/advanced-examples.texi
+++ b/doc/chapters/advanced-examples.texi
@@ -14,10 +14,11 @@
 
																 * Performance model example::   
															
 
																 * Theoretical lower bound on execution time::  
															
 
																 * Insert Task Utility::          
															
 
																+* Parallel tasks::
															
 
																+* Debugging::
															
 
																 * The multiformat interface::
															
 
																 * On-GPU rendering::
															
 
																 * More examples::               More examples shipped with StarPU
															
 
																-* Debugging::                   When things go wrong.
															
 
																 @end menu
															
 
																 @node Using multiple implementations of a codelet
															
@@ -555,6 +556,146 @@ be executed, and is allowed to read from @code{i} to use it e.g. as an
 
																 index. Note that this macro is only avaible when compiling StarPU with
															
 
																 the compiler @code{gcc}.
															
 
																+@node Parallel Tasks
															
 
																+@section Parallel Tasks
															
 
																+
															
 
																+StarPU can leverage existing parallel computation libraries by the means of
															
 
																+parallel tasks. A parallel task is a task which gets worked on by a set of CPUs
															
 
																+(called a parallel or combined worker) at the same time, by using an existing
															
 
																+parallel CPU implementation of the computation to be achieved. This can also be
															
 
																+useful to improve the load balance between slow CPUs and fast GPUs: since CPUs
															
 
																+work collectively on a single task, the completion time of tasks on CPUs become
															
 
																+comparable to the completion time on GPUs, thus relieving from granularity
															
 
																+discrepancy concerns.
															
 
																+
															
 
																+Two modes of execution exist to accomodate with existing usages.
															
 
																+
															
 
																+@subsection Fork-mode parallel tasks
															
 
																+
															
 
																+In the Fork mode, StarPU will call the codelet function on one
															
 
																+of the CPUs of the combined worker. The codelet function can use
															
 
																+@code{starpu_combined_worker_get_size()} to get the number of threads it is
															
 
																+allowed to start to achieve the computation. The CPU binding mask is already
															
 
																+enforced, so that threads created by the function will inherit the mask, and
															
 
																+thus execute where StarPU expected. For instance, using OpenMP (full source is
															
 
																+available in @code{examples/openmp/vector_scal.c}):
															
 
																+
															
 
																+@example
															
 
																+void scal_cpu_func(void *buffers[], void *_args)
															
 
																+{
															
 
																+    unsigned i;
															
 
																+    float *factor = _args;
															
 
																+    struct starpu_vector_interface *vector = buffers[0];
															
 
																+    unsigned n = STARPU_VECTOR_GET_NX(vector);
															
 
																+    float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
															
 
																+
															
 
																+#pragma omp parallel for num_threads(starpu_combined_worker_get_size())
															
 
																+    for (i = 0; i < n; i++)
															
 
																+        val[i] *= *factor;
															
 
																+}
															
 
																+
															
 
																+static struct starpu_perfmodel vector_scal_model =
															
 
																+{
															
 
																+    .type = STARPU_HISTORY_BASED,
															
 
																+    .symbol = "vector_scale_parallel"
															
 
																+};
															
 
																+
															
 
																+static struct starpu_codelet cl =
															
 
																+{
															
 
																+    .modes = { STARPU_RW },
															
 
																+    .where = STARPU_CPU,
															
 
																+    .type = STARPU_FORKJOIN,
															
 
																+    .max_parallelism = INT_MAX,
															
 
																+    .cpu_funcs = {scal_cpu_func, NULL},
															
 
																+    .nbuffers = 1,
															
 
																+};
															
 
																+@end example
															
 
																+
															
 
																+Other examples include for instance calling a BLAS parallel CPU implementation
															
 
																+(see examples/mult/xgemm.c).
															
 
																+
															
 
																+@subsection SPMD-mode parallel tasks
															
 
																+
															
 
																+In the SPMD mode, StarPU will call the codelet function on
															
 
																+each CPU of the combined worker. The codelet function can use
															
 
																+@code{starpu_combined_worker_get_size()} to get the total number of CPUs
															
 
																+involved in the combined worker, and thus the number of calls that are made in
															
 
																+parallel to the function, and @code{starpu_combined_worker_get_rank()} to get
															
 
																+the rank of the current CPU within the combined worker. For instance:
															
 
																+
															
 
																+@example
															
 
																+static void func(void *buffers[], void *args)
															
 
																+@{
															
 
																+    unsigned i;
															
 
																+    float *factor = _args;
															
 
																+    struct starpu_vector_interface *vector = buffers[0];
															
 
																+    unsigned n = STARPU_VECTOR_GET_NX(vector);
															
 
																+    float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
															
 
																+
															
 
																+    /* Compute slice to compute */
															
 
																+    unsigned m = starpu_combined_worker_get_size();
															
 
																+    unsigned j = starpu_combined_worker_get_rank();
															
 
																+    unsigned slice = (n+m-1)/m;
															
 
																+
															
 
																+    for (i = j * slice; i < (j+1) * slice && i < n; i++)
															
 
																+        val[i] *= *factor;
															
 
																+@}
															
 
																+
															
 
																+statuc struct starpu_codelet cl =
															
 
																+@{
															
 
																+    .modes = { STARPU_RW },
															
 
																+    .where = STARP_CPU,
															
 
																+    .type = STARPU_SPMD,
															
 
																+    .max_parallelism = INT_MAX,
															
 
																+    .cpu_funcs = { func, NULL },
															
 
																+    .nbuffers = 1,
															
 
																+@}
															
 
																+@end example
															
 
																+
															
 
																+Of course, this trivial example will not really benefit from parallel task
															
 
																+execution, and was only meant to be simple to understand.  The benefit comes
															
 
																+when the computation to be done is so that threads have to e.g. exchange
															
 
																+intermediate results, or write to the data in a complex but safe way in the same
															
 
																+buffer.
															
 
																+
															
 
																+@subsection Parallel tasks performance
															
 
																+
															
 
																+To benefit from parallel tasks, a parallel-task-aware StarPU scheduler has to
															
 
																+be used. When exposed to codelets with a Fork or SPMD flag, the @code{pheft}
															
 
																+(parallel-heft) and @code{pgreedy} (parallel greedy) schedulers will indeed also
															
 
																+try to execute tasks with several CPUs. It will automatically try the various
															
 
																+available combined worker sizes and thus be able to avoid choosing a large
															
 
																+combined worker if the codelet does not actually scale so much.
															
 
																+
															
 
																+@subsection Combined worker sizes
															
 
																+
															
 
																+By default, StarPU creates combined workers according to the architecture
															
 
																+structure as detected by HwLoc. It means that for each object of the Hwloc
															
 
																+topology (NUMA node, socket, cache, ...) a combined worker will be created. If
															
 
																+some nodes of the hierarchy have a big arity (e.g. many cores in a socket
															
 
																+without a hierarchy of shared caches), StarPU will create combined workers of
															
 
																+intermediate sizes.
															
 
																+
															
 
																+@subsection Concurrent parallel tasks
															
 
																+
															
 
																+Unfortunately, many environments and librairies do not support concurrent
															
 
																+calls.
															
 
																+
															
 
																+For instance, most OpenMP implementations (including the main ones) do not
															
 
																+support concurrent @code{pragma omp parallel} statements without nesting them in
															
 
																+another @code{pragma omp parallel} statement, but StarPU does not yet support
															
 
																+creating its CPU workers by using such pragma.
															
 
																+
															
 
																+Other parallel libraries are also not safe when being invoked concurrently
															
 
																+from different threads, due to the use of global variables in their sequential
															
 
																+sections for instance.
															
 
																+
															
 
																+The solution is then to use only a single combined worker, scoping all
															
 
																+the CPUs.  This can be done by setting @code{single_combined_worker}
															
 
																+to 1 in the @code{starpu_conf} structure, or setting the
															
 
																+@code{STARPU_SINGLE_COMBINED_WORKER} environment variable to 1. StarPU will then
															
 
																+use parallel tasks only over all the CPUs at the same time.
															
 
																+
															
 
																 @node Debugging
															
 
																 @section Debugging
															
@@ -631,23 +772,23 @@ kind of kernel.
 
																 static void
															
 
																 multiformat_scal_cpu_func(void *buffers[], void *args)
															
 
																 @{
															
 
																-	struct point *aos;
															
 
																-	unsigned int n;
															
 
																+    struct point *aos;
															
 
																+    unsigned int n;
															
 
																-	aos = STARPU_MULTIFORMAT_GET_PTR(buffers[0]);
															
 
																-	n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
															
 
																-	...
															
 
																+    aos = STARPU_MULTIFORMAT_GET_PTR(buffers[0]);
															
 
																+    n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
															
 
																+    ...
															
 
																 @}
															
 
																 extern "C" void multiformat_scal_cuda_func(void *buffers[], void *_args)
															
 
																 @{
															
 
																-	unsigned int n;
															
 
																-	struct struct_of_arrays *soa;
															
 
																+    unsigned int n;
															
 
																+    struct struct_of_arrays *soa;
															
 
																-	soa = (struct struct_of_arrays *) STARPU_MULTIFORMAT_GET_CUDA_PTR(buffers[0]);
															
 
																-	n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
															
 
																+    soa = (struct struct_of_arrays *) STARPU_MULTIFORMAT_GET_CUDA_PTR(buffers[0]);
															
 
																+    n = STARPU_MULTIFORMAT_GET_NX(buffers[0]);
															
 
																-	...
															
 
																+    ...
															
 
																 @}
															
 
																 @end smallexample
															
 
																 @end cartouche
															
@@ -667,8 +808,8 @@ be given the CUDA pointer at registration, for instance:
 
																 @cartouche
															
 
																 @smallexample
															
 
																 for (workerid = 0; workerid < starpu_worker_get_count(); workerid++)
															
 
																-	if (starpu_worker_get_type(workerid) == STARPU_CUDA_WORKER)
															
 
																-		break;
															
 
																+        if (starpu_worker_get_type(workerid) == STARPU_CUDA_WORKER)
															
 
																+                break;
															
 
																 cudaSetDevice(starpu_worker_get_devid(workerid));
															
 
																 cudaGraphicsResourceGetMappedPointer((void**)&output, &num_bytes, resource);
															
@@ -694,7 +835,7 @@ directory. Simple examples include:
 
																 @table @asis
															
 
																 @item @code{incrementer/}:
															
 
																-	Trivial incrementation test.
															
 
																+    Trivial incrementation test.
															
 
																 @item @code{basic_examples/}:
															
 
																         Simple documented Hello world (as shown in @ref{Hello World}), vector/scalar product (as shown
															
 
																         in @ref{Vector Scaling on an Hybrid CPU/GPU Machine}), matrix
															
@@ -702,20 +843,20 @@ directory. Simple examples include:
 
																         interface, an example using the variable data interface, and an example
															
 
																         using different formats on CPUs and GPUs.
															
 
																 @item @code{matvecmult/}:
															
 
																-	OpenCL example from NVidia, adapted to StarPU.
															
 
																+    OpenCL example from NVidia, adapted to StarPU.
															
 
																 @item @code{axpy/}:
															
 
																-	AXPY CUBLAS operation adapted to StarPU.
															
 
																+    AXPY CUBLAS operation adapted to StarPU.
															
 
																 @item @code{fortran/}:
															
 
																-	Example of Fortran bindings.
															
 
																+    Example of Fortran bindings.
															
 
																 @end table
															
 
																 More advanced examples include:
															
 
																 @table @asis
															
 
																 @item @code{filters/}:
															
 
																-	Examples using filters, as shown in @ref{Partitioning Data}.
															
 
																+    Examples using filters, as shown in @ref{Partitioning Data}.
															
 
																 @item @code{lu/}:
															
 
																-	LU matrix factorization, see for instance @code{xlu_implicit.c}
															
 
																+    LU matrix factorization, see for instance @code{xlu_implicit.c}
															
 
																 @item @code{cholesky/}:
															
 
																-	Cholesky matrix factorization, see for instance @code{cholesky_implicit.c}.
															
 
																+    Cholesky matrix factorization, see for instance @code{cholesky_implicit.c}.
															
 
																 @end table
															
--- a/doc/chapters/basic-api.texi
+++ b/doc/chapters/basic-api.texi
@@ -1169,6 +1169,15 @@ instance).
 
																 @item @code{name} (optional)
															
 
																 Codelets are allowed to have a name, which can be useful for debugging purposes.
															
 
																+@item @code{type} (optional)
															
 
																+The default is @code{STARPU_SEQ}, i.e. usual sequential implementation. Other
															
 
																+values (@code{STARPU_SPMD} or @code{STARPU_FORKJOIN} declare that a parallel
															
 
																+implementation is also available. See @ref{Parallel Tasks} for details.
															
 
																+
															
 
																+@item @code{max_parallelism} (optional)
															
 
																+If a parallel implementation is available, this denotes the maximum combined
															
 
																+worker size that StarPU will use to execute parallel tasks for this codelet.
															
 
																+
															
 
																 @end table
															
 
																 @end deftp