Browse Source

a lot of documentation fixes and additions here and there

Samuel Thibault 14 years ago
parent
commit
d87e58afc7
5 changed files with 245 additions and 130 deletions
  1. 1 0
      configure.ac
  2. 228 120
      doc/starpu.texi
  3. 4 3
      doc/vector_scal_c.texi
  4. 5 3
      doc/vector_scal_cuda.texi
  5. 7 4
      doc/vector_scal_opencl.texi

+ 1 - 0
configure.ac

@@ -147,6 +147,7 @@ AC_ARG_ENABLE(nmaxcpus, [AS_HELP_STRING([--enable-nmaxcpus=<number>],
 			[maximum number of CPUs])],
 			[maximum number of CPUs])],
 			nmaxcpus=$enableval, nmaxcpus=16)
 			nmaxcpus=$enableval, nmaxcpus=16)
 AC_MSG_RESULT($nmaxcpus)
 AC_MSG_RESULT($nmaxcpus)
+# TODO: rename to STARPU_MAXCPUS for coherency with CUDA/OpenCL?
 AC_DEFINE_UNQUOTED(STARPU_NMAXCPUS, [$nmaxcpus], [Maximum number of CPUs supported])
 AC_DEFINE_UNQUOTED(STARPU_NMAXCPUS, [$nmaxcpus], [Maximum number of CPUs supported])
 
 
 AC_MSG_CHECKING(whether CPUs should be used)
 AC_MSG_CHECKING(whether CPUs should be used)

+ 228 - 120
doc/starpu.texi

@@ -35,6 +35,7 @@ This manual documents the usage of StarPU.
 * Installing StarPU::           How to configure, build and install StarPU
 * Installing StarPU::           How to configure, build and install StarPU
 * Using StarPU::                How to run StarPU application
 * Using StarPU::                How to run StarPU application
 * Basic Examples::              Basic examples of the use of StarPU
 * Basic Examples::              Basic examples of the use of StarPU
+* Performance options::         Performance options worth knowing
 * Performance feedback::        Performance debugging tools
 * Performance feedback::        Performance debugging tools
 * Configuring StarPU::          How to configure StarPU
 * Configuring StarPU::          How to configure StarPU
 * StarPU API::                  The API to use StarPU
 * StarPU API::                  The API to use StarPU
@@ -70,7 +71,8 @@ StarPU is a runtime system that offers support for heterogeneous multicore
 architectures, it not only offers a unified view of the computational resources
 architectures, it not only offers a unified view of the computational resources
 (i.e. CPUs and accelerators at the same time), but it also takes care of
 (i.e. CPUs and accelerators at the same time), but it also takes care of
 efficiently mapping and executing tasks onto an heterogeneous machine while
 efficiently mapping and executing tasks onto an heterogeneous machine while
-transparently handling low-level issues in a portable fashion.
+transparently handling low-level issues such as data transfers in a portable
+fashion.
 
 
 @c this leads to a complicated distributed memory design
 @c this leads to a complicated distributed memory design
 @c which is not (easily) manageable by hand
 @c which is not (easily) manageable by hand
@@ -100,7 +102,7 @@ fashion.
 @node Codelet and Tasks
 @node Codelet and Tasks
 @subsection Codelet and Tasks
 @subsection Codelet and Tasks
 
 
-One of StarPU primary data structure is the @b{codelet}. A codelet describes a
+One of the StarPU primary data structures is the @b{codelet}. A codelet describes a
 computational kernel that can possibly be implemented on multiple architectures
 computational kernel that can possibly be implemented on multiple architectures
 such as a CPU, a CUDA device or a Cell's SPU.
 such as a CPU, a CUDA device or a Cell's SPU.
 
 
@@ -109,7 +111,7 @@ such as a CPU, a CUDA device or a Cell's SPU.
 Another important data structure is the @b{task}. Executing a StarPU task
 Another important data structure is the @b{task}. Executing a StarPU task
 consists in applying a codelet on a data set, on one of the architectures on
 consists in applying a codelet on a data set, on one of the architectures on
 which the codelet is implemented. In addition to the codelet that a task
 which the codelet is implemented. In addition to the codelet that a task
-implements, it also describes which data are accessed, and how they are
+useuses, it also describes which data are accessed, and how they are
 accessed during the computation (read and/or write).
 accessed during the computation (read and/or write).
 StarPU tasks are asynchronous: submitting a task to StarPU is a non-blocking
 StarPU tasks are asynchronous: submitting a task to StarPU is a non-blocking
 operation. The task structure can also specify a @b{callback} function that is
 operation. The task structure can also specify a @b{callback} function that is
@@ -117,9 +119,12 @@ called once StarPU has properly executed the task. It also contains optional
 fields that the application may use to give hints to the scheduler (such as
 fields that the application may use to give hints to the scheduler (such as
 priority levels).
 priority levels).
 
 
-A task may be identified by a unique 64-bit number which we refer as a @b{tag}.
+A task may be identified by a unique 64-bit number chosen by the application
-Task dependencies can be enforced either by the means of callback functions, or
+which we refer as a @b{tag}.
-by expressing dependencies between tags.
+Task dependencies can be enforced either by the means of callback functions, by
+expressing dependencies between explicit tasks or by expressing dependencies
+between tags (which can thus correspond to tasks that have not been submitted
+yet).
 
 
 @c TODO insert illustration f(Ar, Brw, Cr) + ..
 @c TODO insert illustration f(Ar, Brw, Cr) + ..
 
 
@@ -174,6 +179,14 @@ can be used to install StarPU.
 @node Getting Sources
 @node Getting Sources
 @subsection Getting Sources
 @subsection Getting Sources
 
 
+The simplest way to get StarPU sources is to download the latest official
+release tarball from @indicateurl{https://gforge.inria.fr/frs/?group_id=1570} ,
+or the latest nightly snapshot from
+@indicateurl{http://starpu.gforge.inria.fr/testing/} . The following documents
+how to get the very latest version from the subversion repository itself, it
+should be needed only if you need the very latest changes (i.e. less than a
+day!)
+
 The source code is managed by a Subversion server hosted by the
 The source code is managed by a Subversion server hosted by the
 InriaGforge. To get the source code, you need:
 InriaGforge. To get the source code, you need:
 
 
@@ -181,16 +194,18 @@ InriaGforge. To get the source code, you need:
 @item
 @item
 To install the client side of the software Subversion if it is
 To install the client side of the software Subversion if it is
 not already available on your system. The software can be obtained from
 not already available on your system. The software can be obtained from
-@indicateurl{http://subversion.tigris.org}.
+@indicateurl{http://subversion.tigris.org} . If you are running
+on Windows, you will probably prefer to use TortoiseSVN from
+@indicateurl{http://tortoisesvn.tigris.org/} .
 
 
 @item
 @item
 You can check out the project's SVN repository through anonymous
 You can check out the project's SVN repository through anonymous
 access. This will provide you with a read access to the
 access. This will provide you with a read access to the
 repository.
 repository.
 
 
-You can also choose to become a member of the project @code{starpu}.
+If you need to have write access on the StarPU project, you can also choose to
-For this, you first need to get an account to the gForge server. You
+become a member of the project @code{starpu}.  For this, you first need to get
-can then send a request to join the project
+an account to the gForge server. You can then send a request to join the project
 (@indicateurl{https://gforge.inria.fr/project/request.php?group_id=1570}).
 (@indicateurl{https://gforge.inria.fr/project/request.php?group_id=1570}).
 
 
 @item
 @item
@@ -225,7 +240,7 @@ using your gForge account
 These steps require to run autoconf and automake to generate the
 These steps require to run autoconf and automake to generate the
 @code{./configure} script. This can be done by calling
 @code{./configure} script. This can be done by calling
 @code{./autogen.sh}. The required version for autoconf is 2.60 or
 @code{./autogen.sh}. The required version for autoconf is 2.60 or
-higher.
+higher. You will also need makeinfo.
 
 
 @example
 @example
 % ./autogen.sh
 % ./autogen.sh
@@ -377,9 +392,9 @@ installed. This step is done only once per user and per machine.
 @section Using accelerators
 @section Using accelerators
 
 
 When both CUDA and OpenCL drivers are enabled, StarPU will launch an
 When both CUDA and OpenCL drivers are enabled, StarPU will launch an
-OpenCL worker only if CUDA is not already running on the GPU.
+OpenCL worker for NVIDIA GPUs only if CUDA is not already running on them.
 This design choice was necessary as OpenCL and CUDA can not run at the
 This design choice was necessary as OpenCL and CUDA can not run at the
-same time on the same GPU, as there is currently no interoperability
+same time on the same NVIDIA GPU, as there is currently no interoperability
 between them.
 between them.
 
 
 Details on how to specify devices running OpenCL and the ones running
 Details on how to specify devices running OpenCL and the ones running
@@ -487,7 +502,7 @@ manipulated by the codelet: here the codelet does not access or modify any data
 that is controlled by our data management library. Note that the argument
 that is controlled by our data management library. Note that the argument
 passed to the codelet (the @code{cl_arg} field of the @code{starpu_task}
 passed to the codelet (the @code{cl_arg} field of the @code{starpu_task}
 structure) does not count as a buffer since it is not managed by our data
 structure) does not count as a buffer since it is not managed by our data
-management library.
+management library, but just contain trivial parameters.
 
 
 @c TODO need a crossref to the proper description of "where" see bla for more ...
 @c TODO need a crossref to the proper description of "where" see bla for more ...
 We create a codelet which may only be executed on the CPUs. The @code{where}
 We create a codelet which may only be executed on the CPUs. The @code{where}
@@ -501,7 +516,8 @@ which @emph{must} have the following prototype:
 
 
 In this example, we can ignore the first argument of this function which gives a
 In this example, we can ignore the first argument of this function which gives a
 description of the input and output buffers (e.g. the size and the location of
 description of the input and output buffers (e.g. the size and the location of
-the matrices). The second argument is a pointer to a buffer passed as an
+the matrices) since there is none.
+The second argument is a pointer to a buffer passed as an
 argument to the codelet by the means of the @code{cl_arg} field of the
 argument to the codelet by the means of the @code{cl_arg} field of the
 @code{starpu_task} structure.
 @code{starpu_task} structure.
 
 
@@ -510,7 +526,8 @@ Be aware that this may be a pointer to a
 @emph{copy} of the actual buffer, and not the pointer given by the programmer:
 @emph{copy} of the actual buffer, and not the pointer given by the programmer:
 if the codelet modifies this buffer, there is no guarantee that the initial
 if the codelet modifies this buffer, there is no guarantee that the initial
 buffer will be modified as well: this for instance implies that the buffer
 buffer will be modified as well: this for instance implies that the buffer
-cannot be used as a synchronization medium.
+cannot be used as a synchronization medium. If synchronization is needed, data
+has to be registered to StarPU, see @ref{Scaling a Vector}.
 
 
 @node Submitting a Task
 @node Submitting a Task
 @subsection Submitting a Task
 @subsection Submitting a Task
@@ -573,9 +590,10 @@ The optional @code{cl_arg} field is a pointer to a buffer (of size
 @code{cl_arg_size}) with some parameters for the kernel
 @code{cl_arg_size}) with some parameters for the kernel
 described by the codelet. For instance, if a codelet implements a computational
 described by the codelet. For instance, if a codelet implements a computational
 kernel that multiplies its input vector by a constant, the constant could be
 kernel that multiplies its input vector by a constant, the constant could be
-specified by the means of this buffer, instead of registering it.
+specified by the means of this buffer, instead of registering it as a StarPU
+data.
 
 
-Once a task has been executed, an optional callback function can be called.
+Once a task has been executed, an optional callback function is be called.
 While the computational kernel could be offloaded on various architectures, the
 While the computational kernel could be offloaded on various architectures, the
 callback function is always executed on a CPU. The @code{callback_arg}
 callback function is always executed on a CPU. The @code{callback_arg}
 pointer is passed as an argument of the callback. The prototype of a callback
 pointer is passed as an argument of the callback. The prototype of a callback
@@ -583,10 +601,11 @@ function must be:
 
 
 @code{void (*callback_function)(void *);}
 @code{void (*callback_function)(void *);}
 
 
-If the @code{synchronous} field is non-null, task submission will be
+If the @code{synchronous} field is non-zero, task submission will be
 synchronous: the @code{starpu_task_submit} function will not return until the
 synchronous: the @code{starpu_task_submit} function will not return until the
 task was executed. Note that the @code{starpu_shutdown} method does not
 task was executed. Note that the @code{starpu_shutdown} method does not
-guarantee that asynchronous tasks have been executed before it returns.
+guarantee that asynchronous tasks have been executed before it returns,
+@code{starpu_task_wait_for_all} can be used to that effect..
 
 
 @node Execution of Hello World
 @node Execution of Hello World
 @subsection Execution of Hello World
 @subsection Execution of Hello World
@@ -595,7 +614,7 @@ guarantee that asynchronous tasks have been executed before it returns.
 % make hello_world
 % make hello_world
 cc $(pkg-config --cflags libstarpu)  $(pkg-config --libs libstarpu) hello_world.c -o hello_world
 cc $(pkg-config --cflags libstarpu)  $(pkg-config --libs libstarpu) hello_world.c -o hello_world
 % ./hello_world
 % ./hello_world
-Hello world (array = @{1.000000, -1.000000@} )
+Hello world (params = @{1, 2.000000@} )
 Callback function (arg 42)
 Callback function (arg 42)
 @end smallexample
 @end smallexample
 
 
@@ -625,7 +644,7 @@ Before submitting those tasks, the programmer first needs to declare the
 different pieces of data to StarPU using the @code{starpu_*_data_register}
 different pieces of data to StarPU using the @code{starpu_*_data_register}
 functions. To ease the development of applications for StarPU, it is possible
 functions. To ease the development of applications for StarPU, it is possible
 to describe multiple types of data layout. A type of data layout is called an
 to describe multiple types of data layout. A type of data layout is called an
-@b{interface}. By default, there are different interfaces available in StarPU:
+@b{interface}. There are different predefined interfaces available in StarPU:
 here we will consider the @b{vector interface}.
 here we will consider the @b{vector interface}.
 
 
 The following lines show how to declare an array of @code{NX} elements of type
 The following lines show how to declare an array of @code{NX} elements of type
@@ -644,10 +663,10 @@ starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector, NX,
 The first argument, called the @b{data handle}, is an opaque pointer which
 The first argument, called the @b{data handle}, is an opaque pointer which
 designates the array in StarPU. This is also the structure which is used to
 designates the array in StarPU. This is also the structure which is used to
 describe which data is used by a task. The second argument is the node number
 describe which data is used by a task. The second argument is the node number
-where the data currently resides. Here it is 0 since the @code{vector} array is in
+where the data originally resides. Here it is 0 since the @code{vector} array is in
-the main memory. Then comes the pointer @code{vector} where the data can be found,
+the main memory. Then comes the pointer @code{vector} where the data can be found in main memory,
 the number of elements in the vector and the size of each element.
 the number of elements in the vector and the size of each element.
-It is possible to construct a StarPU task that will manipulate the
+The following shows how to construct a StarPU task that will manipulate the
 vector and a constant factor.
 vector and a constant factor.
 
 
 @cartouche
 @cartouche
@@ -666,7 +685,8 @@ starpu_task_submit(task);
 @end smallexample
 @end smallexample
 @end cartouche
 @end cartouche
 
 
-Since the factor is a mere float value parameter, it does not need a preliminary registration, and
+Since the factor is a mere constant float value parameter,
+it does not need a preliminary registration, and
 can just be passed through the @code{cl_arg} pointer like in the previous
 can just be passed through the @code{cl_arg} pointer like in the previous
 example.  The vector parameter is described by its handle.
 example.  The vector parameter is described by its handle.
 There are two fields in each element of the @code{buffers} array.
 There are two fields in each element of the @code{buffers} array.
@@ -703,7 +723,7 @@ starpu_codelet cl = @{
 The first argument is an array that gives
 The first argument is an array that gives
 a description of all the buffers passed in the @code{task->buffers}@ array. The
 a description of all the buffers passed in the @code{task->buffers}@ array. The
 size of this array is given by the @code{nbuffers} field of the codelet
 size of this array is given by the @code{nbuffers} field of the codelet
-structure. For the sake of generality, this array contains pointers to the
+structure. For the sake of genericity, this array contains pointers to the
 different interfaces describing each buffer.  In the case of the @b{vector
 different interfaces describing each buffer.  In the case of the @b{vector
 interface}, the location of the vector (resp. its length) is accessible in the
 interface}, the location of the vector (resp. its length) is accessible in the
 @code{ptr} (resp. @code{nx}) of this array. Since the vector is accessed in a
 @code{ptr} (resp. @code{nx}) of this array. Since the vector is accessed in a
@@ -752,7 +772,7 @@ static __global__ void vector_mult_cuda(float *val, unsigned n,
                                         float factor)
                                         float factor)
 @{
 @{
     unsigned i;
     unsigned i;
-    for(i = 0 ; i < n ; i++)
+    if (i < n)
         val[i] *= factor;
         val[i] *= factor;
 @}
 @}
 
 
@@ -764,8 +784,10 @@ extern "C" void scal_cuda_func(void *buffers[], void *_args)
     unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
     unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
     /* local copy of the vector pointer */
     /* local copy of the vector pointer */
     float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);
     float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);
+    unsigned threads_per_block = 64;
+    unsigned nblocks = (n + threads_per_block-1) / threads_per_block;
 
 
-@i{    vector_mult_cuda<<<1,1>>>(val, n, *factor);}
+@i{    vector_mult_cuda<<<nblocks,threads_per_block>>>(val, n, *factor);}
 
 
 @i{    cudaThreadSynchronize();}
 @i{    cudaThreadSynchronize();}
 @}
 @}
@@ -882,8 +904,8 @@ int main(int argc, char **argv)
     starpu_init(NULL);                            /* @b{Initialising StarPU} */
     starpu_init(NULL);                            /* @b{Initialising StarPU} */
 
 
 #ifdef STARPU_USE_OPENCL
 #ifdef STARPU_USE_OPENCL
-    starpu_opencl_load_opencl_from_file("examples/basic_examples/vector_scal_opencl_codelet.cl",
+    starpu_opencl_load_opencl_from_file(
-                                        &programs);
+            "examples/basic_examples/vector_scal_opencl_codelet.cl", &programs);
 #endif
 #endif
 
 
     vector = malloc(NX*sizeof(vector[0]));
     vector = malloc(NX*sizeof(vector[0]));
@@ -917,6 +939,7 @@ int main(int argc, char **argv)
             return 1;
             return 1;
     @}
     @}
 
 
+@c TODO: Mmm, should rather be an unregistration with an implicit dependency, no?
     /* @b{Waiting for its termination} */
     /* @b{Waiting for its termination} */
     starpu_task_wait_for_all();
     starpu_task_wait_for_all();
 
 
@@ -985,7 +1008,8 @@ or for example, by disabling CPU devices:
 0.000000 3.000000 6.000000 9.000000 12.000000
 0.000000 3.000000 6.000000 9.000000 12.000000
 @end smallexample
 @end smallexample
 
 
-or by disabling CUDA devices:
+or by disabling CUDA devices (which may permit to enable the use of OpenCL,
+see @ref{Using accelerators}):
 
 
 @smallexample
 @smallexample
 % STARPU_NCUDA=0 ./vector_scal
 % STARPU_NCUDA=0 ./vector_scal
@@ -1007,6 +1031,7 @@ task->synchronous = 1;
  * query the profiling info before the task is destroyed. */
  * query the profiling info before the task is destroyed. */
 task->destroy = 0;
 task->destroy = 0;
 
 
+/* Submit and wait for completion (since synchronous was set to 1) */
 starpu_task_submit(task);
 starpu_task_submit(task);
 
 
 /* The task is finished, get profiling information */
 /* The task is finished, get profiling information */
@@ -1081,6 +1106,7 @@ starpu_data_partition(handle, &f);
 @smallexample
 @smallexample
 /* Submit a task on each sub-vector */
 /* Submit a task on each sub-vector */
 for (i=0; i<starpu_data_get_nb_children(handle); i++) @{
 for (i=0; i<starpu_data_get_nb_children(handle); i++) @{
+    /* Get subdata number i (there is only 1 dimension) */
     starpu_data_handle sub_handle = starpu_data_get_sub_data(handle, 1, i);
     starpu_data_handle sub_handle = starpu_data_get_sub_data(handle, 1, i);
     struct starpu_task *task = starpu_task_create();
     struct starpu_task *task = starpu_task_create();
 
 
@@ -1110,7 +1136,7 @@ performance model. There are several kinds of performance models.
 @item
 @item
 Providing an estimation from the application itself (@code{STARPU_COMMON} model type and @code{cost_model} field),
 Providing an estimation from the application itself (@code{STARPU_COMMON} model type and @code{cost_model} field),
 see for instance
 see for instance
-@code{examples/common/blas_model.c} and @code{examples/common/blas_model.h}. It can also be provided for each architecture (@code{STARPU_PER_ARCH} model type and @code{per_arch} field)
+@code{examples/common/blas_model.h} and @code{examples/common/blas_model.c}. It can also be provided for each architecture (@code{STARPU_PER_ARCH} model type and @code{per_arch} field)
 @item
 @item
 Measured at runtime (STARPU_HISTORY_BASED model type). This assumes that for a
 Measured at runtime (STARPU_HISTORY_BASED model type). This assumes that for a
 given set of data input/output sizes, the performance will always be about the
 given set of data input/output sizes, the performance will always be about the
@@ -1184,8 +1210,8 @@ the priorities as the StarPU scheduler would, i.e. schedule prioritized
 tasks before less prioritized tasks, to check to which extend this results
 tasks before less prioritized tasks, to check to which extend this results
 to a less optimal solution. This increases even more computation time.
 to a less optimal solution. This increases even more computation time.
 
 
-Note that all this however doesn't take into account data transfer, which is
+Note that for simplicity, all this however doesn't take into account data
-assumed to be completely overlapped.
+transfers, which are assumed to be completely overlapped.
 
 
 @node More examples
 @node More examples
 @section More examples
 @section More examples
@@ -1222,6 +1248,18 @@ More advanced examples include:
 @c Performance feedback
 @c Performance feedback
 @c ---------------------------------------------------------------------
 @c ---------------------------------------------------------------------
 
 
+@node Performance options
+@chapter Performance options worth knowing
+
+TODO: explain why execution should be tried with
+@code{STARPU_PREFETCH=1 STARPU_SCHED=dmda}, when to use
+@code{STARPU_CALIBRATE=2} to force re-calibration, and how to play with
+@code{STARPU_BETA=2} or more.
+
+@c ---------------------------------------------------------------------
+@c Performance feedback
+@c ---------------------------------------------------------------------
+
 @node Performance feedback
 @node Performance feedback
 @chapter Performance feedback
 @chapter Performance feedback
 
 
@@ -1381,9 +1419,9 @@ generate a trace in the Paje format by calling:
 @end example
 @end example
 
 
 This will create a @code{paje.trace} file in the current directory that can be
 This will create a @code{paje.trace} file in the current directory that can be
-inspected with the Vite trace visualizing open-source tool. More information
+inspected with the ViTE trace visualizing open-source tool. More information
-about Vite is available at @indicateurl{http://vite.gforge.inria.fr/}. It is
+about ViTE is available at @indicateurl{http://vite.gforge.inria.fr/}. It is
-possible to open the @code{paje.trace} file with Vite by using the following
+possible to open the @code{paje.trace} file with ViTE by using the following
 command:
 command:
 @example
 @example
 % vite paje.trace
 % vite paje.trace
@@ -1497,7 +1535,7 @@ Augment the verbosity of the debugging messages.
 @subsubsection @code{--enable-coverage}
 @subsubsection @code{--enable-coverage}
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-Enable flags for the coverage tool.
+Enable flags for the @code{gcov} coverage tool.
 @end table
 @end table
 
 
 @node Configuring workers
 @node Configuring workers
@@ -1563,14 +1601,18 @@ Specify the directory where CUDA is installed. This directory should notably con
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
 Specify the directory where CUDA headers are installed. This directory should
 Specify the directory where CUDA headers are installed. This directory should
-notably contain @code{cuda.h}.
+notably contain @code{cuda.h}. This defaults to @code{/include} appended to the
+value given to @code{--with-cuda-dir}.
 @end table
 @end table
 
 
 @node --with-cuda-lib-dir
 @node --with-cuda-lib-dir
 @subsubsection @code{--with-cuda-lib-dir=<path>}
 @subsubsection @code{--with-cuda-lib-dir=<path>}
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-Specify the directory where the CUDA library is installed.
+Specify the directory where the CUDA library is installed. This directory should
+notably contain the CUDA shared libraries (e.g. libcuda.so). This defaults to
+@code{/lib} appended to the value given to @code{--with-cuda-dir}.
+
 @end table
 @end table
 
 
 @node --enable-maxopencldev
 @node --enable-maxopencldev
@@ -1601,15 +1643,18 @@ Specify the location of the OpenCL SDK. This directory should notably contain
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
 Specify the location of OpenCL headers. This directory should notably contain
 Specify the location of OpenCL headers. This directory should notably contain
-@code{CL/cl.h}.
+@code{CL/cl.h}. This defaults to
+@code{/include} appended to the value given to @code{--with-opencl-dir}.
+
 @end table
 @end table
 
 
 @node --with-opencl-lib-dir
 @node --with-opencl-lib-dir
 @subsubsection @code{--with-opencl-lib-dir=<path>}
 @subsubsection @code{--with-opencl-lib-dir=<path>}
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-Specify the location of the OpenCL library.
+Specify the location of the OpenCL library. This directory should notably
-@code{include/CL/cl.h}.
+contain the OpenCL shared libraries (e.g. libOpenCL.so). This defaults to
+@code{/lib} appended to the value given to @code{--with-opencl-dir}.
 @end table
 @end table
 
 
 @node --enable-gordon
 @node --enable-gordon
@@ -1705,7 +1750,8 @@ library has to be 'atlas' or 'goto'.
 @subsubsection @code{--with-magma=<path>}
 @subsubsection @code{--with-magma=<path>}
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-Specify where magma is installed.
+Specify where magma is installed. This directory should notably contain
+@code{include/magmablas.h}.
 @end table
 @end table
 
 
 @node --with-fxt
 @node --with-fxt
@@ -1843,9 +1889,9 @@ Specify the number of SPUs that StarPU can use.
 @item @emph{Description}:
 @item @emph{Description}:
 Passing an array of integers (starting from 0) in @code{STARPU_WORKERS_CPUID}
 Passing an array of integers (starting from 0) in @code{STARPU_WORKERS_CPUID}
 specifies on which logical CPU the different workers should be
 specifies on which logical CPU the different workers should be
-bound. For instance, if @code{STARPU_WORKERS_CPUID = "1 3 0 2"}, the first
+bound. For instance, if @code{STARPU_WORKERS_CPUID = "0 1 4 5"}, the first
-worker will be bound to logical CPU #1, the second CPU worker will be bound to
+worker will be bound to logical CPU #0, the second CPU worker will be bound to
-logical CPU #3 and so on.  Note that the logical ordering of the CPUs is either
+logical CPU #1 and so on.  Note that the logical ordering of the CPUs is either
 determined by the OS, or provided by the @code{hwloc} library in case it is
 determined by the OS, or provided by the @code{hwloc} library in case it is
 available.
 available.
 
 
@@ -2480,7 +2526,7 @@ The application may access the requested data during the execution of this
 callback. The callback function must call @code{starpu_data_release} once the
 callback. The callback function must call @code{starpu_data_release} once the
 application does not need to access the piece of data anymore. 
 application does not need to access the piece of data anymore. 
 Note that implicit data dependencies are also enforced by
 Note that implicit data dependencies are also enforced by
-@code{starpu_data_acquire} in case they are enabled.
+@code{starpu_data_acquire_cb} in case they are enabled.
  Contrary to @code{starpu_data_acquire}, this function is non-blocking and may
  Contrary to @code{starpu_data_acquire}, this function is non-blocking and may
 be called from task callbacks. Upon successful completion, this function
 be called from task callbacks. Upon successful completion, this function
 returns 0.
 returns 0.
@@ -2505,9 +2551,9 @@ This function releases the piece of data acquired by the application either by
 * Variable Interface::          
 * Variable Interface::          
 * Vector Interface::            
 * Vector Interface::            
 * Matrix Interface::            
 * Matrix Interface::            
+* 3D Matrix Interface::             
 * BCSR Interface for Sparse Matrices (Blocked Compressed Sparse Row Representation)::  
 * BCSR Interface for Sparse Matrices (Blocked Compressed Sparse Row Representation)::  
 * CSR Interface for Sparse Matrices (Compressed Sparse Row Representation)::  
 * CSR Interface for Sparse Matrices (Compressed Sparse Row Representation)::  
-* Block Interface::             
 @end menu
 @end menu
 
 
 @node Variable Interface
 @node Variable Interface
@@ -2582,33 +2628,45 @@ starpu_matrix_data_register(&matrix_handle, 0, (uintptr_t)matrix,
 @end cartouche
 @end cartouche
 @end table
 @end table
 
 
-@node BCSR Interface for Sparse Matrices (Blocked Compressed Sparse Row Representation)
+@node 3D Matrix Interface
-@subsection BCSR Interface for Sparse Matrices (Blocked Compressed Sparse Row Representation)
+@subsection 3D Matrix Interface
 
 
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-This variant of @code{starpu_data_register} uses the BCSR sparse matrix interface.
+This variant of @code{starpu_data_register} uses the 3D matrix interface.
-TODO
+@code{ptr} is the address of the array of first element in the home node.
+@code{ldy} is the number of elements between rows. @code{ldz} is the number
+of rows between z planes. @code{nx} is the number of elements in a row (this
+can be different from @code{ldy} if there are extra elements for alignment
+for instance). @code{ny} is the number of rows in a z plane (likewise with
+@code{ldz}). @code{nz} is the number of z planes. @code{elemsize} is the size of
+each element.
 @item @emph{Prototype}:
 @item @emph{Prototype}:
-@code{void starpu_bcsr_data_register(starpu_data_handle *handle, uint32_t home_node, uint32_t nnz, uint32_t nrow,
+@code{void starpu_block_data_register(starpu_data_handle *handle, uint32_t home_node,
-		uintptr_t nzval, uint32_t *colind, uint32_t *rowptr, uint32_t firstentry, uint32_t r, uint32_t c, size_t elemsize);}
+                        uintptr_t ptr, uint32_t ldy, uint32_t ldz, uint32_t nx,
+                        uint32_t ny, uint32_t nz, size_t elemsize);}
 @item @emph{Example}:
 @item @emph{Example}:
 @cartouche
 @cartouche
 @smallexample
 @smallexample
+float *block;
+starpu_data_handle block_handle;
+block = (float*)malloc(nx*ny*nz*sizeof(float));
+starpu_block_data_register(&block_handle, 0, (uintptr_t)block,
+                           nx, nx*ny, nx, ny, nz, sizeof(float));
 @end smallexample
 @end smallexample
 @end cartouche
 @end cartouche
 @end table
 @end table
 
 
-@node CSR Interface for Sparse Matrices (Compressed Sparse Row Representation)
+@node BCSR Interface for Sparse Matrices (Blocked Compressed Sparse Row Representation)
-@subsection CSR Interface for Sparse Matrices (Compressed Sparse Row Representation)
+@subsection BCSR Interface for Sparse Matrices (Blocked Compressed Sparse Row Representation)
 
 
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-This variant of @code{starpu_data_register} uses the CSR sparse matrix interface.
+This variant of @code{starpu_data_register} uses the BCSR sparse matrix interface.
 TODO
 TODO
 @item @emph{Prototype}:
 @item @emph{Prototype}:
-@code{void starpu_csr_data_register(starpu_data_handle *handle, uint32_t home_node, uint32_t nnz, uint32_t nrow,
+@code{void starpu_bcsr_data_register(starpu_data_handle *handle, uint32_t home_node, uint32_t nnz, uint32_t nrow,
-		uintptr_t nzval, uint32_t *colind, uint32_t *rowptr, uint32_t firstentry, size_t elemsize);}
+		uintptr_t nzval, uint32_t *colind, uint32_t *rowptr, uint32_t firstentry, uint32_t r, uint32_t c, size_t elemsize);}
 @item @emph{Example}:
 @item @emph{Example}:
 @cartouche
 @cartouche
 @smallexample
 @smallexample
@@ -2616,31 +2674,19 @@ TODO
 @end cartouche
 @end cartouche
 @end table
 @end table
 
 
-@node Block Interface
+@node CSR Interface for Sparse Matrices (Compressed Sparse Row Representation)
-@subsection Block Interface
+@subsection CSR Interface for Sparse Matrices (Compressed Sparse Row Representation)
 
 
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-This variant of @code{starpu_data_register} uses the 3D matrix interface.
+This variant of @code{starpu_data_register} uses the CSR sparse matrix interface.
-@code{ptr} is the address of the array of first element in the home node.
+TODO
-@code{ldy} is the number of elements between rows. @code{ldz} is the number
-of rows between z planes. @code{nx} is the number of elements in a row (this
-can be different from @code{ldy} if there are extra elements for alignment
-for instance). @code{ny} is the number of rows in a z plane (likewise with
-@code{ldz}). @code{nz} is the number of z planes. @code{elemsize} is the size of
-each element.
 @item @emph{Prototype}:
 @item @emph{Prototype}:
-@code{void starpu_block_data_register(starpu_data_handle *handle, uint32_t home_node,
+@code{void starpu_csr_data_register(starpu_data_handle *handle, uint32_t home_node, uint32_t nnz, uint32_t nrow,
-                        uintptr_t ptr, uint32_t ldy, uint32_t ldz, uint32_t nx,
+		uintptr_t nzval, uint32_t *colind, uint32_t *rowptr, uint32_t firstentry, size_t elemsize);}
-                        uint32_t ny, uint32_t nz, size_t elemsize);}
 @item @emph{Example}:
 @item @emph{Example}:
 @cartouche
 @cartouche
 @smallexample
 @smallexample
-float *block;
-starpu_data_handle block_handle;
-block = (float*)malloc(nx*ny*nz*sizeof(float));
-starpu_block_data_register(&block_handle, 0, (uintptr_t)block,
-                           nx, nx*ny, nx, ny, nz, sizeof(float));
 @end smallexample
 @end smallexample
 @end cartouche
 @end cartouche
 @end table
 @end table
@@ -2661,24 +2707,28 @@ starpu_block_data_register(&block_handle, 0, (uintptr_t)block,
 @subsection @code{struct starpu_data_filter} -- StarPU filter structure
 @subsection @code{struct starpu_data_filter} -- StarPU filter structure
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-The filter structure describes a data partitioning function.
+The filter structure describes a data partitioning operation, to be given to the
+@code{starpu_data_partition} function, see @ref{starpu_data_partition} for an example.
 @item @emph{Fields}:
 @item @emph{Fields}:
 @table @asis
 @table @asis
 @item @code{filter_func}:
 @item @code{filter_func}:
-TODO
+This function fills the @code{child_interface} structure with interface
+information for the @code{id}-th child of the parent @code{father_interface} (among @code{nparts}).
 @code{void (*filter_func)(void *father_interface, void* child_interface, struct starpu_data_filter *, unsigned id, unsigned nparts);}
 @code{void (*filter_func)(void *father_interface, void* child_interface, struct starpu_data_filter *, unsigned id, unsigned nparts);}
 @item @code{get_nchildren}:
 @item @code{get_nchildren}:
-TODO
+This returns the number of children.
 @code{unsigned (*get_nchildren)(struct starpu_data_filter *, starpu_data_handle initial_handle);}
 @code{unsigned (*get_nchildren)(struct starpu_data_filter *, starpu_data_handle initial_handle);}
 @item @code{get_child_ops}:
 @item @code{get_child_ops}:
-TODO
+In case the resulting children use a different data interface, this function
+returns which interface is used by child number @code{id}.
 @code{struct starpu_data_interface_ops_t *(*get_child_ops)(struct starpu_data_filter *, unsigned id);}
 @code{struct starpu_data_interface_ops_t *(*get_child_ops)(struct starpu_data_filter *, unsigned id);}
-@item @code{filter_arg}:
-TODO
 @item @code{nchildren}:
 @item @code{nchildren}:
-TODO
+This is the number of parts to partition the data into.
+@item @code{filter_arg}:
+Some filters take an addition parameter, but this is usually unused.
 @item @code{filter_arg_ptr}:
 @item @code{filter_arg_ptr}:
-TODO
+Some filters take an additional array parameter like the sizes of the parts, but
+this is usually unused.
 @end table
 @end table
 @end table
 @end table
 
 
@@ -2687,9 +2737,22 @@ TODO
 
 
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-TODO
+This requests partitioning one StarPU data @code{initial_handle} into several
+subdata according to the filter @code{f}
 @item @emph{Prototype}:
 @item @emph{Prototype}:
 @code{void starpu_data_partition(starpu_data_handle initial_handle, struct starpu_data_filter *f);}
 @code{void starpu_data_partition(starpu_data_handle initial_handle, struct starpu_data_filter *f);}
+@item @emph{Example}:
+@cartouche
+@smallexample
+struct starpu_data_filter f = {
+    .filter_func = starpu_vertical_block_filter_func,
+    .nchildren = nslicesx,
+    .get_nchildren = NULL,
+    .get_child_ops = NULL
+};
+starpu_data_partition(A_handle, &f);
+@end smallexample
+@end cartouche
 @end table
 @end table
 
 
 @node starpu_data_unpartition
 @node starpu_data_unpartition
@@ -2697,9 +2760,16 @@ TODO
 
 
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-TODO
+This unapplies one filter, thus unpartitioning the data. The pieces of data are
+collected back into one big piece in the @code{gathering_node} (usually 0).
 @item @emph{Prototype}:
 @item @emph{Prototype}:
 @code{void starpu_data_unpartition(starpu_data_handle root_data, uint32_t gathering_node);}
 @code{void starpu_data_unpartition(starpu_data_handle root_data, uint32_t gathering_node);}
+@item @emph{Example}:
+@cartouche
+@smallexample
+starpu_data_unpartition(A_handle, 0);
+@end smallexample
+@end cartouche
 @end table
 @end table
 
 
 @node starpu_data_get_nb_children
 @node starpu_data_get_nb_children
@@ -2707,9 +2777,9 @@ TODO
 
 
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-TODO
+This function returns the number of children.
 @item @emph{Return value}:
 @item @emph{Return value}:
-This function returns returns the number of children.
+The number of children.
 @item @emph{Prototype}:
 @item @emph{Prototype}:
 @code{int starpu_data_get_nb_children(starpu_data_handle handle);}
 @code{int starpu_data_get_nb_children(starpu_data_handle handle);}
 @end table
 @end table
@@ -2721,11 +2791,22 @@ This function returns returns the number of children.
 
 
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-TODO
+After partitioning a StarPU data by applying a filter,
+@code{starpu_data_get_sub_data} can be used to get handles for each of the data
+portions. @code{root_data} is the parent data that was partitioned. @code{depth}
+is the number of filters to traverse (in case several filters have been applied,
+to e.g. partition in row blocks, and then in column blocks), and the subsequent
+parameters are the indexes.
 @item @emph{Return value}:
 @item @emph{Return value}:
-TODO
+A handle to the subdata.
 @item @emph{Prototype}:
 @item @emph{Prototype}:
 @code{starpu_data_handle starpu_data_get_sub_data(starpu_data_handle root_data, unsigned depth, ... );}
 @code{starpu_data_handle starpu_data_get_sub_data(starpu_data_handle root_data, unsigned depth, ... );}
+@item @emph{Example}:
+@cartouche
+@smallexample
+h = starpu_data_get_sub_data(A_handle, 1, taskx);
+@end smallexample
+@end cartouche
 @end table
 @end table
 
 
 @node Predefined filter functions
 @node Predefined filter functions
@@ -2738,52 +2819,79 @@ TODO
 * Partitioning Block Data::     
 * Partitioning Block Data::     
 @end menu
 @end menu
 
 
-This section gives a list of the predefined partitioning functions.
+This section gives a partial list of the predefined partitioning functions.
-Examples on how to use them are shown in @ref{Partitioning Data}.
+Examples on how to use them are shown in @ref{Partitioning Data}. The complete
+list can be found in @code{starpu_data_filters.h} .
 
 
 @node Partitioning BCSR Data
 @node Partitioning BCSR Data
 @subsubsection Partitioning BCSR Data
 @subsubsection Partitioning BCSR Data
-@itemize
+
-@item
+@table @asis
+@item @emph{Description}:
 TODO
 TODO
+@item @emph{Prototype}:
 @code{void starpu_canonical_block_filter_bcsr(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
 @code{void starpu_canonical_block_filter_bcsr(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
-@item
+@end table
+
+@table @asis
+@item @emph{Description}:
 TODO
 TODO
+@item @emph{Prototype}:
 @code{void starpu_vertical_block_filter_func_csr(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
 @code{void starpu_vertical_block_filter_func_csr(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
-@end itemize
+@end table
 
 
 @node Partitioning BLAS interface
 @node Partitioning BLAS interface
 @subsubsection Partitioning BLAS interface
 @subsubsection Partitioning BLAS interface
-@itemize
+
-@item
+@table @asis
-TODO
+@item @emph{Description}:
+This partitions a dense Matrix into horizontal blocks.
+@item @emph{Prototype}:
 @code{void starpu_block_filter_func(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
 @code{void starpu_block_filter_func(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
-@item
+@end table
-TODO
+
+@table @asis
+@item @emph{Description}:
+This partitions a dense Matrix into vertical blocks.
+@item @emph{Prototype}:
 @code{void starpu_vertical_block_filter_func(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
 @code{void starpu_vertical_block_filter_func(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
-@end itemize
+@end table
 
 
 @node Partitioning Vector Data
 @node Partitioning Vector Data
 @subsubsection Partitioning Vector Data
 @subsubsection Partitioning Vector Data
-@itemize
+
-@item
+@table @asis
-TODO
+@item @emph{Description}:
+This partitions a vector into blocks of the same size.
+@item @emph{Prototype}:
 @code{void starpu_block_filter_func_vector(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
 @code{void starpu_block_filter_func_vector(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
-@item
+@end table
-TODO
+
+
+@table @asis
+@item @emph{Description}:
+This partitions a vector into blocks of sizes given in @code{filter_arg_ptr}.
+@item @emph{Prototype}:
 @code{void starpu_vector_list_filter_func(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
 @code{void starpu_vector_list_filter_func(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
-@item
+@end table
-TODO
+
+@table @asis
+@item @emph{Description}:
+This partitions a vector into two blocks, the first block size being given in @code{filter_arg}.
+@item @emph{Prototype}:
 @code{void starpu_vector_divide_in_2_filter_func(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
 @code{void starpu_vector_divide_in_2_filter_func(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
-@end itemize
+@end table
+
 
 
 @node Partitioning Block Data
 @node Partitioning Block Data
 @subsubsection Partitioning Block Data
 @subsubsection Partitioning Block Data
-@itemize
+
-@item
+@table @asis
-TODO
+@item @emph{Description}:
+This partitions a 3D matrix along the X axis.
+@item @emph{Prototype}:
 @code{void starpu_block_filter_func_block(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
 @code{void starpu_block_filter_func_block(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts);}
-@end itemize
+@end table
 
 
 @node Codelets and Tasks
 @node Codelets and Tasks
 @section Codelets and Tasks
 @section Codelets and Tasks
@@ -3086,7 +3194,7 @@ because there is no task being executed at the moment.
 @subsection @code{starpu_display_codelet_stats} -- Display statistics
 @subsection @code{starpu_display_codelet_stats} -- Display statistics
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-TODO
+Output on @code{stderr} some statistics on the codelet @code{cl}.
 @item @emph{Prototype}:
 @item @emph{Prototype}:
 @code{void starpu_display_codelet_stats(struct starpu_codelet_t *cl);}
 @code{void starpu_display_codelet_stats(struct starpu_codelet_t *cl);}
 @end table
 @end table
@@ -3133,7 +3241,7 @@ redundancy in the task dependencies.
 @subsection @code{starpu_tag_t} -- Task logical identifier
 @subsection @code{starpu_tag_t} -- Task logical identifier
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-It is possible to associate a task with a unique ``tag'' and to express
+It is possible to associate a task with a unique ``tag'' chosen by the application, and to express
 dependencies between tasks by the means of those tags. To do so, fill the
 dependencies between tasks by the means of those tags. To do so, fill the
 @code{tag_id} field of the @code{starpu_task} structure with a tag number (can
 @code{tag_id} field of the @code{starpu_task} structure with a tag number (can
 be arbitrary) and set the @code{use_tag} field to 1.
 be arbitrary) and set the @code{use_tag} field to 1.
@@ -3342,7 +3450,7 @@ TODO
 @subsection @code{starpu_force_bus_sampling}
 @subsection @code{starpu_force_bus_sampling}
 @table @asis
 @table @asis
 @item @emph{Description}:
 @item @emph{Description}:
-TODO
+This forces sampling the bus performance model again.
 @item @emph{Prototype}:
 @item @emph{Prototype}:
 @code{void starpu_force_bus_sampling(void);}
 @code{void starpu_force_bus_sampling(void);}
 @end table
 @end table

+ 4 - 3
doc/vector_scal_c.texi

@@ -48,8 +48,8 @@ int main(int argc, char **argv)
     starpu_init(NULL);
     starpu_init(NULL);
 
 
 #ifdef STARPU_USE_OPENCL
 #ifdef STARPU_USE_OPENCL
-        starpu_opencl_load_opencl_from_file("examples/basic_examples/vector_scal_opencl_kernel.cl",
+        starpu_opencl_load_opencl_from_file(
-                                            &programs);
+               "examples/basic_examples/vector_scal_opencl_kernel.cl", &programs);
 #endif
 #endif
 
 
     /* Tell StaPU to associate the "vector" vector with the "vector_handle"
     /* Tell StaPU to associate the "vector" vector with the "vector_handle"
@@ -66,7 +66,8 @@ int main(int argc, char **argv)
      *  - the fifth argument is the size of each element.
      *  - the fifth argument is the size of each element.
      */
      */
     starpu_data_handle vector_handle;
     starpu_data_handle vector_handle;
-    starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector, NX, sizeof(vector[0]));
+    starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector,
+                                NX, sizeof(vector[0]));
 
 
     float factor = 3.14;
     float factor = 3.14;
 
 

+ 5 - 3
doc/vector_scal_cuda.texi

@@ -4,7 +4,7 @@ static __global__ void vector_mult_cuda(float *val, unsigned n,
                                         float factor)
                                         float factor)
 @{
 @{
         unsigned i;
         unsigned i;
-        for(i = 0 ; i < n ; i++)
+        if (i < n)
                val[i] *= factor;
                val[i] *= factor;
 @}
 @}
 
 
@@ -16,8 +16,10 @@ extern "C" void scal_cuda_func(void *buffers[], void *_args)
         unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
         unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
         /* local copy of the vector pointer */
         /* local copy of the vector pointer */
         float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);
         float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);
+        unsigned threads_per_block = 64;
+        unsigned nblocks = (n + threads_per_block-1) / threads_per_block;
 
 
-        vector_mult_cuda<<<1,1>>>(val, n, *factor);
+        vector_mult_cuda<<<nblocks,threads_per_block>>>(val, n, *factor);
 
 
-    cudaThreadSynchronize();
+        cudaThreadSynchronize();
 @}
 @}

+ 7 - 4
doc/vector_scal_opencl.texi

@@ -18,7 +18,8 @@ void scal_opencl_func(void *buffers[], void *_args)
     id = starpu_worker_get_id();
     id = starpu_worker_get_id();
     devid = starpu_worker_get_devid(id);
     devid = starpu_worker_get_devid(id);
 
 
-    err = starpu_opencl_load_kernel(&kernel, &queue, &programs, "vector_mult_opencl", devid);
+    err = starpu_opencl_load_kernel(&kernel, &queue, &programs, "vector_mult_opencl",
+                                    devid);
     if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
     if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
 
 
     err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &val);
     err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &val);
@@ -28,16 +29,18 @@ void scal_opencl_func(void *buffers[], void *_args)
 
 
     @{
     @{
         size_t global=n;
         size_t global=n;
-	size_t local;
+        size_t local;
         size_t s;
         size_t s;
         cl_device_id device;
         cl_device_id device;
 
 
         starpu_opencl_get_device(devid, &device);
         starpu_opencl_get_device(devid, &device);
-        err = clGetKernelWorkGroupInfo (kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), &local, &s);
+        err = clGetKernelWorkGroupInfo (kernel, device, CL_KERNEL_WORK_GROUP_SIZE,
+                                        sizeof(local), &local, &s);
         if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
         if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
         if (local > global) local=global;
         if (local > global) local=global;
 
 
-        err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0, NULL, NULL);
+        err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0,
+                                     NULL, NULL);
         if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
         if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
     @}
     @}