12 年之前 · cb1b052539
--- a/doc/doxygen/Makefile.am
+++ b/doc/doxygen/Makefile.am
@@ -44,6 +44,7 @@ chapters =	\
 
				 	chapters/mic_scc_support.doxy \
			
 
				 	chapters/code/hello_pragma2.c \
			
 
				 	chapters/code/hello_pragma.c \
			
 
				+	chapters/code/scal_pragma.cu \
			
 
				 	chapters/code/matmul_pragma.c \
			
 
				 	chapters/code/matmul_pragma2.c \
			
 
				 	chapters/code/cholesky_pragma.c \
			
--- a/doc/doxygen/chapters/advanced_examples.doxy
+++ b/doc/doxygen/chapters/advanced_examples.doxy
@@ -92,12 +92,12 @@ thus be very fast. The function starpu_cuda_get_device_properties()
 
				 provides a quick access to CUDA properties of CUDA devices to achieve
			
 
				 such efficiency.
			
 
				 
			
 
				-Another example is compiling CUDA code for various compute capabilities,
			
 
				+Another example is to compile CUDA code for various compute capabilities,
			
 
				 resulting with two CUDA functions, e.g. <c>scal_gpu_13</c> for compute capability
			
 
				 1.3, and <c>scal_gpu_20</c> for compute capability 2.0. Both functions can be
			
 
				-provided to StarPU by using <c>cuda_funcs</c>, and <c>can_execute</c> can then be
			
 
				-used to rule out the <c>scal_gpu_20</c> variant on a CUDA device which
			
 
				-will not be able to execute it:
			
 
				+provided to StarPU by using starpu_codelet::cuda_funcs, and
			
 
				+starpu_codelet::can_execute can then be used to rule out the
			
 
				+<c>scal_gpu_20</c> variant on a CUDA device which will not be able to execute it:
			
 
				 
			
 
				 \code{.c}
			
 
				 static int can_execute(unsigned workerid, struct starpu_task *task, unsigned nimpl)
			
@@ -390,9 +390,9 @@ starpu_perfmodel::size_base however permits the application to
 
				 override that, when for instance some of the data do not matter for
			
 
				 task cost (e.g. mere reference table), or when using sparse
			
 
				 structures (in which case it is the number of non-zeros which matter), or when
			
 
				-there is some hidden parameter such as the number of iterations, etc. The
			
 
				-<c>examples/pi</c> examples uses this to include the number of iterations in the
			
 
				-base.
			
 
				+there is some hidden parameter such as the number of iterations, etc.
			
 
				+The example in the directory <c>examples/pi</c> uses this to include
			
 
				+the number of iterations in the base.
			
 
				 
			
 
				 How to use schedulers which can benefit from such performance model is explained
			
 
				 in \ref TaskSchedulingPolicy.
			
@@ -427,11 +427,11 @@ starpu_bound_print_lp() or starpu_bound_print_mps() can then be used
 
				 to output a Linear Programming problem corresponding to the schedule
			
 
				 of your tasks. Run it through <c>lp_solve</c> or any other linear
			
 
				 programming solver, and that will give you a lower bound for the total
			
 
				-execution time of your tasks. If StarPU was compiled with the glpk
			
 
				-library installed, starpu_bound_compute() can be used to solve it
			
 
				+execution time of your tasks. If StarPU was compiled with the library
			
 
				+<c>glpk</c> installed, starpu_bound_compute() can be used to solve it
			
 
				 immediately and get the optimized minimum, in ms. Its parameter
			
 
				 <c>integer</c> allows to decide whether integer resolution should be
			
 
				-computed and returned too.
			
 
				+computed and returned 
			
 
				 
			
 
				 The <c>deps</c> parameter tells StarPU whether to take tasks, implicit
			
 
				 data, and tag dependencies into account. Tags released in a callback
			
@@ -549,7 +549,7 @@ STARPU_DATA_ACQUIRE_CB(i_handle, STARPU_R,
 
				 The macro ::STARPU_DATA_ACQUIRE_CB submits an asynchronous request for
			
 
				 acquiring data <c>i</c> for the main application, and will execute the code
			
 
				 given as third parameter when it is acquired. In other words, as soon as the
			
 
				-value of <c>i</c> computed by the <c>which_index</c> codelet can be read, the
			
 
				+value of <c>i</c> computed by the codelet <c>which_index</c> can be read, the
			
 
				 portion of code passed as third parameter of ::STARPU_DATA_ACQUIRE_CB will
			
 
				 be executed, and is allowed to read from <c>i</c> to use it e.g. as an
			
 
				 index. Note that this macro is only avaible when compiling StarPU with
			
@@ -609,7 +609,7 @@ struct starpu_codelet accumulate_variable_cl =
 
				 }
			
 
				 \endcode
			
 
				 
			
 
				-and attaches them as reduction methods for its <c>dtq</c> handle:
			
 
				+and attaches them as reduction methods for its handle <c>dtq</c>:
			
 
				 
			
 
				 \code{.c}
			
 
				 starpu_variable_data_register(&dtq_handle, -1, NULL, sizeof(type));
			
@@ -674,7 +674,7 @@ tasks.
 
				 Data can sometimes be entirely produced by a task, and entirely consumed by
			
 
				 another task, without the need for other parts of the application to access
			
 
				 it. In such case, registration can be done without prior allocation, by using
			
 
				-the special -1 memory node number, and passing a zero pointer. StarPU will
			
 
				+the special memory node number <c>-1</c>, and passing a zero pointer. StarPU will
			
 
				 actually allocate memory only when the task creating the content gets scheduled,
			
 
				 and destroy it on unregistration.
			
 
				 
			
@@ -704,9 +704,8 @@ function, and free it at the end, but that would be costly. It could also
 
				 allocate one buffer per worker (similarly to \ref
			
 
				 HowToInitializeAComputationLibraryOnceForEachWorker), but that would
			
 
				 make them systematic and permanent. A more  optimized way is to use
			
 
				-the ::STARPU_SCRATCH data access mode, as examplified below,
			
 
				-
			
 
				-which provides per-worker buffers without content consistency.
			
 
				+the data access mode ::STARPU_SCRATCH, as examplified below, which
			
 
				+provides per-worker buffers without content consistency.
			
 
				 
			
 
				 \code{.c}
			
 
				 starpu_vector_data_register(&workspace, -1, 0, sizeof(float));
			
@@ -723,7 +722,7 @@ the other on the same worker. Also, if for instance GPU memory becomes scarce,
 
				 StarPU will notice that it can free such buffers easily, since the content does
			
 
				 not matter.
			
 
				 
			
 
				-The <c>examples/pi</c> example uses scratches for some temporary buffer.
			
 
				+The example <c>examples/pi</c> uses scratches for some temporary buffer.
			
 
				 
			
 
				 \section ParallelTasks Parallel Tasks
			
 
				 
			
@@ -734,8 +733,9 @@ parallel CPU implementation of the computation to be achieved. This can also be
 
				 useful to improve the load balance between slow CPUs and fast GPUs: since CPUs
			
 
				 work collectively on a single task, the completion time of tasks on CPUs become
			
 
				 comparable to the completion time on GPUs, thus relieving from granularity
			
 
				-discrepancy concerns. Hwloc support needs to be enabled to get good performance,
			
 
				-otherwise StarPU will not know how to better group cores.
			
 
				+discrepancy concerns. <c>hwloc</c> support needs to be enabled to get
			
 
				+good performance, otherwise StarPU will not know how to better group
			
 
				+cores.
			
 
				 
			
 
				 Two modes of execution exist to accomodate with existing usages.
			
 
				 
			
@@ -808,8 +808,8 @@ buffer.
 
				 
			
 
				 To benefit from parallel tasks, a parallel-task-aware StarPU scheduler has to
			
 
				 be used. When exposed to codelets with a flag ::STARPU_FORKJOIN or
			
 
				-::STARPU_SPMD, the <c>pheft</c> (parallel-heft) and <c>peager</c>
			
 
				-(parallel eager) schedulers will indeed also try to execute tasks with
			
 
				+::STARPU_SPMD, the schedulers <c>pheft</c> (parallel-heft) and <c>peager</c>
			
 
				+(parallel eager) will indeed also try to execute tasks with
			
 
				 several CPUs. It will automatically try the various available combined
			
 
				 worker sizes (making several measurements for each worker size) and
			
 
				 thus be able to avoid choosing a large combined worker if the codelet
			
@@ -846,9 +846,9 @@ from different threads, due to the use of global variables in their sequential
 
				 sections for instance.
			
 
				 
			
 
				 The solution is then to use only one combined worker at a time.  This can be
			
 
				-done by setting the field starpu_conf::single_combined_worker to 1, or
			
 
				+done by setting the field starpu_conf::single_combined_worker to <c>1</c>, or
			
 
				 setting the environment variable \ref STARPU_SINGLE_COMBINED_WORKER
			
 
				-to 1. StarPU will then run only one parallel task at a time (but other
			
 
				+to <c>1</c>. StarPU will then run only one parallel task at a time (but other
			
 
				 CPU and GPU tasks are not affected and can be run concurrently). The parallel
			
 
				 task scheduler will however still however still try varying combined worker
			
 
				 sizes to look for the most efficient ones.
			
@@ -1183,8 +1183,8 @@ directory <c>examples/basic_examples/dynamic_handles.c</c>.
 
				 
			
 
				 \section MoreExamples More Examples
			
 
				 
			
 
				-More examples are available in the StarPU sources in the <c>examples/</c>
			
 
				-directory. Simple examples include:
			
 
				+More examples are available in the StarPU sources in the directory
			
 
				+<c>examples/</c>. Simple examples include:
			
 
				 
			
 
				 <dl>
			
 
				 <dt> <c>incrementer/</c> </dt>
			
--- a/doc/doxygen/chapters/basic_examples.doxy
+++ b/doc/doxygen/chapters/basic_examples.doxy
@@ -12,7 +12,7 @@
 
				 
			
 
				 This section shows how to implement a simple program that submits a task
			
 
				 to StarPU using the StarPU C extension (\ref cExtensions). The complete example, and additional examples,
			
 
				-is available in the <c>gcc-plugin/examples</c> directory of the StarPU
			
 
				+is available in the directory <c>gcc-plugin/examples</c> of the StarPU
			
 
				 distribution. A similar example showing how to directly use the StarPU's API is shown
			
 
				 in \ref HelloWorldUsingStarPUAPI.
			
 
				 
			
@@ -24,7 +24,7 @@ has a single implementation for CPU:
 
				 
			
 
				 \snippet hello_pragma.c To be included
			
 
				 
			
 
				-The code can then be compiled and linked with GCC and the <c>-fplugin</c> flag:
			
 
				+The code can then be compiled and linked with GCC and the flag <c>-fplugin</c>:
			
 
				 
			
 
				 \verbatim
			
 
				 $ gcc `pkg-config starpu-1.2 --cflags` hello-starpu.c \
			
@@ -92,9 +92,9 @@ compiler implicitly do it as examplified above.
 
				 The field starpu_codelet::nbuffers specifies the number of data buffers that are
			
 
				 manipulated by the codelet: here the codelet does not access or modify any data
			
 
				 that is controlled by our data management library. Note that the argument
			
 
				-passed to the codelet (the field starpu_task::cl_arg) does not count
			
 
				-as a buffer since it is not managed by our data management library,
			
 
				-but just contain trivial parameters.
			
 
				+passed to the codelet (the parameter <c>cl_arg</c> of the function
			
 
				+<c>cpu_func</c>) does not count as a buffer since it is not managed by
			
 
				+our data management library, but just contain trivial parameters.
			
 
				 
			
 
				 \internal
			
 
				 TODO need a crossref to the proper description of "where" see bla for more ...
			
@@ -168,7 +168,7 @@ int main(int argc, char **argv)
 
				 \endcode
			
 
				 
			
 
				 Before submitting any tasks to StarPU, starpu_init() must be called. The
			
 
				-<c>NULL</c> argument specifies that we use default configuration. Tasks cannot
			
 
				+<c>NULL</c> argument specifies that we use the default configuration. Tasks cannot
			
 
				 be submitted after the termination of StarPU by a call to
			
 
				 starpu_shutdown().
			
 
				 
			
@@ -194,12 +194,13 @@ computational kernel that multiplies its input vector by a constant,
 
				 the constant could be specified by the means of this buffer, instead
			
 
				 of registering it as a StarPU data. It must however be noted that
			
 
				 StarPU avoids making copy whenever possible and rather passes the
			
 
				-pointer as such, so the buffer which is pointed at must kept allocated
			
 
				+pointer as such, so the buffer which is pointed at must be kept allocated
			
 
				 until the task terminates, and if several tasks are submitted with
			
 
				 various parameters, each of them must be given a pointer to their
			
 
				-buffer.	
			
 
				+own buffer.
			
 
				 
			
 
				-Once a task has been executed, an optional callback function is be called.
			
 
				+Once a task has been executed, an optional callback function
			
 
				+starpu_task::callback_func is called when defined.
			
 
				 While the computational kernel could be offloaded on various architectures, the
			
 
				 callback function is always executed on a CPU. The pointer
			
 
				 starpu_task::callback_arg is passed as an argument of the callback
			
@@ -211,7 +212,7 @@ void (*callback_function)(void *);
 
				 
			
 
				 If the field starpu_task::synchronous is non-zero, task submission
			
 
				 will be synchronous: the function starpu_task_submit() will not return
			
 
				-until the task was executed. Note that the function starpu_shutdown()
			
 
				+until the task has been executed. Note that the function starpu_shutdown()
			
 
				 does not guarantee that asynchronous tasks have been executed before
			
 
				 it returns, starpu_task_wait_for_all() can be used to that effect, or
			
 
				 data can be unregistered (starpu_data_unregister()), which will
			
@@ -237,12 +238,12 @@ we show how StarPU tasks can manipulate data.
 
				 
			
 
				 We will first show how to use the C language extensions provided by
			
 
				 the GCC plug-in (\ref cExtensions). The complete example, and
			
 
				-additional examples, is available in the <c>gcc-plugin/examples</c>
			
 
				-directory of the StarPU distribution. These extensions map directly
			
 
				+additional examples, is available in the directory <c>gcc-plugin/examples</c>
			
 
				+of the StarPU distribution. These extensions map directly
			
 
				 to StarPU's main concepts: tasks, task implementations for CPU,
			
 
				 OpenCL, or CUDA, and registered data buffers. The standard C version
			
 
				-that uses StarPU's standard C programming interface is given in the
			
 
				-next section (\ref VectorScalingUsingStarPUAPI).
			
 
				+that uses StarPU's standard C programming interface is given in \ref
			
 
				+VectorScalingUsingStarPUAPI.
			
 
				 
			
 
				 First of all, the vector-scaling task and its simple CPU implementation
			
 
				 has to be defined:
			
@@ -268,7 +269,7 @@ implemented:
 
				 
			
 
				 \snippet hello_pragma2.c To be included
			
 
				 
			
 
				-The <c>main</c> function above does several things:
			
 
				+The function <c>main</c> above does several things:
			
 
				 
			
 
				 <ul>
			
 
				 <li>
			
@@ -287,22 +288,20 @@ StarPU to transfer that memory region between GPUs and the main memory.
 
				 Removing this <c>pragma</c> is an error.
			
 
				 </li>
			
 
				 <li>
			
 
				-It invokes the <c>vector_scal</c> task.  The invocation looks the same
			
 
				+It invokes the task <c>vector_scal</c>.  The invocation looks the same
			
 
				 as a standard C function call.  However, it is an asynchronous
			
 
				 invocation, meaning that the actual call is performed in parallel with
			
 
				 the caller's continuation.
			
 
				 </li>
			
 
				 <li>
			
 
				-It waits for the termination of the <c>vector_scal</c>
			
 
				-asynchronous call.
			
 
				+It waits for the termination of the asynchronous call <c>vector_scal</c>.
			
 
				 </li>
			
 
				 <li>
			
 
				 Finally, StarPU is shut down.
			
 
				 </li>
			
 
				 </ul>
			
 
				 
			
 
				-The program can be compiled and linked with GCC and the <c>-fplugin</c>
			
 
				-flag:
			
 
				+The program can be compiled and linked with GCC and the flag <c>-fplugin</c>:
			
 
				 
			
 
				 \verbatim
			
 
				 $ gcc `pkg-config starpu-1.2 --cflags` vector_scal.c \
			
@@ -317,7 +316,7 @@ And voilà!
 
				 Now, this is all fine and great, but you certainly want to take
			
 
				 advantage of these newfangled GPUs that your lab just bought, don't you?
			
 
				 
			
 
				-So, let's add an OpenCL implementation of the <c>vector_scal</c> task.
			
 
				+So, let's add an OpenCL implementation of the task <c>vector_scal</c>.
			
 
				 We assume that the OpenCL kernel is available in a file,
			
 
				 <c>vector_scal_opencl_kernel.cl</c>, not shown here.  The OpenCL task
			
 
				 implementation is similar to that used with the standard C API
			
@@ -374,14 +373,14 @@ vector_scal_opencl (unsigned size, float vector[size], float factor)
 
				 \endcode
			
 
				 
			
 
				 The OpenCL kernel itself must be loaded from <c>main</c>, sometime after
			
 
				-the <c>initialize</c> pragma:
			
 
				+the pragma <c>initialize</c>:
			
 
				 
			
 
				 \code{.c}
			
 
				 starpu_opencl_load_opencl_from_file ("vector_scal_opencl_kernel.cl",
			
 
				                                        &cl_programs, "");
			
 
				 \endcode
			
 
				 
			
 
				-And that's it.  The <c>vector_scal</c> task now has an additional
			
 
				+And that's it.  The task <c>vector_scal</c> now has an additional
			
 
				 implementation, for OpenCL, which StarPU's scheduler may choose to use
			
 
				 at run-time.  Unfortunately, the <c>vector_scal_opencl</c> above still
			
 
				 has to go through the common OpenCL boilerplate; in the future,
			
@@ -404,40 +403,13 @@ The actual implementation of the CUDA task goes into a separate
 
				 compilation unit, in a <c>.cu</c> file.  It is very close to the
			
 
				 implementation when using StarPU's standard C API (\ref DefinitionOfTheCUDAKernel).
			
 
				 
			
 
				-\code{.c}
			
 
				-/* CUDA implementation of the `vector_scal' task, to be compiled with `nvcc'. */
			
 
				-
			
 
				-#include <starpu.h>
			
 
				-#include <stdlib.h>
			
 
				-
			
 
				-static __global__ void
			
 
				-vector_mult_cuda (unsigned n, float *val, float factor)
			
 
				-{
			
 
				-  unsigned i = blockIdx.x * blockDim.x + threadIdx.x;
			
 
				-
			
 
				-  if (i < n)
			
 
				-    val[i] *= factor;
			
 
				-}
			
 
				-
			
 
				-/* Definition of the task implementation declared in the C file. */
			
 
				-extern "C" void
			
 
				-vector_scal_cuda (size_t size, float vector[], float factor)
			
 
				-{
			
 
				-  unsigned threads_per_block = 64;
			
 
				-  unsigned nblocks = (size + threads_per_block - 1) / threads_per_block;
			
 
				-
			
 
				-  vector_mult_cuda <<< nblocks, threads_per_block, 0,
			
 
				-    starpu_cuda_get_local_stream () >>> (size, vector, factor);
			
 
				-
			
 
				-  cudaStreamSynchronize (starpu_cuda_get_local_stream ());
			
 
				-}
			
 
				-\endcode
			
 
				+\snippet scal_pragma.cu To be included
			
 
				 
			
 
				-The complete source code, in the <c>gcc-plugin/examples/vector_scal</c>
			
 
				-directory of the StarPU distribution, also shows how an SSE-specialized
			
 
				+The complete source code, in the directory <c>gcc-plugin/examples/vector_scal</c>
			
 
				+of the StarPU distribution, also shows how an SSE-specialized
			
 
				 CPU task implementation can be added.
			
 
				 
			
 
				-For more details on the C extensions provided by StarPU's GCC plug-in,
			
 
				+For more details on the C extensions provided by StarPU's GCC plug-in, see
			
 
				 \ref cExtensions.
			
 
				 
			
 
				 \section VectorScalingUsingStarPUAPI Vector Scaling Using StarPU's API
			
@@ -479,7 +451,7 @@ starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector, NX,
 
				 The first argument, called the <b>data handle</b>, is an opaque pointer which
			
 
				 designates the array in StarPU. This is also the structure which is used to
			
 
				 describe which data is used by a task. The second argument is the node number
			
 
				-where the data originally resides. Here it is 0 since the <c>vector array</c> is in
			
 
				+where the data originally resides. Here it is 0 since the array <c>vector</c> is in
			
 
				 the main memory. Then comes the pointer <c>vector</c> where the data can be found in main memory,
			
 
				 the number of elements in the vector and the size of each element.
			
 
				 The following shows how to construct a StarPU task that will manipulate the
			
@@ -569,7 +541,7 @@ The CUDA implementation can be written as follows. It needs to be compiled with
 
				 a CUDA compiler such as nvcc, the NVIDIA CUDA compiler driver. It must be noted
			
 
				 that the vector pointer returned by ::STARPU_VECTOR_GET_PTR is here a
			
 
				 pointer in GPU memory, so that it can be passed as such to the
			
 
				-<c>vector_mult_cuda</c> kernel call.
			
 
				+kernel call <c>vector_mult_cuda</c>.
			
 
				 
			
 
				 \code{.c}
			
 
				 #include <starpu.h>
			
--- a/doc/doxygen/chapters/building.doxy
+++ b/doc/doxygen/chapters/building.doxy
@@ -134,8 +134,8 @@ $ make install
 
				 \endverbatim
			
 
				 
			
 
				 Libtool interface versioning information are included in
			
 
				-libraries names (libstarpu-1.2.so, libstarpumpi-1.2.so and
			
 
				-libstarpufft-1.2.so).
			
 
				+libraries names (<c>libstarpu-1.2.so</c>, <c>libstarpumpi-1.2.so</c> and
			
 
				+<c>libstarpufft-1.2.so</c>).
			
 
				 
			
 
				 \section SettingUpYourOwnCode Setting up Your Own Code
			
 
				 
			
@@ -145,10 +145,10 @@ StarPU provides a pkg-config executable to obtain relevant compiler
 
				 and linker flags.
			
 
				 Compiling and linking an application against StarPU may require to use
			
 
				 specific flags or libraries (for instance <c>CUDA</c> or <c>libspe2</c>).
			
 
				-To this end, it is possible to use the <c>pkg-config</c> tool.
			
 
				+To this end, it is possible to use the tool <c>pkg-config</c>.
			
 
				 
			
 
				 If StarPU was not installed at some standard location, the path of StarPU's
			
 
				-library must be specified in the <c>PKG_CONFIG_PATH</c> environment variable so
			
 
				+library must be specified in the environment variable <c>PKG_CONFIG_PATH</c> so
			
 
				 that <c>pkg-config</c> can find it. For example if StarPU was installed in
			
 
				 <c>$prefix_dir</c>:
			
 
				 
			
@@ -175,10 +175,10 @@ Make sure that <c>pkg-config --libs starpu-1.2</c> actually produces some output
 
				 before going further: <c>PKG_CONFIG_PATH</c> has to point to the place where
			
 
				 <c>starpu-1.2.pc</c> was installed during <c>make install</c>.
			
 
				 
			
 
				-Also pass the <c>--static</c> option if the application is to be
			
 
				+Also pass the option <c>--static</c> if the application is to be
			
 
				 linked statically.
			
 
				 
			
 
				-It is also necessary to set the variable <c>LD_LIBRARY_PATH</c> to
			
 
				+It is also necessary to set the environment variable <c>LD_LIBRARY_PATH</c> to
			
 
				 locate dynamic libraries at runtime.
			
 
				 
			
 
				 \verbatim
			
@@ -283,10 +283,10 @@ multiplication using BLAS and cuBLAS. They output the obtained GFlops.
 
				 
			
 
				 \subsection CholeskyFactorization Cholesky Factorization
			
 
				 
			
 
				-<c>cholesky\*</c> perform a Cholesky factorization (single precision). They use different dependency primitives.
			
 
				+<c>cholesky/*</c> perform a Cholesky factorization (single precision). They use different dependency primitives.
			
 
				 
			
 
				 \subsection LUFactorization LU Factorization
			
 
				 
			
 
				-<c>lu\*</c> perform an LU factorization. They use different dependency primitives.
			
 
				+<c>lu/*</c> perform an LU factorization. They use different dependency primitives.
			
 
				 
			
 
				 */
			
--- a/doc/doxygen/chapters/fft_support.doxy
+++ b/doc/doxygen/chapters/fft_support.doxy
@@ -9,7 +9,7 @@
 
				 /*! \page FFTSupport FFT Support
			
 
				 
			
 
				 StarPU provides <c>libstarpufft</c>, a library whose design is very similar to
			
 
				-both fftw and cufft, the difference being that it takes benefit from both CPUs
			
 
				+both <c>fftw</c> and <c>cufft</c>, the difference being that it takes benefit from both CPUs
			
 
				 and GPUs. It should however be noted that GPUs do not have the same precision as
			
 
				 CPUs, so the results may different by a negligible amount.
			
 
				 
			
@@ -33,7 +33,7 @@ The documentation below is given with names for double precision, replace
 
				 
			
 
				 Only complex numbers are supported at the moment.
			
 
				 
			
 
				-The application has to call starpu_init() before calling starpufft functions.
			
 
				+The application has to call starpu_init() before calling <c>starpufft</c> functions.
			
 
				 
			
 
				 Either main memory pointers or data handles can be provided.
			
 
				 
			
@@ -66,6 +66,6 @@ $ pkg-config --cflags starpufft-1.2  # options for the compiler
 
				 $ pkg-config --libs starpufft-1.2    # options for the linker
			
 
				 \endverbatim
			
 
				 
			
 
				-Also pass the <c>--static</c> option if the application is to be linked statically.
			
 
				+Also pass the option <c>--static</c> if the application is to be linked statically.
			
 
				 
			
 
				 */
			
--- a/doc/doxygen/chapters/mpi_support.doxy
+++ b/doc/doxygen/chapters/mpi_support.doxy
@@ -31,7 +31,7 @@ $ pkg-config --cflags starpumpi-1.2  # options for the compiler
 
				 $ pkg-config --libs starpumpi-1.2    # options for the linker
			
 
				 \endverbatim
			
 
				 
			
 
				-You also need pass the <c>--static</c> option if the application is to
			
 
				+You also need pass the option <c>--static</c> if the application is to
			
 
				 be linked statically.
			
 
				 
			
 
				 \code{.c}
			
@@ -257,7 +257,7 @@ int my_distrib(int x, int y, int nb_nodes) {
 
				 
			
 
				 Now the data can be registered within StarPU. Data which are not
			
 
				 owned but will be needed for computations can be registered through
			
 
				-the lazy allocation mechanism, i.e. with a <c>home_node</c> set to -1.
			
 
				+the lazy allocation mechanism, i.e. with a <c>home_node</c> set to <c>-1</c>.
			
 
				 StarPU will automatically allocate the memory when it is used for the
			
 
				 first time.
			
 
				 
			
--- a/doc/doxygen/chapters/optimize_performance.doxy
+++ b/doc/doxygen/chapters/optimize_performance.doxy
@@ -37,7 +37,7 @@ starpu_data_set_wt_mask(img_handle, 1<<0);
 
				 \endcode
			
 
				 
			
 
				 will for instance request to always automatically transfer a replicate into the
			
 
				-main memory (node 0), as bit 0 of the write-through bitmask is being set.
			
 
				+main memory (node <c>0</c>), as bit <c>0</c> of the write-through bitmask is being set.
			
 
				 
			
 
				 \code{.c}
			
 
				 starpu_data_set_wt_mask(img_handle, ~0U);
			
@@ -108,7 +108,7 @@ possibility according to task size, one can run
 
				 speedup of independent tasks of very small sizes.
			
 
				 
			
 
				 The choice of scheduler also has impact over the overhead: for instance, the
			
 
				-<c>dmda</c> scheduler takes time to make a decision, while <c>eager</c> does
			
 
				+ scheduler<c>dmda</c> takes time to make a decision, while <c>eager</c> does
			
 
				 not. <c>tasks_size_overhead.sh</c> can again be used to get a grasp at how much
			
 
				 impact that has on the target machine.
			
 
				 
			
@@ -132,7 +132,7 @@ priority information to StarPU.
 
				 
			
 
				 \section TaskSchedulingPolicy Task Scheduling Policy
			
 
				 
			
 
				-By default, StarPU uses the <c>eager</c> simple greedy scheduler. This is
			
 
				+By default, StarPU uses the simple greedy scheduler <c>eager</c>. This is
			
 
				 because it provides correct load balance even if the application codelets do not
			
 
				 have performance models. If your application codelets have performance models
			
 
				 (\ref PerformanceModelExample), you should change the scheduler thanks
			
@@ -276,14 +276,14 @@ and in Joules for the energy consumption models.
 
				 
			
 
				 Distributing tasks to balance the load induces data transfer penalty. StarPU
			
 
				 thus needs to find a balance between both. The target function that the
			
 
				-<c>dmda</c> scheduler of StarPU
			
 
				+scheduler <c>dmda</c> of StarPU
			
 
				 tries to minimize is <c>alpha * T_execution + beta * T_data_transfer</c>, where
			
 
				 <c>T_execution</c> is the estimated execution time of the codelet (usually
			
 
				 accurate), and <c>T_data_transfer</c> is the estimated data transfer time. The
			
 
				 latter is estimated based on bus calibration before execution start,
			
 
				 i.e. with an idle machine, thus without contention. You can force bus
			
 
				 re-calibration by running the tool <c>starpu_calibrate_bus</c>. The
			
 
				-beta parameter defaults to 1, but it can be worth trying to tweak it
			
 
				+beta parameter defaults to <c>1</c>, but it can be worth trying to tweak it
			
 
				 by using <c>export STARPU_SCHED_BETA=2</c> for instance, since during
			
 
				 real application execution, contention makes transfer times bigger.
			
 
				 This is of course imprecise, but in practice, a rough estimation
			
@@ -291,7 +291,7 @@ already gives the good results that a precise estimation would give.
 
				 
			
 
				 \section DataPrefetch Data Prefetch
			
 
				 
			
 
				-The <c>heft</c>, <c>dmda</c> and <c>pheft</c> scheduling policies
			
 
				+The scheduling policies <c>heft</c>, <c>dmda</c> and <c>pheft</c>
			
 
				 perform data prefetch (see \ref STARPU_PREFETCH):
			
 
				 as soon as a scheduling decision is taken for a task, requests are issued to
			
 
				 transfer its required data to the target processing unit, if needeed, so that
			
@@ -310,9 +310,9 @@ the handle and the desired target memory node.
 
				 \section Power-basedScheduling Power-based Scheduling
			
 
				 
			
 
				 If the application can provide some power performance model (through
			
 
				-the <c>power_model</c> field of the codelet structure), StarPU will
			
 
				+the field starpu_codelet::power_model), StarPU will
			
 
				 take it into account when distributing tasks. The target function that
			
 
				-the <c>dmda</c> scheduler minimizes becomes <c>alpha * T_execution +
			
 
				+the scheduler <c>dmda</c> minimizes becomes <c>alpha * T_execution +
			
 
				 beta * T_data_transfer + gamma * Consumption</c> , where <c>Consumption</c>
			
 
				 is the estimated task consumption in Joules. To tune this parameter, use
			
 
				 <c>export STARPU_SCHED_GAMMA=3000</c> for instance, to express that each Joule
			
@@ -333,7 +333,7 @@ On-line task consumption measurement is currently only supported through the
 
				 <c>CL_PROFILING_POWER_CONSUMED</c> OpenCL extension, implemented in the MoviSim
			
 
				 simulator. Applications can however provide explicit measurements by
			
 
				 using the function starpu_perfmodel_update_history() (examplified in \ref PerformanceModelExample
			
 
				-with the <c>power_model</c> performance model. Fine-grain
			
 
				+with the <c>power_model</c> performance model). Fine-grain
			
 
				 measurement is often not feasible with the feedback provided by the hardware, so
			
 
				 the user can for instance run a given task a thousand times, measure the global
			
 
				 consumption for that series of tasks, divide it by a thousand, repeat for
			
@@ -446,9 +446,9 @@ $ STARPU_SCHED=dmda ./examples/matvecmult/matvecmult
 
				 TEST PASSED
			
 
				 \endverbatim
			
 
				 
			
 
				-Note that we force to use the dmda scheduler to generate performance
			
 
				-models for the application. The application may need to be run several
			
 
				-times before the model is calibrated.
			
 
				+Note that we force to use the scheduler <c>dmda</c> to generate
			
 
				+performance models for the application. The application may need to be
			
 
				+run several times before the model is calibrated.
			
 
				 
			
 
				 \subsection Simulation Simulation
			
 
				 
			
--- a/doc/doxygen/chapters/performance_feedback.doxy
+++ b/doc/doxygen/chapters/performance_feedback.doxy
@@ -16,7 +16,7 @@ nice visual task debugging. To do so, build Temanejo's <c>libayudame.so</c>,
 
				 install <c>Ayudame.h</c> to e.g. <c>/usr/local/include</c>, apply the
			
 
				 <c>tools/patch-ayudame</c> to it to fix C build, re-<c>./configure</c>, make
			
 
				 sure that it found it, rebuild StarPU.  Run the Temanejo GUI, give it the path
			
 
				-to your application, any options you want to pass it, the path to libayudame.so.
			
 
				+to your application, any options you want to pass it, the path to <c>libayudame.so</c>.
			
 
				 
			
 
				 Make sure to specify at least the same number of CPUs in the dialog box as your
			
 
				 machine has, otherwise an error will happen during execution. Future versions
			
@@ -35,7 +35,7 @@ call starpu_profiling_status_set() with the parameter
 
				 is already enabled or not by calling starpu_profiling_status_get().
			
 
				 Enabling monitoring also reinitialize all previously collected
			
 
				 feedback. The environment variable \ref STARPU_PROFILING can also be
			
 
				-set to 1 to achieve the same effect.
			
 
				+set to <c>1</c> to achieve the same effect.
			
 
				 
			
 
				 Likewise, performance monitoring is stopped by calling
			
 
				 starpu_profiling_status_set() with the parameter
			
@@ -247,7 +247,7 @@ Or you can simply point the <c>PKG_CONFIG_PATH</c> to
 
				 \ref with-fxt "--with-fxt" to <c>./configure</c>
			
 
				 
			
 
				 When FxT is enabled, a trace is generated when StarPU is terminated by calling
			
 
				-starpu_shutdown()). The trace is a binary file whose name has the form
			
 
				+starpu_shutdown(). The trace is a binary file whose name has the form
			
 
				 <c>prof_file_XXX_YYY</c> where <c>XXX</c> is the user name, and
			
 
				 <c>YYY</c> is the pid of the process that used StarPU. This file is saved in the
			
 
				 <c>/tmp/</c> directory by default, or by the directory specified by
			
@@ -269,7 +269,7 @@ application shutdown.
 
				 This will create a file <c>paje.trace</c> in the current directory that
			
 
				 can be inspected with the <a href="http://vite.gforge.inria.fr/">ViTE trace
			
 
				 visualizing open-source tool</a>.  It is possible to open the
			
 
				-<c>paje.trace</c> file with ViTE by using the following command:
			
 
				+file <c>paje.trace</c> with ViTE by using the following command:
			
 
				 
			
 
				 \verbatim
			
 
				 $ vite paje.trace
			
@@ -322,7 +322,7 @@ generate an activity trace by calling:
 
				 $ starpu_fxt_tool -i filename
			
 
				 \endverbatim
			
 
				 
			
 
				-This will create an <c>activity.data</c> file in the current
			
 
				+This will create a file <c>activity.data</c> in the current
			
 
				 directory. A profile of the application showing the activity of StarPU
			
 
				 during the execution of the program can be generated:
			
 
				 
			
@@ -341,7 +341,7 @@ efficiently. The black sections indicate that the processing unit was blocked
 
				 because there was no task to process: this may indicate a lack of parallelism
			
 
				 which may be alleviated by creating more tasks when it is possible.
			
 
				 
			
 
				-The second part of the <c>activity.eps</c> picture is a graph showing the
			
 
				+The second part of the picture <c>activity.eps</c> is a graph showing the
			
 
				 evolution of the number of tasks available in the system during the execution.
			
 
				 Ready tasks are shown in black, and tasks that are submitted but not
			
 
				 schedulable yet are shown in grey.
			
@@ -360,8 +360,8 @@ file: <starpu_slu_lu_model_22.hannibal>
 
				 file: <starpu_slu_lu_model_12.hannibal>
			
 
				 \endverbatim
			
 
				 
			
 
				-Here, the codelets of the lu example are available. We can examine the
			
 
				-performance of the 22 kernel (in micro-seconds), which is history-based:
			
 
				+Here, the codelets of the example <c>lu</c> are available. We can examine the
			
 
				+performance of the kernel <c>22</c> (in micro-seconds), which is history-based:
			
 
				 
			
 
				 \verbatim
			
 
				 $ starpu_perfmodel_display -s starpu_slu_lu_model_22
			
@@ -414,7 +414,7 @@ starpu_perfmodel_load_symbol(). The source code of the tool
 
				 
			
 
				 The tool <c>starpu_perfmodel_plot</c> can be used to draw performance
			
 
				 models. It writes a <c>.gp</c> file in the current directory, to be
			
 
				-run in the <c>gnuplot</c> tool, which shows the corresponding curve.
			
 
				+run with the tool <c>gnuplot</c>, which shows the corresponding curve.
			
 
				 
			
 
				 When the field starpu_task::flops is set, <c>starpu_perfmodel_plot</c> can
			
 
				 directly draw a GFlops curve, by simply adding the <c>-f</c> option:
			
@@ -448,13 +448,13 @@ $ starpu_perfmodel_plot -s non_linear_memset_regression_based -i /tmp/prof_file_
 
				 It will produce a <c>.gp</c> file which contains both the performance model
			
 
				 curves, and the profiling measurements.
			
 
				 
			
 
				-If you have the <c>R</c> statistical tool installed, you can additionally use
			
 
				+If you have the statistical tool <c>R</c> installed, you can additionally use
			
 
				 
			
 
				 \verbatim
			
 
				 $ starpu_codelet_histo_profile distrib.data
			
 
				 \endverbatim
			
 
				 
			
 
				-Which will create one pdf file per codelet and per input size, showing a
			
 
				+Which will create one <c>.pdf</c> file per codelet and per input size, showing a
			
 
				 histogram of the codelet execution time distribution.
			
 
				 
			
 
				 \section TheoreticalLowerBoundOnExecutionTime Theoretical Lower Bound On Execution Time
			
@@ -475,13 +475,13 @@ use this.
 
				 \section MemoryFeedback Memory Feedback
			
 
				 
			
 
				 It is possible to enable memory statistics. To do so, you need to pass
			
 
				-the option \ref enable-memory-stats "--enable-memory-stats" when running configure. It is then
			
 
				-possible to call the function starpu_display_memory_stats() to
			
 
				+the option \ref enable-memory-stats "--enable-memory-stats" when running <c>configure</c>. It is then
			
 
				+possible to call the function starpu_data_display_memory_stats() to
			
 
				 display statistics about the current data handles registered within StarPU.
			
 
				 
			
 
				 Moreover, statistics will be displayed at the end of the execution on
			
 
				 data handles which have not been cleared out. This can be disabled by
			
 
				-setting the environment variable \ref STARPU_MEMORY_STATS to 0.
			
 
				+setting the environment variable \ref STARPU_MEMORY_STATS to <c>0</c>.
			
 
				 
			
 
				 For example, if you do not unregister data at the end of the complex
			
 
				 example, you will get something similar to:
			
@@ -552,7 +552,7 @@ of the application. To enable them, you need to pass the option
 
				 starpu_shutdown() various statistics will be displayed,
			
 
				 execution, MSI cache statistics, allocation cache statistics, and data
			
 
				 transfer statistics. The display can be disabled by setting the
			
 
				-environment variable \ref STARPU_STATS to 0.
			
 
				+environment variable \ref STARPU_STATS to <c>0</c>.
			
 
				 
			
 
				 \verbatim
			
 
				 $ ./examples/cholesky/cholesky_tag