11 years ago · deb5a22184
--- a/doc/doxygen/chapters/05check_list_performance.doxy
+++ b/doc/doxygen/chapters/05check_list_performance.doxy
@@ -52,29 +52,37 @@ func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);
 
				 cudaStreamSynchronize(starpu_cuda_get_local_stream());
			
 
				 \endcode
			
 
				 
			
 
				+Unfortunately, some CUDA libraries do not have stream variants of
			
 
				+kernels. That will lower the potential for overlapping.
			
 
				+
			
 
				 StarPU already does appropriate calls for the CUBLAS library.
			
 
				 
			
 
				 If the kernel can be made to only use this local stream or other self-allocated
			
 
				 streams, i.e. the whole kernel submission can be made asynchronous, then
			
 
				-one should enable asynchronous execution of the kernel. This means setting
			
 
				-the corresponding cuda_flags[] flag in the codelet and dropping the
			
 
				-cudaStreamSynchronize() call at the end of the kernel. That way, StarPU will be
			
 
				-able to pipeline submitting tasks to GPUs, instead of synchronizing at each
			
 
				+one should enable asynchronous execution of the kernel.  That means setting
			
 
				+the STARPU_CUDA_FLAG flag in cuda_flags[] in the codelet, and dropping the
			
 
				+cudaStreamSynchronize() call at the end of the cuda_func function, so that it
			
 
				+returns immediately after having queued the kernel to the local stream. That way, StarPU will be
			
 
				+able to submit and complete data transfers while kernels are executing, instead of only at each
			
 
				 kernel submission. The kernel just has to make sure that StarPU can use the
			
 
				 local stream to synchronize with the kernel startup and completion.
			
 
				 
			
 
				-Unfortunately, some CUDA libraries do not have stream variants of
			
 
				-kernels. That will lower the potential for overlapping.
			
 
				+Using the STARPU_CUDA_FLAG flag also permits to enabled concurrent kernel
			
 
				+execution, on cards which support it (Kepler and later, notably). This is
			
 
				+enabled by setting the STARPU_NWORKER_PER_CUDA environment variable to the
			
 
				+number of kernels to execute concurrently.  This is useful when kernels are
			
 
				+small and do not feed the whole GPU with threads to run.
			
 
				 
			
 
				 \section OpenCL-specificOptimizations OpenCL-specific Optimizations
			
 
				 
			
 
				 If the kernel can be made to only use the StarPU-provided command queue or other self-allocated
			
 
				-streams, i.e. the whole kernel submission can be made asynchronous, then
			
 
				+queues, i.e. the whole kernel submission can be made asynchronous, then
			
 
				 one should enable asynchronous execution of the kernel. This means setting
			
 
				 the corresponding opencl_flags[] flag in the codelet and dropping the
			
 
				-clFinish() and starpu_opencl_collect_stats() calls at the end of the kernel.
			
 
				-That way, StarPU will be able to pipeline submitting tasks to GPUs, instead of
			
 
				-synchronizing at each kernel submission. The kernel just has to make sure
			
 
				+clFinish() and starpu_opencl_collect_stats() calls at the end of the kernel, so
			
 
				+that it returns immediately after having queued the kernel to the provided queue.
			
 
				+That way, StarPU will be able to submit and complete data transfers while kernels are executing, instead of
			
 
				+only at each kernel submission. The kernel just has to make sure
			
 
				 that StarPU can use the command queue it has provided to synchronize with the
			
 
				 kernel startup and completion.
			
 
				 
			
--- a/doc/doxygen/chapters/40environment_variables.doxy
+++ b/doc/doxygen/chapters/40environment_variables.doxy
@@ -43,6 +43,14 @@ environment variable \ref STARPU_WORKERS_CUDAID. By default, StarPU will
 
				 create as many CUDA workers as there are CUDA devices.
			
 
				 </dd>
			
 
				 
			
 
				+<dt>STARPU_NWORKER_PER_CUDA</dt>
			
 
				+<dd>
			
 
				+\anchor STARPU_NWORKER_PER_CUDA
			
 
				+\addindex __env__STARPU_NWORKER_PER_CUDA
			
 
				+Specify the number of workers per CUDA device, and thus the number of kernels
			
 
				+which will be concurrently running on the devices. The default value is 1.
			
 
				+</dd>
			
 
				+
			
 
				 <dt>STARPU_NOPENCL</dt>
			
 
				 <dd>
			
 
				 \anchor STARPU_NOPENCL