14 years ago · 823be92ae0
--- a/doc/starpu.texi
+++ b/doc/starpu.texi
@@ -1430,10 +1430,16 @@ few additional changes are needed.
 
				 @node Data management
			
 
				 @section Data management
			
 
				 
			
 
				+When the application allocates data, whenever possible it should use the
			
 
				+@code{starpu_malloc} function, which will ask CUDA or
			
 
				+OpenCL to make the allocation itself and pin the corresponding allocated
			
 
				+memory. This is needed to permit asynchronous data transfer, i.e. permit data
			
 
				+transfer to overlap with computations.
			
 
				+
			
 
				 By default, StarPU leaves replicates of data wherever they were used, in case they
			
 
				 will be re-used by other tasks, thus saving the data transfer time. When some
			
 
				 task modifies some data, all the other replicates are invalidated, and only the
			
 
				-processing unit will have a valid replicate of the data. If the application knows
			
 
				+processing unit which ran that task will have a valid replicate of the data. If the application knows
			
 
				 that this data will not be re-used by further tasks, it should advise StarPU to
			
 
				 immediately replicate it to a desired list of memory nodes (given through a
			
 
				 bitmask). This can be understood like the write-through mode of CPU caches.
			
@@ -1445,12 +1451,6 @@ starpu_data_set_wt_mask(img_handle, 1<<0);
 
				 will for instance request to always transfer a replicate into the main memory (node
			
 
				 0), as bit 0 of the write-through bitmask is being set.
			
 
				 
			
 
				-When the application allocates data, whenever possible it should use the
			
 
				-@code{starpu_malloc} function, which will ask CUDA or
			
 
				-OpenCL to make the allocation itself and pin the corresponding allocated
			
 
				-memory. This is needed to permit asynchronous data transfer, i.e. permit data
			
 
				-transfer to overlap with computations.
			
 
				-
			
 
				 @node Task submission
			
 
				 @section Task submission
			
 
				 
			
@@ -1491,13 +1491,13 @@ to configure a performance model for the codelets of the application (see
 
				 use on-line calibration.  StarPU will automatically calibrate codelets
			
 
				 which have never been calibrated yet. To force continuing calibration, use
			
 
				 @code{export STARPU_CALIBRATE=1} . This may be necessary if your application
			
 
				-have not-so-stable performance. Details on the current performance model status
			
 
				+has not-so-stable performance. Details on the current performance model status
			
 
				 can be obtained from the @code{starpu_perfmodel_display} command: the @code{-l}
			
 
				 option lists the available performance models, and the @code{-s} option permits
			
 
				 to choose the performance model to be displayed. The result looks like:
			
 
				 
			
 
				 @example
			
 
				-€ starpu_perfmodel_display -s starpu_dlu_lu_model_22
			
 
				+$ starpu_perfmodel_display -s starpu_dlu_lu_model_22
			
 
				 performance model for cpu
			
 
				 # hash		size		mean		dev		n
			
 
				 5c6c3401	1572864        	1.216300e+04   	2.277778e+03   	1240
			
@@ -1527,10 +1527,11 @@ thus needs to find a balance between both. The target function that the
 
				 tries to minimize is @code{alpha * T_execution + beta * T_data_transfer}, where
			
 
				 @code{T_execution} is the estimated execution time of the codelet (usually
			
 
				 accurate), and @code{T_data_transfer} is the estimated data transfer time. The
			
 
				-latter is however estimated based on bus calibration before execution start,
			
 
				-i.e. with an idle machine. You can force bus re-calibration by running
			
 
				+latter is estimated based on bus calibration before execution start,
			
 
				+i.e. with an idle machine, thus without contention. You can force bus re-calibration by running
			
 
				 @code{starpu_calibrate_bus}. The beta parameter defaults to 1, but it can be
			
 
				-worth trying to tweak it by using @code{export STARPU_BETA=2} for instance.
			
 
				+worth trying to tweak it by using @code{export STARPU_BETA=2} for instance,
			
 
				+since during real application execution, contention makes transfer times bigger.
			
 
				 This is of course imprecise, but in practice, a rough estimation already gives
			
 
				 the good results that a precise estimation would give.
			
 
				 
			
@@ -1563,7 +1564,7 @@ beta * T_data_transfer + gamma * Consumption} , where @code{Consumption}
 
				 is the estimated task consumption in Joules. To tune this parameter, use
			
 
				 @code{export STARPU_GAMMA=3000} for instance, to express that each Joule
			
 
				 (i.e kW during 1000us) is worth 3000us execution time penalty. Setting
			
 
				-alpha and beta to zero permits to only take into account power consumption.
			
 
				+@code{alpha} and @code{beta} to zero permits to only take into account power consumption.
			
 
				 
			
 
				 This is however not sufficient to correctly optimize power: the scheduler would
			
 
				 simply tend to run all computations on the most energy-conservative processing
			
@@ -1603,7 +1604,7 @@ func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);
 
				 cudaStreamSynchronize(starpu_cuda_get_local_stream());
			
 
				 @end example
			
 
				 
			
 
				-Unfortunately, a lot of CUDA libraries do not have stream variants of
			
 
				+Unfortunately, some CUDA libraries do not have stream variants of
			
 
				 kernels. That will lower the potential for overlapping.
			
 
				 
			
 
				 @c ---------------------------------------------------------------------