15 vuotta sitten · 823be92ae0
--- a/doc/starpu.texi
+++ b/doc/starpu.texi
@@ -1430,10 +1430,16 @@ few additional changes are needed.
 
																 @node Data management
															
 
																 @section Data management
															
 
																+When the application allocates data, whenever possible it should use the
															
 
																+@code{starpu_malloc} function, which will ask CUDA or
															
 
																+OpenCL to make the allocation itself and pin the corresponding allocated
															
 
																+memory. This is needed to permit asynchronous data transfer, i.e. permit data
															
 
																+transfer to overlap with computations.
															
 
																+
															
 
																 By default, StarPU leaves replicates of data wherever they were used, in case they
															
 
																 will be re-used by other tasks, thus saving the data transfer time. When some
															
 
																 task modifies some data, all the other replicates are invalidated, and only the
															
 
																-processing unit will have a valid replicate of the data. If the application knows
															
 
																+processing unit which ran that task will have a valid replicate of the data. If the application knows
															
 
																 that this data will not be re-used by further tasks, it should advise StarPU to
															
 
																 immediately replicate it to a desired list of memory nodes (given through a
															
 
																 bitmask). This can be understood like the write-through mode of CPU caches.
															
@@ -1445,12 +1451,6 @@ starpu_data_set_wt_mask(img_handle, 1<<0);
 
																 will for instance request to always transfer a replicate into the main memory (node
															
 
																 0), as bit 0 of the write-through bitmask is being set.
															
 
																-When the application allocates data, whenever possible it should use the
															
 
																-@code{starpu_malloc} function, which will ask CUDA or
															
 
																-OpenCL to make the allocation itself and pin the corresponding allocated
															
 
																-memory. This is needed to permit asynchronous data transfer, i.e. permit data
															
 
																-transfer to overlap with computations.
															
 
																-
															
 
																 @node Task submission
															
 
																 @section Task submission
															
@@ -1491,13 +1491,13 @@ to configure a performance model for the codelets of the application (see
 
																 use on-line calibration.  StarPU will automatically calibrate codelets
															
 
																 which have never been calibrated yet. To force continuing calibration, use
															
 
																 @code{export STARPU_CALIBRATE=1} . This may be necessary if your application
															
 
																-have not-so-stable performance. Details on the current performance model status
															
 
																+has not-so-stable performance. Details on the current performance model status
															
 
																 can be obtained from the @code{starpu_perfmodel_display} command: the @code{-l}
															
 
																 option lists the available performance models, and the @code{-s} option permits
															
 
																 to choose the performance model to be displayed. The result looks like:
															
 
																 @example
															
 
																-€ starpu_perfmodel_display -s starpu_dlu_lu_model_22
															
 
																+$ starpu_perfmodel_display -s starpu_dlu_lu_model_22
															
 
																 performance model for cpu
															
 
																 # hash		size		mean		dev		n
															
 
																 5c6c3401	1572864        	1.216300e+04   	2.277778e+03   	1240
															
@@ -1527,10 +1527,11 @@ thus needs to find a balance between both. The target function that the
 
																 tries to minimize is @code{alpha * T_execution + beta * T_data_transfer}, where
															
 
																 @code{T_execution} is the estimated execution time of the codelet (usually
															
 
																 accurate), and @code{T_data_transfer} is the estimated data transfer time. The
															
 
																-latter is however estimated based on bus calibration before execution start,
															
 
																-i.e. with an idle machine. You can force bus re-calibration by running
															
 
																+latter is estimated based on bus calibration before execution start,
															
 
																+i.e. with an idle machine, thus without contention. You can force bus re-calibration by running
															
 
																 @code{starpu_calibrate_bus}. The beta parameter defaults to 1, but it can be
															
 
																-worth trying to tweak it by using @code{export STARPU_BETA=2} for instance.
															
 
																+worth trying to tweak it by using @code{export STARPU_BETA=2} for instance,
															
 
																+since during real application execution, contention makes transfer times bigger.
															
 
																 This is of course imprecise, but in practice, a rough estimation already gives
															
 
																 the good results that a precise estimation would give.
															
@@ -1563,7 +1564,7 @@ beta * T_data_transfer + gamma * Consumption} , where @code{Consumption}
 
																 is the estimated task consumption in Joules. To tune this parameter, use
															
 
																 @code{export STARPU_GAMMA=3000} for instance, to express that each Joule
															
 
																 (i.e kW during 1000us) is worth 3000us execution time penalty. Setting
															
 
																-alpha and beta to zero permits to only take into account power consumption.
															
 
																+@code{alpha} and @code{beta} to zero permits to only take into account power consumption.
															
 
																 This is however not sufficient to correctly optimize power: the scheduler would
															
 
																 simply tend to run all computations on the most energy-conservative processing
															
@@ -1603,7 +1604,7 @@ func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);
 
																 cudaStreamSynchronize(starpu_cuda_get_local_stream());
															
 
																 @end example
															
 
																-Unfortunately, a lot of CUDA libraries do not have stream variants of
															
 
																+Unfortunately, some CUDA libraries do not have stream variants of
															
 
																 kernels. That will lower the potential for overlapping.
															
 
																 @c ---------------------------------------------------------------------