Samuel Thibault пре 14 година
родитељ
комит
823be92ae0
1 измењених фајлова са 15 додато и 14 уклоњено
  1. 15 14
      doc/starpu.texi

+ 15 - 14
doc/starpu.texi

@@ -1430,10 +1430,16 @@ few additional changes are needed.
 @node Data management
 @section Data management
 
+When the application allocates data, whenever possible it should use the
+@code{starpu_malloc} function, which will ask CUDA or
+OpenCL to make the allocation itself and pin the corresponding allocated
+memory. This is needed to permit asynchronous data transfer, i.e. permit data
+transfer to overlap with computations.
+
 By default, StarPU leaves replicates of data wherever they were used, in case they
 will be re-used by other tasks, thus saving the data transfer time. When some
 task modifies some data, all the other replicates are invalidated, and only the
-processing unit will have a valid replicate of the data. If the application knows
+processing unit which ran that task will have a valid replicate of the data. If the application knows
 that this data will not be re-used by further tasks, it should advise StarPU to
 immediately replicate it to a desired list of memory nodes (given through a
 bitmask). This can be understood like the write-through mode of CPU caches.
@@ -1445,12 +1451,6 @@ starpu_data_set_wt_mask(img_handle, 1<<0);
 will for instance request to always transfer a replicate into the main memory (node
 0), as bit 0 of the write-through bitmask is being set.
 
-When the application allocates data, whenever possible it should use the
-@code{starpu_malloc} function, which will ask CUDA or
-OpenCL to make the allocation itself and pin the corresponding allocated
-memory. This is needed to permit asynchronous data transfer, i.e. permit data
-transfer to overlap with computations.
-
 @node Task submission
 @section Task submission
 
@@ -1491,13 +1491,13 @@ to configure a performance model for the codelets of the application (see
 use on-line calibration.  StarPU will automatically calibrate codelets
 which have never been calibrated yet. To force continuing calibration, use
 @code{export STARPU_CALIBRATE=1} . This may be necessary if your application
-have not-so-stable performance. Details on the current performance model status
+has not-so-stable performance. Details on the current performance model status
 can be obtained from the @code{starpu_perfmodel_display} command: the @code{-l}
 option lists the available performance models, and the @code{-s} option permits
 to choose the performance model to be displayed. The result looks like:
 
 @example
- starpu_perfmodel_display -s starpu_dlu_lu_model_22
+$ starpu_perfmodel_display -s starpu_dlu_lu_model_22
 performance model for cpu
 # hash		size		mean		dev		n
 5c6c3401	1572864        	1.216300e+04   	2.277778e+03   	1240
@@ -1527,10 +1527,11 @@ thus needs to find a balance between both. The target function that the
 tries to minimize is @code{alpha * T_execution + beta * T_data_transfer}, where
 @code{T_execution} is the estimated execution time of the codelet (usually
 accurate), and @code{T_data_transfer} is the estimated data transfer time. The
-latter is however estimated based on bus calibration before execution start,
-i.e. with an idle machine. You can force bus re-calibration by running
+latter is estimated based on bus calibration before execution start,
+i.e. with an idle machine, thus without contention. You can force bus re-calibration by running
 @code{starpu_calibrate_bus}. The beta parameter defaults to 1, but it can be
-worth trying to tweak it by using @code{export STARPU_BETA=2} for instance.
+worth trying to tweak it by using @code{export STARPU_BETA=2} for instance,
+since during real application execution, contention makes transfer times bigger.
 This is of course imprecise, but in practice, a rough estimation already gives
 the good results that a precise estimation would give.
 
@@ -1563,7 +1564,7 @@ beta * T_data_transfer + gamma * Consumption} , where @code{Consumption}
 is the estimated task consumption in Joules. To tune this parameter, use
 @code{export STARPU_GAMMA=3000} for instance, to express that each Joule
 (i.e kW during 1000us) is worth 3000us execution time penalty. Setting
-alpha and beta to zero permits to only take into account power consumption.
+@code{alpha} and @code{beta} to zero permits to only take into account power consumption.
 
 This is however not sufficient to correctly optimize power: the scheduler would
 simply tend to run all computations on the most energy-conservative processing
@@ -1603,7 +1604,7 @@ func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);
 cudaStreamSynchronize(starpu_cuda_get_local_stream());
 @end example
 
-Unfortunately, a lot of CUDA libraries do not have stream variants of
+Unfortunately, some CUDA libraries do not have stream variants of
 kernels. That will lower the potential for overlapping.
 
 @c ---------------------------------------------------------------------