Browse Source

misc fixes

Samuel Thibault 14 years ago
parent
commit
823be92ae0
1 changed files with 15 additions and 14 deletions
  1. 15 14
      doc/starpu.texi

+ 15 - 14
doc/starpu.texi

@@ -1430,10 +1430,16 @@ few additional changes are needed.
 @node Data management
 @node Data management
 @section Data management
 @section Data management
 
 
+When the application allocates data, whenever possible it should use the
+@code{starpu_malloc} function, which will ask CUDA or
+OpenCL to make the allocation itself and pin the corresponding allocated
+memory. This is needed to permit asynchronous data transfer, i.e. permit data
+transfer to overlap with computations.
+
 By default, StarPU leaves replicates of data wherever they were used, in case they
 By default, StarPU leaves replicates of data wherever they were used, in case they
 will be re-used by other tasks, thus saving the data transfer time. When some
 will be re-used by other tasks, thus saving the data transfer time. When some
 task modifies some data, all the other replicates are invalidated, and only the
 task modifies some data, all the other replicates are invalidated, and only the
-processing unit will have a valid replicate of the data. If the application knows
+processing unit which ran that task will have a valid replicate of the data. If the application knows
 that this data will not be re-used by further tasks, it should advise StarPU to
 that this data will not be re-used by further tasks, it should advise StarPU to
 immediately replicate it to a desired list of memory nodes (given through a
 immediately replicate it to a desired list of memory nodes (given through a
 bitmask). This can be understood like the write-through mode of CPU caches.
 bitmask). This can be understood like the write-through mode of CPU caches.
@@ -1445,12 +1451,6 @@ starpu_data_set_wt_mask(img_handle, 1<<0);
 will for instance request to always transfer a replicate into the main memory (node
 will for instance request to always transfer a replicate into the main memory (node
 0), as bit 0 of the write-through bitmask is being set.
 0), as bit 0 of the write-through bitmask is being set.
 
 
-When the application allocates data, whenever possible it should use the
-@code{starpu_malloc} function, which will ask CUDA or
-OpenCL to make the allocation itself and pin the corresponding allocated
-memory. This is needed to permit asynchronous data transfer, i.e. permit data
-transfer to overlap with computations.
-
 @node Task submission
 @node Task submission
 @section Task submission
 @section Task submission
 
 
@@ -1491,13 +1491,13 @@ to configure a performance model for the codelets of the application (see
 use on-line calibration.  StarPU will automatically calibrate codelets
 use on-line calibration.  StarPU will automatically calibrate codelets
 which have never been calibrated yet. To force continuing calibration, use
 which have never been calibrated yet. To force continuing calibration, use
 @code{export STARPU_CALIBRATE=1} . This may be necessary if your application
 @code{export STARPU_CALIBRATE=1} . This may be necessary if your application
-have not-so-stable performance. Details on the current performance model status
+has not-so-stable performance. Details on the current performance model status
 can be obtained from the @code{starpu_perfmodel_display} command: the @code{-l}
 can be obtained from the @code{starpu_perfmodel_display} command: the @code{-l}
 option lists the available performance models, and the @code{-s} option permits
 option lists the available performance models, and the @code{-s} option permits
 to choose the performance model to be displayed. The result looks like:
 to choose the performance model to be displayed. The result looks like:
 
 
 @example
 @example
- starpu_perfmodel_display -s starpu_dlu_lu_model_22
+$ starpu_perfmodel_display -s starpu_dlu_lu_model_22
 performance model for cpu
 performance model for cpu
 # hash		size		mean		dev		n
 # hash		size		mean		dev		n
 5c6c3401	1572864        	1.216300e+04   	2.277778e+03   	1240
 5c6c3401	1572864        	1.216300e+04   	2.277778e+03   	1240
@@ -1527,10 +1527,11 @@ thus needs to find a balance between both. The target function that the
 tries to minimize is @code{alpha * T_execution + beta * T_data_transfer}, where
 tries to minimize is @code{alpha * T_execution + beta * T_data_transfer}, where
 @code{T_execution} is the estimated execution time of the codelet (usually
 @code{T_execution} is the estimated execution time of the codelet (usually
 accurate), and @code{T_data_transfer} is the estimated data transfer time. The
 accurate), and @code{T_data_transfer} is the estimated data transfer time. The
-latter is however estimated based on bus calibration before execution start,
-i.e. with an idle machine. You can force bus re-calibration by running
+latter is estimated based on bus calibration before execution start,
+i.e. with an idle machine, thus without contention. You can force bus re-calibration by running
 @code{starpu_calibrate_bus}. The beta parameter defaults to 1, but it can be
 @code{starpu_calibrate_bus}. The beta parameter defaults to 1, but it can be
-worth trying to tweak it by using @code{export STARPU_BETA=2} for instance.
+worth trying to tweak it by using @code{export STARPU_BETA=2} for instance,
+since during real application execution, contention makes transfer times bigger.
 This is of course imprecise, but in practice, a rough estimation already gives
 This is of course imprecise, but in practice, a rough estimation already gives
 the good results that a precise estimation would give.
 the good results that a precise estimation would give.
 
 
@@ -1563,7 +1564,7 @@ beta * T_data_transfer + gamma * Consumption} , where @code{Consumption}
 is the estimated task consumption in Joules. To tune this parameter, use
 is the estimated task consumption in Joules. To tune this parameter, use
 @code{export STARPU_GAMMA=3000} for instance, to express that each Joule
 @code{export STARPU_GAMMA=3000} for instance, to express that each Joule
 (i.e kW during 1000us) is worth 3000us execution time penalty. Setting
 (i.e kW during 1000us) is worth 3000us execution time penalty. Setting
-alpha and beta to zero permits to only take into account power consumption.
+@code{alpha} and @code{beta} to zero permits to only take into account power consumption.
 
 
 This is however not sufficient to correctly optimize power: the scheduler would
 This is however not sufficient to correctly optimize power: the scheduler would
 simply tend to run all computations on the most energy-conservative processing
 simply tend to run all computations on the most energy-conservative processing
@@ -1603,7 +1604,7 @@ func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);
 cudaStreamSynchronize(starpu_cuda_get_local_stream());
 cudaStreamSynchronize(starpu_cuda_get_local_stream());
 @end example
 @end example
 
 
-Unfortunately, a lot of CUDA libraries do not have stream variants of
+Unfortunately, some CUDA libraries do not have stream variants of
 kernels. That will lower the potential for overlapping.
 kernels. That will lower the potential for overlapping.
 
 
 @c ---------------------------------------------------------------------
 @c ---------------------------------------------------------------------