|
@@ -1430,10 +1430,16 @@ few additional changes are needed.
|
|
|
@node Data management
|
|
|
@section Data management
|
|
|
|
|
|
+When the application allocates data, whenever possible it should use the
|
|
|
+@code{starpu_malloc} function, which will ask CUDA or
|
|
|
+OpenCL to make the allocation itself and pin the corresponding allocated
|
|
|
+memory. This is needed to permit asynchronous data transfer, i.e. permit data
|
|
|
+transfer to overlap with computations.
|
|
|
+
|
|
|
By default, StarPU leaves replicates of data wherever they were used, in case they
|
|
|
will be re-used by other tasks, thus saving the data transfer time. When some
|
|
|
task modifies some data, all the other replicates are invalidated, and only the
|
|
|
-processing unit will have a valid replicate of the data. If the application knows
|
|
|
+processing unit which ran that task will have a valid replicate of the data. If the application knows
|
|
|
that this data will not be re-used by further tasks, it should advise StarPU to
|
|
|
immediately replicate it to a desired list of memory nodes (given through a
|
|
|
bitmask). This can be understood like the write-through mode of CPU caches.
|
|
@@ -1445,12 +1451,6 @@ starpu_data_set_wt_mask(img_handle, 1<<0);
|
|
|
will for instance request to always transfer a replicate into the main memory (node
|
|
|
0), as bit 0 of the write-through bitmask is being set.
|
|
|
|
|
|
-When the application allocates data, whenever possible it should use the
|
|
|
-@code{starpu_malloc} function, which will ask CUDA or
|
|
|
-OpenCL to make the allocation itself and pin the corresponding allocated
|
|
|
-memory. This is needed to permit asynchronous data transfer, i.e. permit data
|
|
|
-transfer to overlap with computations.
|
|
|
-
|
|
|
@node Task submission
|
|
|
@section Task submission
|
|
|
|
|
@@ -1491,13 +1491,13 @@ to configure a performance model for the codelets of the application (see
|
|
|
use on-line calibration. StarPU will automatically calibrate codelets
|
|
|
which have never been calibrated yet. To force continuing calibration, use
|
|
|
@code{export STARPU_CALIBRATE=1} . This may be necessary if your application
|
|
|
-have not-so-stable performance. Details on the current performance model status
|
|
|
+has not-so-stable performance. Details on the current performance model status
|
|
|
can be obtained from the @code{starpu_perfmodel_display} command: the @code{-l}
|
|
|
option lists the available performance models, and the @code{-s} option permits
|
|
|
to choose the performance model to be displayed. The result looks like:
|
|
|
|
|
|
@example
|
|
|
-€ starpu_perfmodel_display -s starpu_dlu_lu_model_22
|
|
|
+$ starpu_perfmodel_display -s starpu_dlu_lu_model_22
|
|
|
performance model for cpu
|
|
|
# hash size mean dev n
|
|
|
5c6c3401 1572864 1.216300e+04 2.277778e+03 1240
|
|
@@ -1527,10 +1527,11 @@ thus needs to find a balance between both. The target function that the
|
|
|
tries to minimize is @code{alpha * T_execution + beta * T_data_transfer}, where
|
|
|
@code{T_execution} is the estimated execution time of the codelet (usually
|
|
|
accurate), and @code{T_data_transfer} is the estimated data transfer time. The
|
|
|
-latter is however estimated based on bus calibration before execution start,
|
|
|
-i.e. with an idle machine. You can force bus re-calibration by running
|
|
|
+latter is estimated based on bus calibration before execution start,
|
|
|
+i.e. with an idle machine, thus without contention. You can force bus re-calibration by running
|
|
|
@code{starpu_calibrate_bus}. The beta parameter defaults to 1, but it can be
|
|
|
-worth trying to tweak it by using @code{export STARPU_BETA=2} for instance.
|
|
|
+worth trying to tweak it by using @code{export STARPU_BETA=2} for instance,
|
|
|
+since during real application execution, contention makes transfer times bigger.
|
|
|
This is of course imprecise, but in practice, a rough estimation already gives
|
|
|
the good results that a precise estimation would give.
|
|
|
|
|
@@ -1563,7 +1564,7 @@ beta * T_data_transfer + gamma * Consumption} , where @code{Consumption}
|
|
|
is the estimated task consumption in Joules. To tune this parameter, use
|
|
|
@code{export STARPU_GAMMA=3000} for instance, to express that each Joule
|
|
|
(i.e kW during 1000us) is worth 3000us execution time penalty. Setting
|
|
|
-alpha and beta to zero permits to only take into account power consumption.
|
|
|
+@code{alpha} and @code{beta} to zero permits to only take into account power consumption.
|
|
|
|
|
|
This is however not sufficient to correctly optimize power: the scheduler would
|
|
|
simply tend to run all computations on the most energy-conservative processing
|
|
@@ -1603,7 +1604,7 @@ func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);
|
|
|
cudaStreamSynchronize(starpu_cuda_get_local_stream());
|
|
|
@end example
|
|
|
|
|
|
-Unfortunately, a lot of CUDA libraries do not have stream variants of
|
|
|
+Unfortunately, some CUDA libraries do not have stream variants of
|
|
|
kernels. That will lower the potential for overlapping.
|
|
|
|
|
|
@c ---------------------------------------------------------------------
|