14 gadi atpakaļ · 05e3555597
--- a/doc/starpu.texi
+++ b/doc/starpu.texi
@@ -1344,7 +1344,8 @@ tries to minimize is @code{alpha * T_execution + beta * T_data_transfer}, where
 
				 @code{T_execution} is the estimated execution time of the codelet (usually
			
 
				 accurate), and @code{T_data_transfer} is the estimated data transfer time. The
			
 
				 latter is however estimated based on bus calibration before execution start,
			
 
				-i.e. with an idle machine. When StarPU manages several GPUs, such estimation
			
 
				+i.e. with an idle machine. You can force bus re-calibration by running
			
 
				+@code{starpu_calibrate_bus}. When StarPU manages several GPUs, such estimation
			
 
				 is not accurate any more. Beta can then be used to correct this by hand. For
			
 
				 instance, you can use @code{export STARPU_BETA=2} to double the transfer
			
 
				 time estimation, e.g. because there are two GPUs in the machine. This is of
			
@@ -1390,6 +1391,7 @@ next chapter.
 
				 @menu
			
 
				 * On-line::       On-line performance feedback
			
 
				 * Off-line::      Off-line performance feedback
			
 
				+* Codelet performance::      Performance of codelets
			
 
				 @end menu
			
 
				 
			
 
				 @node On-line
			
@@ -1600,6 +1602,45 @@ evolution of the number of tasks available in the system during the execution.
 
				 Ready tasks are shown in black, and tasks that are submitted but not
			
 
				 schedulable yet are shown in grey.
			
 
				 
			
 
				+@node Codelet performance
			
 
				+@section Performance of codelets
			
 
				+
			
 
				+The performance model of codelets can be examined by using the
			
 
				+@code{starpu_perfmodel_display} tool:
			
 
				+
			
 
				+@example
			
 
				+$ starpu_perfmodel_display -l
			
 
				+file: <malloc_pinned.hannibal>
			
 
				+file: <starpu_slu_lu_model_21.hannibal>
			
 
				+file: <starpu_slu_lu_model_11.hannibal>
			
 
				+file: <starpu_slu_lu_model_22.hannibal>
			
 
				+file: <starpu_slu_lu_model_12.hannibal>
			
 
				+@end example
			
 
				+
			
 
				+Here, the codelets of the lu example are available. We can examine the
			
 
				+performance of the 22 kernel:
			
 
				+
			
 
				+@example
			
 
				+$ starpu_perfmodel_display -s starpu_slu_lu_model_22
			
 
				+performance model for cpu
			
 
				+# hash		size		mean		dev		n
			
 
				+57618ab0	19660800       	2.851069e+05   	1.829369e+04   	109
			
 
				+performance model for cuda_0
			
 
				+# hash		size		mean		dev		n
			
 
				+57618ab0	19660800       	1.164144e+04   	1.556094e+01   	315
			
 
				+performance model for cuda_1
			
 
				+# hash		size		mean		dev		n
			
 
				+57618ab0	19660800       	1.164271e+04   	1.330628e+01   	360
			
 
				+performance model for cuda_2
			
 
				+# hash		size		mean		dev		n
			
 
				+57618ab0	19660800       	1.166730e+04   	3.390395e+02   	456
			
 
				+@end example
			
 
				+
			
 
				+We can see that for the given size, over a sample of a few hundreds of
			
 
				+execution, the GPUs are about 20 times faster than the CPUs (numbers are in
			
 
				+us). The standard deviation is extremely low for the GPUs, and less than 10% for
			
 
				+CPUs.
			
 
				+
			
 
				 @c ---------------------------------------------------------------------
			
 
				 @c MPI support
			
 
				 @c ---------------------------------------------------------------------