ソースを参照

When starting a kernel while there remains pending data requests, the measured
execution time can integrate that of the data transfers in case the application
is not using streams instead of cudaThreadSynchronize (as this is the case with
99% of our current applications). This workaround permits to have reasonnable
performance models, but adds a barrier that avoids to overlap data transfers
with kernels.

Cédric Augonnet 14 年 前
コミット
c384b4f4e7
共有1 個のファイルを変更した7 個の追加0 個の削除を含む
  1. 7 0
      src/drivers/cuda/driver_cuda.c

+ 7 - 0
src/drivers/cuda/driver_cuda.c

@@ -166,6 +166,13 @@ static int execute_job_on_cuda(starpu_job_t j, struct starpu_worker_s *args)
 		return -EAGAIN;
 	}
 
+	if (calibrate_model)
+	{
+		cudaError_t cures = cudaThreadSynchronize();
+		if (STARPU_UNLIKELY(cures))
+			STARPU_CUDA_REPORT_ERROR(cures);
+	}
+
 	STARPU_TRACE_START_CODELET_BODY(j);
 
 	struct starpu_task_profiling_info *profiling_info;