浏览代码

When starting a kernel while there remains pending data requests, the measured
execution time can integrate that of the data transfers in case the application
is not using streams instead of cudaThreadSynchronize (as this is the case with
99% of our current applications). This workaround permits to have reasonnable
performance models, but adds a barrier that avoids to overlap data transfers
with kernels.

Cédric Augonnet 14 年之前
父节点
当前提交
c384b4f4e7
共有 1 个文件被更改,包括 7 次插入0 次删除
  1. 7 0
      src/drivers/cuda/driver_cuda.c

+ 7 - 0
src/drivers/cuda/driver_cuda.c

@@ -166,6 +166,13 @@ static int execute_job_on_cuda(starpu_job_t j, struct starpu_worker_s *args)
 		return -EAGAIN;
 	}
 
+	if (calibrate_model)
+	{
+		cudaError_t cures = cudaThreadSynchronize();
+		if (STARPU_UNLIKELY(cures))
+			STARPU_CUDA_REPORT_ERROR(cures);
+	}
+
 	STARPU_TRACE_START_CODELET_BODY(j);
 
 	struct starpu_task_profiling_info *profiling_info;