소스 검색

When starting a kernel while there remains pending data requests, the measured
execution time can integrate that of the data transfers in case the application
is not using streams instead of cudaThreadSynchronize (as this is the case with
99% of our current applications). This workaround permits to have reasonnable
performance models, but adds a barrier that avoids to overlap data transfers
with kernels.

Cédric Augonnet 14 년 전
부모
커밋
c384b4f4e7
1개의 변경된 파일7개의 추가작업 그리고 0개의 파일을 삭제
  1. 7 0
      src/drivers/cuda/driver_cuda.c

+ 7 - 0
src/drivers/cuda/driver_cuda.c

@@ -166,6 +166,13 @@ static int execute_job_on_cuda(starpu_job_t j, struct starpu_worker_s *args)
 		return -EAGAIN;
 	}
 
+	if (calibrate_model)
+	{
+		cudaError_t cures = cudaThreadSynchronize();
+		if (STARPU_UNLIKELY(cures))
+			STARPU_CUDA_REPORT_ERROR(cures);
+	}
+
 	STARPU_TRACE_START_CODELET_BODY(j);
 
 	struct starpu_task_profiling_info *profiling_info;