Ver código fonte

When starting a kernel while there remains pending data requests, the measured
execution time can integrate that of the data transfers in case the application
is not using streams instead of cudaThreadSynchronize (as this is the case with
99% of our current applications). This workaround permits to have reasonnable
performance models, but adds a barrier that avoids to overlap data transfers
with kernels.

Cédric Augonnet 14 anos atrás
pai
commit
c384b4f4e7
1 arquivos alterados com 7 adições e 0 exclusões
  1. 7 0
      src/drivers/cuda/driver_cuda.c

+ 7 - 0
src/drivers/cuda/driver_cuda.c

@@ -166,6 +166,13 @@ static int execute_job_on_cuda(starpu_job_t j, struct starpu_worker_s *args)
 		return -EAGAIN;
 	}
 
+	if (calibrate_model)
+	{
+		cudaError_t cures = cudaThreadSynchronize();
+		if (STARPU_UNLIKELY(cures))
+			STARPU_CUDA_REPORT_ERROR(cures);
+	}
+
 	STARPU_TRACE_START_CODELET_BODY(j);
 
 	struct starpu_task_profiling_info *profiling_info;