Quellcode durchsuchen

When starting a kernel while there remains pending data requests, the measured
execution time can integrate that of the data transfers in case the application
is not using streams instead of cudaThreadSynchronize (as this is the case with
99% of our current applications). This workaround permits to have reasonnable
performance models, but adds a barrier that avoids to overlap data transfers
with kernels.

Cédric Augonnet vor 14 Jahren
Ursprung
Commit
c384b4f4e7
1 geänderte Dateien mit 7 neuen und 0 gelöschten Zeilen
  1. 7 0
      src/drivers/cuda/driver_cuda.c

+ 7 - 0
src/drivers/cuda/driver_cuda.c

@@ -166,6 +166,13 @@ static int execute_job_on_cuda(starpu_job_t j, struct starpu_worker_s *args)
 		return -EAGAIN;
 	}
 
+	if (calibrate_model)
+	{
+		cudaError_t cures = cudaThreadSynchronize();
+		if (STARPU_UNLIKELY(cures))
+			STARPU_CUDA_REPORT_ERROR(cures);
+	}
+
 	STARPU_TRACE_START_CODELET_BODY(j);
 
 	struct starpu_task_profiling_info *profiling_info;