When starting a kernel while there remains pending data requests, the measured execution time can integrate that of the data transfers in case the application is not using streams instead of cudaThreadSynchronize (as this is the case with 99% of our current applications). This workaround permits to have reasonnable performance models, but adds a barrier that avoids to overlap data transfers with kernels.