Browse Source

- merge trunk

Olivier Aumage 11 years ago
parent
commit
fef05954ab

+ 3 - 0
ChangeLog

@@ -49,6 +49,7 @@ New features:
   * Add paje traces statistics tools.
   * Add CUDA concurrent kernel execution support through
     the STARPU_NWORKER_PER_CUDA environment variable.
+  * Use streams for GPUA->GPUB and GPUB->GPUA transfers.
 
 Small features:
   * New functions starpu_data_acquire_cb_sequential_consistency() and
@@ -340,6 +341,8 @@ Small changes:
   * Use C99 variadic macro support, not GNU.
   * Fix performance regression: dmda queues were inadvertently made
     LIFOs in r9611.
+  * Use big fat abortions when one tries to make a task or callback
+    sleep, instead of just returning EDEADLCK which few people will test
   * By default, StarPU FFT examples are not compiled and checked, the
     configure option --enable-starpufft-examples needs to be specified
     to change this behaviour.

+ 10 - 8
doc/doxygen/chapters/08scheduling.doxy

@@ -33,22 +33,24 @@ worker.
 
 The <b>dm</b> (deque model) scheduler uses task execution performance models into account to
 perform an HEFT-similar scheduling strategy: it schedules tasks where their
-termination time will be minimal.
+termination time will be minimal. The difference with HEFT is that tasks are
+scheduled in the order they become available.
 
-The <b>dmda</b> (deque model data aware) scheduler is similar to dm, it also takes
+The <b>dmda</b> (deque model data aware) scheduler is similar to dm, but it also takes
 into account data transfer time.
 
 The <b>dmdar</b> (deque model data aware ready) scheduler is similar to dmda,
 it also sorts tasks on per-worker queues by number of already-available data
-buffers.
+buffers on the target device.
 
-The <b>dmdas</b> (deque model data aware sorted) scheduler is similar to dmda, it
-also supports arbitrary priority values.
+The <b>dmdas</b> (deque model data aware sorted) scheduler is similar to dmdar,
+except that it sorts tasks by priority order, which allows to become even closer
+to HEFT.
 
-The <b>heft</b> (heterogeneous earliest finish time) scheduler is deprecated. It
-is now just an alias for <b>dmda</b>.
+The <b>heft</b> (heterogeneous earliest finish time) scheduler is a deprecated
+alias for <b>dmda</b>.
 
-The <b>pheft</b> (parallel HEFT) scheduler is similar to heft, it also supports
+The <b>pheft</b> (parallel HEFT) scheduler is similar to dmda, it also supports
 parallel tasks (still experimental). Should not be used when several contexts using
 it are being executed simultaneously.
 

+ 33 - 0
doc/doxygen/chapters/16mpi_support.doxy

@@ -320,6 +320,39 @@ application can prune the task for loops according to the data distribution,
 so as to only submit tasks on nodes which have to care about them (either to
 execute them, or to send the required data).
 
+A way to do some of this quite easily can be to just add an <c>if</c> like this:
+
+\code{.c}
+    for(loop=0 ; loop<niter; loop++)
+        for (x = 1; x < X-1; x++)
+            for (y = 1; y < Y-1; y++)
+                if (my_distrib(x,y,size) == my_rank
+                 || my_distrib(x-1,y,size) == my_rank
+                 || my_distrib(x+1,y,size) == my_rank
+                 || my_distrib(x,y-1,size) == my_rank
+                 || my_distrib(x,y+1,size) == my_rank)
+                    starpu_mpi_insert_task(MPI_COMM_WORLD, &stencil5_cl,
+                                           STARPU_RW, data_handles[x][y],
+                                           STARPU_R, data_handles[x-1][y],
+                                           STARPU_R, data_handles[x+1][y],
+                                           STARPU_R, data_handles[x][y-1],
+                                           STARPU_R, data_handles[x][y+1],
+                                           0);
+    starpu_task_wait_for_all();
+\endcode
+
+This permits to drop the cost of function call argument passing and parsing.
+
+If the <c>my_distrib</c> function can be inlined by the compiler, the latter can
+improve the test.
+
+If the <c>size</c> can be made a compile-time constant, the compiler can
+considerably improve the test further.
+
+If the distribution function is not too complex and the compiler is very good,
+the latter can even optimize the <c>for</c> loops, thus dramatically reducing
+the cost of task submission.
+
 A function starpu_mpi_task_build() is also provided with the aim to
 only construct the task structure. All MPI nodes need to call the
 function, only the node which is to execute the task will return a

+ 1 - 1
src/core/workers.c

@@ -326,7 +326,7 @@ int starpu_combined_worker_can_execute_task(unsigned workerid, struct starpu_tas
  * Runtime initialization methods
  */
 
-#ifdef STARPU_USE_CUDA
+#if defined(STARPU_USE_CUDA) || defined(STARPU_SIMGRID)
 static struct _starpu_worker_set cuda_worker_set[STARPU_MAXCUDADEVS];
 #endif
 #ifdef STARPU_USE_MIC

+ 5 - 0
tools/Makefile.am

@@ -29,6 +29,11 @@ dist_pkgdata_DATA = gdbinit
 EXTRA_DIST =				\
 	dev/rename.sed			\
 	dev/rename.sh			\
+	valgrind/hwloc.suppr		\
+	valgrind/libnuma.suppr		\
+	valgrind/openmpi.suppr		\
+	valgrind/pthread.suppr		\
+	valgrind/starpu.suppr		\
 	msvc/starpu_clean.bat		\
 	msvc/starpu_open.bat		\
 	msvc/starpu_exec.bat		\