Browse Source

Add asynchronous partition planning. It only supports coherency through the main RAM for now.

Samuel Thibault 9 years ago
parent
commit
68605a2f15

+ 2 - 0
ChangeLog

@@ -107,6 +107,8 @@ New features:
   * Add Fortran 90 module and example using it
   * Add Fortran 90 module and example using it
   * New StarPU-MPI gdb debug functions
   * New StarPU-MPI gdb debug functions
   * Generate animated html trace of modular schedulers.
   * Generate animated html trace of modular schedulers.
+  * Add asynchronous partition planning. It only supports coherency through
+    the main RAM for now.
 
 
 Small features:
 Small features:
   * Tasks can now have a name (via the field const char *name of
   * Tasks can now have a name (via the field const char *name of

+ 56 - 3
doc/doxygen/chapters/07data_management.doxy

@@ -191,10 +191,63 @@ StarPU provides various interfaces and filters for matrices, vectors, etc.,
 but applications can also write their own data interfaces and filters, see
 but applications can also write their own data interfaces and filters, see
 <c>examples/interface</c> and <c>examples/filters/custom_mf</c> for an example.
 <c>examples/interface</c> and <c>examples/filters/custom_mf</c> for an example.
 
 
-\section MultipleView Multiple views
+\section AsynchronousPartitioning Asynchronous Partitioning
 
 
-Partitioning is synchronous, which can be a problem for dynamic applications.
-Another way is to register several views on the same piece of data.
+The partitioning functions described in the previous section are synchronous:
+starpu_data_partition and starpu_data_unpartition both wait for all the tasks
+currently working on the data.  This can be a bottleneck for the application.
+
+An asynchronous API also exists, it works only on handles with sequential
+consistency. The principle is to first plan the partitioning, which returns
+data handles of the partition, which are not functional yet. Along other task
+submission, one can submit the actual partitioning, and then use the handles
+of the partition. Before using the handle of the whole data, one has to submit
+the unpartitioning. <c>fmultiple_automatic</c> is a complete example using this
+technique.
+
+In short, we first register a matrix and plan the partitioning:
+
+\code{.c}
+starpu_matrix_data_register(&handle, STARPU_MAIN_RAM, (uintptr_t)matrix, NX, NX, NY, sizeof(matrix[0]));
+struct starpu_data_filter f_vert =
+{
+	.filter_func = starpu_matrix_filter_block,
+	.nchildren = PARTS
+};
+starpu_data_partition_plan(handle, &f_vert, vert_handle);
+\endcode
+
+starpu_data_partition_plan returns the handles for the partition in vert_handle.
+
+One can submit tasks working on the main handle, but not yet on the vert_handle
+handles. Now we submit the partitioning:
+
+\code{.c}
+starpu_data_partition_submit(handle, PARTS, vert_handle);
+\endcode
+
+And now we can submit tasks working on vert_handle handles (and not on the main
+handle any more). Eventually we want to work on the main handle again, so we
+submit the unpartitioning:
+
+starpu_data_unpartition_submit(handle, PARTS, vert_handle);
+
+And now we can submit tasks working on the main handle again.
+
+All this code is asynchronous, just submitting which tasks, partitioning and
+unpartitioning should be done at runtime.
+
+Planning several partitioning of the same data is also possible, one just has
+to submit unpartitioning (to get back to the initial handle) before submitting
+another partitioning.
+
+It is also possible to activate several partitioning at the same time, in read-only mode.
+
+\section ManualPartitioning Manual Partitioning
+
+One can also handle partitioning by hand, by registering several views on the
+same piece of data. The idea is then to manage the coherency of the various
+views through the common buffer in the main memory.
 <c>fmultiple_manual</c> is a complete example using this technique.
 <c>fmultiple_manual</c> is a complete example using this technique.
 
 
 In short, we first register the same matrix several times:
 In short, we first register the same matrix several times:

+ 95 - 3
doc/doxygen/chapters/api/data_partition.doxy

@@ -43,9 +43,7 @@ Here an example of how to use the function.
 \code{.c}
 \code{.c}
 struct starpu_data_filter f = {
 struct starpu_data_filter f = {
         .filter_func = starpu_matrix_filter_block,
         .filter_func = starpu_matrix_filter_block,
-        .nchildren = nslicesx,
-        .get_nchildren = NULL,
-        .get_child_ops = NULL
+        .nchildren = nslicesx
 };
 };
 starpu_data_partition(A_handle, &f);
 starpu_data_partition(A_handle, &f);
 \endcode
 \endcode
@@ -103,6 +101,100 @@ Applies \p nfilters filters to the handle designated by
 \p root_handle recursively. It uses a va_list of pointers to variables of
 \p root_handle recursively. It uses a va_list of pointers to variables of
 the type starpu_data_filter.
 the type starpu_data_filter.
 
 
+@name Asynchronous API
+\ingroup API_Data_Partition
+
+\fn void starpu_data_partition_plan(starpu_data_handle_t initial_handle, struct starpu_data_filter *f, starpu_data_handle_t *children)
+\ingroup API_Data_Partition
+This plans for partitioning one StarPU data handle \p initial_handle into
+several subdata according to the filter \p f. The handles are returned into
+the \p children array, which has to be the same size as the number of parts
+described in \p f. These handles are not immediately usable,
+starpu_data_partition_submit has to be called to submit the actual partitioning.
+
+Here is an example of how to use the function:
+
+\code{.c}
+starpu_data_handle_t children[nslicesx];
+struct starpu_data_filter f = {
+        .filter_func = starpu_matrix_filter_block,
+        .nchildren = nslicesx
+};
+starpu_data_partition_plan(A_handle, &f, children);
+\endcode
+
+\fn void starpu_data_partition_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children)
+\ingroup API_Data_Partition
+
+This submits the actual partitioning of \p initial_handle into the \p nparts
+\p children handles. This call is asynchronous, it only submits that the
+partitioning should be done, so that the \p children handles can now be used to
+submit tasks, and \p initial_handle can not be used to submit tasks any more (to
+guarantee coherency).
+
+For instance,
+
+\code{.c}
+starpu_data_partition_submit(A_handle, nslicesx, children);
+\endcode
+
+\fn void starpu_data_partition_readonly_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children)
+\ingroup API_Data_Partition
+
+This is the same as starpu_data_partition_submit, but does not invalidate \p
+initial_handle. This allows to continue using it, but the application has to be
+careful not to write to \p initial_handle or \p children handles, only read from
+them, since the coherency is otherwise not guaranteed.  This thus allows to
+submit various tasks which concurrently read from various partitions of the data.
+
+When the application wants to write to \p initial_handle again, it should call
+starpu_data_unpartition_submit, which will properly add dependencies between the
+reads on the \p children and the writes to be submitted.
+
+If instead the application wants to write to \p children handles, it should
+call starpu_data_partition_readwrite_upgrade_submit, which will properly add
+dependencies between the reads on the \p initial_handle and the writes to be
+submitted.
+
+\fn void starpu_data_partition_readwrite_upgrade_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children)
+\ingroup API_Data_Partition
+
+This assumes that a partitioning of \p initial_handle has already been submited
+in readonly mode through starpu_data_partition_readonly_submit, and will upgrade
+that partitioning into read-write mode for the \p children, by invalidating \p
+initial_handle, and adding the necessary dependencies.
+
+\fn void starpu_data_unpartition_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children, int gathering_node)
+\ingroup API_Data_Partition
+
+This assumes that \p initial_handle is partitioned into \p children, and submits
+an unpartitionning of it, i.e. submitting a gathering of the pieces on the
+requested \p gathering_node memory node, and submitting an invalidation of the
+children.
+
+\p gathering_node can be set to -1 to let the runtime decide which memory node
+should be used to gather the pieces.
+
+\fn void starpu_data_unpartition_readonly_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children, int gathering_node)
+\ingroup API_Data_Partition
+
+This assumes that \p initial_handle is partitioned into \p children, and submits
+just a readonly unpartitionning of it, i.e. submitting a gathering of the pieces
+on the requested \p gathering_node memory node. It does not invalidate the
+children. This brings \p initial_handle and \p children handles to the same
+state as obtained with starpu_data_partition_readonly_submit.
+
+\p gathering_node can be set to -1 to let the runtime decide which memory node
+should be used to gather the pieces.
+
+\fn void starpu_data_partition_clean(starpu_data_handle_t root_data, unsigned nparts, starpu_data_handle_t *children)
+\ingroup API_Data_Partition
+
+This should be used to clear the partition planning established between \p
+root_data and \p children with starpu_data_partition_plan. This will notably
+submit an unregister all the \p children, which can thus not be used any more
+afterwards.
+
 @name Predefined Vector Filter Functions
 @name Predefined Vector Filter Functions
 \ingroup API_Data_Partition
 \ingroup API_Data_Partition
 
 

+ 18 - 0
examples/Makefile.am

@@ -183,6 +183,8 @@ STARPU_EXAMPLES =				\
 	filters/fblock				\
 	filters/fblock				\
 	filters/fmatrix				\
 	filters/fmatrix				\
 	filters/fmultiple_manual		\
 	filters/fmultiple_manual		\
+	filters/fmultiple_submit		\
+	filters/fmultiple_submit_readonly	\
 	tag_example/tag_example			\
 	tag_example/tag_example			\
 	tag_example/tag_example2		\
 	tag_example/tag_example2		\
 	tag_example/tag_example3		\
 	tag_example/tag_example3		\
@@ -430,6 +432,22 @@ filters_fmultiple_manual_SOURCES +=		\
 	filters/fmultiple_cuda.cu
 	filters/fmultiple_cuda.cu
 endif
 endif
 
 
+filters_fmultiple_submit_SOURCES =		\
+	filters/fmultiple_submit.c
+
+if STARPU_USE_CUDA
+filters_fmultiple_submit_SOURCES +=		\
+	filters/fmultiple_cuda.cu
+endif
+
+filters_fmultiple_submit_readonly_SOURCES =	\
+	filters/fmultiple_submit_readonly.c
+
+if STARPU_USE_CUDA
+filters_fmultiple_submit_readonly_SOURCES +=	\
+	filters/fmultiple_cuda.cu
+endif
+
 examplebin_PROGRAMS +=				\
 examplebin_PROGRAMS +=				\
 	filters/shadow				\
 	filters/shadow				\
 	filters/shadow2d			\
 	filters/shadow2d			\

+ 28 - 1
examples/filters/fmultiple_cuda.cu

@@ -18,7 +18,7 @@
 
 
 #include <starpu.h>
 #include <starpu.h>
 
 
-static __global__ void _fmultiple_check_cuda(int *val, int nx, int ny, unsigned ld, int start, int factor)
+static __global__ void _fmultiple_check_scale_cuda(int *val, int nx, int ny, unsigned ld, int start, int factor)
 {
 {
         int i, j;
         int i, j;
 	for(j=0; j<ny ; j++)
 	for(j=0; j<ny ; j++)
@@ -32,6 +32,33 @@ static __global__ void _fmultiple_check_cuda(int *val, int nx, int ny, unsigned
         }
         }
 }
 }
 
 
+extern "C" void fmultiple_check_scale_cuda(void *buffers[], void *cl_arg)
+{
+	int start, factor;
+	int nx = (int)STARPU_MATRIX_GET_NX(buffers[0]);
+	int ny = (int)STARPU_MATRIX_GET_NY(buffers[0]);
+        unsigned ld = STARPU_MATRIX_GET_LD(buffers[0]);
+	int *val = (int *)STARPU_MATRIX_GET_PTR(buffers[0]);
+
+	starpu_codelet_unpack_args(cl_arg, &start, &factor);
+
+        /* TODO: use more vals and threads in vals */
+	_fmultiple_check_scale_cuda<<<1,1, 0, starpu_cuda_get_local_stream()>>>(val, nx, ny, ld, start, factor);
+}
+
+static __global__ void _fmultiple_check_cuda(int *val, int nx, int ny, unsigned ld, int start, int factor)
+{
+        int i, j;
+	for(j=0; j<ny ; j++)
+	{
+		for(i=0; i<nx ; i++)
+		{
+			if (val[(j*ld)+i] != start + factor*(i+100*j))
+				asm("trap;");
+		}
+        }
+}
+
 extern "C" void fmultiple_check_cuda(void *buffers[], void *cl_arg)
 extern "C" void fmultiple_check_cuda(void *buffers[], void *cl_arg)
 {
 {
 	int start, factor;
 	int start, factor;

+ 12 - 11
examples/filters/fmultiple_manual.c

@@ -16,9 +16,10 @@
 
 
 /*
 /*
  * This examplifies how to access the same matrix with different partitioned
  * This examplifies how to access the same matrix with different partitioned
- * views.
+ * views, doing the coherency by hand.
  * We first run a kernel on the whole matrix to fill it, then run a kernel on
  * We first run a kernel on the whole matrix to fill it, then run a kernel on
- * each vertical slice, then run a kernel on each horizontal slice.
+ * each vertical slice to check the value and multiply it by two, then run a
+ * kernel on each horizontal slice to do the same.
  */
  */
 
 
 #include <starpu.h>
 #include <starpu.h>
@@ -55,7 +56,7 @@ struct starpu_codelet cl_fill =
 	.name = "matrix_fill"
 	.name = "matrix_fill"
 };
 };
 
 
-void fmultiple_check(void *buffers[], void *cl_arg)
+void fmultiple_check_scale(void *buffers[], void *cl_arg)
 {
 {
 	int start, factor;
 	int start, factor;
 	unsigned i, j;
 	unsigned i, j;
@@ -79,21 +80,21 @@ void fmultiple_check(void *buffers[], void *cl_arg)
 }
 }
 
 
 #ifdef STARPU_USE_CUDA
 #ifdef STARPU_USE_CUDA
-extern void fmultiple_check_cuda(void *buffers[], void *cl_arg);
+extern void fmultiple_check_scale_cuda(void *buffers[], void *cl_arg);
 #endif
 #endif
-struct starpu_codelet cl_check =
+struct starpu_codelet cl_check_scale =
 {
 {
 #ifdef STARPU_USE_CUDA
 #ifdef STARPU_USE_CUDA
-	.cuda_funcs = {fmultiple_check_cuda},
+	.cuda_funcs = {fmultiple_check_scale_cuda},
 	.cuda_flags = {STARPU_CUDA_ASYNC},
 	.cuda_flags = {STARPU_CUDA_ASYNC},
 #else
 #else
 	/* Only enable it on CPUs if we don't have a CUDA device, to force remote execution on the CUDA device */
 	/* Only enable it on CPUs if we don't have a CUDA device, to force remote execution on the CUDA device */
-	.cpu_funcs = {fmultiple_check},
-	.cpu_funcs_name = {"fmultiple_check"},
+	.cpu_funcs = {fmultiple_check_scale},
+	.cpu_funcs_name = {"fmultiple_check_scale"},
 #endif
 #endif
 	.nbuffers = 1,
 	.nbuffers = 1,
 	.modes = {STARPU_RW},
 	.modes = {STARPU_RW},
-	.name = "fmultiple_check"
+	.name = "fmultiple_check_scale"
 };
 };
 
 
 void empty(void *buffers[] STARPU_ATTRIBUTE_UNUSED, void *cl_arg STARPU_ATTRIBUTE_UNUSED)
 void empty(void *buffers[] STARPU_ATTRIBUTE_UNUSED, void *cl_arg STARPU_ATTRIBUTE_UNUSED)
@@ -169,7 +170,7 @@ int main(int argc, char **argv)
 	{
 	{
 		int factor = 1;
 		int factor = 1;
 		int start = i*(NX/PARTS);
 		int start = i*(NX/PARTS);
-		ret = starpu_task_insert(&cl_check,
+		ret = starpu_task_insert(&cl_check_scale,
 				STARPU_RW, vert_handle[i],
 				STARPU_RW, vert_handle[i],
 				STARPU_VALUE, &start, sizeof(start),
 				STARPU_VALUE, &start, sizeof(start),
 				STARPU_VALUE, &factor, sizeof(factor),
 				STARPU_VALUE, &factor, sizeof(factor),
@@ -206,7 +207,7 @@ int main(int argc, char **argv)
 	{
 	{
 		int factor = 2;
 		int factor = 2;
 		int start = factor*100*i*(NY/PARTS);
 		int start = factor*100*i*(NY/PARTS);
-		ret = starpu_task_insert(&cl_check,
+		ret = starpu_task_insert(&cl_check_scale,
 				STARPU_RW, horiz_handle[i],
 				STARPU_RW, horiz_handle[i],
 				STARPU_VALUE, &start, sizeof(start),
 				STARPU_VALUE, &start, sizeof(start),
 				STARPU_VALUE, &factor, sizeof(factor),
 				STARPU_VALUE, &factor, sizeof(factor),

+ 207 - 0
examples/filters/fmultiple_submit.c

@@ -0,0 +1,207 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2015  Université Bordeaux
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+/*
+ * This examplifies how to access the same matrix with different partitioned
+ * views, doing the coherency through partition planning.
+ * We first run a kernel on the whole matrix to fill it, then run a kernel on
+ * each vertical slice to check the value and multiply it by two, then run a
+ * kernel on each horizontal slice to do the same.
+ */
+
+#include <starpu.h>
+
+#define NX    6
+#define NY    6
+#define PARTS 2
+
+#define FPRINTF(ofile, fmt, ...) do { if (!getenv("STARPU_SSILENT")) {fprintf(ofile, fmt, ## __VA_ARGS__); }} while(0)
+
+void matrix_fill(void *buffers[], void *cl_arg STARPU_ATTRIBUTE_UNUSED)
+{
+	unsigned i, j;
+
+	/* length of the matrix */
+	unsigned nx = STARPU_MATRIX_GET_NX(buffers[0]);
+	unsigned ny = STARPU_MATRIX_GET_NY(buffers[0]);
+	unsigned ld = STARPU_MATRIX_GET_LD(buffers[0]);
+	int *val = (int *)STARPU_MATRIX_GET_PTR(buffers[0]);
+
+	for(j=0; j<ny ; j++)
+	{
+		for(i=0; i<nx ; i++)
+			val[(j*ld)+i] = i+100*j;
+	}
+}
+
+struct starpu_codelet cl_fill =
+{
+	.cpu_funcs = {matrix_fill},
+	.cpu_funcs_name = {"matrix_fill"},
+	.nbuffers = 1,
+	.modes = {STARPU_W},
+	.name = "matrix_fill"
+};
+
+void fmultiple_check_scale(void *buffers[], void *cl_arg)
+{
+	int start, factor;
+	unsigned i, j;
+
+	/* length of the matrix */
+	unsigned nx = STARPU_MATRIX_GET_NX(buffers[0]);
+	unsigned ny = STARPU_MATRIX_GET_NY(buffers[0]);
+	unsigned ld = STARPU_MATRIX_GET_LD(buffers[0]);
+	int *val = (int *)STARPU_MATRIX_GET_PTR(buffers[0]);
+
+	starpu_codelet_unpack_args(cl_arg, &start, &factor);
+
+	for(j=0; j<ny ; j++)
+	{
+		for(i=0; i<nx ; i++)
+		{
+			STARPU_ASSERT(val[(j*ld)+i] == start + factor*((int)(i+100*j)));
+			val[(j*ld)+i] *= 2;
+		}
+	}
+}
+
+#ifdef STARPU_USE_CUDA
+extern void fmultiple_check_scale_cuda(void *buffers[], void *cl_arg);
+#endif
+struct starpu_codelet cl_check_scale =
+{
+#ifdef STARPU_USE_CUDA
+	.cuda_funcs = {fmultiple_check_scale_cuda},
+	.cuda_flags = {STARPU_CUDA_ASYNC},
+#else
+	/* Only enable it on CPUs if we don't have a CUDA device, to force remote execution on the CUDA device */
+	.cpu_funcs = {fmultiple_check_scale},
+	.cpu_funcs_name = {"fmultiple_check_scale"},
+#endif
+	.nbuffers = 1,
+	.modes = {STARPU_RW},
+	.name = "fmultiple_check_scale"
+};
+
+int main(int argc, char **argv)
+{
+	unsigned j, n=1;
+	int matrix[NX][NY];
+	int ret, i;
+
+	/* We haven't taken care otherwise */
+	STARPU_ASSERT((NX%PARTS) == 0);
+	STARPU_ASSERT((NY%PARTS) == 0);
+
+	starpu_data_handle_t handle;
+	starpu_data_handle_t vert_handle[PARTS];
+	starpu_data_handle_t horiz_handle[PARTS];
+
+	ret = starpu_init(NULL);
+	if (ret == -ENODEV)
+		return 77;
+	STARPU_CHECK_RETURN_VALUE(ret, "starpu_init");
+
+	/* Declare the whole matrix to StarPU */
+	starpu_matrix_data_register(&handle, STARPU_MAIN_RAM, (uintptr_t)matrix, NX, NX, NY, sizeof(matrix[0][0]));
+
+	/* Partition the matrix in PARTS vertical slices */
+	struct starpu_data_filter f_vert =
+	{
+		.filter_func = starpu_matrix_filter_block,
+		.nchildren = PARTS
+	};
+	starpu_data_partition_plan(handle, &f_vert, vert_handle);
+
+	/* Partition the matrix in PARTS horizontal slices */
+	struct starpu_data_filter f_horiz =
+	{
+		.filter_func = starpu_matrix_filter_vertical_block,
+		.nchildren = PARTS
+	};
+	starpu_data_partition_plan(handle, &f_horiz, horiz_handle);
+
+	/* Fill the matrix */
+	ret = starpu_task_insert(&cl_fill, STARPU_W, handle, 0);
+	if (ret == -ENODEV) goto enodev;
+	STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+
+	/* Now switch to vertical view of the matrix */
+	starpu_data_partition_submit(handle, PARTS, vert_handle);
+
+	/* Check the values of the vertical slices */
+	for (i = 0; i < PARTS; i++)
+	{
+		int factor = 1;
+		int start = i*(NX/PARTS);
+		ret = starpu_task_insert(&cl_check_scale,
+				STARPU_RW, vert_handle[i],
+				STARPU_VALUE, &start, sizeof(start),
+				STARPU_VALUE, &factor, sizeof(factor),
+				0);
+		if (ret == -ENODEV) goto enodev;
+		STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+	}
+
+	/* Now switch back to total view of the matrix */
+	starpu_data_unpartition_submit(handle, PARTS, vert_handle, -1);
+
+	/* And switch to horizontal view of the matrix */
+	starpu_data_partition_submit(handle, PARTS, horiz_handle);
+
+	/* Check the values of the horizontal slices */
+	for (i = 0; i < PARTS; i++)
+	{
+		int factor = 2;
+		int start = factor*100*i*(NY/PARTS);
+		ret = starpu_task_insert(&cl_check_scale,
+				STARPU_RW, horiz_handle[i],
+				STARPU_VALUE, &start, sizeof(start),
+				STARPU_VALUE, &factor, sizeof(factor),
+				0);
+		if (ret == -ENODEV) goto enodev;
+		STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+	}
+
+	/* Now switch back to total view of the matrix */
+	starpu_data_unpartition_submit(handle, PARTS, horiz_handle, -1);
+
+	/* And check the values of the whole matrix */
+	int factor = 4;
+	int start = 0;
+	ret = starpu_task_insert(&cl_check_scale,
+			STARPU_RW, handle,
+			STARPU_VALUE, &start, sizeof(start),
+			STARPU_VALUE, &factor, sizeof(factor),
+			0);
+	if (ret == -ENODEV) goto enodev;
+	STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+
+	/*
+	 * Unregister data from StarPU and shutdown.
+	 */
+	starpu_data_partition_clean(handle, PARTS, vert_handle);
+	starpu_data_partition_clean(handle, PARTS, horiz_handle);
+	starpu_data_unregister(handle);
+	starpu_shutdown();
+
+	return ret;
+
+enodev:
+	starpu_shutdown();
+	return 77;
+}

+ 367 - 0
examples/filters/fmultiple_submit_readonly.c

@@ -0,0 +1,367 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2015  Université Bordeaux
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+/*
+ * This examplifies how to access the same matrix with different partitioned
+ * views, doing the coherency through partition planning.
+ * We first run a kernel on the whole matrix to fill it, then run a kernel on
+ * each vertical slice to check the value,
+ * and multiply it by two, then run a
+ * kernel on each horizontal slice to do the same.
+ */
+
+#include <starpu.h>
+
+#define NX    6
+#define NY    6
+#define PARTS 2
+
+#define FPRINTF(ofile, fmt, ...) do { if (!getenv("STARPU_SSILENT")) {fprintf(ofile, fmt, ## __VA_ARGS__); }} while(0)
+
+void matrix_fill(void *buffers[], void *cl_arg STARPU_ATTRIBUTE_UNUSED)
+{
+	unsigned i, j;
+
+	/* length of the matrix */
+	unsigned nx = STARPU_MATRIX_GET_NX(buffers[0]);
+	unsigned ny = STARPU_MATRIX_GET_NY(buffers[0]);
+	unsigned ld = STARPU_MATRIX_GET_LD(buffers[0]);
+	int *val = (int *)STARPU_MATRIX_GET_PTR(buffers[0]);
+
+	for(j=0; j<ny ; j++)
+	{
+		for(i=0; i<nx ; i++)
+			val[(j*ld)+i] = i+100*j;
+	}
+}
+
+struct starpu_codelet cl_fill =
+{
+	.cpu_funcs = {matrix_fill},
+	.cpu_funcs_name = {"matrix_fill"},
+	.nbuffers = 1,
+	.modes = {STARPU_W},
+	.name = "matrix_fill"
+};
+
+void fmultiple_check_scale(void *buffers[], void *cl_arg)
+{
+	int start, factor;
+	unsigned i, j;
+
+	/* length of the matrix */
+	unsigned nx = STARPU_MATRIX_GET_NX(buffers[0]);
+	unsigned ny = STARPU_MATRIX_GET_NY(buffers[0]);
+	unsigned ld = STARPU_MATRIX_GET_LD(buffers[0]);
+	int *val = (int *)STARPU_MATRIX_GET_PTR(buffers[0]);
+
+	starpu_codelet_unpack_args(cl_arg, &start, &factor);
+
+	for(j=0; j<ny ; j++)
+	{
+		for(i=0; i<nx ; i++)
+		{
+			STARPU_ASSERT(val[(j*ld)+i] == start + factor*((int)(i+100*j)));
+			val[(j*ld)+i] *= 2;
+		}
+	}
+}
+
+#ifdef STARPU_USE_CUDA
+extern void fmultiple_check_scale_cuda(void *buffers[], void *cl_arg);
+#endif
+struct starpu_codelet cl_check_scale =
+{
+#ifdef STARPU_USE_CUDA
+	.cuda_funcs = {fmultiple_check_scale_cuda},
+	.cuda_flags = {STARPU_CUDA_ASYNC},
+#else
+	/* Only enable it on CPUs if we don't have a CUDA device, to force remote execution on the CUDA device */
+	.cpu_funcs = {fmultiple_check_scale},
+	.cpu_funcs_name = {"fmultiple_check_scale"},
+#endif
+	.nbuffers = 1,
+	.modes = {STARPU_RW},
+	.name = "fmultiple_check_scale"
+};
+
+void fmultiple_check(void *buffers[], void *cl_arg)
+{
+	int start, factor;
+	unsigned i, j;
+
+	/* length of the matrix */
+	unsigned nx = STARPU_MATRIX_GET_NX(buffers[0]);
+	unsigned ny = STARPU_MATRIX_GET_NY(buffers[0]);
+	unsigned ld = STARPU_MATRIX_GET_LD(buffers[0]);
+	int *val = (int *)STARPU_MATRIX_GET_PTR(buffers[0]);
+
+	starpu_codelet_unpack_args(cl_arg, &start, &factor);
+
+	for(j=0; j<ny ; j++)
+	{
+		for(i=0; i<nx ; i++)
+		{
+			STARPU_ASSERT(val[(j*ld)+i] == start + factor*((int)(i+100*j)));
+		}
+	}
+}
+
+#ifdef STARPU_USE_CUDA
+extern void fmultiple_check_cuda(void *buffers[], void *cl_arg);
+#endif
+struct starpu_codelet cl_check =
+{
+#ifdef STARPU_USE_CUDA
+	.cuda_funcs = {fmultiple_check_cuda},
+	.cuda_flags = {STARPU_CUDA_ASYNC},
+#else
+	/* Only enable it on CPUs if we don't have a CUDA device, to force remote execution on the CUDA device */
+	.cpu_funcs = {fmultiple_check},
+	.cpu_funcs_name = {"fmultiple_check"},
+#endif
+	.nbuffers = 1,
+	.modes = {STARPU_R},
+	.name = "fmultiple_check"
+};
+
+int main(int argc, char **argv)
+{
+	int start, factor;
+	unsigned j, n=1;
+	int matrix[NX][NY];
+	int ret, i;
+
+	/* We haven't taken care otherwise */
+	STARPU_ASSERT((NX%PARTS) == 0);
+	STARPU_ASSERT((NY%PARTS) == 0);
+
+	starpu_data_handle_t handle;
+	starpu_data_handle_t vert_handle[PARTS];
+	starpu_data_handle_t horiz_handle[PARTS];
+
+	ret = starpu_init(NULL);
+	if (ret == -ENODEV)
+		return 77;
+	STARPU_CHECK_RETURN_VALUE(ret, "starpu_init");
+
+	/* Declare the whole matrix to StarPU */
+	starpu_matrix_data_register(&handle, STARPU_MAIN_RAM, (uintptr_t)matrix, NX, NX, NY, sizeof(matrix[0][0]));
+
+	/* Partition the matrix in PARTS vertical slices */
+	struct starpu_data_filter f_vert =
+	{
+		.filter_func = starpu_matrix_filter_block,
+		.nchildren = PARTS
+	};
+	starpu_data_partition_plan(handle, &f_vert, vert_handle);
+
+	/* Partition the matrix in PARTS horizontal slices */
+	struct starpu_data_filter f_horiz =
+	{
+		.filter_func = starpu_matrix_filter_vertical_block,
+		.nchildren = PARTS
+	};
+	starpu_data_partition_plan(handle, &f_horiz, horiz_handle);
+
+	/* Fill the matrix */
+	ret = starpu_task_insert(&cl_fill, STARPU_W, handle, 0);
+	if (ret == -ENODEV) goto enodev;
+	STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+	factor = 1;
+
+	/* Now switch to vertical view of the matrix, but readonly */
+	starpu_data_partition_readonly_submit(handle, PARTS, vert_handle);
+	/* as well as horizontal view of the matrix, but readonly */
+	starpu_data_partition_readonly_submit(handle, PARTS, horiz_handle);
+
+	/* Check the values of the vertical slices */
+	for (i = 0; i < PARTS; i++)
+	{
+		start = i*(NX/PARTS);
+		ret = starpu_task_insert(&cl_check,
+				STARPU_R, vert_handle[i],
+				STARPU_VALUE, &start, sizeof(start),
+				STARPU_VALUE, &factor, sizeof(factor),
+				0);
+		if (ret == -ENODEV) goto enodev;
+		STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+	}
+	/* Check the values of the horizontal slices */
+	for (i = 0; i < PARTS; i++)
+	{
+		start = factor*100*i*(NY/PARTS);
+		ret = starpu_task_insert(&cl_check,
+				STARPU_R, horiz_handle[i],
+				STARPU_VALUE, &start, sizeof(start),
+				STARPU_VALUE, &factor, sizeof(factor),
+				0);
+		if (ret == -ENODEV) goto enodev;
+		STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+	}
+	/* And of the main matrix */
+	start = 0;
+	ret = starpu_task_insert(&cl_check,
+			STARPU_R, handle,
+			STARPU_VALUE, &start, sizeof(start),
+			STARPU_VALUE, &factor, sizeof(factor),
+			0);
+	if (ret == -ENODEV) goto enodev;
+	STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+
+	/* Now switch back to total view of the matrix */
+	starpu_data_unpartition_submit(handle, PARTS, horiz_handle, -1);
+	starpu_data_unpartition_submit(handle, PARTS, vert_handle, -1);
+
+	/* Check and scale it */
+	start = 0;
+	ret = starpu_task_insert(&cl_check_scale,
+			STARPU_RW, handle,
+			STARPU_VALUE, &start, sizeof(start),
+			STARPU_VALUE, &factor, sizeof(factor),
+			0);
+	if (ret == -ENODEV) goto enodev;
+	STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+	factor = 2;
+
+	/* Now switch to vertical view of the matrix, but readonly */
+	starpu_data_partition_readonly_submit(handle, PARTS, vert_handle);
+	/* as well as horizontal view of the matrix, but readonly */
+	starpu_data_partition_readonly_submit(handle, PARTS, horiz_handle);
+
+	/* Check the values of the vertical slices */
+	for (i = 0; i < PARTS; i++)
+	{
+		start = factor*i*(NX/PARTS);
+		ret = starpu_task_insert(&cl_check,
+				STARPU_R, vert_handle[i],
+				STARPU_VALUE, &start, sizeof(start),
+				STARPU_VALUE, &factor, sizeof(factor),
+				0);
+		if (ret == -ENODEV) goto enodev;
+		STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+	}
+	/* Check the values of the horizontal slices */
+	for (i = 0; i < PARTS; i++)
+	{
+		start = factor*100*i*(NY/PARTS);
+		ret = starpu_task_insert(&cl_check,
+				STARPU_R, horiz_handle[i],
+				STARPU_VALUE, &start, sizeof(start),
+				STARPU_VALUE, &factor, sizeof(factor),
+				0);
+		if (ret == -ENODEV) goto enodev;
+		STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+	}
+	/* And of the main matrix */
+	start = 0;
+	ret = starpu_task_insert(&cl_check,
+			STARPU_R, handle,
+			STARPU_VALUE, &start, sizeof(start),
+			STARPU_VALUE, &factor, sizeof(factor),
+			0);
+	if (ret == -ENODEV) goto enodev;
+	STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+
+	/* Disable the read-only vertical view of the matrix */
+	starpu_data_unpartition_submit(handle, PARTS, vert_handle, -1);
+	/* And switch to read-write horizontal view of the matrix */
+	starpu_data_partition_readwrite_upgrade_submit(handle, PARTS, horiz_handle);
+
+	/* Check and scale the values of the horizontal slices */
+	for (i = 0; i < PARTS; i++)
+	{
+		start = factor*100*i*(NY/PARTS);
+		ret = starpu_task_insert(&cl_check_scale,
+				STARPU_RW, horiz_handle[i],
+				STARPU_VALUE, &start, sizeof(start),
+				STARPU_VALUE, &factor, sizeof(factor),
+				0);
+		if (ret == -ENODEV) goto enodev;
+		STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+	}
+	factor = 4;
+
+	/* Now switch back to total view of the matrix, but readonly */
+	starpu_data_unpartition_readonly_submit(handle, PARTS, horiz_handle, -1);
+	/* And also enable a vertical view of the matrix, but readonly */
+	starpu_data_partition_readonly_submit(handle, PARTS, vert_handle);
+
+	/* Check the values of the vertical slices */
+	for (i = 0; i < PARTS; i++)
+	{
+		start = factor*i*(NX/PARTS);
+		ret = starpu_task_insert(&cl_check,
+				STARPU_R, vert_handle[i],
+				STARPU_VALUE, &start, sizeof(start),
+				STARPU_VALUE, &factor, sizeof(factor),
+				0);
+		if (ret == -ENODEV) goto enodev;
+		STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+	}
+	/* Check the values of the horizontal slices */
+	for (i = 0; i < PARTS; i++)
+	{
+		start = factor*100*i*(NY/PARTS);
+		ret = starpu_task_insert(&cl_check,
+				STARPU_R, horiz_handle[i],
+				STARPU_VALUE, &start, sizeof(start),
+				STARPU_VALUE, &factor, sizeof(factor),
+				0);
+		if (ret == -ENODEV) goto enodev;
+		STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+	}
+	/* And of the main matrix */
+	start = 0;
+	ret = starpu_task_insert(&cl_check,
+			STARPU_R, handle,
+			STARPU_VALUE, &start, sizeof(start),
+			STARPU_VALUE, &factor, sizeof(factor),
+			0);
+	if (ret == -ENODEV) goto enodev;
+	STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+
+	/* Disable the read-only vertical view of the matrix */
+	starpu_data_unpartition_submit(handle, PARTS, vert_handle, -1);
+	/* And the read-only horizontal view of the matrix */
+	starpu_data_unpartition_submit(handle, PARTS, horiz_handle, -1);
+	/* Thus getting back to a read-write total view of the matrix */
+
+	/* And check and scale the values of the whole matrix */
+	start = 0;
+	ret = starpu_task_insert(&cl_check_scale,
+			STARPU_RW, handle,
+			STARPU_VALUE, &start, sizeof(start),
+			STARPU_VALUE, &factor, sizeof(factor),
+			0);
+	if (ret == -ENODEV) goto enodev;
+	STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+	factor = 8;
+
+	/*
+	 * Unregister data from StarPU and shutdown.
+	 */
+	starpu_data_partition_clean(handle, PARTS, vert_handle);
+	starpu_data_partition_clean(handle, PARTS, horiz_handle);
+	starpu_data_unregister(handle);
+	starpu_shutdown();
+
+	return ret;
+
+enodev:
+	starpu_shutdown();
+	return 77;
+}

+ 9 - 1
include/starpu_data_filters.h

@@ -1,6 +1,6 @@
 /* StarPU --- Runtime system for heterogeneous multicore architectures.
 /* StarPU --- Runtime system for heterogeneous multicore architectures.
  *
  *
- * Copyright (C) 2010-2012  Université de Bordeaux
+ * Copyright (C) 2010-2012, 2015  Université de Bordeaux
  * Copyright (C) 2010  Mehdi Juhoor <mjuhoor@gmail.com>
  * Copyright (C) 2010  Mehdi Juhoor <mjuhoor@gmail.com>
  * Copyright (C) 2010, 2011, 2012, 2013  CNRS
  * Copyright (C) 2010, 2011, 2012, 2013  CNRS
  *
  *
@@ -42,6 +42,14 @@ struct starpu_data_filter
 void starpu_data_partition(starpu_data_handle_t initial_handle, struct starpu_data_filter *f);
 void starpu_data_partition(starpu_data_handle_t initial_handle, struct starpu_data_filter *f);
 void starpu_data_unpartition(starpu_data_handle_t root_data, unsigned gathering_node);
 void starpu_data_unpartition(starpu_data_handle_t root_data, unsigned gathering_node);
 
 
+void starpu_data_partition_plan(starpu_data_handle_t initial_handle, struct starpu_data_filter *f, starpu_data_handle_t *children);
+void starpu_data_partition_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children);
+void starpu_data_partition_readonly_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children);
+void starpu_data_partition_readwrite_upgrade_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children);
+void starpu_data_unpartition_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children, int gathering_node);
+void starpu_data_unpartition_readonly_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children, int gathering_node);
+void starpu_data_partition_clean(starpu_data_handle_t root_data, unsigned nparts, starpu_data_handle_t *children);
+
 int starpu_data_get_nb_children(starpu_data_handle_t handle);
 int starpu_data_get_nb_children(starpu_data_handle_t handle);
 starpu_data_handle_t starpu_data_get_child(starpu_data_handle_t handle, unsigned i);
 starpu_data_handle_t starpu_data_get_child(starpu_data_handle_t handle, unsigned i);
 
 

+ 11 - 0
src/datawizard/coherency.h

@@ -145,6 +145,17 @@ struct _starpu_data_state
 
 
 	starpu_data_handle_t children;
 	starpu_data_handle_t children;
 	unsigned nchildren;
 	unsigned nchildren;
+	/* How many partition plans this handle has */
+	unsigned nplans;
+	/* Whether a partition plan is currently submitted and the
+	 * corresponding unpartition has not been yet
+	 *
+	 * Or the number of partition plans currently submitted in readonly
+	 * mode.
+	 */
+	unsigned partitioned;
+	/* Whether a partition plan is currently submitted in readonly mode */
+	unsigned readonly;
 
 
 	/* describe the state of the data in term of coherency */
 	/* describe the state of the data in term of coherency */
 	struct _starpu_data_replicate per_node[STARPU_MAXNODES];
 	struct _starpu_data_replicate per_node[STARPU_MAXNODES];

+ 204 - 34
src/datawizard/filters.c

@@ -22,8 +22,6 @@
 #include <datawizard/interfaces/data_interface.h>
 #include <datawizard/interfaces/data_interface.h>
 #include <core/task.h>
 #include <core/task.h>
 
 
-static void starpu_data_create_children(starpu_data_handle_t handle, unsigned nchildren, struct starpu_data_filter *f);
-
 /*
 /*
  * This function applies a data filter on all the elements of a partition
  * This function applies a data filter on all the elements of a partition
  */
  */
@@ -114,27 +112,37 @@ starpu_data_handle_t starpu_data_vget_sub_data(starpu_data_handle_t root_handle,
 	return current_handle;
 	return current_handle;
 }
 }
 
 
-void starpu_data_partition(starpu_data_handle_t initial_handle, struct starpu_data_filter *f)
+static unsigned _starpu_data_partition_nparts(starpu_data_handle_t initial_handle, struct starpu_data_filter *f)
+{
+	/* how many parts ? */
+	if (f->get_nchildren)
+	  return f->get_nchildren(f, initial_handle);
+	else
+	  return f->nchildren;
+
+}
+
+static void _starpu_data_partition(starpu_data_handle_t initial_handle, starpu_data_handle_t *childrenp, unsigned nparts, struct starpu_data_filter *f, int inherit_state)
 {
 {
-	unsigned nparts;
 	unsigned i;
 	unsigned i;
 	unsigned node;
 	unsigned node;
 
 
 	/* first take care to properly lock the data header */
 	/* first take care to properly lock the data header */
 	_starpu_spin_lock(&initial_handle->header_lock);
 	_starpu_spin_lock(&initial_handle->header_lock);
 
 
-	STARPU_ASSERT_MSG(initial_handle->nchildren == 0, "there should not be mutiple filters applied on the same data %p, futher filtering has to be done on children", initial_handle);
-
-	/* how many parts ? */
-	if (f->get_nchildren)
-	  nparts = f->get_nchildren(f, initial_handle);
-	else
-	  nparts = f->nchildren;
+	initial_handle->nplans++;
 
 
 	STARPU_ASSERT_MSG(nparts > 0, "Partitioning data %p in 0 piece does not make sense", initial_handle);
 	STARPU_ASSERT_MSG(nparts > 0, "Partitioning data %p in 0 piece does not make sense", initial_handle);
 
 
 	/* allocate the children */
 	/* allocate the children */
-	starpu_data_create_children(initial_handle, nparts, f);
+	if (inherit_state)
+	{
+		initial_handle->children = (struct _starpu_data_state *) calloc(nparts, sizeof(struct _starpu_data_state));
+		STARPU_ASSERT(initial_handle->children);
+
+		/* this handle now has children */
+		initial_handle->nchildren = nparts;
+	}
 
 
 	unsigned nworkers = starpu_worker_get_count();
 	unsigned nworkers = starpu_worker_get_count();
 
 
@@ -147,6 +155,7 @@ void starpu_data_partition(starpu_data_handle_t initial_handle, struct starpu_da
 	{
 	{
 		/* This is lazy allocation, allocate it now in main RAM, so as
 		/* This is lazy allocation, allocate it now in main RAM, so as
 		 * to have somewhere to gather pieces later */
 		 * to have somewhere to gather pieces later */
+		/* FIXME: mark as unevictable! */
 		int ret = _starpu_allocate_memory_on_node(initial_handle, &initial_handle->per_node[0], 0);
 		int ret = _starpu_allocate_memory_on_node(initial_handle, &initial_handle->per_node[0], 0);
 #ifdef STARPU_DEVEL
 #ifdef STARPU_DEVEL
 #warning we should reclaim memory if allocation failed
 #warning we should reclaim memory if allocation failed
@@ -156,12 +165,29 @@ void starpu_data_partition(starpu_data_handle_t initial_handle, struct starpu_da
 
 
 	for (i = 0; i < nparts; i++)
 	for (i = 0; i < nparts; i++)
 	{
 	{
-		starpu_data_handle_t child =
-			starpu_data_get_child(initial_handle, i);
+		starpu_data_handle_t child;
 
 
+		if (inherit_state)
+			child = &initial_handle->children[i];
+		else
+			child = childrenp[i];
 		STARPU_ASSERT(child);
 		STARPU_ASSERT(child);
 
 
+		struct starpu_data_interface_ops *ops;
+
+		/* each child may have his own interface type */
+		/* what's this child's interface ? */
+		if (f->get_child_ops)
+			ops = f->get_child_ops(f, i);
+		else
+			ops = initial_handle->ops;
+
+		_starpu_data_handle_init(child, ops, initial_handle->mf_node);
+
 		child->nchildren = 0;
 		child->nchildren = 0;
+		child->nplans = 0;
+		child->partitioned = 0;
+		child->readonly = 0;
                 child->mpi_data = initial_handle->mpi_data;
                 child->mpi_data = initial_handle->mpi_data;
 		child->root_handle = initial_handle->root_handle;
 		child->root_handle = initial_handle->root_handle;
 		child->father_handle = initial_handle;
 		child->father_handle = initial_handle;
@@ -224,19 +250,29 @@ void starpu_data_partition(starpu_data_handle_t initial_handle, struct starpu_da
 			initial_replicate = &initial_handle->per_node[node];
 			initial_replicate = &initial_handle->per_node[node];
 			child_replicate = &child->per_node[node];
 			child_replicate = &child->per_node[node];
 
 
-			child_replicate->state = initial_replicate->state;
-			child_replicate->allocated = initial_replicate->allocated;
+			if (inherit_state)
+				child_replicate->state = initial_replicate->state;
+			else
+				child_replicate->state = STARPU_INVALID;
+			if (inherit_state || !initial_replicate->automatically_allocated)
+				child_replicate->allocated = initial_replicate->allocated;
+			else
+				child_replicate->allocated = 0;
 			/* Do not allow memory reclaiming within the child for parent bits */
 			/* Do not allow memory reclaiming within the child for parent bits */
 			child_replicate->automatically_allocated = 0;
 			child_replicate->automatically_allocated = 0;
 			child_replicate->refcnt = 0;
 			child_replicate->refcnt = 0;
 			child_replicate->memory_node = node;
 			child_replicate->memory_node = node;
 			child_replicate->relaxed_coherency = 0;
 			child_replicate->relaxed_coherency = 0;
-			child_replicate->initialized = initial_replicate->initialized;
+			if (inherit_state)
+				child_replicate->initialized = initial_replicate->initialized;
+			else
+				child_replicate->initialized = 0;
 
 
 			/* update the interface */
 			/* update the interface */
 			void *initial_interface = starpu_data_get_interface_on_node(initial_handle, node);
 			void *initial_interface = starpu_data_get_interface_on_node(initial_handle, node);
 			void *child_interface = starpu_data_get_interface_on_node(child, node);
 			void *child_interface = starpu_data_get_interface_on_node(child, node);
 
 
+			STARPU_ASSERT_MSG(!(!inherit_state && child_replicate->automatically_allocated && child_replicate->allocated), "partition planning is currently not supported when handle has some automatically allocated buffers");
 			f->filter_func(initial_interface, child_interface, f, i, nparts);
 			f->filter_func(initial_interface, child_interface, f, i, nparts);
 		}
 		}
 
 
@@ -459,6 +495,7 @@ void starpu_data_unpartition(starpu_data_handle_t root_handle, unsigned gatherin
 	starpu_data_handle_t children = root_handle->children;
 	starpu_data_handle_t children = root_handle->children;
 	root_handle->children = NULL;
 	root_handle->children = NULL;
 	root_handle->nchildren = 0;
 	root_handle->nchildren = 0;
+	root_handle->nplans--;
 
 
 	/* now the parent may be used again so we release the lock */
 	/* now the parent may be used again so we release the lock */
 	_starpu_spin_unlock(&root_handle->header_lock);
 	_starpu_spin_unlock(&root_handle->header_lock);
@@ -468,31 +505,164 @@ void starpu_data_unpartition(starpu_data_handle_t root_handle, unsigned gatherin
 	_STARPU_TRACE_END_UNPARTITION(root_handle, gathering_node);
 	_STARPU_TRACE_END_UNPARTITION(root_handle, gathering_node);
 }
 }
 
 
-/* each child may have his own interface type */
-static void starpu_data_create_children(starpu_data_handle_t handle, unsigned nchildren, struct starpu_data_filter *f)
+void starpu_data_partition(starpu_data_handle_t initial_handle, struct starpu_data_filter *f)
 {
 {
-	handle->children = (struct _starpu_data_state *) calloc(nchildren, sizeof(struct _starpu_data_state));
-	STARPU_ASSERT(handle->children);
+	unsigned nparts = _starpu_data_partition_nparts(initial_handle, f);
+	STARPU_ASSERT_MSG(initial_handle->nchildren == 0, "there should not be mutiple filters applied on the same data %p, futher filtering has to be done on children", initial_handle);
+	STARPU_ASSERT_MSG(initial_handle->nplans == 0, "partition planning and synchronous partitioning is not supported");
 
 
-	unsigned child;
+	initial_handle->children = NULL;
+	_starpu_data_partition(initial_handle, NULL, nparts, f, 1);
+}
+
+void starpu_data_partition_plan(starpu_data_handle_t initial_handle, struct starpu_data_filter *f, starpu_data_handle_t *childrenp)
+{
+	unsigned i;
+	unsigned nparts = _starpu_data_partition_nparts(initial_handle, f);
+	STARPU_ASSERT_MSG(initial_handle->nchildren == 0, "partition planning and synchronous partitioning is not supported");
+	STARPU_ASSERT_MSG(initial_handle->home_node == STARPU_MAIN_RAM, "partition planning is currently only supported from main RAM");
+	STARPU_ASSERT_MSG(initial_handle->sequential_consistency, "partition planning is currently only supported for data with sequential consistency");
+
+	for (i = 0; i < nparts; i++)
+		childrenp[i] = calloc(1, sizeof(struct _starpu_data_state));
+	_starpu_data_partition(initial_handle, childrenp, nparts, f, 0);
+}
+
+void starpu_data_partition_clean(starpu_data_handle_t root_handle, unsigned nparts, starpu_data_handle_t *children)
+{
+	unsigned i;
+
+	for (i = 0; i < nparts; i++)
+		starpu_data_unregister_submit(children[i]);
+
+	_starpu_spin_lock(&root_handle->header_lock);
+	root_handle->nplans--;
+	_starpu_spin_unlock(&root_handle->header_lock);
+}
+
+static void empty(void *buffers[] STARPU_ATTRIBUTE_UNUSED, void *cl_arg STARPU_ATTRIBUTE_UNUSED)
+{
+	/* This doesn't need to do anything, it's simply used to make coherency
+	 * between the two views, by simply running on the home node of the
+	 * data, thus getting back all data pieces there.  */
+}
+
+static struct starpu_codelet cl_switch =
+{
+	.cpu_funcs = {empty},
+	.nbuffers = STARPU_VARIABLE_NBUFFERS,
+	.name = "data_partition_switch"
+};
 
 
-	for (child = 0; child < nchildren; child++)
+void starpu_data_partition_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children)
+{
+	STARPU_ASSERT_MSG(initial_handle->sequential_consistency, "partition planning is currently only supported for data with sequential consistency");
+	_starpu_spin_lock(&initial_handle->header_lock);
+	STARPU_ASSERT_MSG(initial_handle->partitioned == 0, "One can't submit several partition plannings at the same time");
+	STARPU_ASSERT_MSG(initial_handle->readonly == 0, "One can't submit a partition planning while a readonly partitioning is active");
+	initial_handle->partitioned++;
+	_starpu_spin_unlock(&initial_handle->header_lock);
+
+	unsigned i;
+	struct starpu_data_descr descr[nparts];
+	for (i = 0; i < nparts; i++)
 	{
 	{
-		starpu_data_handle_t handle_child;
-		struct starpu_data_interface_ops *ops;
+		STARPU_ASSERT_MSG(children[i]->father_handle == initial_handle, "children parameter of starpu_data_partition_submit must be the children of the parent parameter");
+		descr[i].handle = children[i];
+		descr[i].mode = STARPU_W;
+	}
+	/* TODO: assert nparts too */
+	starpu_task_insert(&cl_switch, STARPU_RW, initial_handle, STARPU_DATA_MODE_ARRAY, descr, nparts, 0);
+	starpu_data_invalidate_submit(initial_handle);
+}
 
 
-		/* what's this child's interface ? */
-		if (f->get_child_ops)
-		  ops = f->get_child_ops(f, child);
-		else
-		  ops = handle->ops;
+void starpu_data_partition_readonly_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children)
+{
+	STARPU_ASSERT_MSG(initial_handle->sequential_consistency, "partition planning is currently only supported for data with sequential consistency");
+	_starpu_spin_lock(&initial_handle->header_lock);
+	STARPU_ASSERT_MSG(initial_handle->partitioned == 0 || initial_handle->readonly, "One can't submit a readonly partition planning at the same time as a readwrite partition planning");
+	initial_handle->partitioned++;
+	initial_handle->readonly = 1;
+	_starpu_spin_unlock(&initial_handle->header_lock);
+
+	unsigned i;
+	struct starpu_data_descr descr[nparts];
+	for (i = 0; i < nparts; i++)
+	{
+		STARPU_ASSERT_MSG(children[i]->father_handle == initial_handle, "children parameter of starpu_data_partition_submit must be the children of the parent parameter");
+		descr[i].handle = children[i];
+		descr[i].mode = STARPU_W;
+	}
+	/* TODO: assert nparts too */
+	starpu_task_insert(&cl_switch, STARPU_R, initial_handle, STARPU_DATA_MODE_ARRAY, descr, nparts, 0);
+}
+
+void starpu_data_partition_readwrite_upgrade_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children)
+{
+	STARPU_ASSERT_MSG(initial_handle->sequential_consistency, "partition planning is currently only supported for data with sequential consistency");
+	_starpu_spin_lock(&initial_handle->header_lock);
+	STARPU_ASSERT_MSG(initial_handle->partitioned == 1, "One can't upgrade a readonly partition planning to readwrite while other readonly partition plannings are active");
+	STARPU_ASSERT_MSG(initial_handle->readonly == 1, "One can only upgrade a readonly partition planning");
+	initial_handle->readonly = 0;
+	_starpu_spin_unlock(&initial_handle->header_lock);
+
+	unsigned i;
+	struct starpu_data_descr descr[nparts];
+	for (i = 0; i < nparts; i++)
+	{
+		STARPU_ASSERT_MSG(children[i]->father_handle == initial_handle, "children parameter of starpu_data_partition_submit must be the children of the parent parameter");
+		descr[i].handle = children[i];
+		descr[i].mode = STARPU_W;
+	}
+	/* TODO: assert nparts too */
+	starpu_task_insert(&cl_switch, STARPU_RW, initial_handle, STARPU_DATA_MODE_ARRAY, descr, nparts, 0);
+	starpu_data_invalidate_submit(initial_handle);
+}
+
+void starpu_data_unpartition_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children, int gather_node)
+{
+	STARPU_ASSERT_MSG(initial_handle->sequential_consistency, "partition planning is currently only supported for data with sequential consistency");
+	STARPU_ASSERT_MSG(gather_node == STARPU_MAIN_RAM || gather_node == -1, "gathering node different from main RAM is currently not supported");
+	_starpu_spin_lock(&initial_handle->header_lock);
+	STARPU_ASSERT_MSG(initial_handle->partitioned >= 1, "No partition planning is active for this handle");
+	initial_handle->partitioned--;
+	if (!initial_handle->partitioned)
+		initial_handle->readonly = 0;
+	_starpu_spin_unlock(&initial_handle->header_lock);
 
 
-		handle_child = &handle->children[child];
-		_starpu_data_handle_init(handle_child, ops, handle->mf_node);
+	unsigned i;
+	struct starpu_data_descr descr[nparts];
+	for (i = 0; i < nparts; i++)
+	{
+		STARPU_ASSERT_MSG(children[i]->father_handle == initial_handle, "children parameter of starpu_data_partition_submit must be the children of the parent parameter");
+		descr[i].handle = children[i];
+		descr[i].mode = STARPU_RW;
 	}
 	}
+	/* TODO: assert nparts too */
+	starpu_task_insert(&cl_switch, STARPU_W, initial_handle, STARPU_DATA_MODE_ARRAY, descr, nparts, 0);
+	for (i = 0; i < nparts; i++)
+		starpu_data_invalidate_submit(children[i]);
+}
+
+void starpu_data_unpartition_readonly_submit(starpu_data_handle_t initial_handle, unsigned nparts, starpu_data_handle_t *children, int gather_node)
+{
+	STARPU_ASSERT_MSG(initial_handle->sequential_consistency, "partition planning is currently only supported for data with sequential consistency");
+	STARPU_ASSERT_MSG(gather_node == STARPU_MAIN_RAM || gather_node == -1, "gathering node different from main RAM is currently not supported");
+	_starpu_spin_lock(&initial_handle->header_lock);
+	STARPU_ASSERT_MSG(initial_handle->partitioned >= 1, "No partition planning is active for this handle");
+	initial_handle->readonly = 1;
+	_starpu_spin_unlock(&initial_handle->header_lock);
 
 
-	/* this handle now has children */
-	handle->nchildren = nchildren;
+	unsigned i;
+	struct starpu_data_descr descr[nparts];
+	for (i = 0; i < nparts; i++)
+	{
+		STARPU_ASSERT_MSG(children[i]->father_handle == initial_handle, "children parameter of starpu_data_partition_submit must be the children of the parent parameter");
+		descr[i].handle = children[i];
+		descr[i].mode = STARPU_R;
+	}
+	/* TODO: assert nparts too */
+	starpu_task_insert(&cl_switch, STARPU_W, initial_handle, STARPU_DATA_MODE_ARRAY, descr, nparts, 0);
 }
 }
 
 
 /*
 /*

+ 6 - 0
src/datawizard/interfaces/data_interface.c

@@ -252,6 +252,9 @@ static void _starpu_register_new_data(starpu_data_handle_t handle,
 
 
 	/* there is no hierarchy yet */
 	/* there is no hierarchy yet */
 	handle->nchildren = 0;
 	handle->nchildren = 0;
+	handle->nplans = 0;
+	handle->partitioned = 0;
+	handle->readonly = 0;
 	handle->root_handle = handle;
 	handle->root_handle = handle;
 	handle->father_handle = NULL;
 	handle->father_handle = NULL;
 	handle->sibling_index = 0; /* could be anything for the root */
 	handle->sibling_index = 0; /* could be anything for the root */
@@ -641,6 +644,9 @@ static void _starpu_data_unregister(starpu_data_handle_t handle, unsigned cohere
 {
 {
 	STARPU_ASSERT(handle);
 	STARPU_ASSERT(handle);
 	STARPU_ASSERT_MSG(handle->nchildren == 0, "data %p needs to be unpartitioned before unregistration", handle);
 	STARPU_ASSERT_MSG(handle->nchildren == 0, "data %p needs to be unpartitioned before unregistration", handle);
+	STARPU_ASSERT_MSG(handle->nplans == 0, "data %p needs its partition plans to be cleaned before unregistration", handle);
+	STARPU_ASSERT_MSG(handle->partitioned == 0, "data %p needs its partitioned plans to be unpartitioned before unregistration", handle);
+	/* TODO: also check that it has the latest coherency */
 	STARPU_ASSERT(!(nowait && handle->busy_count != 0));
 	STARPU_ASSERT(!(nowait && handle->busy_count != 0));
 
 
 	int sequential_consistency = handle->sequential_consistency;
 	int sequential_consistency = handle->sequential_consistency;