瀏覽代碼

In the hwloc case, create combined workers hierarchically along hwloc objects, introduce STARPU_MAX_ARITY which permits to control whether to synthesize intermediate levels

Samuel Thibault 13 年之前
父節點
當前提交
464f6730ea
共有 3 個文件被更改,包括 159 次插入24 次删除
  1. 3 11
      doc/chapters/advanced-examples.texi
  2. 18 13
      doc/chapters/configuration.texi
  3. 138 0
      src/sched_policies/detect_combined_workers.c

+ 3 - 11
doc/chapters/advanced-examples.texi

@@ -882,23 +882,15 @@ try to execute tasks with several CPUs. It will automatically try the various
 available combined worker sizes and thus be able to avoid choosing a large
 combined worker if the codelet does not actually scale so much.
 
-@subsection Combined worker sizes
+@subsection Combined workers
 
 By default, StarPU creates combined workers according to the architecture
 structure as detected by hwloc. It means that for each object of the hwloc
 topology (NUMA node, socket, cache, ...) a combined worker will be created. If
 some nodes of the hierarchy have a big arity (e.g. many cores in a socket
 without a hierarchy of shared caches), StarPU will create combined workers of
-intermediate sizes.
-The user can give some hints to StarPU about combined workers sizes to favor.
-This can be done by using the environment variables @code{STARPU_MIN_WORKERSIZE}
-and @code{STARPU_MAX_WORKERSIZE}. When set, they will force StarPU to create the
-biggest combined workers possible without overstepping the defined boundaries.
-However, StarPU will create the remaining combined workers without abiding by
-the rules if not possible.
-For example : if the user specifies a minimum and maximum combined workers size
-of 3 on a machine containing 8 CPUs, StarPU will create a combined worker of
-size 2 beside the combined workers of size 3.
+intermediate sizes. The @code{STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER} variable
+permits to tune the maximum arity between levels of combined workers.
 
 The combined workers actually produced can be seen in the output of the
 @code{starpu_machine_display} tool (the @code{STARPU_SCHED} environment variable

+ 18 - 13
doc/chapters/configuration.texi

@@ -235,8 +235,7 @@ MKL website} provides a script to determine the linking flags.
 * STARPU_WORKERS_CUDAID::       	Select specific CUDA devices
 * STARPU_WORKERS_OPENCLID::     	Select specific OpenCL devices
 * STARPU_SINGLE_COMBINED_WORKER:: 	Do not use concurrent workers
-* STARPU_MIN_WORKERSIZE::	 	Minimum size of the combined workers
-* STARPU_MAX_WORKERSIZE:: 		Maximum size of the combined workers
+* STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER::	 	Maximum arity between combined worker levels
 @end menu
 
 @node STARPU_NCPU
@@ -325,17 +324,23 @@ If set, StarPU will create several workers which won't be able to work
 concurrently. It will create combined workers which size goes from 1 to the
 total number of CPU workers in the system.
 
-@node STARPU_MIN_WORKERSIZE
-@subsubsection @code{STARPU_MIN_WORKERSIZE} -- Minimum size of the combined workers
-
-Let the user give a hint to StarPU about which how many workers
-(minimum boundary) the combined workers should contain.
-
-@node STARPU_MAX_WORKERSIZE
-@subsubsection @code{STARPU_MAX_WORKERSIZE} -- Maximum size of the combined workers
-
-Let the user give a hint to StarPU about which how many workers
-(maximum boundary) the combined workers should contain.
+@node STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER
+@subsubsection @code{SYNTHESIZE_ARITY_COMBINED_WORKER} -- Maximum arity between levels of combined workers (default = 2)
+
+Let the user decide how many elements are allowed between combined workers
+created from hwloc information. For instance, in the case of sockets with 6
+cores without shared L2 caches, if @code{SYNTHESIZE_ARITY_COMBINED_WORKER} is
+set to 6, no combined worker will be synthesized beyond one for the socket
+and one per core. If it is set to 3, 3 intermediate combined workers will be
+synthesized, to divide the socket cores into 3 chunks of 2 cores. If it set to
+2, 2 intermediate combined workers will be synthesized, to divide the the socket
+cores into 2 chunks of 3 cores, and then 3 additional combined workers will be
+synthesized, to divide the former synthesized workers into a bunch of 2 cores,
+and the remaining core (for which no combined worker is synthesized since there
+is already a normal worker for it).
+
+The default, 2, thus makes StarPU tend to building a binary trees of combined
+workers.
 
 @node Scheduling
 @subsection Configuring the Scheduling engine

+ 138 - 0
src/sched_policies/detect_combined_workers.c

@@ -25,6 +25,7 @@
 #ifdef STARPU_HAVE_HWLOC
 #include <hwloc.h>
 
+#if 0
 /* struct _starpu_tree
  * ==================
  * Purpose
@@ -410,6 +411,143 @@ static void find_and_assign_combinations_with_hwloc(struct starpu_machine_topolo
 
     free(tree.workers);
 }
+#endif
+
+static void find_workers(hwloc_obj_t obj, int cpu_workers[STARPU_NMAXWORKERS], unsigned *n)
+{
+    if (!obj->userdata)
+	/* Not something we run something on, don't care */
+	return;
+    if (obj->userdata == (void*) -1)
+    {
+	/* Intra node, recurse */
+	unsigned i;
+	for (i = 0; i < obj->arity; i++)
+	    find_workers(obj->children[i], cpu_workers, n);
+	return;
+    }
+
+    /* Got to a PU leaf */
+    struct _starpu_worker *worker = obj->userdata;
+    /* is it a CPU worker? */
+    if (worker->perf_arch == STARPU_CPU_DEFAULT)
+    {
+	_STARPU_DEBUG("worker %d is part of it\n", worker->workerid);
+	/* Add it to the combined worker */
+	cpu_workers[(*n)++] = worker->workerid;
+    }
+}
+
+static void synthesize_intermediate_workers(struct starpu_machine_topology *topology, hwloc_obj_t *children, unsigned arity, unsigned n, unsigned synthesize_arity)
+{
+    unsigned nworkers, i, j;
+    unsigned chunk_size = (n + synthesize_arity-1) / max_arity;
+    unsigned chunk_start;
+    int cpu_workers[STARPU_NMAXWORKERS];
+    int ret;
+
+    if (n <= synthesize_arity)
+	/* Not too many children, do not synthesize */
+	return;
+
+    _STARPU_DEBUG("%d children > %d, synthesizing intermediate combined workers of size %d\n", n, synthesize_arity, chunk_size);
+
+    n = 0;
+    j = 0;
+    nworkers = 0;
+    chunk_start = 0;
+    for (i = 0 ; i < arity; i++)
+    {
+	if (children[i]->userdata) {
+	    n++;
+	    _STARPU_DEBUG("child %d\n", i);
+	    find_workers(children[i], cpu_workers, &nworkers);
+	    j++;
+	}
+	/* Completed a chunk, or last bit (but not if it's just 1 subobject) */
+	if (j == chunk_size || (i == arity-1 && j > 1)) {
+	    _STARPU_DEBUG("Adding it\n");
+	    ret = starpu_combined_worker_assign_workerid(nworkers, cpu_workers);
+	    STARPU_ASSERT(ret >= 0);
+	    /* Recurse there */
+	    synthesize_intermediate_workers(topology, children+chunk_start, i - chunk_start, n, synthesize_arity);
+	    /* And restart another one */
+	    n = 0;
+	    j = 0;
+	    nworkers = 0;
+	    chunk_start = i+1;
+	}
+    }
+}
+
+static void find_and_assign_combinations(struct starpu_machine_topology *topology, hwloc_obj_t obj, unsigned synthesize_arity)
+{
+    char name[64];
+    unsigned i, n, nworkers;
+    int cpu_workers[STARPU_NMAXWORKERS];
+
+    int ret;
+
+    hwloc_obj_snprintf(name, sizeof(name), topology->hwtopology, obj, "#", 0);
+    _STARPU_DEBUG("Looking at %s\n", name);
+
+    for (n = 0, i = 0; i < obj->arity; i++)
+	if (obj->children[i]->userdata)
+	    /* it has a CPU worker */
+	    n++;
+
+    if (n == 1) {
+	/* If there is only one child, we go to the next level right away */
+	find_and_assign_combinations(topology, obj->children[0], synthesize_arity);
+	return;
+    }
+
+    /* Add this object */
+    nworkers = 0;
+    find_workers(obj, cpu_workers, &nworkers);
+
+    if (nworkers > 1)
+    {
+	_STARPU_DEBUG("Adding it\n");
+	ret = starpu_combined_worker_assign_workerid(nworkers, cpu_workers);
+	STARPU_ASSERT(ret >= 0);
+    }
+
+    /* Add artificial intermediate objects recursively */
+    synthesize_intermediate_workers(topology, obj->children, obj->arity, n, synthesize_arity);
+
+    /* And recurse */
+    for (i = 0; i < obj->arity; i++)
+	if (obj->children[i]->userdata == (void*) -1)
+	    find_and_assign_combinations(topology, obj->children[i], synthesize_arity);
+}
+
+static void find_and_assign_combinations_with_hwloc(struct starpu_machine_topology *topology)
+{
+    unsigned i;
+    struct _starpu_machine_config *config = _starpu_get_machine_config();
+    int synthesize_arity = starpu_get_env_number("STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER");
+
+    if (synthesize_arity == -1)
+	synthesize_arity = 2;
+
+    /* First, mark nodes which contain CPU workers, simply by setting their userdata field */
+    for (i = 0; i < topology->nworkers; i++)
+    {
+	struct _starpu_worker *worker = &config->workers[i];
+	if (worker->perf_arch == STARPU_CPU_DEFAULT)
+	{
+	    hwloc_obj_t obj = hwloc_get_obj_by_depth(topology->hwtopology, config->cpu_depth, worker->bindid);
+	    STARPU_ASSERT(obj->userdata == worker);
+	    obj = obj->parent;
+	    while (obj) {
+		obj->userdata = (void*) -1;
+		obj = obj->parent;
+	    }
+	}
+    }
+    find_and_assign_combinations(topology, hwloc_get_root_obj(topology->hwtopology), synthesize_arity);
+}
 
 #else /* STARPU_HAVE_HWLOC */