Mariem makni 5 years ago
parent
commit
709176b07a
1 changed files with 307 additions and 71 deletions
  1. 307 71
      doc/doxygen/chapters/320_scheduling.doxy

+ 307 - 71
doc/doxygen/chapters/320_scheduling.doxy

@@ -1,6 +1,8 @@
 /* StarPU --- Runtime system for heterogeneous multicore architectures.
  *
- * Copyright (C) 2009-2020  Université de Bordeaux, CNRS (LaBRI UMR 5800), Inria
+ * Copyright (C) 2010-2019                                CNRS
+ * Copyright (C) 2011,2012,2016                           Inria
+ * Copyright (C) 2009-2011,2014-2019                      Université de Bordeaux
  *
  * StarPU is free software; you can redistribute it and/or modify
  * it under the terms of the GNU Lesser General Public License as published by
@@ -39,33 +41,33 @@ STARPU_SCHED. For instance <c>export STARPU_SCHED=dmda</c> . Use <c>help</c> to
 get the list of available schedulers.
 
 
-\subsection NonPerformanceModelingPolicies Non Performance Modelling Policies
+<b>Non Performance Modelling Policies:</b>
 
-- The <b>eager</b> scheduler uses a central task queue, from which all workers draw tasks
+The <b>eager</b> scheduler uses a central task queue, from which all workers draw tasks
 to work on concurrently. This however does not permit to prefetch data since the scheduling
 decision is taken late. If a task has a non-0 priority, it is put at the front of the queue.
 
-- The <b>random</b> scheduler uses a queue per worker, and distributes tasks randomly according to assumed worker
+The <b>random</b> scheduler uses a queue per worker, and distributes tasks randomly according to assumed worker
 overall performance.
 
-- The <b>ws</b> (work stealing) scheduler uses a queue per worker, and schedules
+The <b>ws</b> (work stealing) scheduler uses a queue per worker, and schedules
 a task on the worker which released it by
 default. When a worker becomes idle, it steals a task from the most loaded
 worker.
 
-- The <b>lws</b> (locality work stealing) scheduler uses a queue per worker, and schedules
+The <b>lws</b> (locality work stealing) scheduler uses a queue per worker, and schedules
 a task on the worker which released it by
 default. When a worker becomes idle, it steals a task from neighbour workers. It
 also takes into account priorities.
 
-- The <b>prio</b> scheduler also uses a central task queue, but sorts tasks by
+The <b>prio</b> scheduler also uses a central task queue, but sorts tasks by
 priority specified by the programmer (between -5 and 5).
 
-- The <b>heteroprio</b> scheduler uses different priorities for the different processing units.
+The <b>heteroprio</b> scheduler uses different priorities for the different processing units.
 This scheduler must be configured to work correclty and to expect high-performance
 as described in the corresponding section.
 
-\subsection DMTaskSchedulingPolicy Performance Model-Based Task Scheduling Policies
+\section DMTaskSchedulingPolicy Performance Model-Based Task Scheduling Policies
 
 If (<b>and only if</b>) your application <b>codelets have performance models</b> (\ref
 PerformanceModelExample), you should change the scheduler thanks to the
@@ -87,84 +89,47 @@ family policy using performance model hints. A low or zero percentage may be
 the sign that performance models are not converging or that codelets do not
 have performance models enabled.
 
-- The <b>dm</b> (deque model) scheduler takes task execution performance models into account to
+<b>Performance Modelling Policies:</b>
+
+The <b>dm</b> (deque model) scheduler takes task execution performance models into account to
 perform a HEFT-similar scheduling strategy: it schedules tasks where their
 termination time will be minimal. The difference with HEFT is that <b>dm</b>
 schedules tasks as soon as they become available, and thus in the order they
 become available, without taking priorities into account.
 
-- The <b>dmda</b> (deque model data aware) scheduler is similar to dm, but it also takes
+The <b>dmda</b> (deque model data aware) scheduler is similar to dm, but it also takes
 into account data transfer time.
 
-- The <b>dmdap</b> (deque model data aware prio) scheduler is similar to dmda,
+The <b>dmdap</b> (deque model data aware prio) scheduler is similar to dmda,
 except that it sorts tasks by priority order, which allows to become even closer
 to HEFT by respecting priorities after having made the scheduling decision (but
 it still schedules tasks in the order they become available).
 
-- The <b>dmdar</b> (deque model data aware ready) scheduler is similar to dmda,
+The <b>dmdar</b> (deque model data aware ready) scheduler is similar to dmda,
 but it also privileges tasks whose data buffers are already available
 on the target device.
 
-- The <b>dmdas</b> combines dmdap and dmdas: it sorts tasks by priority order,
+The <b>dmdas</b> combines dmdap and dmdas: it sorts tasks by priority order,
 but for a given priority it will privilege tasks whose data buffers are already
 available on the target device.
 
-- The <b>dmdasd</b> (deque model data aware sorted decision) scheduler is similar
+The <b>dmdasd</b> (deque model data aware sorted decision) scheduler is similar
 to dmdas, except that when scheduling a task, it takes into account its priority
 when computing the minimum completion time, since this task may get executed
 before others, and thus the latter should be ignored.
 
-- The <b>heft</b> (heterogeneous earliest finish time) scheduler is a deprecated
+The <b>heft</b> (heterogeneous earliest finish time) scheduler is a deprecated
 alias for <b>dmda</b>.
 
-- The <b>pheft</b> (parallel HEFT) scheduler is similar to dmda, it also supports
+The <b>pheft</b> (parallel HEFT) scheduler is similar to dmda, it also supports
 parallel tasks (still experimental). Should not be used when several contexts using
 it are being executed simultaneously.
 
-- The <b>peager</b> (parallel eager) scheduler is similar to eager, it also
+The <b>peager</b> (parallel eager) scheduler is similar to eager, it also
 supports parallel tasks (still experimental). Should not be used when several 
 contexts using it are being executed simultaneously.
 
-\subsection ExistingModularizedSchedulers Modularized Schedulers
-
-StarPU provides a powerful way to implement schedulers, as documented in \ref
-DefiningANewModularSchedulingPolicy . It is currently shipped with the following
-pre-defined Modularized Schedulers :
-
-
-- <b>modular-eager</b> , <b>modular-eager-prefetching</b> are eager-based Schedulers (without and with prefetching)), they are \n
-naive schedulers, which try to map a task on the first available resource
-they find. The prefetching variant queues several tasks in advance to be able to
-do data prefetching. This may however degrade load balancing a bit.
-
-- <b>modular-prio</b>, <b>modular-prio-prefetching</b>, <b>modular-eager-prio</b> are prio-based Schedulers (without / with prefetching):,
-similar to Eager-Based Schedulers. Can handle tasks which have a defined
-priority and schedule them accordingly.
-The <b>modular-eager-prio</b> variant integrates the eager and priority queue in a
-single component. This allows it to do a better job at pushing tasks.
-
-- <b>modular-random</b>, <b>modular-random-prio</b>, <b>modular-random-prefetching</b>, <b>modular-random-prio-prefetching</b> are random-based Schedulers (without/with prefetching) : \n
-Select randomly a resource to be mapped on for each task.
-
-- <b>modular-ws</b>) implements Work Stealing:
-Maps tasks to workers in round robin, but allows workers to steal work from other workers.
-
-- <b>modular-heft</b>, <b>modular-heft2</b>, and <b>modular-heft-prio</b> are
-HEFT Schedulers : \n
-Maps tasks to workers using a heuristic very close to
-Heterogeneous Earliest Finish Time.
-It needs that every task submitted to StarPU have a
-defined performance model (\ref PerformanceModelCalibration)
-to work efficiently, but can handle tasks without a performance
-model. <b>modular-heft</b> just takes tasks by priority order. <b>modular-heft2</b> takes
-at most 5 tasks of the same priority and checks which one fits best.
-<b>modular-heft-prio</b> is similar to <b>modular-heft</b>, but only decides the memory
-node, not the exact worker, just pushing tasks to one central queue per memory
-node.
-
-- <b>modular-heteroprio</b> is a Heteroprio Scheduler: \n
-Maps tasks to worker similarly to HEFT, but first attribute accelerated tasks to
-GPUs, then not-so-accelerated tasks to CPUs.
+TODO: describe modular schedulers
 
 \section TaskDistributionVsDataTransfer Task Distribution Vs Data Transfer
 
@@ -185,11 +150,6 @@ already gives the good results that a precise estimation would give.
 
 \section Energy-basedScheduling Energy-based Scheduling
 
-Note: by default StarPU does not let CPU workers sleep, to let them react to
-task release as quickly as possible. For idle time to really let CPU cores save
-energy, one needs to use the \ref enable-blocking-drivers
-"--enable-blocking-drivers" configuration option.
-
 If the application can provide some energy consumption performance model (through
 the field starpu_codelet::energy_model), StarPU will
 take it into account when distributing tasks. The target function that
@@ -205,12 +165,10 @@ simply tend to run all computations on the most energy-conservative processing
 unit. To account for the consumption of the whole machine (including idle
 processing units), the idle power of the machine should be given by setting
 <c>export STARPU_IDLE_POWER=200</c> (\ref STARPU_IDLE_POWER) for 200W, for instance. This value can often
-be obtained from the machine power supplier, e.g. by running
-
-<c>ipmitool -I lanplus -H mymachine-ipmi -U myuser -P mypasswd sdr type Current</c>
+be obtained from the machine power supplier.
 
 The energy actually consumed by the total execution can be displayed by setting
-<c>export STARPU_PROFILING=1 STARPU_WORKER_STATS=1</c> (\ref STARPU_PROFILING and \ref STARPU_WORKER_STATS).
+<c>export STARPU_PROFILING=1 STARPU_WORKER_STATS=1</c> .
 
 For OpenCL devices, on-line task consumption measurement is currently supported through the
 <c>CL_PROFILING_POWER_CONSUMED</c> OpenCL extension, implemented in the MoviSim
@@ -234,14 +192,292 @@ single task gives the consumption of the task in Joules, which can be given to
 starpu_perfmodel_update_history().
 
 Another way to provide the energy performance is to define a
-perfmodel with starpu_perfmodel::type ::STARPU_PER_ARCH or
-::STARPU_PER_WORKER , and set the starpu_perfmodel::arch_cost_function or
-starpu_perfmodel::worker_cost_function field to a function which shall return
-the estimated consumption of the task in Joules. Such a function can for instance
+perfmodel with starpu_perfmodel::type ::STARPU_PER_ARCH, and set the
+starpu_perfmodel::arch_cost_function field to a function which shall return the
+estimated consumption of the task in Joules. Such a function can for instance
 use starpu_task_expected_length() on the task (in µs), multiplied by the
 typical power consumption of the device, e.g. in W, and divided by 1000000. to
 get Joules.
 
+
+\subsection MeasuringEnergyandPower Measuring energy and power with StarPU
+
+We have extended the performance model of StarPU to measure energy and power values of CPUs. These values are measured using the existing Performance API (PAPI) analysis library. PAPI provides the tool designer and application engineer with a consistent interface and methodology for use of the performance counter hardware found in most major microprocessors. PAPI enables software engineers to see, in near real time, the relation between software performance and processor events. 
+
+\subsection StarpuEnergyStart starpu_energy_start()
+
+Before calling <c>starpu_energy_start()</c>, we need to define some variables:
+
+- const int <c>N_EVTS</c> = 2; the number of <c>RAPL</c> events that we use, in this example, we use two events.
+
+- To measure energy consumption of CPUs, we use the following events, which are available on your CPU architecture:
+
+const char* event_names[] = { "rapl::RAPL_ENERGY_PKG:cpu=%d",
+
+                              "rapl::RAPL_ENERGY_DRAM:cpu=%d"};
+
+
+Where <c>RAPL_ENERGY_PKG</c> represents the whole CPU socket power consumption.
+
+and <c>RAPL_ENERGY_DRAM</c> represents the RAM power consumption.
+
+
+- int <c>EventSet</c> = PAPI_NULL; this variable must be initialized to PAPI_NULL before calling PAPI_create_event() function.
+
+- long long *<c>values</c>; this is where we store the values we read from the eventset.
+
+
+As presented in the following code, <c>PAPI_library_init()</c> initializes the PAPI library. It must be called before any low level PAPI functions can be used. <c>starpu_energy_start()</c> function use <c>PAPI_start()</c> to start counting.
+
+We use <c>hwloc_get_nbobjs_by_type</c> to get the number of sockets. This number depends on your machine that you use.
+
+
+\code{.c}
+
+int starpu_energy_start()
+{
+    int retval, number;
+    int i;
+
+    struct _starpu_machine_config *config = _starpu_get_machine_config();
+    hwloc_topology_t topology = config->topology.hwtopology;
+
+    /* get the number of sockets */
+    nsockets = hwloc_get_nbobjs_by_type(topology, HWLOC_OBJ_PACKAGE);
+
+    values=calloc(nsockets * N_EVTS,sizeof(long long));
+    if (values==NULL) {
+         exit(1);
+      }
+   
+    /* /* Initialize the library */
+    if((retval = PAPI_library_init(PAPI_VER_CURRENT)) != PAPI_VER_CURRENT )
+       ERROR_RETURN(retval);
+
+    /* creating the eventset */
+    if ( (retval = PAPI_create_eventset(&EventSet)) != PAPI_OK)
+      ERROR_RETURN(retval);
+
+    for (i = 0 ; i < nsockets ; i ++ )
+    {
+        /* return the index of socket */
+        hwloc_obj_t obj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_PACKAGE, i);
+        add_event(EventSet, obj->os_index);
+    }
+
+    /* get the number of events in the event set */
+    number = 0;
+    if ( (retval = PAPI_list_events(EventSet, NULL, &number)) != PAPI_OK)
+        ERROR_RETURN(retval);
+
+    printf("There are %d events in the event set\n", number);
+    /* Start counting */
+    if ( (retval = PAPI_start(EventSet)) != PAPI_OK)
+                ERROR_RETURN(retval);
+
+    t1 = starpu_timing_now();
+
+return retval;
+
+}
+
+\endcode
+
+\subsection StarpuEnergyStop starpu_energy_stop()
+<c>starpu_energy_stop()</c> function use PAPI_stop() to stop counting and store the values into the array. we calculate both energy in <c>Joules</c> and power consumption in <c>Watt</c>. We call starpu_perfmodel_update_history() function to provide explicit measurements.
+
+
+\code{.c}
+int starpu_energy_stop(struct starpu_perfmodel *model, struct starpu_task *task, unsigned ntasks)
+{
+    double energy = 0.;
+
+    int retval;
+    unsigned workerid = 0;
+    unsigned cpuid = 0;
+
+    double t2 = starpu_timing_now();
+
+    double t = t2 - t1;
+
+   /* Stop counting and store the values into the array */
+    if ( (retval = PAPI_stop(EventSet, values)) != PAPI_OK)
+         ERROR_RETURN(retval);
+
+       int k,s;
+
+       for( s = 0 ; s < nsockets ; s ++){
+
+           for(k = 0 ; k < N_EVTS; k++) {
+
+             energy += values[s * N_EVTS + k];
+
+             printf("%-40s%12.6f J\t(Average Power %.1fW)\n",
+             event_names[k],
+             (energy/1.0e9),
+             ((energy/1.0e9)/(t*1.0E-6))
+);
+  }
+             }
+     /* we divide here the energy by the number of tasks */
+     energy = energy / ntasks;
+
+     struct starpu_perfmodel_arch *arch = starpu_worker_get_perf_archtype(workerid, STARPU_NMAX_SCHED_CTXS);
+     starpu_perfmodel_update_history(model, task, arch, cpuid, 0, energy);
+
+    /* removes all events from a PAPI event set */
+    if ( (retval = PAPI_cleanup_eventset(EventSet)) != PAPI_OK)
+      ERROR_RETURN(retval);
+
+    /* deallocates the memory associated with an empty PAPI EventSet */
+    if ( (retval = PAPI_destroy_eventset(&EventSet)) != PAPI_OK)
+      ERROR_RETURN(retval);
+
+    /* free the resources used by PAPI */
+     PAPI_shutdown();
+
+    return retval;
+
+    }
+
+
+\endcode
+
+
+\subsection TestExample Test example 
+In this example, we launch different <c>ntasks</c> tasks in parallel with different sizes.  
+\code{.c}
+
+        . . .
+
+        /* the different tasks are executed in parallel */
+        starpu_data_handle_t tab_handle[ntasks];
+
+        for (loop = 0; loop < ntasks; loop++)
+        {
+                task = starpu_task_create();
+                starpu_vector_data_register(&tab_handle[loop], -1, (uintptr_t)NULL, nelems, sizeof(int));
+
+                task->cl = codelet;
+                task->where = STARPU_CPU;
+                task->handles[0] = tab_handle[loop];
+
+                int ret = starpu_task_submit(task);
+                if (ret == -ENODEV)
+                        exit(STARPU_TEST_SKIPPED);
+                STARPU_CHECK_RETURN_VALUE(ret, "starpu_task_submit");
+
+
+        }
+
+
+        for (loop = 0; loop < ntasks; loop++)
+        {
+
+          starpu_data_unregister(tab_handle[loop]);
+
+        }
+
+       . . .
+
+\endcode
+
+
+\code{.c}
+
+int main(int argc, char **argv)
+
+{
+
+...
+
+for (size = STARTlin; size < END; size *= 2)
+        {
+
+        starpu_vector_data_register(&handle, -1, (uintptr_t)NULL, size, sizeof(int));
+
+        struct starpu_task *task = starpu_task_create();
+
+        task->cl = &memset_cl;
+
+        task->handles[0] = handle;
+        task->synchronous = 1;
+
+        task->destroy = 0;
+
+        /* Start counting */
+
+         if ( (retval = starpu_energy_start()) != 0)
+                ERROR_RETURN(retval);
+
+
+           test_memset(size, &memset_cl);
+
+        /* Stop counting and store the values into the array */
+
+         if ( (retval = starpu_energy_stop(&my_perfmodel, task, ntasks)) != 0)
+
+                ERROR_RETURN(retval);
+
+
+        starpu_task_destroy(task);
+
+        starpu_data_unregister(handle);
+
+
+      }
+
+...
+
+
+}
+\endcode
+ 
+\section ExistingModularizedSchedulers Modularized Schedulers
+
+StarPU provides a powerful way to implement schedulers, as documented in \ref
+DefiningANewModularSchedulingPolicy . It is currently shipped with the following
+pre-defined Modularized Schedulers :
+
+- Eager-based Schedulers (with/without prefetching : \c modular-eager ,
+\c modular-eager-prefetching) : \n
+Naive scheduler, which tries to map a task on the first available resource
+it finds. The prefecthing variant queues several tasks in advance to be able to
+do data prefetching. This may however degrade load balancing a bit.
+
+- Prio-based Schedulers (with/without prefetching :
+\c modular-prio, \c modular-prio-prefetching , \c modular-eager-prio) : \n
+Similar to Eager-Based Schedulers. Can handle tasks which have a defined
+priority and schedule them accordingly.
+The \c modular-eager-prio variant integrates the eager and priority queue in a
+single component. This allows it to do a better job at pushing tasks.
+
+- Random-based Schedulers (with/without prefetching: \c modular-random,
+\c modular-random-prio, \c modular-random-prefetching, \c
+modular-random-prio-prefetching) : \n
+Selects randomly a resource to be mapped on for each task.
+
+- Work Stealing (\c modular-ws) : \n
+Maps tasks to workers in round robin, but allows workers to steal work from other workers.
+
+- HEFT Scheduler : \n
+Maps tasks to workers using a heuristic very close to
+Heterogeneous Earliest Finish Time.
+It needs that every task submitted to StarPU have a
+defined performance model (\ref PerformanceModelCalibration)
+to work efficiently, but can handle tasks without a performance
+model. \c modular-heft just takes tasks by priority order. \c modular-heft takes
+at most 5 tasks of the same priority and checks which one fits best. \c
+modular-heft-prio is similar to \c modular-heft, but only decides the memory
+node, not the exact worker, just pushing tasks to one central queue per memory
+node.
+
+- Heteroprio Scheduler: \n
+Maps tasks to worker similarly to HEFT, but first attribute accelerated tasks to
+GPUs, then not-so-accelerated tasks to CPUs.
+
+To use one of these schedulers, one can set the environment variable \ref STARPU_SCHED.
+
 \section StaticScheduling Static Scheduling
 
 In some cases, one may want to force some scheduling, for instance force a given