010_core.doxy 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395
  1. /* StarPU --- Runtime system for heterogeneous multicore architectures.
  2. *
  3. * Copyright (C) 2018-2020 Université de Bordeaux, CNRS (LaBRI UMR 5800), Inria
  4. *
  5. * StarPU is free software; you can redistribute it and/or modify
  6. * it under the terms of the GNU Lesser General Public License as published by
  7. * the Free Software Foundation; either version 2.1 of the License, or (at
  8. * your option) any later version.
  9. *
  10. * StarPU is distributed in the hope that it will be useful, but
  11. * WITHOUT ANY WARRANTY; without even the implied warranty of
  12. * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
  13. *
  14. * See the GNU Lesser General Public License in COPYING.LGPL for more details.
  15. */
  16. /*! \page StarPUCore StarPU Core
  17. \section CoreEntities StarPU Core Entities
  18. TODO
  19. \subsection CoreEntitiesOverview Overview
  20. Execution entities:
  21. - <b>worker</b>: A worker (see \ref CoreEntitiesWorkers, \ref
  22. CoreEntitiesWorkersAndContexts) entity is a CPU thread created by StarPU to manage
  23. one computing unit. The computing unit can be a local CPU core, an accelerator
  24. or GPU device, or --- on the master side when running in master-slave
  25. distributed mode --- a remote slave computing node. It is responsible for
  26. querying scheduling policies for tasks to execute.
  27. - <b>sched_context</b>: A scheduling context (see \ref CoreEntitiesContexts, \ref
  28. CoreEntitiesWorkersAndContexts) is a logical set of workers governed by an
  29. instance of a scheduling policy. It defines the computing units to which the
  30. scheduling policy instance may assign work entities.
  31. - <b>driver</b>: A driver is the set of hardware-dependent routines used by a
  32. worker to initialize its associated computing unit, execute work entities on
  33. it, and finalize the computing unit usage at the end of the session.
  34. Work entities:
  35. - <b>task</b>: A task is a high level work request submitted to StarPU by the
  36. application, or internally by StarPU itself.
  37. - <b>job</b>: A job is a low level view of a work request. It is not exposed to
  38. the application. A job structure may be shared among several task structures
  39. in the case of a parallel task.
  40. Data entities:
  41. - <b>data handle</b>: A data handle is a high-level, application opaque object designating a
  42. piece of data currently registered to the StarPU data management layer.
  43. Internally, it is a \ref _starpu_data_state structure.
  44. - <b>data replicate</b>: A data replicate is a low-level object designating one
  45. copy of a piece of data registered to StarPU as a data handle, residing in one
  46. memory node managed by StarPU. It is not exposed to the application.
  47. \subsection CoreEntitiesWorkers Workers
  48. A <b>worker</b> is a CPU thread created by StarPU. Its role is to manage one computing
  49. unit. This computing unit can be a local CPU core, in which case, the worker
  50. thread manages the actual CPU core to which it is assigned; or it can be a
  51. computing device such as a GPU or an accelerator (or even a remote computing
  52. node when StarPU is running in distributed master-slave mode.) When a worker
  53. manages a computing device, the CPU core to which the worker's thread is
  54. by default exclusively assigned to the device management work and does not
  55. participate to computation.
  56. \subsubsection CoreEntitiesWorkersStates States
  57. <b>Scheduling operations related state</b>
  58. While a worker is conducting a scheduling operations, e.g. the worker is in the
  59. process of selecting a new task to execute, flag state_sched_op_pending is set
  60. to \c !0, otherwise it is set to \c 0.
  61. While state_sched_op_pending is !0, the following exhaustive list of operations on that
  62. workers are restricted in the stated way:
  63. - adding the worker to a context is not allowed;
  64. - removing the worker from a context is not allowed;
  65. - adding the worker to a parallel task team is not allowed;
  66. - removing the worker from a parallel task team is not allowed;
  67. - querying state information about the worker is only allowed while
  68. <code>state_relax_refcnt > 0</code>;
  69. - in particular, querying whether the worker is blocked on a parallel team entry is only
  70. allowed while <code>state_relax_refcnt > 0</code>.
  71. Entering and leaving the state_sched_op_pending state is done through calls to
  72. \ref _starpu_worker_enter_sched_op() and \ref _starpu_worker_leave_sched_op()
  73. respectively (see these functions in use in functions \ref _starpu_get_worker_task() and
  74. \ref _starpu_get_multi_worker_task()). These calls ensure that any pending
  75. conflicting operation deferred while the worker was in the
  76. state_sched_op_pending state is performed in an orderly manner.
  77. <br>
  78. <b>Scheduling contexts related states</b>
  79. Flag \c state_changing_ctx_notice is set to \c !0 when a thread is about to
  80. add the worker to a scheduling context or remove it from a scheduling context, and is
  81. currently waiting for a safe window to do so, until the targeted worker is not in a
  82. scheduling operation or parallel task operation anymore. This flag set to \c !0 will also
  83. prevent the targeted worker to attempt a fresh scheduling operation or parallel
  84. task operation to avoid starving conditions. However, a scheduling operation
  85. that was already in progress before the notice is allowed to complete.
  86. Flag \c state_changing_ctx_waiting is set to \c !0 when a scheduling context worker
  87. addition or removal involving the targeted worker is about to occur and the
  88. worker is currently performing a scheduling operation to tell the targeted
  89. worker that the initiator thread is waiting for the scheduling operation to
  90. complete and should be woken up upon completion.
  91. <br>
  92. <b>Relaxed synchronization related states</b>
  93. Any StarPU worker may participate to scheduling operations, and in this process,
  94. may be forced to observe state information from other workers.
  95. A StarPU worker thread may therefore be observed by any thread, even
  96. other StarPU workers. Since workers may observe each other in any order, it is
  97. not possible to rely exclusively on the \c sched_mutex of each worker to protect the
  98. observation of worker state flags by other workers, because
  99. worker A observing worker B would involve locking workers in (A B) sequence,
  100. while worker B observing worker A would involve locking workers in (B A)
  101. sequence, leading to lock inversion deadlocks.
  102. In consequence, no thread must hold more than one worker's sched_mutex at any time.
  103. Instead, workers implement a relaxed locking scheme based on the \c state_relax_refcnt
  104. counter, itself protected by the worker's sched_mutex. When <code>state_relax_refcnt
  105. > 0</code>, the targeted worker state flags may be observed, otherwise the thread attempting
  106. the observation must repeatedly wait on the targeted worker's \c sched_cond
  107. condition until <code>state_relax_refcnt > 0</code>.
  108. The relaxed mode, while on, can actually be seen as a transactional consistency
  109. model, where concurrent accesses are authorized and potential conflicts are
  110. resolved after the fact. When the relaxed mode is off, the consistency model
  111. becomes a mutual exclusion model, where the sched_mutex of the worker must be
  112. held in order to access or change the worker state.
  113. <br>
  114. <b>Parallel tasks related states</b>
  115. When a worker is scheduled to participate to the execution of a parallel task,
  116. it must wait for the whole team of workers participating to the execution of
  117. this task to be ready. While the worker waits for its teammates, it is not
  118. available to run other tasks or perform other operations. Such a waiting
  119. operation can therefore not start while conflicting operations such as
  120. scheduling operations and scheduling context resizing involving the worker are
  121. on-going. Conversely these operations and other may query weather the worker is
  122. blocked on a parallel task entry with \ref starpu_worker_is_blocked_in_parallel().
  123. The \ref starpu_worker_is_blocked_in_parallel() function is allowed to proceed while
  124. and only while <code>state_relax_refcnt > 0</code>. Due to the relaxed worker locking scheme,
  125. the \c state_blocked_in_parallel flag of the targeted worker may change after it
  126. has been observed by an observer thread. In consequence, flag
  127. \c state_blocked_in_parallel_observed of the targeted worker is set to \c 1 by the
  128. observer immediately after the observation to "taint" the targeted worker. The
  129. targeted worker will clear the \c state_blocked_in_parallel_observed flag tainting
  130. and defer the processing of parallel task related requests until a full
  131. scheduling operation shot completes without the
  132. \c state_blocked_in_parallel_observed flag being tainted again. The purpose of this
  133. tainting flag is to prevent parallel task operations to be started immediately
  134. after the observation of a transient scheduling state.
  135. Worker's management of parallel tasks is
  136. governed by the following set of state flags and counters:
  137. - \c state_blocked_in_parallel: set to \c !0 while the worker is currently blocked on a parallel
  138. task;
  139. - \c state_blocked_in_parallel_observed: set to \c !0 to taint the worker when a
  140. thread has observed the state_blocked_in_parallel flag of this worker while
  141. its \c state_relax_refcnt state counter was \c >0. Any pending request to add or
  142. remove the worker from a parallel task team will be deferred until a whole
  143. scheduling operation shot completes without being tainted again.
  144. - \c state_block_in_parallel_req: set to \c !0 when a thread is waiting on a request
  145. for the worker to be added to a parallel task team. Must be protected by the
  146. worker's \c sched_mutex.
  147. - \c state_block_in_parallel_ack: set to \c !0 by the worker when acknowledging a
  148. request for being added to a parallel task team. Must be protected by the
  149. worker's \c sched_mutex.
  150. - \c state_unblock_in_parallel_req: set to \c !0 when a thread is waiting on a request
  151. for the worker to be removed from a parallel task team. Must be protected by the
  152. worker's \c sched_mutex.
  153. - \c state_unblock_in_parallel_ack: set to \c !0 by the worker when acknowledging a
  154. request for being removed from a parallel task team. Must be protected by the
  155. worker's \c sched_mutex.
  156. - \c block_in_parallel_ref_count: counts the number of consecutive pending requests
  157. to enter parallel task teams. Only the first of a train of requests for
  158. entering parallel task teams triggers the transition of the
  159. \c state_block_in_parallel_req flag from \c 0 to \c 1. Only the last of a train of
  160. requests to leave a parallel task team triggers the transition of flag
  161. \c state_unblock_in_parallel_req from \c 0 to \c 1. Must be protected by the
  162. worker's \c sched_mutex.
  163. \subsubsection CoreEntitiesWorkersOperations Operations
  164. <b>Entry point</b>
  165. All the operations of a worker are handled in an iterative fashion, either by
  166. the application code on a thread launched by the application, or automatically
  167. by StarPU on a device-dependent CPU thread launched by StarPU. Whether a
  168. worker's operation cycle is managed automatically or
  169. not is controlled per session by the field \c not_launched_drivers of the \c
  170. starpu_conf struct, and is decided in \ref _starpu_launch_drivers() function.
  171. When managed automatically, cycles of operations for a worker are handled by the corresponding
  172. driver specific <code>_starpu_<DRV>_worker()</code> function, where \c DRV is a driver name such as
  173. cpu (\c _starpu_cpu_worker) or cuda (\c _starpu_cuda_worker), for instance.
  174. Otherwise, the application must supply a thread which will repeatedly call \ref
  175. starpu_driver_run_once() for the corresponding worker.
  176. In both cases, control is then transferred to
  177. \ref _starpu_cpu_driver_run_once() (or the corresponding driver specific func).
  178. The cycle of operations typically includes, at least, the following operations:
  179. - <b>task scheduling</b>
  180. - <b>parallel task team build-up</b>
  181. - <b>task input processing</b>
  182. - <b>data transfer processing</b>
  183. - <b>task execution</b>
  184. When the worker cycles are handled by StarPU automatically, the iterative
  185. operation processing ends when the \c running field of \c _starpu_config
  186. becomes false. This field should not be read directly, instead it should be read
  187. through the \ref _starpu_machine_is_running() function.
  188. <br>
  189. <b>Task scheduling</b>
  190. If the worker does not yet have a queued task, it calls
  191. _starpu_get_worker_task() to try and obtain a task. This may involve scheduling
  192. operations such as stealing a queued but not yet executed task from another
  193. worker. The operation may not necessarily succeed if no tasks are ready and/or
  194. suitable to run on the worker's computing unit.
  195. <br>
  196. <b>Parallel task team build-up</b>
  197. If the worker has a task ready to run and the corresponding job has a size
  198. \c >1, then the task is a parallel job and the worker must synchronize with the
  199. other workers participating to the parallel execution of the job to assign a
  200. unique rank for each worker. The synchronization is done through the job's \c
  201. sync_mutex mutex.
  202. <br>
  203. <b>Task input processing</b>
  204. Before the task can be executed, its input data must be made available on a
  205. memory node reachable by the worker's computing unit. To do so, the worker calls
  206. \ref _starpu_fetch_task_input()
  207. <br>
  208. <b>Data transfer processing</b>
  209. The worker makes pending data transfers (involving memory node(s) that it is
  210. driving) progress, with a call to \ref __starpu_datawizard_progress(),
  211. <br>
  212. <b>Task execution</b>
  213. Once the worker has a pending task assigned and the input data for that task are
  214. available in the memory node reachable by the worker's computing unit, the
  215. worker calls \ref _starpu_cpu_driver_execute_task() (or the corresponding driver
  216. specific function) to proceed to the execution of the task.
  217. \subsection CoreEntitiesContexts Scheduling Contexts
  218. A scheduling context is a logical set of workers governed by an instance of a
  219. scheduling policy. Tasks submitted to a given scheduling context are confined to
  220. the computing units governed by the workers belonging to this scheduling context
  221. at the time they get scheduled.
  222. A scheduling context is identified by an unsigned integer identifier between \c
  223. 0 and <code>STARPU_NMAX_SCHED_CTXS - 1</code>. The \c STARPU_NMAX_SCHED_CTXS
  224. identifier value is reserved to indicated an unallocated, invalid or deleted
  225. scheduling context.
  226. Accesses to the scheduling context structure are governed by a
  227. multiple-readers/single-writer lock (\c rwlock field). Changes to the structure
  228. contents, additions or removals of workers, statistics updates, all must be done
  229. with proper exclusive write access.
  230. \subsection CoreEntitiesWorkersAndContexts Workers and Scheduling Contexts
  231. A worker can be assigned to one or more <b>scheduling contexts</b>. It
  232. exclusively receives tasks submitted to the scheduling context(s) it is
  233. currently assigned at the time such tasks are scheduled. A worker may add itself
  234. to or remove itself from a scheduling context.
  235. <br>
  236. <b>Locking and synchronization rules between workers and scheduling contexts</b>
  237. A thread currently holding a worker sched_mutex must not attempt to acquire a
  238. scheduling context rwlock, neither for writing nor for reading. Such an attempt
  239. constitutes a lock inversion and may result in a deadlock.
  240. A worker currently in a scheduling operation must enter the relaxed state before
  241. attempting to acquire a scheduling context rwlock, either for reading or for
  242. writing.
  243. When the set of workers assigned to a scheduling context is about to be
  244. modified, all the workers in the union between the workers belonging to the
  245. scheduling context before the change and the workers expected to belong to the
  246. scheduling context after the change must be notified using the \ref
  247. notify_workers_about_changing_ctx_pending() function prior to the update. After
  248. the update, all the workers in that same union must be notified for the update
  249. completion with a call to \ref notify_workers_about_changing_ctx_done().
  250. The function \ref notify_workers_about_changing_ctx_pending() places every
  251. worker passed in argument in a state compatible with changing the scheduling
  252. context assignment of that worker, possibly blocking until that worker leaves
  253. incompatible states such as a pending scheduling operation. If the caller of
  254. \c notify_workers_about_changing_ctx_pending() is itself a worker included in the set
  255. of workers passed in argument, it does not notify itself, with the assumption
  256. that the worker is already calling \c notify_workers_about_changing_ctx_pending()
  257. from a state compatible with a scheduling context assignment update.
  258. Once a worker has been notified about a scheduling context change pending, it
  259. cannot proceed with incompatible operations such as a scheduling operation until
  260. it receives a notification that the context update operation is complete.
  261. \subsection CoreEntitiesDrivers Drivers
  262. Each driver defines a set of routines depending on some specific hardware. These
  263. routines include hardware discovery/initialization, task execution, device
  264. memory management and data transfers.
  265. While most hardware dependent routines are in source files located in the \c
  266. /src/drivers subdirectory of the StarPU tree, some can be found elsewhere in the
  267. tree such as \c src/datawizard/malloc.c for memory allocation routines or the
  268. subdirectories of \c src/datawizard/interfaces/ for data transfer routines.
  269. The driver ABI defined in the \ref _starpu_driver_ops structure includes the
  270. following operations:
  271. - \c .init: initialize a driver instance for the calling worker
  272. managing a hardware computing unit compatible with
  273. this driver.
  274. - \c .run_once: perform a single driver progress cycle for the calling worker
  275. (see \ref CoreEntitiesWorkersOperations).
  276. - \c .deinit: deinitialize the driver instance for the calling worker
  277. - \c .run: executes the following sequence automatically: call \c .init,
  278. repeatedly call \c .run_once until the function \ref
  279. _starpu_machine_is_running() returns false, call \c .deinit.
  280. The source code common to all drivers is shared in
  281. <code>src/drivers/driver_common/driver_common.[ch]</code>. This file includes
  282. services such as grabbing a new task to execute on a worker, managing statistics
  283. accounting on job startup and completion and updating the worker status
  284. \subsubsection CoreEntitiesDriversMP Master/Slave Drivers
  285. A subset of the drivers corresponds to drivers managing computing units in
  286. master/slave mode, that is, drivers involving a local master instance managing
  287. one or more remote slave instances on the targeted device(s). This includes
  288. devices such as discrete manycore accelerators (e.g. Intel's Knight Corners
  289. board, for instance), or pseudo devices such as a cluster of cpu nodes driver
  290. through StarPU's MPI master/slave mode. A driver instance on the master side
  291. is named the \b source, while a driver instances on the slave side is named
  292. the \b sink.
  293. A significant part of the work realized on the source and sink sides of
  294. master/slave drivers is identical among all master/slave drivers, due to the
  295. similarities in the software pattern. Therefore, many routines are shared among
  296. all these drivers in the \c src/drivers/mp_common subdirectory. In particular, a
  297. set of default commands to be used between sources and sinks is defined,
  298. assuming the availability of some communication channel between them (see enum
  299. \ref _starpu_mp_command)
  300. TODO
  301. \subsection CoreEntitiesTasksJobs Tasks and Jobs
  302. TODO
  303. \subsection CoreEntitiesData Data
  304. TODO
  305. */