440_fpga_support.doxy 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338
  1. /* StarPU --- Runtime system for heterogeneous multicore architectures.
  2. *
  3. * Copyright (C) 2019-2021 Université de Bordeaux, CNRS (LaBRI UMR 5800), Inria
  4. *
  5. * StarPU is free software; you can redistribute it and/or modify
  6. * it under the terms of the GNU Lesser General Public License as published by
  7. * the Free Software Foundation; either version 2.1 of the License, or (at
  8. * your option) any later version.
  9. *
  10. * StarPU is distributed in the hope that it will be useful, but
  11. * WITHOUT ANY WARRANTY; without even the implied warranty of
  12. * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
  13. *
  14. * See the GNU Lesser General Public License in COPYING.LGPL for more details.
  15. */
  16. /*! \page FPGASupport FPGA Support
  17. \section Introduction Introduction
  18. Maxeler provides hardware and software solutions for accelerating
  19. computing applications on dataflow engines (DFEs). DFEs are in-house
  20. designed accelerators that encapsulate reconfigurable high-end FPGAs
  21. at their core and are equipped with large amounts of DDR memory.
  22. We extend the StarPU task programming library that initially targets
  23. heterogeneous architectures to support Field Programmable Gate Array
  24. (FPGA).
  25. To create <c>StarPU/FPGA</c> applications exploiting DFE
  26. configurations, MaxCompiler allows an application to be split into
  27. three parts:
  28. - <c>Kernel</c>, which implements the computational components of the
  29. application in hardware.
  30. - <c>Manager configuration</c>, which connects Kernels to the CPU,
  31. engine RAM, other Kernels and other DFEs via MaxRing.
  32. - <c>CPU application</c>, which interacts with the DFEs to read and
  33. write data to the Kernels and engine RAM.
  34. The Simple Live CPU interface (SLiC) is Maxeler’s application
  35. programming interface for seamless CPU-DFE integration. SLiC allows
  36. CPU applications to configure and load a number of DFEs as well as to
  37. subsequently schedule and run actions on those DFEs using simple
  38. function calls. In StarPU/FPGA applications, we use <em>Dynamic SLiC
  39. Interface</em> to exchange data streams between the CPU (Main Memory)
  40. and DFE (Local Memory).
  41. \section PortingApplicationsToFPGA Porting Applications to FPGA
  42. The way to port an application to FPGA is to set the field
  43. starpu_codelet::fpga_funcs, to provide StarPU with the function
  44. for FPGA implementation, so for instance:
  45. \verbatim
  46. struct starpu_codelet cl =
  47. {
  48. .fpga_funcs = {myfunc},
  49. .nbuffers = 1,
  50. }
  51. \endverbatim
  52. \subsection FPGAExample StarPU/FPGA Application
  53. To give you an idea of the interface that we used to exchange data
  54. between <c>host</c> (CPU) and <c>FPGA</c> (DFE), here is an example,
  55. based on one of the examples of Maxeler
  56. (https://trac.version.fz-juelich.de/reconfigurable/wiki/Public).
  57. <c>StreamFMAKernel.maxj</c> represents the Java kernel code; it
  58. implements a very simple kernel (<c>c=a+b</c>), and <c>Test.c</c> starts it
  59. from the <c>fpga_add</c> function; it first sets streaming up from the
  60. CPU pointers, triggers execution and waits for the result. The API to
  61. interact with DFEs is called <em>SLiC</em> which then also involves the
  62. <c>MaxelerOS</c> runtime.
  63. - <c>StreamFMAKernel.maxj</c>: the DFE part is described in the MaxJ
  64. programming language which is a Java-based metaprogramming approach.
  65. \code{.java}
  66. package tests;
  67. import com.maxeler.maxcompiler.v2.kernelcompiler.Kernel;
  68. import com.maxeler.maxcompiler.v2.kernelcompiler.KernelParameters;
  69. import com.maxeler.maxcompiler.v2.kernelcompiler.types.base.DFEType;
  70. import com.maxeler.maxcompiler.v2.kernelcompiler.types.base.DFEVar;
  71. class StreamFMAKernel extends Kernel
  72. {
  73. private static final DFEType type = dfeInt(32);
  74. protected StreamFMAKernel(KernelParameters parameters)
  75. {
  76. super(parameters);
  77. DFEVar a = io.input("a", type);
  78. DFEVar b = io.input("b", type);
  79. DFEVar c;
  80. c = a+b;
  81. io.output("output", c, type);
  82. }
  83. }
  84. \endcode
  85. - <c>StreamFMAManager.maxj</c>: is also described in the MaxJ
  86. programming language and orchestrates data movement between the host
  87. and the DFE.
  88. \code{.java}
  89. package tests;
  90. import com.maxeler.maxcompiler.v2.build.EngineParameters;
  91. import com.maxeler.maxcompiler.v2.managers.custom.blocks.KernelBlock;
  92. import com.maxeler.platform.max5.manager.Max5LimaManager;
  93. class StreamFMAManager extends Max5LimaManager
  94. {
  95. private static final String kernel_name = "StreamFMAKernel";
  96. public StreamFMAManager(EngineParameters arg0)
  97. {
  98. super(arg0);
  99. KernelBlock kernel = addKernel(new StreamFMAKernel(makeKernelParameters(kernel_name)));
  100. kernel.getInput("a") <== addStreamFromCPU("a");
  101. kernel.getInput("b") <== addStreamFromCPU("b");
  102. addStreamToCPU("output") <== kernel.getOutput("output");
  103. }
  104. public static void main(String[] args)
  105. {
  106. StreamFMAManager manager = new StreamFMAManager(new EngineParameters(args));
  107. manager.build();
  108. }
  109. }
  110. \endcode
  111. Once <c>StreamFMAKernel.maxj</c> and <c>StreamFMAManager.maxj</c> are
  112. written, there are other steps to do:
  113. - Building the JAVA program: (for Kernel and Manager (.maxj))
  114. \verbatim
  115. $ maxjc -1.7 -cp $MAXCLASSPATH streamfma/
  116. \endverbatim
  117. - Running the Java program to generate a DFE implementation (a .max
  118. file) that can be called from a StarPU/FPGA application and slic
  119. headers (.h) for simulation:
  120. \verbatim
  121. $ java -XX:+UseSerialGC -Xmx2048m -cp $MAXCLASSPATH:. streamfma.StreamFMAManager DFEModel=MAIA maxFileName=StreamFMA target=DFE_SIM
  122. \endverbatim
  123. - Build the slic object file (simulation):
  124. \verbatim
  125. $ sliccompile StreamFMA.max
  126. \endverbatim
  127. - <c>Test.c </c>:
  128. to interface StarPU task-based runtime system with Maxeler's DFE
  129. devices, we use the advanced dynamic interface of <em>SLiC</em> in
  130. <b>non_blocking</b> mode.
  131. Test code must include <c>MaxSLiCInterface.h</c> and <c>MaxFile.h</c>.
  132. The .max file contains the bitstream. The StarPU/FPGA application can
  133. be written in C, C++, etc.
  134. \code{.c}
  135. #include "StreamFMA.h"
  136. #include "MaxSLiCInterface.h"
  137. void fpga_add(void *buffers[], void *cl_arg)
  138. {
  139. (void)cl_arg;
  140. int *a = (int*) STARPU_VECTOR_GET_PTR(buffers[0]);
  141. int *b = (int*) STARPU_VECTOR_GET_PTR(buffers[1]);
  142. int *c = (int*) STARPU_VECTOR_GET_PTR(buffers[2]);
  143. int size = STARPU_VECTOR_GET_NX(buffers[0]);
  144. /* actions to run on an engine */
  145. max_actions_t *act = max_actions_init(maxfile, NULL);
  146. /* set the number of ticks for a kernel */
  147. max_set_ticks (act, "StreamFMAKernel", size);
  148. /* send input streams */
  149. max_queue_input(act, "a", a, size *sizeof(a[0]));
  150. max_queue_input(act, "b", b, size*sizeof(b[0]));
  151. /* store output stream */
  152. max_queue_output(act,"output", c, size*sizeof(c[0]));
  153. /* run actions on the engine */
  154. printf("**** Run actions in non blocking mode **** \n");
  155. /* run actions in non_blocking mode */
  156. max_run_t *run0= max_run_nonblock(engine, act);
  157. printf("*** wait for the actions on DFE to complete *** \n");
  158. max_wait(run0);
  159. }
  160. static struct starpu_codelet cl =
  161. {
  162. .cpu_funcs = {cpu_func},
  163. .cpu_funcs_name = {"cpu_func"},
  164. .fpga_funcs = {fpga_add},
  165. .nbuffers = 3,
  166. .modes = {STARPU_R, STARPU_R, STARPU_W}
  167. };
  168. int main(int argc, char **argv)
  169. {
  170. ...
  171. /* Implementation of a maxfile */
  172. max_file_t *maxfile = StreamFMA_init();
  173. /* Implementation of an engine */
  174. max_engine_t *engine = max_load(maxfile, "*");
  175. starpu_init(NULL);
  176. ... Task submission etc. ...
  177. starpu_shutdown();
  178. /* deallocate the set of actions */
  179. max_actions_free(act);
  180. /* unload and deallocate an engine obtained by way of max_load */
  181. max_unload(engine);
  182. return 0;
  183. }
  184. \endcode
  185. To write the StarPU/FPGA application: first, the programmer must
  186. describe the codelet using StarPU’s C API. This codelet provides both
  187. a CPU implementation and an FPGA one. It also specifies that the task
  188. has two inputs and one output through the starpu_codelet::nbuffers and
  189. starpu_codelet::modes attributes.
  190. <c>fpga_add</c> function is the name of the FPGA implementation and is
  191. mainly divided in four steps:
  192. - Init actions to be run on DFE.
  193. - Add data to an input stream for an action.
  194. - Add data storage space for an output stream.
  195. - Run actions on DFE in <b>non_blocking</b> mode; a non-blocking call
  196. returns immediately, allowing the calling code to do more CPU work
  197. in parallel while the actions are run.
  198. - Wait for the actions to complete.
  199. In the <c>main</c> function, there are four important steps:
  200. - Implement a maxfile.
  201. - Load a DFE.
  202. - Free actions.
  203. - Unload and deallocate the DFE.
  204. The rest of the application (data registration, task submission, etc.)
  205. is as usual with StarPU.
  206. The design load can also be delegated to StarPU by specifying an array of load
  207. specifications in <c>starpu_conf::fpga_load</c>.
  208. Complete examples are available in <c>tests/fpga/*.c</c>
  209. \subsection FPGADataTransfers Data Transfers in StarPU/FPGA Applications
  210. The communication between the host and the DFE is done through the
  211. <em>Dynamic advance interface</em> to exchange data between the main
  212. memory and the local memory of the DFE.
  213. For the moment, we use \ref STARPU_MAIN_RAM to send and store data
  214. to/from DFE's local memory. However, we aim to use a multiplexer to
  215. choose which memory node we will use to read/write data. So, the user
  216. can tell that the computational kernel will take data from the main
  217. memory or DFE's local memory for example.
  218. In StarPU applications, when \ref starpu_codelet::specific_nodes is
  219. set to 1, this specifies the memory nodes where each data should be
  220. sent to for task execution.
  221. \subsection FPGAConfiguration FPGA Configuration
  222. To configure StarPU with FPGA accelerators, we can enable <c>FPGA</c>
  223. through the \c configure option \ref with-fpga "--with-fpga".
  224. Compiling and installing StarPU/FPGA application is done following the
  225. standard procedure:
  226. \verbatim
  227. $ make
  228. $ make install
  229. \endverbatim
  230. \subsection FPGALaunchingprograms Launching Programs: Simulation
  231. Maxeler provides a simple tutorial to use MaxCompiler
  232. (https://trac.version.fz-juelich.de/reconfigurable/wiki/Public).
  233. Running the Java program to generate maxfile and slic headers
  234. (hardware) on Maxeler's DFE device, takes a VERY long time, approx. 2
  235. hours even for this very small example. That's why we use the
  236. simulation.
  237. - To start the simulation on Maxeler's DFE device:
  238. \verbatim
  239. $ maxcompilersim -c LIMA -n StreamFMA restart
  240. \endverbatim
  241. - To run the binary (simulation)
  242. \verbatim
  243. $ export LD_LIBRARY_PATH=$MAXELEROSDIR/lib:$LD_LIBRARY_PATH
  244. $ export SLIC_CONF="use_simulation=StreamFMA"
  245. \endverbatim
  246. - To force tasks to be scheduled on the FPGA, one can disable the use of CPU
  247. cores by setting the \ref STARPU_NCPU environment variable to 0.
  248. \verbatim
  249. $ STARPU_NCPU=0 ./StreamFMA
  250. \endverbatim
  251. - To stop the simulation
  252. \verbatim
  253. $ maxcompilersim -c LIMA -n StreamFMA stop
  254. \endverbatim
  255. */