starpu_mlr_analysis.Rmd 8.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242
  1. ```{r Setup, echo=FALSE}
  2. opts_chunk$set(echo=FALSE)
  3. ```
  4. ```{r Load_R_files_and_functions}
  5. print_codelet <- function(reg,codelet){
  6. cat(paste("/* ############################################ */", "\n"))
  7. cat(paste("/*\t Automatically generated code */", "\n"))
  8. cat(paste("\t Check for potential errors and be sure parameter value are written in good order (alphabetical one by default)", "\n"))
  9. cat(paste("\t Adjusted R-squared: ", summary(reg)$adj.r.squared, "*/\n\n"))
  10. ncomb <- reg$rank - 1
  11. cat(paste("\t ", codelet, ".model->ncombinations = ", ncomb, ";\n", sep=""))
  12. cat(paste("\t ", codelet, ".model->combinations = (unsigned **) malloc(", codelet, ".model->ncombinations*sizeof(unsigned *))", ";\n\n", sep=""))
  13. cat(paste("\t if (", codelet, ".model->combinations)", "\n", "\t {\n", sep=""))
  14. cat(paste("\t for (unsigned i = 0; i < ", codelet, ".model->ncombinations; i++)", "\n", "\t {\n", sep=""))
  15. cat(paste("\t ", codelet, ".model->combinations[i] = (unsigned *) malloc(", codelet, ".model->nparameters*sizeof(unsigned))", ";\n", "\t }\n", "\t }\n\n", sep=""))
  16. # Computing combinations
  17. df <- data.frame(attr(reg$terms, "factors"))
  18. df <- df/2
  19. df$Params <- row.names(df)
  20. df <-df[c(2:nrow(df)),]
  21. i=1
  22. options(warn=-1)
  23. for(i in (1:nrow(df)))
  24. {
  25. name <- df[i,]$Params
  26. if (grepl("I\\(*", name))
  27. {
  28. exp <- as.numeric(gsub("(.*?)\\^(.*?)\\)", "\\2", name))
  29. df[i,] <- as.numeric(df[i,]) * exp
  30. df[i,]$Params <- as.character(gsub("I\\((.*?)\\^(.*?)\\)", "\\1", name))
  31. }
  32. }
  33. df <- aggregate(. ~ Params, transform(df, Params), sum)
  34. options(warn=0)
  35. i=1
  36. j=1
  37. for(j in (2:length(df)))
  38. {
  39. for(i in (1:nrow(df)))
  40. {
  41. cat(paste("\t ", codelet, ".model->combinations[", j-2, "][", i-1, "] = ", as.numeric(df[i,j]), ";\n", sep=""))
  42. }
  43. }
  44. cat(paste("/* ############################################ */", "\n"))
  45. }
  46. df<-read.csv(input_trace, header=TRUE)
  47. opts_chunk$set(echo=TRUE)
  48. ```
  49. # Multiple Linear Regression Model Example
  50. ## Introduction
  51. This document demonstrates the type of the analysis needed to compute
  52. the multiple linear regression model of the task. It relies on the
  53. input data benchmarked by the StarPU (or any other tool, but following
  54. the same format). The input data used in this example is generated by
  55. the task "mlr_init", from the "examples/mlr/mlr.c".
  56. This document can be used as an template for the analysis of any other
  57. task.
  58. ### How to compile
  59. ./starpu_mlr_analysis .starpu/sampling/codelets/tmp/mlr_init.out
  60. ### Software dependencies
  61. In order to run the analysis you need to have R installed:
  62. sudo apt-get install r-base
  63. In order to compile this document, you need *knitr* (although you can
  64. perfectly only use the R code from this document without knitr). If
  65. you decided that you want to generate this document, then start R
  66. (e.g., from terminal) and install knitr package:
  67. R> install.packages("knitr")
  68. No additional R packages are needed.
  69. ## First glimpse at the data
  70. First, we show the relations between all parameters in a single plot.
  71. ```{r InitPlot}
  72. plot(df)
  73. ```
  74. For this example, all three parameters M, N, K have some influence,
  75. but their relation is not easy to understand.
  76. In general, this type of plots can typically show if there are
  77. outliers. It can also show if there is a group of parameters which are
  78. mutually perfectly correlated, in which case only a one parameter from
  79. the group should be kept for the further analysis. Additionally, plot
  80. can show the parameters that have a constant value, and since these
  81. cannot have an influence on the model, they should also be ignored.
  82. However, making conclusions based solely on the visual analysis can be
  83. treacherous and it is better to rely on the statistical tools. The
  84. multiple linear regression methods used in the following sections will
  85. also be able to detect and ignore these irrelevant
  86. parameters. Therefore, this initial visual look should only be used to
  87. get a basic idea about the model, but all the parameters should be
  88. kept for now.
  89. ## Initial model
  90. At this point, an initial model is computed, using all the parameters,
  91. but not taking into account their exponents or the relations between
  92. them.
  93. ```{r Model1}
  94. model1 <- lm(data=df, Duration ~ M+N+K)
  95. summary(model1)
  96. ```
  97. For each parameter and the constant in the first column, an estimation
  98. of the corresponding coefficient is provided along with the 95%
  99. confidence interval. If there are any parameters with NA value, which
  100. suggests that the parameters are correlated to another parameter or
  101. that their value is constant, these parameters should not be used in
  102. the following model computations. The stars in the last column
  103. indicate the significance of each parameter. However, having maximum
  104. three stars for each parameter does not necessarily mean that the
  105. model is perfect and we should always inspect the adjusted R^2 value
  106. (the closer it is to 1, the better the model is). To the users that
  107. are not common to the multiple linear regression analysis and R tools,
  108. we suggest to the R documentation. Some explanations are also provided
  109. in the following article https://hal.inria.fr/hal-01180272.
  110. In this example, all parameters M, N, K are very important. However,
  111. it is not clear if there are some relations between them or if some of
  112. these parameters should be used with an exponent. Moreover, adjusted
  113. R^2 value is not extremely high and we hope we can get a better
  114. one. Thus, we proceed to the more advanced analysis.
  115. ## Refining the model
  116. Now, we can seek for the relations between the parameters. Note that
  117. trying all the possible combinations for the cases with a huge number
  118. of parameters can be prohibitively long. Thus, it may be better to first
  119. get rid of the parameters which seem to have very small influence
  120. (typically the ones with no stars from the table in the previous
  121. section).
  122. ```{r Model2}
  123. model2 <- lm(data=df, Duration ~ M*N*K)
  124. summary(model2)
  125. ```
  126. This model is more accurate, as the R^2 value increased. We can also
  127. try some of these parameters with the exponents.
  128. ```{r Model3}
  129. model3 <- lm(data=df, Duration ~ I(M^2)+I(M^3)+I(N^2)+I(N^3)+I(K^2)+I(K^3))
  130. summary(model3)
  131. ```
  132. It seems like some parameters are important. Now we combine these and
  133. try to find the optimal combination (here we go directly to the final
  134. solution, although this process typically takes several iterations of
  135. trying different combinations).
  136. ```{r Model4}
  137. model4 <- lm(data=df, Duration ~ I(M^2):N+I(N^3):K)
  138. summary(model4)
  139. ```
  140. This seems to be the most accurate model, with a high R^2 value. We
  141. can proceed to its validation.
  142. ## Validation
  143. Once the model has been computed, we should validate it. Apart from
  144. the low adjusted R^2 value, the model weakness can also be observed
  145. even better when inspecting the residuals. The results on two
  146. following plots (and thus the accuracy of the model) will greatly
  147. depend on the measurements variability and the design of experiments.
  148. ```{r Validation}
  149. par(mfrow=c(1,2))
  150. plot(model4, which=c(1:2))
  151. ```
  152. Generally speaking, if there are some structures on the left plot,
  153. this can indicate that there are certain phenomena not explained by
  154. the model. Many points on the same horizontal line represent
  155. repetitive occurrences of the task with the same parameter values,
  156. which is typical for a single experiment run with a homogeneous
  157. data. The fact that there is some variability is common, as executing
  158. exactly the same code on a real machine will always have slightly
  159. different duration. However, having a huge variability means that the
  160. benchmarks were very noisy, thus deriving an accurate models from them
  161. will be hard.
  162. Plot on the right may show that the residuals do not follow the normal
  163. distribution. Therefore, such model in overall would have a limited
  164. predictive power.
  165. If we are not satisfied with the accuracy of the observed models, we
  166. should go back to the previous section and try to find a better
  167. one. In some cases, the benchmarked data is just be too noisy or the
  168. choice of the parameters is not appropriate, and thus the experiments
  169. should be redesigned and rerun.
  170. When we are finally satisfied with the model accuracy, we should
  171. modify our task code, so that StarPU knows which parameters
  172. combinations are used in the model.
  173. ## Generating C code
  174. Depending on the way the task codelet is programmed, this section may
  175. be somehow useful. This is a simple helper to generate C code for the
  176. parameters combinations and it should be copied to the task
  177. description in the application. The function generating the code is
  178. not so robust, so make sure that the generated code correctly
  179. corresponds to computed model (e.g., parameters are considered in the
  180. alphabetical order).
  181. ```{r Code}
  182. print_codelet(model4, "mlr_cl")
  183. ```
  184. ## Conclusion
  185. We have computed the model for our benchmarked data using multiple
  186. linear regression. After encoding this model into the task code,
  187. StarPU will be able to automatically compute the coefficients and use
  188. the model to predict task duration.