# StarPU --- Runtime system for heterogeneous multicore architectures. # # Copyright (C) 2016-2020 Université de Bordeaux, CNRS (LaBRI UMR 5800), Inria # # StarPU is free software; you can redistribute it and/or modify # it under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation; either version 2.1 of the License, or (at # your option) any later version. # # StarPU is distributed in the hope that it will be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # # See the GNU Lesser General Public License in COPYING.LGPL for more details. # ```{r Setup, echo=FALSE} opts_chunk$set(echo=FALSE) ``` ```{r Load_R_files_and_functions} print_codelet <- function(reg,codelet){ cat(paste("/* ############################################ */", "\n")) cat(paste("/*\t Automatically generated code */", "\n")) cat(paste("\t Check for potential errors and be sure parameter value are written in good order (alphabetical one by default)", "\n")) cat(paste("\t Adjusted R-squared: ", summary(reg)$adj.r.squared, "*/\n\n")) ncomb <- reg$rank - 1 cat(paste("\t ", codelet, ".model->ncombinations = ", ncomb, ";\n", sep="")) cat(paste("\t ", codelet, ".model->combinations = (unsigned **) malloc(", codelet, ".model->ncombinations*sizeof(unsigned *))", ";\n\n", sep="")) cat(paste("\t if (", codelet, ".model->combinations)", "\n", "\t {\n", sep="")) cat(paste("\t for (unsigned i = 0; i < ", codelet, ".model->ncombinations; i++)", "\n", "\t {\n", sep="")) cat(paste("\t ", codelet, ".model->combinations[i] = (unsigned *) malloc(", codelet, ".model->nparameters*sizeof(unsigned))", ";\n", "\t }\n", "\t }\n\n", sep="")) # Computing combinations df <- data.frame(attr(reg$terms, "factors")) df <- df/2 df$Params <- row.names(df) df <-df[c(2:nrow(df)),] i=1 options(warn=-1) for(i in (1:nrow(df))) { name <- df[i,]$Params if (grepl("I\\(*", name)) { exp <- as.numeric(gsub("(.*?)\\^(.*?)\\)", "\\2", name)) df[i,] <- as.numeric(df[i,]) * exp df[i,]$Params <- as.character(gsub("I\\((.*?)\\^(.*?)\\)", "\\1", name)) } } df <- aggregate(. ~ Params, transform(df, Params), sum) options(warn=0) i=1 j=1 for(j in (2:length(df))) { for(i in (1:nrow(df))) { cat(paste("\t ", codelet, ".model->combinations[", j-2, "][", i-1, "] = ", as.numeric(df[i,j]), ";\n", sep="")) } } cat(paste("/* ############################################ */", "\n")) } df<-read.csv(input_trace, header=TRUE) opts_chunk$set(echo=TRUE) ``` # Multiple Linear Regression Model Example ## Introduction This document demonstrates the type of the analysis needed to compute the multiple linear regression model of the task. It relies on the input data benchmarked by the StarPU (or any other tool, but following the same format). The input data used in this example is generated by the task "mlr_init", from the "examples/mlr/mlr.c". This document can be used as an template for the analysis of any other task. ### How to compile ./starpu_mlr_analysis .starpu/sampling/codelets/tmp/mlr_init.out ### Software dependencies In order to run the analysis you need to have R installed: sudo apt-get install r-base In order to compile this document, you need *knitr* (although you can perfectly only use the R code from this document without knitr). If you decided that you want to generate this document, then start R (e.g., from terminal) and install knitr package: R> install.packages("knitr") No additional R packages are needed. ## First glimpse at the data First, we show the relations between all parameters in a single plot. ```{r InitPlot} plot(df) ``` For this example, all three parameters M, N, K have some influence, but their relation is not easy to understand. In general, this type of plots can typically show if there are outliers. It can also show if there is a group of parameters which are mutually perfectly correlated, in which case only a one parameter from the group should be kept for the further analysis. Additionally, plot can show the parameters that have a constant value, and since these cannot have an influence on the model, they should also be ignored. However, making conclusions based solely on the visual analysis can be treacherous and it is better to rely on the statistical tools. The multiple linear regression methods used in the following sections will also be able to detect and ignore these irrelevant parameters. Therefore, this initial visual look should only be used to get a basic idea about the model, but all the parameters should be kept for now. ## Initial model At this point, an initial model is computed, using all the parameters, but not taking into account their exponents or the relations between them. ```{r Model1} model1 <- lm(data=df, Duration ~ M+N+K) summary(model1) ``` For each parameter and the constant in the first column, an estimation of the corresponding coefficient is provided along with the 95% confidence interval. If there are any parameters with NA value, which suggests that the parameters are correlated to another parameter or that their value is constant, these parameters should not be used in the following model computations. The stars in the last column indicate the significance of each parameter. However, having maximum three stars for each parameter does not necessarily mean that the model is perfect and we should always inspect the adjusted R^2 value (the closer it is to 1, the better the model is). To the users that are not common to the multiple linear regression analysis and R tools, we suggest to the R documentation. Some explanations are also provided in the following article https://hal.inria.fr/hal-01180272. In this example, all parameters M, N, K are very important. However, it is not clear if there are some relations between them or if some of these parameters should be used with an exponent. Moreover, adjusted R^2 value is not extremely high and we hope we can get a better one. Thus, we proceed to the more advanced analysis. ## Refining the model Now, we can seek for the relations between the parameters. Note that trying all the possible combinations for the cases with a huge number of parameters can be prohibitively long. Thus, it may be better to first get rid of the parameters which seem to have very small influence (typically the ones with no stars from the table in the previous section). ```{r Model2} model2 <- lm(data=df, Duration ~ M*N*K) summary(model2) ``` This model is more accurate, as the R^2 value increased. We can also try some of these parameters with the exponents. ```{r Model3} model3 <- lm(data=df, Duration ~ I(M^2)+I(M^3)+I(N^2)+I(N^3)+I(K^2)+I(K^3)) summary(model3) ``` It seems like some parameters are important. Now we combine these and try to find the optimal combination (here we go directly to the final solution, although this process typically takes several iterations of trying different combinations). ```{r Model4} model4 <- lm(data=df, Duration ~ I(M^2):N+I(N^3):K) summary(model4) ``` This seems to be the most accurate model, with a high R^2 value. We can proceed to its validation. ## Validation Once the model has been computed, we should validate it. Apart from the low adjusted R^2 value, the model weakness can also be observed even better when inspecting the residuals. The results on two following plots (and thus the accuracy of the model) will greatly depend on the measurements variability and the design of experiments. ```{r Validation} par(mfrow=c(1,2)) plot(model4, which=c(1:2)) ``` Generally speaking, if there are some structures on the left plot, this can indicate that there are certain phenomena not explained by the model. Many points on the same horizontal line represent repetitive occurrences of the task with the same parameter values, which is typical for a single experiment run with a homogeneous data. The fact that there is some variability is common, as executing exactly the same code on a real machine will always have slightly different duration. However, having a huge variability means that the benchmarks were very noisy, thus deriving an accurate models from them will be hard. Plot on the right may show that the residuals do not follow the normal distribution. Therefore, such model in overall would have a limited predictive power. If we are not satisfied with the accuracy of the observed models, we should go back to the previous section and try to find a better one. In some cases, the benchmarked data is just be too noisy or the choice of the parameters is not appropriate, and thus the experiments should be redesigned and rerun. When we are finally satisfied with the model accuracy, we should modify our task code, so that StarPU knows which parameters combinations are used in the model. ## Generating C code Depending on the way the task codelet is programmed, this section may be somehow useful. This is a simple helper to generate C code for the parameters combinations and it should be copied to the task description in the application. The function generating the code is not so robust, so make sure that the generated code correctly corresponds to computed model (e.g., parameters are considered in the alphabetical order). ```{r Code} print_codelet(model4, "mlr_cl") ``` ## Conclusion We have computed the model for our benchmarked data using multiple linear regression. After encoding this model into the task code, StarPU will be able to automatically compute the coefficients and use the model to predict task duration.