A Comparison between MLR , MARS , SVR and RF Techniques: Hydrological Time-series Modeling

Pan evaporation modeling is an essential part of water resources management and water budget governance. The study's objective was to examine the suitability of regression and tree-based techniques for estimating pan evaporation from climatic variables. Multiple linear regression (MLR), multivariate adaptive regression splines (MARS), support vector machine (SVM) and random forest (RF) techniques are employed for weekly pan evaporation modeling for the Ranichauri station situated in the Mid-Himalayan region of Uttarakhand, India. The determination of the most appropriate inputs among climatic variables to map evaporation was done by regression approaches. The data was divided into two parts: the first three years of data used for calibration and the remainder of the one-year data used for model validation. Statistical indices such as root mean square error (RMSE), Nash Sutcliffe coefficient of efficiency (NSE), and coefficient of determination (R 2 ) were used to assess the performance of weekly pan evaporation estimating models. Based on scatter plots, the results are under-predicted and over-predicted for the weekly pan evaporation values. The results showed that the values of the RMSE values ranged from 0.542 to 0.689, the NSE values ranged from 0.953 to 0.974, and the highest R 2 value was found for the SVR model for the testing period. Therefore, the SVR model was found to be superior and can be applied to predict weekly pan evaporation values for the Ranichauri site.


Introduction
Techniques for quantifying evaporation loss need to be strengthened to meet essential needs such as water requirements for domestic use, industrial use, agricultural uses, and commercial and cultural purposes.The main factor that should be precisely determined for the water budget is evaporation.Evaporation is usually a form of water depletion, and the entire world faces this problem.Evaporation is responsible for climatic variables such as relative humidity, temperature, wind speed, surface area, etc.According to the report, an accurate and precise assessment of evaporation is essential for managing, planning, and growing water supplies [1,2].Evaporation is a process that is nonlinear, unstable, and very complex [3][4][5].The impact of climate change has been significant on the water system [6].In water resource modeling, Multivariate Adaptive Regression Splines (MARS) and Random Forest are successfully used [7].Various learning algorithms have been used to create models by using meteorological variables to show their ability to perform pan evaporation [8][9][10][11][12][13][14][15][16][17][18][19].
A number of research studies have been conducted in recent years on the implementation of machine learning (ML) models for evaporation estimation across various regions [20].Yaseen (2020) [19] employed classification and regression tree (CART), gene expression programming (GEP).The cascade correlation neural network (CCNNs), and the support vector machine (SVM) for prediction of pan evaporation in two stations in Iraq were tested and it was found that SVM yielded the best performance.Along with appropriate model selection, the input-output selection is of prime importance as it is very complex and involves mathematical procedural interventions.Using different techniques and then choosing the best model or various methods to pay attention to hydrology is also a part of the planning, management, and development of water resources [19].
The paper explores and compares the estimation of pan evaporation in Ranichauri, Uttarakhand, India using four approaches: (1) Multiple linear regression (2) Multivariate adaptive regression splines (MARS) approach (MARS), (3) Random Forest (RF), and (4) Support Vector Machine (SVM).The study also examines the association of climatic variables with pan evaporation and selects the most suitable model for pan evaporation estimation.The adequacy of developed models was examined by statistical indices: root mean square error (RMSE), coefficient of determination (R 2 ), and Nash-Sutcliffe coefficient of efficiency (NSE) between observed and estimated pan evaporation values.

Description of the Study Area and Data
Ranichauri is a hill station and a small village located in Uttarakhand's northern Indian state, in the Tehri Garhwal district.The Ranichauri is known for the College of Forestry of Uttarakhand University of Horticulture and Forestry, situated in the Mid-Himalayan region at 1827 m altitude (30°18' N latitude and 78°24' E longitude).The location map of the study area is shown in Figure 1.The weekly meteorological variables such as maximum temperature (Tmax), minimum temperature (Tmin), wind speed (W), maximum relative humidity (Rh1), minimum relative humidity (Rhmin), sunshine hours (Sn), and evaporation (E) were obtained from the College of Forestry of Uttarakhand University of Horticulture and Forestry from January 2014 to December 2017.Out of the 4-year total data, 3-year data were used for calibration (2014 to 2016), and 1-year remaining data were used for validation (2017).

Multiple Linear Regression (MLR)
It is the most commonly used statistical approach to predict the output of several input variables and develops the linear relationship between multiple variables.MLR is a type of model in which the number of independent and dependent variables forms a quantitative relationship.The values of independent variables are affiliated with the value of dependent variables.MLR can be used for estimating impacts, predicts sequence & future values and to elaborate on the effect of independent variables on the dependent variables.In this study, MLR is used to predict pan evaporation, by taking the dataset into training and testing periods.The expression for MLR is defined as follows: where,  = dependent variable's estimated value; αₒ = value of the dependent variable when all independent variables become zero; αi = regression coefficients and; Xi = independent variables.

Multivariate Adaptive Regression Splines (MARS)
MARS proposes a set of extensible regression models where the solution space of each model is divided into various intervals of predictor variables, and the splines are fit to such interval space [21].For the comparison of subsets of the model, a less expansive technique is used called generalized cross-validation [22] it and expressed as follows: where, = Mean squared error;  = number of basis functions;  = Basis function penalty and; n= number of observations.
MARS is a model which performs under two type of functions, forward and backward [23].In the case of forward pass, the over fitted model developed through a huge quantity of basis functions introduced to the MARS model.The generalized form of the MARS model is given [24]: where,  = Output parameter;ₒ = Constant value;  = number of basis functions; Hₑᵢ(Xᵥ(e, i))= ith basis function and; βᵢ= Corresponding coefficient of Hₑᵢ(Xᵥ(e, i)) The MARS is an extended form of a linear model that describes the relationship between dependent and independent variables which is not presumed.It is an algorithm which constructs a model in two stages, the first one is a collection of basis function and in the second stage, it estimates the least square model.MARS model defined as follows [25]: Where; hᵢ(X)= Spline functions; ₒ= Coefficients from spline functions can be determined by the minimum sum of the square errors; = total function number in model.

Support Vector Regression
The Support Vector Machine (SVM) is a nonlinear generalization algorithm used for classification and regression problems, introduced by Vapnik (1992) [26].It is based on the concept of minimizing structural risk and the algorithm converts the patterns that are not linearly separable to the higher dimensional feature space using kernel functions.In the algorithm, the kernel function was built to map the actual data into higher-dimensional space data without increasing the computational cost.Instead of fixing the empirical error, the SVR attempts to minimize the upper limit of the generalization error.The hard margin solution is considered the first formulation of the SVR, which contributes to overfitting.The definition of soft-margin, which appears to generalize well in the presence of noise, outliers, and prevents overfitting, which is widely used for forecasting research, was then included in further approaches.The overall form of the SVM model regression equation is as follows: where,   =  = 1ᵢᵢ; w = weight vector; ()= nonlinear mapping function of inputs; and = bias, w and b can be estimated by minimizing the loss function.

Random Forest (RF)
Random forest is the model used for both classification and regression problems and also comes under the umbrella of classification and regression tree (CART) tools.It is the algorithm that operates the model by constructing several decision trees and can say that it deals with many numbers of features.Random Forest is an aggregation of tree predictors.Each tree is estimated according to the values of a random vector computed from the same distribution for all the trees in the forest [27].In random forest, each tree in the model is grown with random subsets of variables [28].Random forest ensures that the bagged trees' bias happens in the same way as that of the single tree reduces the correlation and variance between the trees of the model [29].Random Forest, which breaks variables at each node, is also chosen from a random subset of the available features, which reduces the association between trees.The mean squared error of the model can be evaluated as follows [22]: where,   = Measured variable value and; = Mean of all out-of-bag (OOB) predictions.
Several trees in random Forest modeling are also key parameters, and the performance of the model can be evaluated through out-of-bag (OOB).Using random Forest, the over-fitting of the model for the training data can be evaded by selecting input variables during the training cycle to establish variability in weak learners [30].The model builds multiple decision trees, and the model's output can be calculated by taking the average or mean output of each tree [31].The predicted values can be estimated as follows: =1 (7) where;  = Predicted output from Random forest;  = Total tree (ntree) utilized in the model; () = Result of each Random Tree.

Input Variable Selection
The acquisition of appropriate inputs for the model is an essential exercise.To avoid the models' overfitting and minimize the complications in the prediction, the Gamma test (GT) used to eliminate the unnecessary input parameters that have an irrelevant role in predicting the output.The GT avoids complexity with the selection of best inputs, which is genuinely influenced by the results.The combinations of input variables were designated using Gamma value (Г), standard error (SE), and V-ratio for the pan evaporation estimation model.The Gamma Test is a nonparametric approach based on the minimization of mean square error in the modeling [32][33][34].The best input combination is determined based on the obtained minimum value of Gamma static (Г).

Results and Discussion
For the study area, the input combination maximum temperature (Tmax), Maximum Relative Humidity (RHmax), Minimum Relative Humidity (RHmin), Wind Speed (W), and sunshine hours (S) have minimum Gamma value (0.213), standard error (0.057), and V-ratio (0.134) therefore this combination was selected for modeling approach.There are no specific guidelines for separating training and testing data in predictive modeling for learning.However, Zounemat-Kermani et al. (2019) [35] suggested 80% of total data to develop the models.Thus, we divided the data into training (80%) and testing to set up the studied models for evaporation estimation (20%).Therefore, data had been used for training in the period 2014-2016 and the remaining data from 2017 was used for evaluating the applied models [36][37][38][39].The statistical investigation viz.minimum, maximum, mean, standard deviation, first quartile, and third quartile of the training, testing, and the complete dataset is given in Table 1.It can be seen in Table 1 that the maximum value of evaporation for the training set is 7.15 mm, and for the test set, it is 4.38.Similarly, the minimum values are 0.77 and 0.48 mm, respectively.The standard deviation for the training set is 1.31, and for the test set, it is 1.05.There seems a slight variation between the first and third quartile values of the evaporation for training and testing sets.

Modeling Daily Pan Evaporation
The MLR, MARS, SVR, and RF models were trained with the training data set.In the case of the MARS model, for comparison of subsets of the model, the value of generalized cross-validation (GCV) for training and testing was found to be 0.273 & 0.636 [40].In the SVR model, the radial basis kernel function was implemented with kernel parameter gamma of 1.0.Moreover, convergence epsilon was 0.001.For RF model, the number of variables randomly sampled as candidates at each split was chosen 3 with 100 numbers of trees.The results of training and testing periods of the MLR, MARS, RF, and SVM models in estimating pan evaporation for the Ranichauri station are shown in Table 2 While for the testing period, the MLR model values of RMSE, NSE, and R 2 are 0.67, 0.59 and 0.82.In the case of the SVR model, the indices of RMSE, NSE, and R 2 are 0.67, 0.64, and 0.88.MARS model has 0.68, 0.52, and 0.87 values of RMSE, NSE, and R 2 .The performance indices, RMSE, NSE, and R 2 values for the RF model were found to be 0.73, 0.47, and 0.86 for the testing period.It is visible from the assessment of all models during the testing period that the SVR model outperformed.
The variations of the scatter plot of applied models during the testing period are compared in Figure 3(a-d).It is clear from the illustration that the model's scatterplot follows the 1:1 best-fit line.However, the SVR model has less scatter.It can be seen from Table 2 that all models' performance is not consistent in the testing case.The RF model outperformed in the training period, but it could not generalize results for the testing period.However, the MLR model, which is generally considered the simplest model, has done considerably well for the RMSE and NSE statistics during the testing period.Perhaps these uncertainties in the results obtained were due to data division, input uncertainty, and model parameter optimization.Therefore, the present study results showed that the evaporation values could be calculated in similar climates using a few easily measured meteorological parameters and with sufficient precision.In order to determine the consistency of the considered models, they should be tested in different climates with varying data lengths and training testing split.
As a final comment, the results obtained suggest that the accuracy of MLR, MARS, SVR, and RF techniques was adequate when using maximum temperature (Tmax), maximum relative humidity (RHmax), maximum relative humidity (RHmin), wind speed (W), and sunshine hours (S) meteorological parameters.In addition, while for Ranichauri station, different machine learning methods' accuracy varied, and SVR output was better than other models examined.

Conclusion
This work employed four approaches (MLR, MARS, SVR, and RF) to determine the pan evaporation, using a short data set for the Ranichauri station.In climatic explanatory variables, in particular, the input combination of the maximum temperature (Tmax), minimum relative humidity (RHmin), maximum relative humidity (RHmax), wind speed (W), and sunshine hours (S) has a significant influence on pan evaporation, thus they are chosen as inputs for models.All four approaches can estimate pan evaporation accurately (R 2 > 0.82).The MLR approach is the simplest one and showed considerable generalization capability for the Ranichauri station for the given data set.The RF model performed very well in the training period and could not generalize for the testing period.The performance wise ranking of models was done as SVR, MARS, MLR, and RF.Among the four models, the SVR model outperformed in testing periods and also performed well in the training period.Consequently, it can be surmised that the SVR model delivers the strongest potential for estimating pan evaporation in Ranichauri.

Figure 3 .
Figure 3. Scatter plot of the observed and estimated weekly pan evaporation values by MLR, MARS, SVR and RF models during the testing period for the study area

Table 2 . Comparison of the various models by using statistical indices
. According to Table 2, it is evident for the training period, the MLR model values of RMSE, NSE, and R 2 are 0.53, 0.80 and 0.84.SVR model has values of RMSE, NSE, and R 2 as 0.33, 0.94 and 0.94.For the MARS model, RMSE, NSE, and R 2 are 0.39, 0.97, and 0.91.Identically, the performance indices, RMSE, NSE, and R 2 values for the RF model were found to be 0.22, 0.97, and 0.98 for the training period.It is visible from the assessment of all models that the RF model has the minimum RMSE value and the highest NSE and R 2 value during the training period.The MLR model is the simplest model, with the highest RMSE value and the lowest NSE and R 2 values for the study area during the training phase.