A Comparison of Machine Learning Approaches for Prediction of Permeability using Well Log Data in the Hydrocarbon Reservoirs

Permeability is a vital parameter in reservoir engineering that affects production directly. Since this parameter's significance is obvious, finding a way for accurate determination of permeability is essential as well. In this paper, the permeability of two notable carbonate reservoirs (Ilam and Sarvak) in the southwest of Iran was predicted by several different methods, and the level of accuracy in all models was compared. For this purpose, Multi-Layer Perceptron Neural Network (MLP), Radial Basis Function Neural Network (RBF), Support Vector Regression (SVR), decision tree (DT), and random forest (RF) methods were chosen. The full set of real well-logging data was investigated by random forest, and five of them were selected as the potent variables. Depth, Computed gamma-ray log (CGR), Spectral gamma-ray log (SGR), Neutron porosity log (NPHI), and density log (RHOB) were considered efficacious variables and used as input data, while permeability was considered output. It should be noted that permeability values are derived from core analysis. Statistical parameters like the coefficient of determination ( 𝑅 2 ), root mean square error (RMSE) and standard deviation (SD) were determined for the train, test, and total sets. Based on statistical and graphical results, the SVM and DT models perform more accurately than others. RMSE, SD and R 2 values of SVM and DT models are 0.38, 1.63, 0.97 and 0.44, 2.89, and 0.96 respectively. The results of the best-proposed models of this paper were then compared with the outcome of the empirical equation for permeability prediction. The comparison indicates that artificial intelligence methods perform more accurately than traditional methods for permeability estimation, such as proposed equations.


Introduction
Permeability can be considered as one of the most significant concepts in reservoir engineering owing to its effect on identifying flow units, characterization of reservoirs, and planning for perforation [1][2][3]. Many challenges faced during production could be related to this parameter. Moreover, permeability is required for the dynamic modeling of a reservoir. Therefore, accurately determining this parameter is of considerable importance. There are several common methods for this target such as well testing, core analysis, and well logging. Usually, because of high cost and time consumption, these ways are not available in all wells of a field or all over the desired intervals, and sometimes low accuracy of some methods causes to avoid using their results. Thus finding a model that predicts permeability values of a reservoir can provide an insight to act better in so many branches like developing a plan of field, production, reserve estimation, etc. This paper aims to predict permeability's precise values in two Iranian significant carbonate reservoirs, Ilam and Sarvak Formations.
In this paper, permeability quantities are predicted with high precision for Ilam and Sarvak reservoirs, which are two significant reservoir intervals in Iranian oil fields. A full set of well logging data and core measurements are available for these formations in a well in the southwestern part of Iran. Artificial neural networks such as Multi-Layer Perceptron Neural Network (MLP), Radial Basis Function Neural Network (RBF), and Support Vector Regression (SVR) are used for predicting permeability. In addition to these approaches, two data mining methods, decision tree (DT) and random forest (RF), are applied for permeability prediction. Finally, all techniques' results are compared by known statistical parameters to determine the most accurate model. A comparison of all these methods for finding permeability values in the studied field is the primary goal of this paper, whereas previous studies have not focused on such a comprehensive comparison, and they were summarized just in one or two distinct methods. So, this point can be considered as the main innovation of this paper. Also, there is no study about permeability prediction of Ilam and Sarvak formations. Moreover, a comparison between these artificial intelligence techniques and proposed equations for permeability prediction in Ilam and Sarvak was made.

Geographical, Geological, and Data Descriptions
The southwestern part of Iran, with great amounts of hydrocarbon reserves, is considered one of the world's most productive regions [30]. The studied well is located in the Abadan plain, which is a part of this area ( Figure 1). With an approximate area of 26000 km 2 , the Abadan plain is located between the boundary of Iran and Iraq, Persian Gulf, and Dezful Embayment that containing numerous oil and gas wells [31]. Jurassic and Cretaceous reservoirs of this region include lots of oil and gas [32]. Ilam, Sarvak, Gadvan, and Fahliyan are considered oil reservoirs in this region. Still, just limy Ilam and Sarvak formations with an approximate thickness of 90 and 600 m, respectively, are investigated in this paper. The Santonian Ilam Formation, a proven significant hydrocarbon reservoir interval, comprises light gray shallowmarine limestones with beds of algal and rudist-bearing limestones capped by shales. These shallow-water carbonates contain natural fractures [35]. The late Albian-early Turonian Sarvak Formation is the most critical reservoir in the Abadan plain. The Sarvak Formation is composed of a shallow-water massive limestone and deep-marine thin-bedded facies [36]. There are two essential disconformity surfaces in Cenomanian-Turonian intervals of the Sarvak Formation in this area that control the reservoir's quality. In the Abadan Plain, an argillaceous-shaley limestone (Laffan Formation with the age of Coniacian) with a thickness of about 10-15 m separates the Sarvak and Ilam formations [37].
Previous research indicates the Ilam reservoir is considered a diagenetic reservoir in Dezful embayment and possesses five units that one of them is an excellent reservoir with high quality [38]. Sarvak Formation keeps the best quality of a reservoir in intervals with rudist limestones over the Zagros area and the Persian Gulf [39]. Also, Ilam and Sarvak are fractured reservoirs in the Strait of Hormuz, the Persian Gulf [40]. The other study about these two reservoirs mentioned earlier indicates that Ilam Formation shows better reservoir characteristics than Sarvak in the Dezful embayment [41].
In this work, full set logging data of computed gamma ray (CGR), spectral gamma ray (SGR), potassium (POTA), thorium (TH), uranium (URAN), sonic (DT), Density (RHOB), Neutron (NPHI), photoelectric absorption properties (PEF), Micro-Spherical Focused Log (MSFL), laterolog deep (LLD), and laterolog shallow (LLS) belong to an oil well located in the Abadan plain. The statistical details of these data have been summarized in Table 1. Moreover, the core samples of Ilam and Sarvak formations were analyzed in the core laboratory. The results of measured permeability are also shown in Table 1. Different computational intelligence methods have been used to model the permeability employing a dataset containing 171 data points. After gathering data, the required data were selected and then normalized to enhance the system's accuracy [42]. The dataset is complex and constructed as a model in machine learning due to working with many features. So, selecting essential elements and the decreasing dimension of the dataset seems necessary [43]. Also, this causes reducing the time of training, makes the interpretation of the model easier, and decreases expenses of data gathering [44]. In this technique, variables' choice and importance depend on the selected predictive model [45]. Thus, for finding variables with the most impact, an appropriate model should be employed. In this paper, well logging data are considered as inputs. The variation of all these values with depth is shown in Figure 2. The random forest approach was used to determine the importance of variables, which will be discussed later. The results of the random forest method are illustrated in Figure 3. As shown, the first and the second most effective variables on permeability are depth and NPHI followed by RHOB, CGR, and SGR. Therefore, Depth, NPHI, RHOB, CGR, and SGR were selected for making models as important inputs with the most impact. It must be emphasized that this data selection is suitable for this situation and cannot be attributed to all reservoirs. It should be noted that Figure 4 demonstrates all possible relations between these selected variables and permeability. As indicated in this figure, the value of the regression coefficient of CGR and NPHI versus permeability is negative, while for two other inputs, it has a positive correlation.

Research Methodology
One of the non-parametric supervised learning tools is the classification and regression tree (CART) technique being used extremely in so many fields like industries and applications by a convincing performance of a cause-and-effect analysis. In data mining, DTs (decision trees) consist of two classes named "The classification tree" and "The regression tree". Both of these trees have the same procedure. Splitting predictors leads to the separation of observation that forms DT [46]. The name of this process is binary recursive partitioning. After that, parent nodes are divided into two child nodes whose name is binary splitting, and this process will be continued until arriving at nodes that will not have any splitting named terminal nodes [47]. Finding the splitting criteria, which depends on inputs, a DT begins, and then the value of square error between observed and calculated outputs is minimized. Thus it will have one root node and two child nodes [48]. For further splitting, the same pattern will be repeated on all child nodes. Finally, DT has logic criteria for splitting based on inputs and a diagram of the procedure scheme.
The type of outputs determines the type of trees. If outputs are naturally categorized, the tree will be the classification type; while outputs are continuous, that will be the regression tree [49]. In this paper, since permeability (the output) is a continuous variable, a regression tree-based on five inputs (depth, NPHI, RHOB, SGR, and CGR) is used. Figure 5 illustrates this tree. Statistical information of each node is presented in the plot. All splitting values and outputs are mentioned as well. Application of RF method as a set of some unpruned decision trees with sequential growth in place of a single restricted type was mentioned by Breiman (2001) at first. RF uses the bootstrap sampling method for choosing random data due to substitution eventuality. Then, the other samples are used for the test of the tree. All trees will have this procedure, so the estimation will act better since random sampling makes differences between sets of trees. Eventually, with an internal cross-validation program between all trees, the average of errors is calculated. For training a tree in this approach, a small subset of size m out of M features for each node is chosen [50]. While altering features for each node, all over forest growth, the size of m is sustained stable. So, the best splitting of nodes is done in m features, not all features of M [51]. Different parameters in three categories of trees in the forest (ntree), size of subset (mtry), and min leaf size in each tree grow up the trees in RF. Indeed, the optimum value of these parameters leads to convincing prediction between ensembles of trees. By replacing features randomly, the precision of the model is calculated. Thus, the rank of features regarding their effects on the outputs of model is determined and can be considered an advantage for RF.
Numerous errors, the great value of variance, and the difficulty of overfitting are the problems of Decision trees; however, they can act adequately in nonlinear systems [52]. It must be emphasized that these aforementioned issues can be avoided by applying random forest, an ensemble learning method suggested by Breiman et al. (1984) [53]. The random forest can be considered an appropriate tool for unsupervised learning and classification and regression [54]. Similar to other ensemble learning techniques, the random forest can improve the efficiency of some weak learners, such as single decision trees, single perception, etc. [55].
The structure of ANNs was derived from the brain's structure, and they simulate the function of the brain for learning [56]. Since ANNs are mathematically non-linear, available, and simple models, they are used extremely [57,58]. One of the methods of artificial intelligence is the multi-layer perceptron network used widely [59,60]. These networks act appropriately in non-linear systems modeling [64]. Hidden layer(s) provide the relation of input and output layers in a model [62]. The effectiveness of an MLP-NN strongly depends on the number of hidden layers and the number of their neurons [61]. It should be noted that even one hidden layer has high efficiency for constructing a proper model, and networks with two or more hidden layers are appropriate for systems that are so complicated. Multiplying the weight of every neuron in related neurons in the previous layer and summing them eventuate the value of neurons in the hidden layer or output layer. Then a bias is added, and an activation function is exerted on this value to determine the final output. There are some kinds of the activation function. Usually, tnsig is used for the hidden layer and purelin for the output layer [63], which are as follows: Also, the Levenberg Marquardt algorithm's performance in estimating efficiency function between the other mentioned learning algorithms is the best [77]. In this paper, this algorithm is used. To determine the most suitable network with the high ability for modeling non-linear problems, the tansig function for the hidden layer and purelin for the output layer were selected. The numbers of neurons were trained, and through a trial and error procedure, the best performance was selected by comparison. Meanwhile, MSE or mean squared error was calculated in all iterations during training. Figure 6 displays MSE values for different numbers of neurons in both hidden layers. According to the figure, MLP structure with two hidden layers in which the first one has three neurons and the second one has five neurons is the best one with minimum MSE. All characteristics of the network are presented in Table 2.   Since RBFs behave more correctly in non-linear system modeling, they are used routinely [65]. To model nonlinear data, these networks have more proper operations for better accuracy. The other benefit of them is their single procedure for finding the answer in contrast to MLP networks having iterative procedure [66]. There are different kinds of methods for finding the center of RBF networks, such as random selection of centers, clustering, and density estimation. In this paper, clustering has been used. The precision of K-means is better than the other methods for finding centers, therefore it is used more for clustering. Below equation shows the output of RBF [67,68]: where, is the center, N is the number of hidden layers' neurons, ‖ − ‖ is the distance of inputs and centers, is the weight parameter, is the kernel function, and M is number of inputs.
The schematic form of the RBF network, which is used in this paper, is shown in Figure 7. In addition, determining the proper number of neurons in the hidden layer is done by the RBF network. For this purpose, the number of neurons has been added one by one, and the MSE value is calculated. The minimum amount of MSE corresponds to the best structure of the network. As illustrated in Figure 8, a network with 25 neurons can perform in the best condition. Moreover, SVR is a technique that can estimate the purposed variables through statistical learning theory principals. By using an appropriate linear regression model in a new feature space, that input data which are mapped from the original space to go to a higher m-dimensional space, unknown values are estimated [78]. Different phases of this technique are displayed in Figure 9. As a final point, it should be stated that in this paper, two main ways have been used for determining optimum parameters of all implemented artificial intelligence approaches. The procedure of previous research [10], as well as the trial and error approach, have been used for this purpose.

Results and Discussion
This section of the paper focuses on analyzing models to select the best of them. Some statistical and graphical methods are used for this purpose [43,69].
It should be noted that statistical parameters should investigate the performance and precision of this paper's implemented models. Therefore, analysis of errors is done with some indices named Coefficient of determination ( 2 ), Root mean square error (RMSE) and Standard deviation (SD), which are defined respectively as bellows [70]: where, is experimental value, is estimated value and , is the average of experimental values.
RMSE determines dispersion of data around the regression line whereas R 2 indicates the matching between initial and predicted data. Besides, standard deviation or SD is a measure of a set of values variation. Therefore, these parameters can evaluate implemented models' precision by finding the discrepancies between measured data and predicted data.
These indices are calculated and presented in Table 3, revealing high accuracy of models, especially DT and SVM, and less precision for the RF model. As listed in this table, statistical parameters have been calculated for training, test, and total data sets. The numbers of data in training and test sets are 120 and 51, respectively. In the total set, the highest R 2 values belong to the SVM and DT methods (0.97 and 0.96). As shown in Table 3, SVM and DT models possess the lowest amount of RMSE and SD. Numerical assessments indicate the more precision of these two methods, whereas the other implemented models like RF, RBF, and MLP also show relatively high accuracy. To provide a visual insight of permeability prediction by all methods, the plot of measured core data versus predicted data in the part of the dataset in which the used methods perform best is illustrated in Figure 10. This figure depicts predicted data (vertical axes), measured data (horizontal axes), and their regression plot. In this figure, permeability values of test data (red markers) and train data (blue markers) have been illustrated in graphical form. This can be deduced that the SVM model shows the highest match between measured and predicted data.

Figure 10. The cross plot of modeling prediction of permeability versus measured data for both train and test data
Moreover, Figure 11 illustrates the relative deviations of these models. In other words, these plots show the dispersion between data and outputs of constructed models. These variations can be distinguished in different ranges of data (horizontal axes). As illustrated in Figure 11, the applied SVM model shows the lowest data scattering and more concentrated data. For better comprehension, the performances of all methods have been depicted in Figure 12 to assess the differences visually. In this figure, the measured permeability values of core data are shown with black dots. Colored lines also offer the obtained permeability values by the other five methods. This figure represents the superiority of DT and SVM techniques for permeability prediction. As shown, this figure also demonstrates the weaker performance of the RF model to predict permeability values. The use of experimental equations to estimate permeability has a long history [71][72][73][74][75]. These correlations relate the permeability values to pore characteristics such as porosity. The proposed empirical equations are not very precise, especially in carbonate rocks associated with the complicated texture. Also, these equations are merely applicable in the specified region. In other words, the generalization of these relations seems impossible [73]. In this section of this research, the achievement of this paper should be compared with previous experimental equations to assessing the preference of artificial intelligence methods. For this purpose, two proposed equations for permeability prediction in Ilam and Sarvak formations in this well has been reported as follow [76]: Sarvak Formation = 0/1505 Results of predicted permeability values based on these two aforementioned equations as well as the best-proposed models for permeability prediction in this study (DT and SVM), are illustrated in Figure 13. This figure provides a visual insight into comparison. As presented, these two graphs have been plotted against depth. Permeability values are shown in yellow circles, whereas predicted amounts of this parameter by equations and artificial intelligence approaches have been shown in colored lines. It must be emphasized that there is a gap in the range of 3200 m to 3235 m in the Sarvak Formation since core data has not been reported in this interval. Therefore, Sarvak Formation has been assessed in two distinct zones. As depicted in Figure 13, experimental equations' performance is weaker than DT and SVM models in Ilam and Sarvak formations. These equations are incapable of predicting permeability values that are far from the average of the data, whereas DT and SVM models act reliably even in outlier data. These two artificial intelligence methods follow the trend of measured data appropriately. Moreover, statistical parameters of 2 , RMSE and SD are calculated for predicted permeability by two equations of 7 and 8 in all over the Ilam and Sarvak formations. These values are listed in Table 4. For better comparison, the results of statistical parameters belong to DT and SVM methods are also listed in this paper.
As shown, statistical values vary dramatically in the equation case. R 2 exhibits an inappropriate and very low amount. RMSE and SD of equation case possess larger values than DT and SVM models too. These results indicate poor performance of empirical equations compared to artificial intelligence approaches.

Conclusion
Different types of data mining and neural network techniques were applied to predict Ilam and Sarvak formations' permeability values, which are two reservoirs in the southwestern part of Iran. For this target, MLP, RBF, and SVR neural networks, in addition to DT and RF techniques, were selected for an all-around comparison. Since other researchers investigated these methods individually, a comparison of several techniques can be considered as the main novelty of the present paper. Well-logging data are used as input data. In the studied well, CGR, SGR, DT, LLD, LLS, MSFL, NPHI, RHOB, POTA, URAN, TH, and PEF logs are available. Selecting the variables with the most influence on permeability and reducing dimension was made by random forest. Depth, SGR, CGR, NPHI, and RHOB were chosen as the most influential parameters and were considered input data. 70% of the data were allocated for training and the rest of them for testing. The statistical parameters, such as R 2 , SD, and RMSE, were calculated for all approaches. Furthermore, the cross plots of measured and predicted data in addition to relative deviations of all models were plotted. The graphical and statistical results of constructed models show that all of them perform almost accurately. However, DT and SVM among these models are the best techniques for permeability prediction in Ilam and Sarvak formations. The R 2 values of these two methods are 0.97 (SVM model) and 0.96 (DT model). Moreover, the SD and RMSE of these models have been calculated at 1.63 and 0.38 (for SVM) and 2.89 and 0.44 (for DT), respectively. The performances of two DT and SVM models in permeability prediction were compared with the previously proposed equation for this purpose. The results indicate empirical equations perform weakly in permeability prediction (with statistical parameters of R 2 = -0.75, SD= 11.26, and RMSE= 5.78), whereas machine learning methods act more accurately.

Data Availability Statement
The data presented in this study are available in article.