Review of machine learning algorithms´ application in pharmaceutical technology

Machine learning algorithms, and artificial intelligence in general, have a wide range of applications in the field of pharmaceutical technology. Starting from the formulation development, through a great potential for integration within the Quality by design framework, these data science tools provide a better understanding of the pharmaceutical formulations and respective processing. Machine learning algorithms can be especially helpful with the analysis of the large volume of data generated by the Process analytical technologies. This paper provides a brief explanation of the artificial neural networks, as one of the most frequently used machine learning algorithms. The process of the network training and testing is described and accompanied with illustrative examples of machine learning tools applied in the context of pharmaceutical formulation development and related technologies, as well as an overview of the future trends. Recently published studies on more sophisticated methods, such as deep neural networks and light gradient boosting machine algorithm, have been described. The interested reader is also referred to several official documents (guidelines) that pave the way for a more structured representation of the machine learning models in their prospective submissions to the regulatory bodies.


Introduction
There is an ever-increasing need for the rapid development of pharmaceutical products, that can greatly rely on powerful computational methods. As in other research fields, artificial intelligence (AI), especially machine learning (ML) algorithms, has proven its great potential for deciphering complex relationships between multivariable data that are generated in the pharmaceutical development. Both formulation composition and processing parameters can be efficiently optimized, together with the minimized variability in the final products' quality. ML modeling also provides the ability to analyze unstructured datasets and predict formulation and process properties for any given combination of independent variables. These are, therefore, undoubtedly important tools that coupled with the conventional statistical and regression methods create a data science platform ( Figure 1). In addition to the better understanding of the pharmaceutical formulation and its manufacturing technology, data science is also resource-efficient.
The introduction of process analytical technologies (PAT) has facilitated the acquisition of process-related data, that is high in volume, variety and velocity of generation. This is related to the big data concept (1), whereby ML tools can provide a greater understanding of the data generated during pharmaceutical processing, through the identification of sensitivities and interdependencies of variables. Thereby, the variability of final products (dosage forms), that could potentially affect therapeutic efficacy, can be reduced. Povezanost različitih tehnika za obradu podataka

Overview of the available AI methodologies
Concepts of artificial intelligence and machine learning have been introduced in the middle of the 20 th century. The versatility of their application in various fields, including healthcare, development of new medicines, and (bio)medicine in general, has been growing ever since (2). ML can be based on supervised and unsupervised learning algorithms. In the case of supervised learning, data are designated as input (independent) and output (dependent), and the algorithm searches to find the best relationship that can be used for generalization and predictions. These methods are comparable to the conventional regression techniques. Unsupervised learning, on the other hand, is based on the assessment of the dataset as a whole and recognition of patterns and features, that provide the means for further clustering, reduction of dimensionality, etc. The type of the analyzed data (e.g. discrete or continuous, binary or multiple classes), as well as the type of the model (e.g. parametric or non-parametric), are also relevant for the selection of the appropriate machine learning algorithm. Artificial neural networks (ANNs), as the most frequently used machine learning algorithms, can be used both for regression and classification. Damiati has provided a comprehensive comparison of different machine learning methods that are used in pharmaceutical research (1).
There are many different types of ANNs. Multi-Layered Perceptron (MLP) is the most often used, as one of the simplest yet powerful networks. It is schematically represented in Figure 2. Input data are fed into the network through the input layer neurons whereby the number of neurons represents the number of input variables. Input data are further transformed and analyzed via activation functions in the hidden layer. As represented in Figure 2, sigmoid or hyperbolic functions are often used, but many others can be also implemented (3). This versatility of functions that can be applied to model the data is one of the most important reasons why ANNs can outperform conventional regression methods. Non-linear equations that are used by ANNs are fitted through a number of iterations, in order to capture the variability within the data. This is referred to as the training process, and there is a number of methods that can be used to adjust the synapses, i.e. networks´ weights and coefficients. Backpropagation algorithm, for instance, calculates the difference between the actual and predicted output, and based on the obtained error adjusts the weights and coefficients. The process is repeated until the network converges to the optimum solution. Depending on the type of connections networks can be fully or partially connected. Furthermore, the type of activation functions may vary in different layers. ANNs can have more than one hidden layer, and also the recurrence of the signal may occur (3).
Deep learning is a more recent concept, based on complex ANN structures. In essence, deep learning neural networks have a large number of hidden layers (at least three but usually more), each layer has many nodes, and these networks are predominately built for large datasets. Deep learning networks include convolutional, recurrent, and fully connected networks and are discussed in more detail elsewhere (4). Other ML algorithms, such as neuro-fuzzy logic, symbolic regression, decision trees, genetic programming, etc., have also attracted attention. Fuzzy models provide the opportunity to model relationships between the input and output variables with if-then logical statements (5). Prerequisites for such a modeling approach are membership functions and fuzzy sets. For more details on neuro-fuzzy logic and neuro-fuzzy inference systems, the interested reader is referred to (6) for more details. Genetic programming (GP) is based on the natural selection theory, whereby the GP algorithm generates populations of fitting equations and then searches for the optimal population by applying genetic operations such as "reproduction", "crossover" or "mutation" (7).
Machine learning algorithms may be used individually or combined in structures denoted as forests or ensembles. Ensembles are extremely powerful in the case of multicriteria optimization for large datasets.

Illustrative examples of ML tools applied in the context of pharmaceutical formulation development and related technologies
The current approach to pharmaceutical development should be based on the quality by design principles (QbD) (8). The first step in the QbD approach is the definition of the quality target product profile (QTPP), followed by the identification of the critical quality attributes (CQA) of the product. The most important aspect of the QbD is the establishment of relationship(s) between the critical material attributes (CMA) and/or critical process parameters (CPP) that affect CQAs. Once these relationship(s) are identified and quantified, design space can be appointed, providing the opportunity to optimize and continuously control the product's quality. Historically, quantitative analysis and appointment of the design space were based on experimental design, regression methods, and conventional statistical analysis. However, there are actually no limits on methods that can be used for quantitative assessments in the QbD context. In fact, a whole array of techniques is available, under the data science umbrella (Figure 1), that can efficiently be used for a variety of QbD elements. ML algorithms, especially ANNs, have been used in numerous examples of QbD-based pharmaceutical development, specifically due to their non-linear nature and the ability to capture complex relationships between CMAs and/or CPPs with CQAs for various pharmaceutical dosage forms (9-17).
Simões et al. (18) have recently published a study on ANNs applied to QbD-based development of a poorly soluble and poorly permeable drug (class IV drug according to the BCS -Biopharmaceutics classification system) tablet formulation that was manufactured in industrial settings and compared to the reference product in bioequivalence studies. In short, after the initial risk assessment, the following CMAs and CPPs were identified as critical for the dissolution profile as a CQA: particle size distribution, tablets´ hardness, impeller speed, mesh size for sieving of the dried granules, granulation time and granulation liquid amount. Fully connected MLP networks were trained and validated. ANNs were then used to set product and process specifications, taking into account similarity factors for the predicted dissolution profiles. This allowed the establishment of the control strategy for the entire process. The optimal ANN model was validated on three industrial-scale batches manufactured with three different batches of the milled drug. The ANN model has successfully predicted dissolution profiles for the manufactured batches. This study represents a true power of ML algorithms in pharmaceutical products development. Lao et al. (19) have prepared an in-depth review of the application of ML in solid oral dosage form development in both academia and industry for the last three decades. Table I represents selected examples of recently published reports on ANN models used in pharmaceutical development (for formulation and/or process optimization). Additional reviews of ANNs applied in pharmaceutical development are available (20)(21)(22)(23)(24)(25)(26).
Millen et al. (27) have compared multiple linear regression; stepwise, ridge and lasso regression; elastic network, regression trees and boosted regression trees as modeling techniques for the assessment of particle size distribution and tablets' quality attributes following wet granulation on different scales (from laboratory to full-scale production). Landin has demonstrated that the combination of neuro-fuzzy logic and genetic programing allows modeling of the wet granulation process for different sizes and geometries of the wet granulation equipment (28). Belič et al. have applied ANNs and fuzzy models to minimize the capping tendency by the optimization of the tableting process (5). ANNs were also compared to adaptive neuro-fuzzy inference system (ANFIS) and multiple linear regression, in terms of modeling the compaction performance of novel pharmaceutical excipients (thermally and chemically modified starches) (29). Gams et al. (30) have developed an ML based method, using decision trees and support vector machines (SVM), that evaluates material properties and processing parameters for the successful production of tablets.
Barmpalexis et al. (7) have compared multi-linear regression, particle swarm optimization (PSO) ANNs and genetic programming in the development of mini-tablets. PSO-ANNs were the only regression technique that was able to simultaneously model eight responses (describing powder and mini-tablets properties). PSO is an optimization tool based on bird flocking behavior (7). The superiority of deep learning over conventional machine learning approaches (including multiple linear regression, partial least square regression, support vector machine, artificial neural networks, random forest, and k-nearest neighbors) has been recently demonstrated (36). Yang et al. (36) have extracted experimental data for 131 oral fast disintegrating films and 145 sustained-release matrix tablet formulations. Disintegration times and dissolution profiles for the respective formulations have been accurately modelled and the appropriate generalization has been confirmed by external datasets. The analyzed dataset contained types and contents of active pharmaceutical ingredients and excipients, process parameters and in vitro properties of dosage forms.
Deep learning has been demonstrated comparable, after 10-fold cross-validation, to random forest, single tree and genetic algorithms, for prediction of drug release from poly-lactide-co-glycolide (PLGA) microspheres (37).
In addition to deep learning, other advanced machine learning algorithms have demonstrated their prediction efficiency. Light gradient boosting machine algorithm (lightGBM), as a high-performance boosting decision tree was used to predict complexation between cyclodextrins and active pharmaceutical ingredients (38), as well as complexion of berberine into phospholipid complexes (39). The same machine learning algorithm, lightGBM, was also used for the prediction of particle size and polydispersity index (PDI) of nanocrystals prepared by top-down methods (40). Apart from predictions of target variables, light gradient boosting machine also provided information on the relative contribution of the formulation factors and process parameters on the nanocrystals size and PDI (40). Decision-tree-based methods were demonstrated capable of predicting particle size of solid lipid nanoparticles as well (41).
LightGBM, coupled with natural language processing (NLP) and blockchain technology, was also used to develop a management and recommendation system for the drug supply chain (42).

Regulatory aspects of machine learning algorithms application in pharmaceutical development
There is a great interest and potential for the pharmaceutical industry to utilize machine learning algorithms in practically every aspect of the pharmaceutical products' lifecycle. Henstock has reviewed and recommended steps for the successful integration of artificial intelligence, in the general context, in the pharmaceutical industry (43).
Regarding the official guidelines and recommendations, Food and Drug Administration (FDA) has recently presented a discussion paper followed by the action plan published in 2021 (44), that is devoted to artificial intelligence and machine learning software used as a medical device. These documents provide valuable concepts, such as Good Machine Learning Practice and Real-World Performance that might pave the way for similar AI and ML applications in pharmaceutical development. According to the action plan (44), good machine learning practices refer to efficient data management, feature extraction, training, interpretability, evaluation and documentation. European Medicines Agency (EMA) has, on the other hand, initiated a big data task force, together with the Heads of Medicines Agencies (HMA). Presentations on the current state-of-theart use of AI in medicines, from the perspective of the relevant stakeholders, are available on EMA's website (45). One of EMA's strategic goals is to strengthen the ability to validate AI algorithms, and to deal with big data in general, since it is inevitable that these tools are going to be increasingly applied by the pharmaceutical industry (46).
Danish Medicines Agency has published a proposal for the criteria and questions that can be expected to be asked for AI/ML algorithms across the various GxP-regulated areas (47). It is stated that the proposal applies to static AI/ML algorithms that implement critical GxP-related functions and are trained using supervised learning.

Future expectations
Serialization (track and trace) could also rely on ML tools. Although it is not its primary purpose, the data generated by serialization could also be used in product development and for tracking patients' adherence. Also, ML algorithms could be used for the identification of falsified medicines (48).
More efficient manufacturing in the pharmaceutical industry can be expected with the integration of ML and PAT tools (49). Several such examples have been described in Table II. One of the greatest challenges related to successful PAT implementation is the analysis of high volume, multivariate data. Moreover, fast computations and decision making are of the utmost importance. These issues could be, potentially, solved by different ML tools. For example, Wong et al. (50) have developed a method based on recurrent neural networks that provides efficient regulation of critical quality attributes.
With many opportunities for the application of AI tools also come some obstacles and challenges, especially if algorithms and models are meant to be used in the mass production of medicines, either for pharmaceutical development or production process monitoring, or both. The challenges are related to the volume of data and speed of its accumulation; size of datasets; training/learning time, over or under-fitting of models, etc. (1). With the evolution of the big data concept and advances in computing capability, it is to be expected that the technical challenges will be reduced, but the necessity for critical considerations of AI-based models will still remain, as for any other modeling approach.