QUALITY OF RESEARCH RESULTS IN AGRO-ECONOMY BY DATA MINING

Data Mining (DM) through data in agroeconomy is a scientific method that enables researchers not to go through set research scenarioes that are predetermined assumptions and hypotheses on the basis of insignificant atributes. On the contrary, by data mining detection of these atributes is made possible, in general, those hiden facts that enable setting a hypothesis. The DM method does this by an iterative way, including key atributes and factors and their influence on the quality of agro-resources. The research was conducted on a random sample, by analyzing the quality of eggs. The research subject is the posibility of classifying and predicting significant variablesatributes that determine the level of egg quality. The research starts from the use of Data Mining, as an area of machine studies, which significantly helps researchers in optimizing research. The applied methodology during research includes analyticalsintetic procedures and methods of Data Mining, with a special focus on using Supervised linear discrimination analysis and the Decision Tree. The results indicate significant posibilities of using DM as an additional analytical procedure in performing agroresearch and it can be concluded that it contributes to an improvement in effectiveness and validity of process in performing these researches.

EP 2015 (62) 4 (1137-1146) Gordana Vukelić, Slobodan Stanojević, Zorica Anđelić researches in agro-economy a significant methodological turn. Classical methods that subsum the normative-descriptive methods supported by classical multivariate statistical methods have become a basis for establishing machine studies, as a more productive and exact scientific method in agro-research (Bohanec et al., 2003). The method of machine studies gives a clear positive answer on the question whether a computer can execute functions that we conside thought (Alan Turing, 1912-1954. Machine learning is a set of processes which include: collecting new declarative knowledge, development and specialization of motor and significant capabilities through practice, structuring existing knowledge and discovering new facts and theories by observing and active experimenting (Breiman, 2001). This learning process includes knowledge acquisition, ie. learning new information of symbolic character and training, which means improvement of acquired knowledge, which is a equal to the way people learn. Machine studies includes two models, learning based on examples and learning by observation and own discovery. They produce two types of scientific knowledge: explicite knowledge that is represented by mathematical logic, production rules and systems alike, which are the subject of authors application, and those are data mining, using different techniques of the decision tree, production rules of discrimination analysis, neuron network models etc (Cherkassky, Mulier, 2007).
Research in the domain of data mining relates to the area of inductive learning and performing general laws based on insight in specific occurences -cases. The procedure of applying these methods implies the grade of validity of learned knowledge, so it is necessary to methodologically divide the set on a learning set, which is used for learning and a test set, which is used for testing learned knowledge (Stanojević, 2013).
Usefull inductive knowledge must have predictive accuracy, the percentage of success of classifying new, examples of using learned rules which were not considered, done by grading accuracy in classifying the method of cross-validation and the bootstrap method (Kohavi, 1995). The method of machine learning, especially the area of finding hidden knowledge, or data mining includes an iterative process of discovering patterns, whether automatically or manually, in a surrounding where there are no predetermined assumptions or hypotheses, and that is the research goal (Stanojević, 2013).
In general, machine learning is a special area of artificial intelligence 4 relates to the development of algorithms and techniques which enable computers to "learn" (Platt, 1998). The method of inductive machine learning creates computer programs by extracting rules and patterns of behavior from sets of data. Data mining is also known as a process of knowledge-discovery in databases (KDD) or knowledge-discovery and data mining 5 (Chang, Lin, 2001). Data mining is defined as a process of recruiting one or more computer techniques with the goal of automatic analysis and extraction of knowledge from data which are found in a certain data base (Witten et al., 2011). The purpose of data mining is to find and identify certain patterns and trends in data. All methods of data mining are used for induction based learning (Kantardžić, 2002). That is the process of defining general conceptual definitions by observing specific examples from which learning is being done (Stanojević, 2013).
Within the research analysis of available data in the area of agro-research was conducted, by applying statistical method of Artificial Intelligence and DM of classified methods, and the goal is identification of rules of classification of data in the area of agriculture, poultry raising and analyzing egg quality (Birch et al., 2003). The research used two techniques of DM, and those are supervised in learning by using discrimination lineal analysis and the Decision Tree (Maindonald, Braun, 2007).

Research methodology
The research procedure starts from defined and well grouped data, determining variables and using chosen methods with analyzing and interpreting received results (Mihajlović, 2014).

Data
The authors' research relates to research results of identifying key factors on egg quality. Data was grouped in two samples, and those are: sample A -first quality category, and sample B -second quality category. The sample contains 25 participants, while sample B contains 33 participants.

2.1.Variable 6
Four characteristic of egg qulity on both samples were measured, and those are: X 1 = yolk shadow 7 , X 2 = yolk color 8 , EP 2015 (62) 4 (1137-1146) Gordana Vukelić, Slobodan Stanojević, Zorica Anđelić X 3 = height of egg white 9 , i X 4 = egg white index 10 . The mentioned variables are determined as predictive atributes. The research goal is to identify the key variable which influences the classification of eggs in A and B cathegory (cathegory or class which are goal variables in our research -class of atributes) (Breiman et al., 1984).

Methods
Identification of key variables from X 1 ,X 2 , i X 3 , i X 4 can be identified as a typical problem of classifying, and it occurs in two procedural phases. The first phase, machine model of learning is trained and used as a training sample. The samples is organized in rows and collumns. One of the atribute collumns ie. the class atribute dominantly influences the quality of eggs. This phase is called Supervised learning (Sohl, Venkatachalam, 1995). The second step of the model tries to classify objects which don't belong to the training sample.
The authors used Supervised linear discrimination function in the paper, with using validation methods of accuracy of classification as follows: cross validation 11 and the bootstrap method. 12

Linear discrimination function (LDF)
The goal of aplying this statistical method is to determine useful variable for the purpose of classifying.
In the first step, the method of supervised learning through linear discrimination of the function with continuous variables, while the predictory variable of cathegories is A or B.
The results show that the performance of classification is with a mistake of 1.7%, while the variance mistake in relation to total Wilk"s Lambda (within the MANOVA method) is of a small value with sa p=0,0 13 . 9 Quality of egg white is graded by breaking and egg on a flat surface and measuring the height of dense egg white, which is expressed by Hog units (HU). 10 An egg of good eating quality, has a flat yolk of bigger radius and egg white which is watery and covers a big area.  Source: Goulden, 1952 Sum of the result of LDF can't be reliable because the essential question is not being asked, which is the relevant variable for the research. According to the following LD function, the discrimination equation would be as follows.

Results of the estimation of accuracy of classification with the bootstrap method.
From the table of results given above, it was determined that the actual percentage of error in predicting the quality of eggs was 3.5%. The next step is introducing STEPDISK 14 component, whose purpose is to assign the necessary number of variables for classifying affiliation to class A or B.   Goulden, 1952 According to the STEPDISK analysis results the only relevant attribute is the yolk shadow. The next step is control of effectiveness of the set model (Demšar et al., 2003).
In that sense it is necessary to do the analysis of the supervised linear discrimination function of bootstrap method, whose error is 1.7%, which means it is less in relation to 3.5%, however only one variable occurs in the discrimination function, which is of crucial significance for determining the key factor in classifying within classes A and B Where the classification function is Z= 7.65 X 1 -28 With a decrease in bootstrap error to 1.2%, as shown in the following table.
14 Stepwise discriminant analysis (STEPDISC) is discrimination analysis which determines relevant variables for the purpose of classifying by using WILKS' LAMBDA method. Wilks' lambda is a statistical test used in multivariation analysis of variance (MANOVA) for the purpose of testing the difference between mediums of identified groups of subjects of combinations of dependent variables.

Application of the decision tree
Next to discimination analysis, the research was done by the Decision Tree, by using C4.5 algorithm, which is based on the structure of the tree, where every leaf represents an atribute test and every branch a test result (Quinlan, 1996). The goodness of split is based on selection of atributes which are best separated in the sample.
Results with the classification accuracy of 9.6% give the following decision tree: Or graphically:

Graph 1. Yolk Shadow
Source: Goulden, 1952 Estimating the accuracy of classification with the Cross Validation method we get that the error percentage is 0%.
Further steps in examining the value of classification include the application of bagging method (Sadok et al., 2009). 15 Integrally with this method we use the random tree bagging algorithm within the targets of the supervised learning, which gives 0% error or 100% accuracy (Đinović, 2013

Conclusion
Agro-economy faces great challenges, especially in the domain of research of not only the quality of ground, but also other food resources as well as sources of ecological food. The method of finding hidden knowledge has the assumption in relation to classical methods because they are more precise at classifying, as well as having greater predicting capacities.
The aim of this research was to examine the usefulness and exactness of these methods on the example of examining the presence of egg quality (category A and B) based on examining samples. The Supervised Linear Discrimination Analysis was used with the purpose of identifying the specific influence of variables on the quality of eggs with the variation method of accuracy in classifying the influence of variables and identifying the key variables, in this case it is enlightenment -shadow of eggs. Other than this method, the Decision Tree was used, which gave results which are more precise in relation of determining the level to which is the influence of certain variables. Given results are at the level of 99% precise, in relation to classical multivariate researches, this is the research where, by using supervised discrimination analysis, the influence of four variables on the presence of egg quality was revised, out of which three variables weren't the key for qualification. All that was needed for the research to come in the foreground was achieved, and that is great degree of accuracy of research (level of 99%).
Usage of this methodological apparatus was of significant help to researchers in the area of agriculture, especially due to the possibility that the research is done on scarce training sets which have a big number of attributes (the entity of the research subject, for example land, quality of agricultural products, fruit, vegetables, eggs, meat and many other) and a very small number of examples (so called scarce sets). The problem of scarceness is related to the evaluation of task difficulty, which in the domain of data mining is solved by reducing the number of attributes-variables. These methodological approaches enable revelation of, until now hidden knowledge in agro-economy and agronomy, and primarily on the causes that determine key-deciding variables and attributes and factors for solving research problems and the correct setting of a hypothesis, in the area of agro-economy, as well as in other areas of research.