Intelligent cluster analysis of data
Classification tasks are among the most important both in scientific and technical research conducted in biology, medicine, geology, and in socio-economic ones. In terms of their formulation, such tasks are diverse and numerous. One of them - classification without training - consists in dividing a set of objects that are described by a set of features into homogeneous groups called clusters. To solve such a problem, cluster analysis methods are used, which make it possible to single out clusters in the p-dimensional space of features of multidimensional objects of a very different nature that have special properties (compactness, connectivity, and others). When classifying multidimensional observations, the results of cluster analysis allow the researcher to more reasonably attribute an unknown object to one or another known class.
Cluster analysis (clustering, taxonomy, self-learning, unsupervised learning) is designed to split a set of objects into a given or unknown number of classes based on some mathematical classification quality criterion (cluster - bunch, bunch, cluster, group of elements characterized by some common property) . The clustering quality criterion (partitioning quality functional) to some extent reflects the following informal requirements:
within groups, objects must be closely related to each other;
objects of different groups should be far from each other;
ceteris paribus, the distribution of objects into groups should be uniform.
In clustering algorithms, the most important and least formalized is the definition of the concept of homogeneity, or a measure of the proximity of objects, clusters and the quality of dividing objects into groups (objectivity of the resulting groupings), which mainly determines the final classification result. In each specific task, this problem is solved in its own way and depends mainly on the objectives of the study, the structure, type of initial data, and largely relies on the intuition of the researcher. All this indicates that the implementation of such algorithms in the form of application programs in batch mode is inefficient. Therefore, for the optimal heuristic solution of clustering problems, the researcher must actively use the knowledge of experts in cluster analysis.
So, the clustering of objects, each of which is described by a set of features, involves finding homogeneous groups of objects (clusters), highlighting their hidden structure without exact knowledge of typical representatives. There are many clustering methods and algorithms that are focused on solving various classification problems. The problem of clustering is that for each specific type of data, the structure of the location of objects in the feature space, it is necessary either to choose the right algorithm, or to adapt it or develop a new one. Expert knowledge is widely used to solve this problem.
The "KARKAS" system has a module for intelligent clustering of multidimensional data - a knowledge base is used to select a clustering algorithm.
The cluster analysis knowledge base model covers the application of a number of automatic classification algorithms, such as the iterative K-intragroup average algorithm, the "ISODATA" algorithm, the dynamic thickening method, the hierarchical procedure, and others.
The clustering process can be divided into a number of stages, which are characterized by setting a similarity measure for objects in the feature space, choosing a cluster formation strategy (for example, hierarchical, using different similarity measures between clusters), choosing an assessment of the quality of splitting objects into clusters, and some heuristic considerations.
When formalizing knowledge about the clustering process, attributes can be distinguished, for example, features, distance, intercluster distance, number of clusters. Each of them takes on a specific value in various clustering algorithms. For example, the “features” attribute, depending on the values (quantitative, ordinal, nominal, dichotomous), determines the corresponding metric in the feature space or a measure of similarity between objects. In addition, some attributes, such as the number of clusters, can change dynamically during the course of the algorithm.