### Statistical Pattern Recognition (2nd Edition)

A common example of a pattern-matching algorithm is regular expression matching, which looks for patterns of a given sort in textual data and is included in the search capabilities of many text editors and word processors. In contrast to pattern recognition, pattern matching is not generally a type of machine learning, although pattern-matching algorithms especially with fairly general, carefully tailored patterns can sometimes succeed in providing similar-quality output of the sort provided by pattern-recognition algorithms.

Pattern recognition is generally categorized according to the type of learning procedure used to generate the output value.

Supervised learning assumes that a set of training data the training set has been provided, consisting of a set of instances that have been properly labeled by hand with the correct output. A learning procedure then generates a model that attempts to meet two sometimes conflicting objectives: Perform as well as possible on the training data, and generalize as well as possible to new data usually, this means being as simple as possible, for some technical definition of "simple", in accordance with Occam's Razor , discussed below. Unsupervised learning , on the other hand, assumes training data that has not been hand-labeled, and attempts to find inherent patterns in the data that can then be used to determine the correct output value for new data instances.

Note that in cases of unsupervised learning, there may be no training data at all to speak of; in other words,and the data to be labeled is the training data. Note that sometimes different terms are used to describe the corresponding supervised and unsupervised learning procedures for the same type of output. For example, the unsupervised equivalent of classification is normally known as clustering , based on the common perception of the task as involving no training data to speak of, and of grouping the input data into clusters based on some inherent similarity measure e.

Note also that in some fields, the terminology is different: For example, in community ecology , the term "classification" is used to refer to what is commonly known as "clustering". The piece of input data for which an output value is generated is formally termed an instance. The instance is formally described by a vector of features , which together constitute a description of all known characteristics of the instance.

These feature vectors can be seen as defining points in an appropriate multidimensional space , and methods for manipulating vectors in vector spaces can be correspondingly applied to them, such as computing the dot product or the angle between two vectors. Typically, features are either categorical also known as nominal , i. Often, categorical and ordinal data are grouped together; likewise for integer-valued and real-valued data. Furthermore, many algorithms work only in terms of categorical data and require that real-valued or integer-valued data be discretized into groups e.

Many common pattern recognition algorithms are probabilistic in nature, in that they use statistical inference to find the best label for a given instance. Unlike other algorithms, which simply output a "best" label, often probabilistic algorithms also output a probability of the instance being described by the given label.

### Module will run

In addition, many probabilistic algorithms output a list of the N -best labels with associated probabilities, for some value of N , instead of simply a single best label. When the number of possible labels is fairly small e. Probabilistic algorithms have many advantages over non-probabilistic algorithms:. Feature selection algorithms attempt to directly prune out redundant or irrelevant features. A general introduction to feature selection which summarizes approaches and challenges, has been given.

For a large-scale comparison of feature-selection algorithms see. Techniques to transform the raw feature vectors feature extraction are sometimes used prior to application of the pattern-matching algorithm. For example, feature extraction algorithms attempt to reduce a large-dimensionality feature vector into a smaller-dimensionality vector that is easier to work with and encodes less redundancy, using mathematical techniques such as principal components analysis PCA.

The distinction between feature selection and feature extraction is that the resulting features after feature extraction has taken place are of a different sort than the original features and may not easily be interpretable, while the features left after feature selection are simply a subset of the original features.

In order for this to be a well-defined problem, "approximates as closely as possible" needs to be defined rigorously.

In decision theory , this is defined by specifying a loss function or cost function that assigns a specific value to "loss" resulting from producing an incorrect label. The particular loss function depends on the type of label being predicted. For example, in the case of classification , the simple zero-one loss function is often sufficient.

This corresponds simply to assigning a loss of 1 to any incorrect labeling and implies that the optimal classifier minimizes the error rate on independent test data i. The goal of the learning procedure is then to minimize the error rate maximize the correctness on a "typical" test set. For a probabilistic pattern recognizer, the problem is instead to estimate the probability of each possible output label given a particular input instance, i.

When the labels are continuously distributed e. This finds the best value that simultaneously meets two conflicting objects: To perform as well as possible on the training data smallest error-rate and to find the simplest possible model. Essentially, this combines maximum likelihood estimation with a regularization procedure that favors simpler models over more complex models.

The first pattern classifier — the linear discriminant presented by Fisher — was developed in the frequentist tradition. The frequentist approach entails that the model parameters are considered unknown, but objective.

Introduction to Statistical Pattern Recognition, Second Edition Computer Science and Scientific Comp

The parameters are then computed estimated from the collected data. For the linear discriminant, these parameters are precisely the mean vectors and the covariance matrix. Note that the usage of ' Bayes rule ' in a pattern classifier does not make the classification approach Bayesian. Bayesian statistics has its origin in Greek philosophy where a distinction was already made between the ' a priori ' and the ' a posteriori ' knowledge.

Later Kant defined his distinction between what is a priori known — before observation — and the empirical knowledge gained from observations. Moreover, experience quantified as a priori parameter values can be weighted with empirical observations — using e. The Bayesian approach facilitates a seamless intermixing between expert knowledge in the form of subjective probabilities, and objective observations. Within medical science, pattern recognition is the basis for computer-aided diagnosis CAD systems.

CAD describes a procedure that supports the doctor's interpretations and findings.

## Pattern Classification by David G. Stork

Other typical applications of pattern recognition techniques are automatic speech recognition , classification of text into several categories e. Optical character recognition is a classic example of the application of a pattern classifier, see OCR-example. The method of signing one's name was captured with stylus and overlay starting in Banks were first offered this technology, but were content to collect from the FDIC for any bank fraud and did not want to inconvenience customers..

Artificial neural networks neural net classifiers and deep learning have many real-world applications in image processing, a few examples:. For a discussion of the aforementioned applications of neural networks in image processing, see e. In psychology, pattern recognition making sense of and identifying objects is closely related to perception, which explains how the sensory inputs humans receive are made meaningful. Pattern recognition can be thought of in two different ways: the first being template matching and the second being feature detection.

A template is a pattern used to produce items of the same proportions. The template-matching hypothesis suggests that incoming stimuli are compared with templates in the long term memory. If there is a match, the stimulus is identified. Feature detection models, such as the Pandemonium system for classifying letters Selfridge, , suggest that the stimuli are broken down into their component parts for identification.

For example, a capital E has three horizontal lines and one vertical line. Algorithms for pattern recognition depend on the type of label output, on whether learning is supervised or unsupervised, and on whether the algorithm is statistical or non-statistical in nature. Statistical algorithms can further be categorized as generative or discriminative.

Parametric: . 