Active Appearance Models

Very rough draft

Introduction

Active Appearance Models were developed by Gareth Edwards et al. in 1998 and ever since been a valuable extension to the extensively used Active Shape Models. This was a proposal and implementation of a statistical entity capable of capturing full appearance of some object -- an appearance that can be faithfully described by the generic object shape mapped with some overlaid textures. Such models expressed not only the variation of shape, but also pixel intensities that are vital for full reconstruction and synthesis of valid realistic model instances.

The creation of such a model firstly relies on landmarking, much as in the case of shape models. Annotation of edges, corners and T-junctions in the image identifies unique attributes in some image that can be consistently located across a whole set of examples. Furthermore, what in fact classified such models as ones of full appearance is their ability to extract intensity (brightness) values from a shape normalised to fit a global mean (Procrustes analysis is used here to apply translation, rotation and scaling). Although originally the technique was implemented in greyscale images, Stegmann et. al. now provide an open source API that supports a corresponding RGB appearance for face datasets. Not surprisingly, it usually required 3 times the amount of space and time to process and any solution for grayscale data usually extends to colour by breaking up pixel elements into 3 components.

The appearance models encompass both shape and texture which are coded in a single vector. The means of finding correlation between the two is eigen-analysis of the covariance matrix where Principal Component analysis gives encouraging results and can reduce the dimensionality of the data considerably well while still accounting for much of the variation (but not all of it of course). There is a significant improportion between the space required (hence speed) and the loss that Principal Component analysis imposes. What is seen in practice that the components in the analysis quickly shrink, that is, they have a very small discriminatory power and when values become almost negligible they can be discarded. That, of course, will depend on the requirments of the system. For industrial inspection where quality is crucial or in medical image analysis, low error rates are usually required and the presence of abnormality is difficult to spot. On the contrary, if real-time object tracing in a video sequence is required, subsequent framed can compensate for incorrect location and efficiency is at a premium.

Throughout the process of PCA, dimensionality reduction is initially performed to make the shape representation more compact, but secondly to reduce the dimensionality of the vector describing texture variation (with the mean shape available for normalisation) in the observed (training) data.

To obtain a model that accounts for both the above variations, namely shape and intensity, Principal Component Analysis is again used to reduce the dimensionality of the aggregation of the two. During this process, the correlation between both of these is learned and a combined vector is formed. To account for different representation of texture and shape, i.e. axis aligned, normalised and centred coordinates versus 8 bit (24 bit for RGB by most conventions) encoding of colour, a matrix that scales both components by some given weighing is used. The matrix elements of this weighing component, W, define some type of transformer that improves consistency of value range in the currently handled solumn vector. As a result of the process, a vector which is rather compact can be obtained which describes full appearance (shape and texture). Typically it is larger in size than its two original merged components. That is made mandatory in order to account for the same amount of variation as before. It is rather obvious though that any level of model fidelity can be chosen and it has a direct connection with the number of elements it comprises.

A linear PCA is used to recursively find the direction in which the variation of some data is maximal. Sometimes (for a manageable number of dimensions) we can visualise all vectorised data in space so that an imaginary cloud of points is formed. PCA is able to identify the component whose removal would be the most harmful to classification of that data, i.e. the direction that distinguishes different data instances most effectively. The eigenvalues corresponding to the data in hand indicate how significant each eignevector is with respect to data discrimination. Hence, not all existing eigenvectors (which are linearly dependent on the data dimensionality) are equally useful in some new, more succinct vector representation. Some of them can be found to be 0 in which case they can be fully ignored and dimesionality reduction that is not lossy becomes available.

The allowed range of values for each parameter in the resulting appearance model indicates a general variability property. The modes of variation, that is, the collection of n modes is sorted in descending order of influence on the overall appearance. Mode n will in fact be the nth vector element. The variation of the model and the allowed range is restricted by a set of parameters (virtually a column vector) b_i that can describe a legal state of the model when considered collectively.

Extensions to appearance models span a large range of applications and purposes. Some work of Cootes et al. extended the application of AAM's to faces so that one can switch between different models depending on the view point. Each of these models requires a separate training and learning process as well as relevant data that can be hard to collect. This work intended to allow greater flexibility as the head moves and rotates. This may be of some interest if access control systems exploit AAM's and an almost strictly orthogonal view on a face is difficult to acquire. The normal assumptions of the model usually break when some landmarks get occluded. According to Lanitis et al. this happens at when the angle that entends between the frontal view and the aperture location extends over 22.5 degrees. More work earlier on took place to account for 3-D data and slicing was a common requirement as in the case for brain model (atlas) fitting. Current work attempts to automate much of the process, annotation being a particular problem. When this problem is solved, human intervention will become minimal and the cost of model acquisition will go down considerably.

To traverse image structures, the models produced are usually stretched to fit an image under a standard optimisation routine where image differences (pairwise difference) is first and foremost taken into account. It is not the most attractive feature of this novel technique, but uses of this ability begin to emerge. Interpretation of gestures through the variables b and motion tracking are among the more interesting directions that fitting model to an image took. Measure, inspection and diagnosis are some of the more useful directions.

A somewhat detailed and irrelevant aspect of AAM search is to do with optimisations, off-line training and speed-up. To allow quick and reliable convergence between a model and an image, the relationship between parameter values and the effects they have on the error measure (inferred from image differences) is learned before searching takes place. Not only parameters are taken into account, but also rigid transformations that are vital for matching, let us say, if we know very little about the size of a target object in an image.

To achieve the above a long sequence of alterations to the models is applied and the effects on intensities is learned and recorded in some matrix A. A collection of matrices eventually guides the steps taken in each iteration to achieve better conversion. This matrices can be thought of as masks which allow any real number and the matrices form a virtual image overlay.

Good initialisation is normally required when it comes to the placement of a model in some image. The search will inspect nearby pixels more than distant ones and if nearby pixels show little potential (for fitting), if any at all, then the algorithm will converge in some local minima (or run forever, or maximum number of iterations will be reached). To allow for robust performance, different resolutions of the image as well as scaled models of appearance can be used. Gaussian averaging is normally used to produce such analogous simpler (coarser) elements of the original data. The assumption is that given a coarse scale the problem is simplified and something can be learned and passed forward to the later iterations that deal with finer image resolutions.

Very rough draft

Introduction

Links

Internal

People