Vision deals with the problem of deriving information about the world from the light reflected from it. Although the active and task-oriented nature of vision is only implicit in this formulation, this view captures several of the essential aspects of vision. As Marr (1982) phrased it in his book Vision, vision is an information processing task, in which an internal representation of information is of utmost importance. Only by representation information can be captured and made available to decision processes. The purpose of a representation is to make certain aspects of the information content explicit, that is, immediately accessible without any need for additional processing.
This introductory chapter deals with a fundamental aspect of early image representation---the notion of scale. As Koenderink (1984) emphasizes, the problem of scale must be faced in any imaging situation. An inherent property of objects in the world and details in images is that they only exist as meaningful entities over certain ranges of scale. A simple example of this is the concept of a branch of a tree, which makes sense only at a scale from, say, a few centimeters to at most a few meters. It is meaningless to discuss the tree concept at the nanometer or the kilometer level. At those scales it is more relevant to talk about the molecules that form the leaves of the tree, or the forest in which the tree grows. Consequently, a multi-scale representation is of crucial importance if one aims at describing the structure of the world, or more specifically the structure of projections of the three-dimensional world onto two-dimensional images.
The need for multi-scale representation is well understood, for example, in cartography; maps are produced at different degrees of abstraction. A map of the world contains the largest countries and islands, and possibly, some of the major cities, whereas towns and smaller islands appear at first in a map of a country. In a city guide, the level of abstraction is changed considerably to include streets and buildings etc. In other words, maps constitute symbolic multi-scale representations of the world around us, although constructed manually and with very specific purposes in mind.
To compute any type of representation from image data, it is necessary to extract information, and hence interact with the data using certain operators. Some of the most fundamental problems in low-level vision and image analysis concern: what operators to use, where to apply them, and how large they should be. If these problems are not appropriately addressed, the task of interpreting the output results can be very hard. Ultimately, the task of extracting information from real image data is severely influenced by the inherent measurement problem that real-world structures, in contrast to certain ideal mathematical entities, such as ``points'' or ``lines'', appear in different ways depending upon the scale of observation.
Phrasing the problem in this way shows the intimate relation to physics. Any physical observation by necessity has to be done through some finite aperture, and the result will, in general, depend on the aperture of observation. This holds for any device that registers physical entities from the real world including a vision system based on brightness data. Whereas constant size aperture functions may be sufficient in many (controlled) physical applications, e.g., fixed measurement devices, and also the aperture functions of the basic sensors in a camera (or retina) may have to determined a priori because of practical design constraints, it is far from clear that registering data at a fixed level of resolution is sufficient. A vision system for handling objects of different sizes and at difference distances needs a way to control the scale(s) at which the world is observed.
The goal of this chapter is to review some fundamental results concerning a framework known as scale-space that has been developed by the computer vision community for controlling the scale of observation and representing the multi-scale nature of image data. Starting from a set of basic constraints (axioms) on the first stages of visual processing it will be shown that under reasonable conditions it is possible to substantially restrict the class of possible operations and to derive a (unique) set of weighting profiles for the aperture functions. In fact, the operators that are obtained bear qualitative similarities to receptive fields at the very earliest stages of (human) visual processing (Koenderink 1992). We shall mainly be concerned with the operations that are performed directly on raw image data by the processing modules are collectively termed the visual front-end. The purpose of this processing is to register the information on the retina, and to make important aspects of it explicit that are to be used in later stage processes. If the operations are to be local, they have to preserve the topology at the retina; for this reason the processing can be termed retinotopic processing.
Early visual operationsAn obvious problem concerns what information should be extracted and what computations should be performed at these levels. Is any type of operation feasible? An axiomatic approach that has been adopted in order to restrict the space of possibilities is to assume that the very first stages of visual processing should be able to function without any direct knowledge about what can be expected to be in the scene. As a consequence, the first stages of visual processing should be as uncommitted and make as few irreversible decisions or choices as possible.
The Euclidean nature of the world around us and the perspective mapping onto images impose natural constraints on a visual system. Objects move rigidly, the illumination varies, the size of objects at the retina changes with the depth from the eye, view directions may change etc. Hence, it is natural to require early visual operations to be unaffected by certain primitive transformations (e.g. translations, rotations, and grey-scale transformations). In other words, the visual system should extract properties that are invariant with respect to these transformations.
As we shall see below, these constraints leads to operations that correspond to spatio-temporal derivatives which are then used for computing (differential) geometric descriptions of the incoming data flow. Based on the output of these operations, in turn, a large number of feature detectors can be expressed as well as modules for computing surface shape.
The subject of this chapter is to present a tutorial overview on the historical and current insights of linear scale-space theories as a paradigm for describing the structure of scalar images and as a basis for early vision. For other introductory texts on scale-space; see the monographs by Lindeberg (1991, 1994) and Florack (1993) as well as the overview articles by ter Haar Romeny and Florack (1993) and Lindeberg (1994).
Kluwer Academic Publishers, 1994. 1-41 p.