Science Behind The Art
The Science Behind Art / Tattooing
Architecture as Drawing, Perception and Cognition
Background for an exercise of computer modeling applied to the Church of Sta. Maria de Belém – Lisboa
Abstract. This work is about realizing that human perception is inherent to architecture. It is an asset and a trait subject to training and development in an empirical way, involving physical and manual action. It cannot be taught literally through convention and logic reasoning. It is a human achievement of great significance built on intellectual and scientifi c knowledge. It is something, being physical and empirical, that is supported on instrumental procedure. The computer, as a machine and an instrument, does not shorten the empirical experience of manipulation; on the contrary, it enhances J.J. Gibson’s findings about the perception of space in relation to eye and body movement. Being a cybernetic machine the computer may, and shall, evolve, and become perceptive. In order for that to happen, it is important to keep in mind the mechanism of human perception. Through producing a computerized model of a major architectural work, we develop natural knowledge about its physical features and the thought that lies underneath. To be able to use the computer as an instrument provides a user with explicit knowledge about its ways and mechanism that has to be made available. It involves training, which is to a great extent self-explanatory, and also explicit knowledge about the conventions that are being used, such as programming, reasoning and trigonometry.
The active role played by perception on cognition, as conceived by Arnheim (1969), does connect to Guilford’s (1967) theory on the Structure of the Intellect. As much as Norberg-Schulz’s (1967) writings about a Theory of Architecture, his ideas were also drawing on Piaget’s (1947) findings about the way sensory-motor activity develops to build one’s mental schemes, which in turn articulate into more complex mental constructs that we call cognition. Guilford develops his theory based on the scientific possibility of measuring in a precise and objective manner from standard tests, a set of intellectual faculties called factors and represented on his Structure of Intellect Model. While dealing with the necessity of having scientific rigor, as stated in his Psychometric Methods (Guilford, 1936), he also acknowledged the phenomenological reduction with his writing on Qualitative Descriptions (Guilford, 1967), where the phenomenological could be dealt with. Being concerned with the necessity for objectiveness that should assist any scientific proceeding, he was aware of the difference between a mathematical representation and the need for indicators from the actual world that we live in, the observable world. Calling upon the mythic dimension of a mathematical infrastructure over which events take place in the world, Guilford was addressing what had been the basis for a theoretical representation of architecture (Guilford, 1936). His statement is that mathematics is a human invention and not a real discovery, and its adjustment to events, making possible their prediction, is, first of all, a convenient coincidence.
Just as much as Arnheim would do, Guilford underscores that associated with intellectual development, the capacity to find visual constancy over the changing shape determined by changes in context of what was initially considered a single unitary object, and the capacity to associate in homogeneous classes, similar unitary objects become existent. The basis for that discovery is both visual and analogical. Not on some a priori symbolic way, but as if there was visual reasoning where prediction could be made, comparable to mathematical modeling. Objects have surfaces, contour, dimensions, and distance These become variables in vision correlated to the reality (figs. 1 and 2). Arnheim (1969) associates perception (the images of thought) with the building of cognition stating that there is a link between thought and perception. Good perception means that we can read from the object perceived the pertinent generic features, the ones that assemble the skeletal structure of an image, and this ability is not without active thought. Being able to perceive a visual structure from visible images is being able to build abstractions, and that is the basis for perception and the start of cognition. Susanne Langer (1942) refers that this kind of abstraction comes through a disposition of imagination to isolate the significant examples from a general context, and to reapply them, through interpretation, to other conditions found in reality. Unlike Susanne Langer, Arnheim does not think that this work being done by thought from visible material (which Langer calls representative abstraction, as opposed to science’s generalizing abstraction) is a sole feature of artistic performance. He thinks that the ability that scientists have to sample a set of cases prior to reaching conclusions is also representative abstraction. There is also in science an insight about what is going to be concluded, a formulation originated in Morris Cohen and Ernest Nagel (1934) work. This idea is then illustrated with the visual imagination involved in Copernico’s astronomical model. Images of thought thus have a quality that distinguishes them from an exact reproduction of all physical features of the object being perceived, and this is some degree of incompleteness. The fragmented images with which the mind operates, are such because this fragmented character is a positive quality produced by the object’s mental apprehension, allowing for one’s ability to mentally process visual input, which is different from exactly determining an object’s tangible and material dimensions.
figure 1. Literal vision; identifying surfaces and contour
figure 2. Descriptive dimensions and representation
Gestalt psychology had already stated this idea with the temporal Gestalt (Kofka, 1935). One conclusion to be drawn from this observation is that the fragmented pieces withdrawn from visual representation convey a bone like structure of dynamic traits that plays an essential role on mental operations such as abstracting, producing generalizations, and classifications. It is important to emphasize that, although fragmented, these images are visual constructs and not just the result of conventions, i.e. a difference addressed in Gibson’s (1950) distinction between schematic perception and literal perception. The condition that assists the images of thought is that they are structurally similar to the actual images (fig. 3). Seeing as science Psychology has told us that the capacity to respond to problems aroused by our environment, or to develop knowledge about that environment, is built upon the existence of ‘norm images’ (Arnheim 1969). The problem with this statement is that, at first, it seems too simple, as if not rational or not proceeding from an intellectually valid discourse. From the start, the term ‘norm images’ gives this notion a connotation with the ability to see, as painters, sculptors or architects see, and at the same time it engages some distance with activities that we normally do not associate with a predominance of ‘seeing’, as the scientific, engaged from mathematical and statistical manipulations, or the writing of a novel or an essay. What it actually means is that every act of perception callsupon the possibility to associate the event being perceived with a visual concept which is the stock of previous perceptions treated as structured images being recalled as apt to being applied to that event, i.e. a process calling upon memory, classified as recognition.
figure 3. Visual construct
Architects are normally considered the scientific artists, a role that while addressing the type of synthesis which is tackled by architecture, involving technical ability, physical knowledge, social and cultural awareness, and artistic sensibility, does not quite contribute to giving the activity the clearness of instrumental procedure that is expected in the knowledge specialist world that we live in. While at the same time being ambiguous between the artistic irresponsibility and the strictness of scientific method, we should ask ourselves whether artists are irresponsible or scientists strict. Presently, one of the reasons why architects are considered so, derives from the fact that architects draw (as in sketching), but they also draw scientifically (as in Gaspard Monge’s descriptive geometry). Of course we could also say that what architects draw depends on facts and data with scientific validation, but then, the people who produced such data would be better suited to use it, knowing about it with greater ease, naturally knowing it. Which as we know is not quite true. From the ‘norm images’ point of view, we are taken to ask ourselves: what kind of image are we thinking of? The retinal impression that is engraved at random, some acquired notion derived from geometry, something completely different, or is it a mix of the above? Rudolph Arnheim (1969) gives us a correct notion of what is to be expected from sensorial input, in order to form valid concepts that may be used as knowledge (fig. 4). Assessing the importance of sight, he distinguishes the retinal projection of an image, from the human perception acquired through this projection. Perception becomes analogous to an intellectual concept, and in some circumstances they end up as the same. A relevant fact in this process is the formation of a constant visual concept that one associates with a particular object, identified in order to deal with practical necessities of everyday life: for instance, a lettuce that needs to be put well under the attention, displaying its expected green color. Visual concepts thus need some sort of constancy in order to be easily manipulated in our mind; at the same time, there is a changing degree of intelligence that creates a variation on how this constancy is formed. The way the constant image is created, taking or not into consideration the context where that object is perceived, creates this variation.
The conclusion (Arnheim, 1969) is that the possibility of observing an object in a changing context is bound to give us new information on what is constant about that object, and that is why scientists should always be in quest for new situations, capable of giving him new information. This is what should be associated with productive thought. On the other hand, if our constant image is frozen as a stereotype, we will never have its satisfactory sensorial verification taken from tangible experience; we will be blind to significant changes that may have occurred on the original concept, or.revelation granted by new contexts.
The type of order which connects a perception subject to changes in context is described as an ordered sequence of a progressive change, where different points of view appear as a melting of different states of one single persistent object. Another type of constance happens when different views appear as deviations or distortions of one simpler shape. Arnheim (1969) concludes by observing that these distortions not only allow but actively imply the discovery of the prototype, and, consequently, they are not perceived as a negative feature hiding the true shape of the invariant object, but positively, caused by a condition which exists over the true shape of the object, as logic consequence. As such, a tilted tree may be seen as a normal tree changed by the effect of winds. Drawing as seeing as science Gibson (1950), states that visual perception can be either schematic, i.e. originated from learning and prior convention about meaning, or literal. Realizing the convenience of sight for what he calls “getting about and doing things” he points out that there is something special taking place between what we perceive from seeing, and the flat physiological retinal picture: what he calls “the puzzle of the third dimension”. He says that this is a problem about perceiving space, which means identifying shapes, distinguishing them from a background and realizing their relative location and position at which they lie. Objective and literal vision is what concerns the architect when dealing with sight. This depends upon the mechanisms of perception, where literal visual input is dealt with from a flat light-dependant impression – the visual field – and at the same time realizing a three dimensional space where objects stand with constant shapes, the visual world.
The theory about it is that we operate on correlates of objective properties, that become variables of perception, worked through the retinal projection, saccadic eye-movement (Gibson, 1950) (fig. 5), and movement of the observer in space. There is thus a capacity in visual perception, to compose overlapping retinal projections into panoramic vision, where successive focus of sight are combined through primary memory-vision (fig. 6). One of the utmost importance operations that take place is that we are able to create distinctions upon the impressions of the visual field. From all the changes that take place when we look at something, both staring or rapidly changing focus and shifting stands, we are able to perceive constancy from the changing impressions that concern one single object, and we are also able of discriminating between objects, detecting an entity that at some point is not an object but a background. This kind of operation allows us to identify objects, naming them and associating meaning to their names and to their visual impressions (fig. 7). Taking the architect’s stand about how we look, the objective is to seize the features of the literal visual world, to be able to look objectively and detect reality. There is meaning to be attached to his work, but at some point, he must be able not to let himself as a subject, interfere with this objective stand. It is not enough to take measures and descriptively draw, because he must be able to discriminate and identify, creating the objective correlates of those visual impressions.
A Spanish architect and teacher of architecture at the University of Navarra in Pamplona, named Ignacio Araujo (Araujo, 1976), made a synthesis on his academic writings, where he points out that an architect learns how to look when drawing. His drawing means to objectively take note of volumetric shapes and material textures, under the effect of light and color, by appropriate strokes of pen and pencil; strokes that must carry intent and correlate to reality. Modeling with the computer, one must carry this capacity of looking at, through drawing. There is increased distance from reality (fig 8), but it should be none greater than being at a painter’s studio looking at reality through another artist’s work. Reality is then conceptualized with greater capacity to identify and discriminate.
figure 6. Differences in the visual field and factors of projection
Knowledge from drawing
There are factors in perception that should be taken into account, in connection to Guilford’s (1967) distinction between the concrete and the abstract. These factors distinguish two kinds of mental images. One is associated with sensory stimulus and operating from a retinal projection, while the other is symbolic and builds upon the former existence of conventions that have to be previously learned (fig. 9). We can easily associate the latter with symbols such as letter types being connected into words or mathematical expressions, but these conventions could also be other types of symbols, compared with Guilford’s (1967) classification of “figural factor”. They are symbols where space manipulation is associated with semantic content conveyed in written form, and Arnheim (1969) mentions them when talking about representation, symbols and signs. As such, these kinds of constructs, are different in nature from Guilford’s concrete intelligence which is built upon perception. Although visual, they are different from the classification operated through perception that Arnheim associates with the images of thought, because these correspond to the “bone like structure of traits withdrawn from exterior visual stimulus, maintaining some degree of isomorphism with that stimulus”.
Concreteness is a quality that should be emphasized. It is a term used by Piaget (1937) to explain what goes on at the early stages of a child’s development, when he mentions the need to exercise sensory-motor manipulation, in order to acquire knowledge and adequate schemes of behavior, a process which is also determined by interacting with others or socializing. Arnheim (1969) calls upon the traditional distinction between the person who tends to work through symbolic manipulation and the one who operates manually on the concrete world. And he concludes that this is not an appropriate distinction, because the development and exercise of mental capacities is also associated with operations dealing with the physical manipulation of objects from the concrete world. He uses the expression: “oriented towards ideas or towards objects”. Physical behavior is determined by perceptual ability, and every sort of manipulation involves an assessment about how appropriate one solution might be, how does it work, which is typical of a productive thought. Such manipulations take the form of physical behavior. Thus we can say that sensory-motor behavior implies manipulating abstract ideas.
The conclusion to be made (Arnheim, 1969) is that any object that looks articulate is going to give away perceptual clues that in turn build up the elaboration of thought. There is then a cause-effect relationship between the way a child’s environment looks and how physically manipulative he can be over that environment, and the build up of his cognitive abilities. The reason why this happens is because we operate with both analytic and synthetic judgment, and consequently, whenever we are actively thinking, even if using words, we are recalling previous perceptive experience. Visual media becomes an advantage because it keeps structural equivalents with features taken from objects, events and relationships. Thinking of cognition as perception, it is both the fruit of intuition about the whole, where one becomes aware of the general organization of shapes, colors, place and function in relation to each other, and the intellectual analysis of the elements, operating through the listing of each and every one, and its particular properties.
figure 9. Symbolic meaning
We know that the world perceived through vision has depth, extends itself in distance, and that it is filled with meaningful objects. We have the conviction that these qualities can be observed through sight, and we tend to think that sight is the same as one image. Actually it is not. Through sight we acquire correlates of the world, which are perceived as an organized complex of variations. There are properties, both light dependent and of phenomenal type, that are correlated with patterns of variation, allowing us to establish constant and literal perception over reality. This variation and correlation integrates figural qualities (thenorm image), patterns of change, projection, and discrepancies. Objects are discriminated and differentiated, identified, and detached from a background when constancy is perceived as a quality. This background in turn is made up of more objects. They establish a more complex perception organized through proximity, similarity, symmetry and good continuity, building on inclusive space (figs. 10 and 11).
Araujo, Ignacio: 1976, La Forma Arquitectónica, Universidade de Navarra (EUNSA), Pamplona.
Arnheim, Rudolf: 1969, Visual Thinking, University of California Press, Berkeley.
Broadbent, Geoffrey: 1973, Design in Architecture, Architecture and the Human Sciences, David Fulton Publishers Ltd., London.
Cohen, Morris & Nagel, Ernest: 1934, An introduction to logic and scientific method, Harcourt Brace, New York
Gibson, James J.: 1950, The Perception of the Visual World, The Riverside Press, Cambridge Massachusetts
Guilford, Joy Paul: 1936, Psychometric methods, McGraw-Hill Series in Psychology.
Guilford, Joy Paul: 1967, The Nature of Human Intelligence, McGraw-Hill Series in Psychology
Koffka, Kurt: 1935, Principles of Gestalt Psychology, Harcourt-Brace, New York.
Langer, Susanne: 1942, Philosophy in a new key, Harvard University Press, Cambridge Mass.
Morris, Charles: 1946, Signs, Language and Behavior, Prentice-Hall, New York
Norberg-Schulz, Christian: 1967 (Oslo, Universitetsforlaget), Intenciones en Arquitectura, Editorial Gustavo Gili S.A., Barcelona 1979.
Piaget, Jean: 1947, La Psychologie de l’intelligence, Armand Collin, Paris.
Piaget, Jean: 1937, La construction du réel chez l’enfant, Delachaux & Niestlé, Paris.
Wertheimer, Max: 1959, On discrimination experiments, Psychology Review, 66, pp. 265-273.
The visual perception of 3D shapes
James T. Todd
A fundamental problem for the visual perception of 3D shape is that patterns of optical stimulation are inherently ambiguous. Recent mathematical analyses have shown, however, that these ambiguities can be highly constrained, so that many aspects of 3D structure are uniquely specified even though others might be underdetermined. Empirical results with human observers reveal a similar pattern of performance. Judgments about 3D shape are often systematically distorted relative to the actual structure of an observed scene, but these distortions are typically constrained to a limited class of transformations. These findings suggest that the perceptual representation of 3D shape involves a relatively abstract data structure that is based primarily on qualitative properties that can be reliably determined from visual information.
One of the most remarkable phenomena in the study of human vision is the ability of observers to perceive the 3D shapes of objects from patterns of light that project onto the retina. Indeed, were it not for our own perceptual experiences, it would be tempting to conclude that the visual perception of 3D shape is a mathematical impossibility, because the properties of optical stimulation appear to have so little in common with the properties of real objects encountered in nature. Whereas real objects exist in 3-dimensional space and are composed of tangible substances such as earth, metal or flesh, an optical image of an object is confined to a 2-dimensional projection surface and consists of nothing more than patterns of light. Nevertheless, for many animals, including humans, these seemingly uninterpretable patterns of light are the primary source of sensory information about the arrangement of objects and surfaces in the surrounding environment.
Scientists and philosophers have speculated about the nature of 3D shape perception for over two millennia, yet it remains an active area of research involving many different disciplines, including psychology, neuroscience, computer science, physics and mathematics. The present article reviews the current state of the field from the perspective of human vision: it will first summarize how patterns of light at a point of observation are mathematically related to the structure of the physical environment; it will then consider some recent psychophysical findings on the nature of 3D shape perception; it will evaluate some possible data structures by which 3D shapes might be perceptually represented; and it will summarize recent research on the neural processing of 3D shape. Sources of information about 3D shape There are many different aspects of optical stimulation that are known to provide perceptually salient information about 3D shape.
Several of these properties are exemplified in Figure 1. They include variations of image intensity or shading, gradients of optical texture from patterns of polka dots or surface contours, and line drawings that depict the edges and vertices of objects. Other sources of visual information are defined by systematic transformations among multiple images, including the disparity between each eye’s view in binocular vision, and the optical deformations that occur when objects are observed in motion. How is it that the human visual system is able to make use of these different types of image structure to obtain perceptual knowledge about 3D shape? The first formal attempts to address this issue were proposed by James Gibson and his students in the 1950 s . Gibson argued that to perceive a property of the physical environment, Figure 1.
Some possible sources of visual information for the depiction of 3D shape. The 3D shapes in the four different panels are perceptually specified by: (a) a pattern of image shading, (b) a pattern of lines that mark an object’s occlusion contours and edges with high curvature, (c) gradients of optical texture from a pattern of random polka dots, and (d) gradients of texture from a pattern of parallel surface contours. q Supplementary data associatedwith this article can be found at doi: 10.1016/j.tics. 2004.01.006 Corresponding author: James T. Todd (firstname.lastname@example.org). It must have a one-to-one correspondence with some measurable property of optical stimulation. According to this view, the problem of 3D shape perception is to invert (or partially invert) a function of the following form: L ¼ f(f), where f is the space of environmental properties that can be perceived, and L is the space of measurable image properties that provide the relevant optical information.
The primary difficulty for this approach is that in most natural contexts the relation between f and L is almost always a many-to-one mapping: that is to say, for any give pattern of optical stimulation, there is usually an infinity of possible 3D structures that could potentially have produced it. The traditional way of dealing with this problem is to assume the existence of environmental constraints to restrict the set of possible interpretations. For example, to analyze the pattern of texture in panel c of Figure 1, it would typically be assumed that the actual surface markings are all circular, and that the shape variations within the 2D image are due entirely to the effects of foreshortening from variations of surface orientation . Similarly, an analysis of the contour texture in panel d of Figure 1 would most likely assume that the actual contours carve up the surface into a series of parallel planar cuts . Although some of the constraints that have been invoked to resolve ambiguities in the visual mapping are intuitively quite plausible, there are others that have been adopted more for their mathematical convenience than for their ecological validity.
The problem with this approach is that the resulting analyses of 3D shape might only function effectively within narrowly defined contexts, which have a small probability of occurrence in the natural environments of real biological organisms. The inherent ambiguity of visual information is not always as serious a problem as it might appear at first. Although a given pattern of optical stimulation can have an infinity of possible 3D interpretations, it is often the case that those interpretations are highly constrained, such that they are all related by a limited class of transformations.
Recent theoretical analyses have shown, for example, that the optical flow of 2-frame motion sequences or the pattern of shadows in an image are ambiguous up to a set of stretching or shearing transformations in depth (see Figure 2) [4,5]. It is important to note that these are linear transformations – sometimes called ‘affine’ – that preserve a variety of structural properties, such as the relative signs of curvature on a surface, the parallelism of lines or planes, and relative distance intervals in parallel directions. Thus, 2-frame motion sequences or patterns of shadows can accurately specify all aspects of 3D shape that are invariant over affine transformations, but they cannot specify other aspects of 3D structure involving metrical relations among distance intervals in different directions.
It is important to point out in this context that there are some potential sources of visual information by which it is theoretically possible to determine the 3D shape of an object unambiguously. These include apparent motion sequences with three or more distinct views  and binocular displays with both horizontal and vertical disparities . Because motion and stereo are such powerful sources of information, especially when presented in combination , it should not be surprising that they are of primary importance for the perception of 3D shape in natural vision. However, there is a large body of empirical evidence to indicate that human observers are generally incapable of making full use of that information. When required to make judgments about 3D metric structure from moving or stereoscopic displays, observers almost always produce large systematic errors (see  for a
Psychophysical investigations of 3D shape perception
The earliest psychophysical experiments on perceived 3D shape were performed in the 19th century to investigate stereoscopic vision, although the stimuli used were generally restricted to small points of light presented in otherwise total darkness. These studies revealed that observers’ perceptions can be systematically distorted, such that physically straight lines in the environment can appear perceptually to be curved, and apparent intervals in depth become systematically compressed with increased viewing distance. Given the impoverished nature of the available information in these experiments, it is reasonable to be skeptical about the generality of their results, but more recent research has shown that these same patterns of distortion also occur for judgments of real objects in fully illuminated natural environments [9–14]. A particularly compelling example can be experienced while driving in an automobile along a multi-lane highway.
In the United States, the hash marks that separate passing lanes all have a length of 10 ft (3.05 m), but that is Figure 2. The inherent ambiguity of image shading. Belumeur, Kriegman and Yuille  have shown that the pattern of shadows in an image is inherently ambiguous up to a stretching or shearing transformation in depth. (a) and (b) show the front and side views of a normal human head. (c) and (d), by contrast, show front and side views of this same head after it and the light source were subjected to an affine shearing transformation. Note that the two front views are virtually indistinguishable, even though the depicted 3D shapes are quite different.
Not how they appear perceptually to human observers. Those in the distance appear much shorter than those closer to the observer. Until quite recently, most experiments on the perception of 3D shape have used relatively crude psycho-physical measures, such as judging the magnitude of an object’s extension in depth, or estimating the ratio of its depth and width. It is important to recognize that this type of procedure is obviously inadequate for revealing the richness of human perception. After all, a sphere, a cube or a pyramid can have identical depths and widths, yet all observers would agree that their shapes are quite different. During the past decade our empirical understanding about 3D shape perception has been significantly enhanced by the development of more sophisticated psychophysical methods [15–19], several of which are described in Figure 3. What they all share in common is that observers are required to estimate some aspect of local 3D structure at many different probe points on an object’s surface, and these responses are then analyzed to compute a surface that is maximally consistent in a leastsquares sense with the overall pattern of an observer’s judgments. Consider, for example, a recent experiment by Todd and co-workers . Observers in this study made profile adjustments (Figure 3c) for images of randomly shaped surfaces similar to the one in the lower left panel of Figure 1 that were depicted with different types of texture. An analysis of these judgments revealed that they were almost perfectly correlated with the simulated 3D structures of the depicted surfaces. The correlation between different observers was also high, as was the test-retest reliability for individual observers across multiple experimental sessions. These findings indicate that observers’ judgments about the general pattern of concavities and convexities were quite accurate, but there was one aspect of the apparent 3D structure that was systematically distorted. The judged magnitude of relief was underestimated by all observers, and there were large individual differences in the extent of this underestimation that ranged from 38% to 75%.
These results suggest that the available information from texture gradients can only specify the 3D shape of a surface up to an indeterminate depth scaling. Thus, when observers are required by experimental task demands to estimate a specific magnitude of relief, they are forced to guess or to adopt some ad hoc heuristic. Although this general pattern of results is quite common in experiments on 3D shape perception, it is by no means universal [19,21,22]. Sometimes the variations among observers’ judgments are more complex and cannot be accounted for by a simple depth scaling transformation. A particularly clear example of this phenomenon has recently been reported by Koenderink et al. . The stimuli in their study consisted of four shaded photographs of abstract sculptures by the Romanian artist Constantin Brancusi. Over a series of experimental sessions, the depicted shape in each photograph was judged by observers using all of the different methods described in Figure 3. These judgments were then analyzed to compute a best-fitting response surface for each observer in each condition, and these response surfaces were compared using regression analyses.
A particularly surprising result from this study is that in some instances the correlations between the judged 3D structures obtained by a given observer using different response tasks were close to zero! Subsequent analyses revealed, however, that almost all of the variance between these conditions could be accounted for by an affine shearing transformation in depth like the one depicted in Figure 2. It is important to recognize that this pattern of results is perfectly consistent with the inherent ambiguity of shading information described by Belumeur et al. . These findings indicate that when observers make judgments about objects depicted in shaded images, they are quite accurate at estimating those aspects of structure that are uniquely specified by the available information – in other words, those that are invariant over affine stretching or shearing transformations in depth (Figure 2). Because all remaining aspects of structure are inherently ambiguous, observers must subconsciously adopt some type strategy to constrain their responses, and it appears that these strategies do not necessarily remain constant Figure 3. Alternative methods for the psychophysical measurement of perceived 3D shape. (a) depicts a possible stimulus for a relative depth probe task. On each trial, observers must indicate by pressing an appropriate response key whether the red dot or the green dot appears closer in depth. (b) shows a common procedure for making judgments about local surface orientation.
On each trial, observers are required to adjust the orientation of a circular disk until it appears to fit within the tangent plane of the depicted surface. Note that the probe on the upper right of the object appears to satisfy this criterion, but that the one on the lower left does not. (c) shows a possible stimulus for a depth-profile adjustment task. On each trial, an image of an object is presented with a linear configuration of equally spaced red dots superimposed on its surface. An identical row of dots is also presented against a blank background on a separate monitor, each of which can be moved in a perpendicular direction with a hand held mouse. Observers are required to adjust the dots on the second monitor to match the apparent surface profile in depth along the designated cross-section.
By obtaining multiple judgments at many different locations on the same object, it is possible with all of these procedures to compute a specific surface that is maximally consistent in a least-squares sense with the overall pattern of an observer’s judgments . Across different response tasks. A similar result has also been reported by Cornelis et al. , who found affine shearing distortions between the judged 3D shapes of objects depicted in images viewed at different orientations.
The perceptual representation of 3D shape
Almost all existing theoretical models for computing the 3D structures of arbitrary surfaces from visual information are designed to generate a particular form of data structure that can be referred to generically as a ‘local property map’. The basic idea is quite simple and powerful. A visual scene is broken up into a matrix of small local neighborhoods, each of which is characterized by a number (or a set of numbers) to represent some particular local aspect of 3D structure, such as depth or orientation. This idea was first proposed over 50 years ago by James Gibson , although he eventually rejected it as one of his biggest mistakes .
One major shortcoming of local property maps as a possible data structure for the perceptual representation of 3D shape is that they are highly unstable. Consider what occurs, for example, when an object is viewed from multiple vantage points. In general, when an object moves relative to the observer (or vice versa), the depths and orientations of each visible surface point will change, so that any local property map that is based on those attributes will not exhibit the phenomenon of shape constancy. In principle, one could perform a rigid transformation of the perceived structure at different vantage points to see if they match. This would only work, however, if the perceived metric structure were veridical, and the empirical evidence shows quite clearly that is not the case. What type of data structure could potentially capture the qualitative aspects of 3D surface shape without also requiring an accurate or stable representation of local metric properties? It has long been recognized that a convincing pictorial representation of an object can sometimes be achieved by drawing just a few salient features (see Figure 4). For example, one such feature that is especially important is the occlusion contour that
separates an object from its background .
Indeed, an occlusion contour presented in isolation (e.g. a silhouette) can often provide sufficient information to recognize an object, and to reliably segment it into distinct parts [26–28]. Another class of features that is perceptually important for segmentation and recognition includes the edges and vertices of polyhedral surfaces [29,30], and there is some evidence to suggest that the topological arrangement of these features provides a relatively stable data structure that can facilitate the phenomenal experience of shape constancy (see Box 1).
Within the literature on both human andmachine vision, therehave beennumerous attempts to analyze line drawings of 3D scenes. This research was initially focused on the interpretation of line drawings of simple plane-faced polyhedra [31,32]. Researcherswere able to exhaustively catalog the different types of vertices that can arise in line drawings of these objects, and then use that catalog to labelwhich lines in a drawing correspond to convex, concave, or occluding edges. Similar procedures were later developed to deal with other types of lines corresponding to shadows or cracks, and the occlusion contours of smoothly curved surfaces . A closely related approach has also been used to segment objects into parts, which can be distinguished from one another by different types of features of which they are composed. The arrangement of these parts provides sufficient information to successfully recognize a wide variety of common 3D objects .Moreover, because the classification of verticesandedges is generally unaffected by small changes in 3D orientation, this method of recognition has a high degree of viewpoint-invariance relative to other approaches that have been proposed in the literature.
There is an abundance of evidence from pictorial art and human psychophysics that occlusion contours and edges of high curvature play an important role in the perception of 3D shape, but the mechanisms by which these features are identified within 2D images remain poorly understood. One important reason why edge labeling is so difficult is that the pattern of 2D image structure can be influenced by a wide variety of environmental factors. Occlusion contours and edges of high curvature are most often specified by abrupt changes of image intensity, but similar abrupt changes can also be produced by cast shadows, specular highlights, or changes in surface reflectance (see Figures 1 and 4).
Human observers generally have little difficulty identifying these features, but there are no formal algorithms that are capable of achieving comparable performance, despite over three decades of active research on this problem. The identification of image features is also of crucial importance for traditional computational analyses of 3D structure from motion or binocular disparity. These analyses are all based on a fundamental assumption that Figure 4. Some important features of local surface structure that can provide perceptually useful information about 3D shape even within schematic line drawings. (a) and (b) show shaded images of a smoothly curved surface and a plane-faced polyhedron. (c) and (d) show schematic line drawings of these scenes in which the lines denote occlusion contours or edges of high curvature. Several different types of singular points (identified with arrows) are particularly informative for specifying the qualitative 3D structure of an observed scene. A more complete analysis of these different types of image features is described in a classic paper by Malik .
visual features must projectively correspond to fixed locations on an object’s surface. Although this assumption is satisfied for the motions or binocular disparities of textured surfaces, it is often strongly violated for other types of visual features, such as smooth occlusion boundaries, specular highlights or patterns of smooth shading. There is a growing amount of evidence to suggest, however, that the optical deformations of these features do not pose an impediment to perception, but rather, provide powerful sources of information for the perceptual analysis of 3D shape . Some example videos that demonstrate the perceptual effects of these deformations are provided in the supplementary materials to this article.1
e neural processing of 3D shape
Although most of our current knowledge about the perception of 3D shape has come from computational analyses and psychophysical investigations, there has been a growing effort in recent years to identify the neural mechanisms that are involved in the processing of 3D shape. The first sources of evidence relating to this topic were obtained from lesion studies in monkeys [35,36]. The results revealed that animals with bilateral ablations of the inferior temporal cortex are severely impaired in their ability to discriminate complex 2D patterns or shapes. Animals with lesions in the parietal cortex, by contrast, exhibit normal shape discrimination, but are impaired in their ability to localize objects in space. These findings led to a widely accepted conclusion that the primate visual system contains two functionally distinct visual pathways: a ventral ‘what’ pathway directed towards the temporal lobe that is involved in object recognition, and a dorsal ‘where’ pathway directed towards the parietal lobe that is involved in spatial localization and the visual control of action.
The best available method for studying the neural processing of 3D shape in humans involves functional magnetic resonance imaging (fMRI), which measures local variations in blood flow in different regions of the cortex, thus providing an indirect measure of neural activation. This technique is most often used to compare patterns of brain activation in different experimental conditions. For example, to identify the neural mechanisms involved in the processing of 3D structure from motion, several investigators have compared the activation patterns produced when observers view 3D objects defined by motion relative to those that are produced by moving 2D patterns [37–40]. One limitation of this approach, however, is that it can be difficult to distinguish which specific stimulus attributes are responsible for any observed differences in neural activation. Increased activation in the 3D motion condition could be due to the processing of 3D shape, or it could be due to the processing of 3D motion trajectories. The best way of overcoming this difficulty is to compare the activation patterns for different response tasks applied to identical sets of stimuli [41,42].
Areas involved in the processing of 3D shape would be expected to become more active when making judgments about 3D shape than would otherwise be the case for other possible response tasks, such as judgments of surface texture or motion trajectories. Recent research using both of these procedures for the perception of 3D shape from motion, shading and texture has produced a growing body of evidence that judgments of 3D shape involve both the dorsal and ventral pathways [37–40,43,44], which is somewhat surprising given the Box 1.
Sources of perceptual constancy
An important topic in the theoretical analysis of visual information is to identify informative properties of optical structure that remain stable over changing viewing directions. For example, researchers have shown that the terminations of edges and occlusion contours in an image have a highly stable topological structure. Although these features can sometimes appear or disappear suddenly, these transitions are highly constrained and only occur in a few possible ways, which have been exhaustively enumerated . Tarr and Kriegman  have recently demonstrated that the occurrence of these abrupt events can dramatically improve the ability of observers to detect small changes in object orientation.
Stability over change can also be important for the perceptual representation of 3D shape. Unlike other aspects of local surface structure (e.g. depth or orientation), curvature is an intrinsically defined attribute that does not require an external frame of reference. Thus, because it provides a high degree of viewpoint-invariance, a curvature based representation of 3D shape could be especially useful for achieving perceptual constancy. Several sources of evidence suggest that local maxima or minima of curvature provide important landmarks in the perceptual organization of 3D surface structure. For example, when observers are asked to segment an object’s occlusion boundary into perceptually distinct parts, they most often localize the part boundaries at local extrema of negative curvature [26–28]. A similar pattern of results is obtained when observers are asked to place markers along the ridges and valleys of a surface . These judgments remain remarkably stable over changes in surface orientation, and the marked points generally coincide with local maxima of curvature for ridges and localminima of curvature for valleys. There is other anecdotal evidence to suggest, moreover, that the depiction of smooth surfaces in line drawings can be perceptually enhanced by the inclusion of curvature ridge lines (see Figure I).
Figure I. Two methods of pictorial depiction of smoothly curved surfaces. (a) a shaded image of a randomly shaped object; (b) the same object depicted as a line drawing. The lines on the figure denote two different types of surface features: smooth occlusion contours, where the surface orientation at each point is perpendicular to the line of sight, and curvature ridge lines, where the surface curvature perpendicular to the contour is a local maximum or minimum. The configuration of contours provides compelling information about the overall pattern of 3D shape.
Functional roles that have traditionally been attributed to these pathways. As would be expected from the results of earlier lesion studies, judgments of 3D shape produce significant activations in ventral regions of the cortex, although it is interesting to note that these do not overlap perfectly with regions involved in the analysis of 2D shape [45,46]. The analysis of 3D shape also occurs at numerous locations within the dorsal pathway, including the medial temporal cortex, and at several sites along the intraparietal sulcus. Someof these findings are also consistent with the results obtained using electrophysiological recordings of single neurons within the dorsal and ventral pathways of monkeys [47–50]. For example, Janssen, Vogels and Orban [51–54] have shownthatneuronswithinthe inferior temporal cortex that are selective to 3D surface curvature from binocular disparity are concentrated in a small area of the superior temporal sulcus, but that neurons selective to 2D shape are more broadly distributed.
It has generally been assumed within the field of neurophysiology that the monkey visual system provides an adequate model of human brain function, but, until quite recently, there has been no way to test the validity of that generalization. That situation has changed, however, owing to recent methodological innovations [55,56], which now make it possible to perform fMRI on alert behaving monkeys using exactly the same experimental protocols as are used with humans. In one of the first studies to exploit this approach, Vanduffel et al.  compared the patterns of activation produced by 2D and 3D motion displays in humans and monkeys. Although the activations of ventral cortex were quite similar in both species, the results obtained in parietal cortex were remarkably different.
Whereas the perception of 3D structure from motion produces numerous activations in humans along the intraparietal sulcus, those activations are completely absent in monkeys. These findings suggest that there may be substantial differences between the human and monkey visual systems in how visual information is analyzed for the determination of 3D shape. Because this is such a new area of research, there is much too little data at present to draw any firm conclusions. However, this is likely to be a particularly active topic of investigation over the next several years (see also Box 2).
Psychophysical investigations have revealed that observers’ judgments about 3D shape are often systematically distorted, but that these distortions are constrained to a limited set of transformations in a manner that is consistent with current computational analyses. These findings suggest that the perceptual representation of 3D shape is likely to be primarily based on qualitative aspects of 3D structure that can be determined reliably from visual information. One possible form of data structure for representing these qualitative properties involves arrangements of salient image features, such as occlusion contours or edges of high curvature, whose topological structures remain relatively stable over viewing directions. Other recent empirical studies have shown that the neural processing of 3D shape is broadly distributed throughout the ventral and dorsal visual pathways, suggesting that processes in both of these pathways are of fundamental importance to human perception and cognition.
The preparation of this manuscript was supported by grants from NIH (R01-Ey12432) and NSF (BCS-0079277).
1 Gibson, J.J. (1950) The Perception of the Visual World, Haughton Mifflin
2 Malik, J. and Rosenholtz, R. (1997) Computing local surface orientation and shape from texture for curved surfaces. Int. J. Comput. Vis. 23, 149–168
3 Tse, P.U. (2002) A contour propagation approach to surface filling-in and volume formation. Psychol. Rev. 109, 91–115
4 Belhumeur, P.N. et al. (1999) The bas-relief ambiguity. Int. J. Comput. Vis. 35, 33–44
5 Koenderink, J.J. and van Doorn, A.J. (1991) Affine structure from motion. J. Opt. Soc. Am. A 8, 377–385
6 Ullman, S. (1979) The Interpretation of Visual Motion, MIT Press 7 Longuet-Higgins, H.C. (1981) A computer algorithm for reconstructing
a scene from two projections. Nature 293, 133–135
8 Richards, W.A. (1985) Structure from stereo and motion. J. Opt. Soc. Am. A 2, 343–349
9 Todd, J.T. and Norman, J.F. (2003) The visual perception of 3-D shape Box 2. Questions for future research
† Why are observers insensitive to potential information from motion and/or binocular disparity, by which it would be theoretically possible to perceive very accurately metrical relations among distance intervals in different direction? One speculative answer is that these relations might not be important for tasks that are crucial for survival.
† How do observers achieve stable perceptions of 3D structure from visual information that is inherently ambiguous? One possibility is that they exploit regularities of the natural environment to select an interpretation that is statistically most likely, although there is little hard evidence to support that hypothesis.
† How is the phenomenal experience of perceptual constancy achieved, even though observers are unable to make accurate judgments of 3D metric structure? That constancy is possible could indicate that the perceptual representation of 3D shape involves a relatively abstract data structure based on qualitative surface properties that can be reliably determined from visual information.
† How do observers identify different types of image features, such as occlusion contours, cast shadows, specular highlights or variations in surface reflectance? This is one of the oldest problems in computational vision, but researchers have made surprisingly little headway.
† How do observers obtain useful information about 3D shape from optical deformations of image features, such as smooth occlusion contours, diffuse shading, specular highlights or cast shadows, which do not remain projectively attached to fixed locations on an object’s surface, as is required by current computational models of 3D structure from motion?
† What are the neural mechanisms by which 3D shapes are perceptually analyzed? Current research has identified several anatomical locations of 3D shape processing, but the precise computational functions performed in those regions have yet to be determined.
Percept. Psychophys. 65, 31–47 10 Hecht, H. et al. (1999) Compression of visual space in natural scenes and in their photographic counterparts. Percept. Psychophys. 61, 1269–1286
11 Koenderink, J.J. et al. (2000) Direct measurement of the curvature of visual space. Perception 29, 69–80
12 Koenderink, J.J. et al. (2002) Papus in optical space. Percept. Psychophys. 64, 380–391
13 Loomis, J.M. and Philbeck, J.W. (1999) Is the anisotropy of perceived 3-D shape invariant across scale? Percept. Psychophys. 61, 397–402
14 Norman, J.F. et al. (1996) The visual perception of 3D length. J. Exp. Psychol. Hum. Percept. Perform. 22, 173–186
15 Koenderink, J.J. et al. (1996) Surface range and attitude probing in stereoscopically presented dynamic scenes. J. Exp. Psychol. Hum.
Percept. Perform. 22, 869–878
16 Koenderink, J.J. et al. (1996) Pictorial surface attitude and local depth comparisons. Percept. Psychophys. 58, 163–173
17 Koenderink, J.J. et al. (1997) The visual contour in depth. Percept. Psychophys. 59, 828–838
18 Koenderink, J.J. et al. (2001) Ambiguity and the ‘mental eye’ in pictorial relief. Perception 30, 431–448
19 Todd, J.T. et al. (1996) Effects of changing viewing conditions on the perceived structure of smoothly curved surfaces. J. Exp. Psychol. Hum.
Percept. Perform. 22, 695–706
20 Todd, J.T. et al. (2004) Perception of doubly curved surfaces from anisotropic textures. Psychol. Sci. 15, 40–46
21 Cornelis, E.V.K. et al. Mirror reflecting a picture of an object: what happens to the shape percept? Percept. Psychophys. (in press)
22 Todd, J.T. et al. (1997) Effects of texture, illumination and surface reflectance on stereoscopic shape perception. Perception 26, 807–822
23 Gibson, J.J. (1950) The perception of visual surfaces. Am. J. Psychol. 63, 367–384
24 Gibson, J.J. (1979) The Ecological Approach to Visual Perception, Haughton Mifflin
25 Koenderink, J.J. (1984) What does the occluding contour tell us about solid shape? Perception 13, 321–330
26 Hoffman, D.D. and Richards, W.A. (1984) Parts of recognition. Cognition 18, 65–96
27 Siddiqi, K. et al. (1996) Parts of visual form: psychological aspects. Perception 25, 399–424
28 Singh, M. and Hoffman, D.D. (1999) Completing visual contours: the relationship between relatability and minimizing inflections. Percept.
Psychophys. 61, 943–951
29 Biederman, I. (1987) Recognition-by-components: a theory of human image understanding. Psychol. Rev. 94, 115–147
30 Malik, J. (1987) Interpreting line drawings of curved objects. Int.
J. Comput. Vis. 1, 73–103
31 Huffman, D.A. (1977) Realizable configurations of lines in pictures of polyhedra. Machine Intelligence 8, 493–509
32 Mackworth, A.K. (1973) Interpreting pictures of polyhedral scenes.
Artif. Intell. 4, 121–137
33 Hummel, J.E. and Biederman, I. (1992) Dynamic binding in a neural network for shape recognition. Psychol. Rev. 99, 480–517
34 Norman, J.F. et al. Perception of 3D shape from specular highlights and deformations of shading. Psychol. Sci. (in press)
35 Dean, P. (1976) Effects of inferotemporal lesions on the behavior of monkeys. Psychol. Bull. 83, 41–71
36 Ungerleider, L.G. and Mishkin, M. (1982) Two cortical visual systems.
In Analysis of Visual Behavior (Ingle, D.J. et al., eds), pp. 549–586, MIT press
37 Kriegeskorte, N. et al. (2003) Human cortical object recognition from a visual motion flowfield. J. Neurosci. 23, 1451–1463
38 Murray, S.O. et al. (2003) Processing shape, motion and three dimensional shape-from-motion in the human cortex. Cereb. Cortex 13, 508–516
39 Orban, G.A. et al. (1999) Human cortical regions involved in extracting depth from motion. Neuron 24, 929–940
40 Paradis, A.L. et al. (2000) Visual perception of motion and 3-D structure from motion: an fMRI study. Cereb. Cortex 10, 772–783
41 Corbetta, M. et al. (1991) Selective and divided attention during visual discriminations of shape, color, and speed: functional anatomy by positron emission tomography. J. Neurosci. 11, 2383–2402
42 Peuskens, H. et al. Attention to 3D shape, 3D motion and texture in 3D structure from motion displays. J. Cogn. Neurosci. (in press)
43 Shikata, E. et al. (2001) Surface orientation discrimination activates caudal and anterior intraparietal sulcus in humans: an event-relatedfMRI study. J. Neurophysiol. 85, 1309–131444 Taira, M. et al. (2001) Cortical areas related to attention to 3D surface structures based on shading: an fMRI study. Neuroimage 14, 959–966
45 Kourtzi, Z. and Kanwisher, N. (2000) Cortical regions involved in perceiving object shape. J. Neurosci. 20, 3310–3318
46 Kourtzi, Z. and Kanwisher, N. (2001) Representation of perceived object shape by the human lateral occipital complex. Science 293, 1506–1509
47 Taira, M. et al. (2000) Parietal neurons represent surface orientation from the gradient of binocular disparity. J. Neurophysiol. 83, 3140–3146
48 Tsutsui,K. et al. (2001) Integration of perspective and disparity cues in surface-orientation-selective neurons of area CIP. J. Neurophysiol. 86, 2856–2867
49 Tsutsui, K. et al. (2002) Neural correlates for perception of 3D surface orientation from texture gradient. Science 298, 409–412
50 Xiao, D.K. et al. (1997) Selectivity of macaque MT/V5 neurons for surface orientation in depth specified by motion. Eur. J. Neurosci. 9, 956–964
51 Janssen, P. et al. (1999) Macaque inferior temporal neurons are selective for disparity-defined three-dimensional shapes. Proc. Natl. Acad. Sci. U. S. A. 96, 8217–8222
52 Janssen, P. et al. (2000) Three-dimensional shape coding in inferior temporal cortex. Neuron 27, 385–397
53 Janssen, P. et al. (2000) Selectivity for 3D shape that reveals distinct areas within macaque inferior temporal cortex. Science 288, 2054–2056
54 Janssen, P. et al. (2001) Macaque inferior temporal neurons are selective for three-dimensional boundaries and surfaces. J. Neurosci. 21, 9419–9429
55 Orban, G.A. et al. (2003) Similarities and differences in motion processing between the human and macaque brain: evidence from fMRI. Neuropsychologia 41, 1757–1768
56 Vanduffel, W. et al. (2002) Extracting 3D from motion: differences in human and monkey intraparietal cortex. Science 298, 413–415
57 Cipollo, R. and Giblin, P. (1999) Visual Motion of Curves and Surfaces, Cambridge University Press
58 Tarr,M.J. and Kriegman, D.J. (2001) What defines a view? Vision Res. 41, 1981–2004
59 Phillips, F. et al. (2003) Perceptual representation of visible surfaces. Percept. Psychophys. 65, 747–762
Running head: Fictive motion and eye movements
The integration of figurative language and static depictions: An eye movement study of fictive motion
Do we view the world differently if it is described to us in figurative rather than literal terms? An answer to this question would reveal something about both the conceptual representation of figurative language and the scope of top down influences on scene perception. Previous work has shown that participants will look longer at a path region of a picture when it is described with a type of figurative language called fictive motion (The road goes through the desert) rather than without (The road is in the desert). The current experiment provided evidence that such fictive motion descriptions affect eye movements by evoking mental representations of motion. If participants heard contextual information that would hinder actual motion, it influenced how they viewed a picture when it was described with fictive motion. Inspection times and eye movements scanning along the path increased during fictive motion descriptions when the terrain was first described as difficult (The desert is hilly) as compared to easy (The desert is flat); there were no such effects for descriptions without fictive motion. It is argued that fictive motion evokes a mental simulation of motion that is immediately integrated with visual processing, and hence figurative language can have a distinct effect on perception.
Our comprehension of a picture is more than the sum of its pixels; our comprehension of a sentence is more than the sum of its words. Both words and pictures need interpretation. When spoken words describe what we see in front of us, we must integrate these interpretations on the fly. How do these visual and verbal processes interact? Since Cooper (1974) demonstrated that eye movements are often directed towards objects referred to in speech, research has revealed a close integration of visual and linguistic processing (see Henderson & Ferreira, 2004; Trueswell & Tanenhaus, 2005). For example, visual processes are engaged during processing syntactic structure (Tanenhaus, Spivey Knowlton, Eberhard, & Sedivy, 1995), differentiating semantic roles (Altmann & Kamide, 1999) and resolving anaphoric reference (Runner, Sussman, & Tanenhaus, 2003), and the degree to which listeners’ eye movements are coupled to speakers’ reflects levels of comprehension (Richardson & Dale, 2005). Yet studies of verbal and visual integration have focused on literal language. Even though figurative expressions are pervasive in everyday language and exist in all cultures (Gibbs, 1994; Lakoff, 1987), research has not addressed how figurative language affects the process through which we perceive the world. In the current experiment, we investigated how a scene would be perceived when it was described by forms of literal and figurative language that are reported to have equivalent meaning. If the mental representation of a figurative expression is identical to that of a literal expression, then there would be no difference between eye movement patterns. Similarly, if the mental representation of a figurative expression does not interact with visual processes, then there would be no difference between eye movement patterns. Therefore, any differences that are present in eye movement patterns can tell us about both the distinct mental representations that are evoked by figurative language, and the scope of the integration between visual and verbal processing.
We chose to study a class of figurative spatial descriptions known as fictive motion (FM) sentences. Two examples are shown in (1a) and (1b). (1a) The road goes through the desert (1b) The fence follows the coastline Pervasive in English and many other languages, including Swedish, Finnish, Italian, Chinese, and Japanese, the descriptions are figurative because they contain a motion verb but describe no motion (Huumo, 2005; Matlock, 2004a; Matsumoto, 1996) They highlight the spatial relation between a path or linear entity and a landmark (Talmy, 2000), for instance, the road and the desert in (1a) and the fence and the coastline in (1b). In this way, these fictive motion descriptions are equivalent to literal spatial descriptions, or non-fictive motion sentences (non-FM) such as those in (2a) and (2b). (2a) The road is in the desert (2b) The fence is next to the coastline
Experimental evidence supports the idea that simulated motion is evoked by fictive motion sentences such as (1a) and (1b). In a study by Matlock, Ramscar, and Boroditsky (2005) it was shown that thinking about the meaning of fictive motion sentences affected how people would conceptualize time spatially. Participants in the study were primed with FM sentences (e.g., The tattoo runs along his spine) or non-FM sentences (e.g., The tattoo is next to his spine) before answering this ambiguous question about time: “Next Wednesday’s meeting has been moved forward two days. What day is the meeting now that it has been re-scheduled?” The expression “move forward” is ambiguous because both Monday and Friday are possible answers. When primed with descriptions with fictive motion, participants in Matlock et al. (2005) were encouraged to take an ego-moving perspective and more likely to say Friday (versus Monday), but when primed with non-FM descriptions they were split between Monday and Friday.
Similarly, fictive motion direction (either away or toward, as in The road goes all the way to New York or The road comes all the way from New York) affected how participants conceptualized of time, namely, more Fridays with going away and more Mondays with coming toward. Together, the results of Matlock et al. (2005) parallel those of other studies on time, space, and motion (Boroditsky, 2000; Boroditsky & Ramscar, 2002; Ramscar, Matlock, & Boroditsky, 2005), suggesting that thinking about motion (fictive or actual) induces an ego-moving perspective when thinking about time.
We have found suggestive evidence that fictive motion descriptions can have an immediate and distinct effect on visual processing. Matlock and Richardson (2004) presented participants with simple drawings of paths such as roads, rivers and pipelines. They heard either FM or non-FM descriptions of these paths while their gaze was tracked. The FM descriptions caused participants to spend more time inspecting the region of the path. These gaze differences did not merely result from minor differences in sentence length. Nor did they result from different semantic content, for FM and non-FM sentences were judged as having similar meanings, to be equally semantic sensible, and to be equally good descriptions of the pictures..
Why might fictive motion descriptions have influenced eye movements in this way? One possibility is that participants simply found the FM descriptions to be more interesting, and so viewers paid more attention to the paths. Another possibility is that comprehending fictive motion descriptions evokes mental representations of motion (Matlock, 2004a, 2004b; Matlock et al., 2005; Talmy, 2000), and that these motion representations result in more visual attention being directed to the path. The first goal of the current experiment was to distinguish between these two possibilities. The second goal was to learn more about the eye movements produced by fictive motion descriptions.
Is it simply that the whole path attracts more visual attention, or do fictive motion descriptions also evoke a pattern of eye movements that is related to motion along a path? We addressed these goals by introducing an additional experimental factor and an additional dependent variable. In Matlock’s (2004b) reading time studies, participants read stories about protagonists travelling through spatial domains (e.g., valley), followed by target sentences with fictive motion (e.g., The road goes through the valley). In general, participants were quicker to process fictive motion target sentences after reading about terrains that were easy to traverse (e.g., The valley was flat and smooth) versus terrains that were difficult to traverse (e.g., The valley was bumpy and uneven). Critically, there was no difference for comparable literal target sentences without fictive motion (e.g., The road is in the valley). These results suggest that the comprehension of descriptions of fictive motion across a domain is influenced by factors that would affect actual motion across the domain. Following that logic, in the current experiment we presented participants with descriptions of easy and difficult terrains and then FM sentences or non- FM sentences. If terrain information modulated looking behavior with FM sentences, it would show that it was not merely something generally eye catching about the combination of non-literal motion verb and path preposition (e.g., runs along, goes through) that influenced the looking times in Matlock and Richardson (2004), but rather, the engagement of contextually appropriate simulated motion.
We hypothesized that fictive motion descriptions would activate representations of motion. If so, then perhaps we would see not only longer looking times to the path, but also sequences of eye movements that correspond to motion. Spivey and colleagues found that as participants listened to a narrative and looked at blank screen (Spivey & Geng, 2001) or closed their eyes (Spivey, Tyler, Richardson, & Young, 2000), they tended to make eye movements that corresponded to spatial content in the stories. For example, more vertical eye movements were made when hearing about someone repelling down a canyon wall, and more horizontal eye movements were made when hearing about a train pull out of a station. Eye movements were increased along a specific axis of motion, rather than sequentially in a particular direction. We adapted this idea to our experiment, and counted the number of occasions that participants made path scanning eye movements, in which one region of the path was fixated immediately after any other path region. In addition to looking time differences, we predicted that participants would make more path scanning looks along the path during a fictive motion description when they had previous heard a description of a difficult rather than easy terrain, but there would be no such difference for non-fictive motion descriptions.
Participants. Sixty-three Stanford University psychology students with normal or corrected vision participated. Data from six participants were discarded because a successful calibration was not achieved. Stimuli. The visual stimuli consisted of 32 pictures of spatial scenes. All of these pictures were matched on luminance, and all were created with a Microsoft drawing program. Of the 32 pictures,16 were experimental and 16 were fillers. All experimental pictures contained two paths, one represented vertically in the picture plane, and the other horizontally (see Figure 1). These paths were traversable objects, such as roads or trails, or linearly extended objects, such fences or rows of trees.
The verbal stimuli consisted of 64 sentences recorded in 16 blocks of four sentences. Each block contained two pairs of descriptions. One pair described the vertical path, and the other described the horizontal path. Each pair contained two experiment sentences: a fictive motion (FM) sentence and a comparable non fictive motion (non-FM) sentence, such as The road runs through the valley and The road is in the valley. The experiment was designed such that each participant would hear one sentence from each of the 16 blocks in addition to 16 sentences for the filler pictures. Norming studies reported in Matlock and Richardson (2004) showed that these FM and non-FM sentences were judged to be equal in semantic content and semantic sensibility, and to be equally good descriptions of the scenes.
We recorded two terrain descriptions to precede each experimental sentence. Each terrain description referred to a region in which movement could be conceptualized as easy or difficult, for example, The valley was flat and smooth (easy), and The valley was full of potholes described (difficult). We did a norming study to ensure that all sentences would in fact be equally compatible with the scenes they described. The participants were told to judge how well the sentences go with the scenes in the pictures. Using a scale that ranged from 1 for “not at all” to 7 for “very well“, 10 Stanford undergraduates judged all pairs to be well-matched. The means were FM + slow-terrain 5.72, FM + fast-terrain 5.62, non-FM + slow-terrain 5.74, non-FM + fast-terrain 5.73. No combination of terrain description and experimental sentence was any better than the other, F(3, 124) = .4, p > .1, suggesting that all sentence-picture combinations were plausible pairings. In addition to the primary stimuli, we created filler descriptions for all filler sentences.
An ASL 504 remote eye tracking camera was positioned at the base of a 17” LCD stimulus display that was set to 800×600 resolution. Participants were unrestrained and sat about 30” from the screen. The stimuli were 560 pixels square, which subtended approximately 18º square of visual angle. The camera detected pupil and corneal reflection position from the right eye, and the eye-tracking PC calculated point-of-gaze in terms of coordinates on the stimulus display. This information was passed to a PowerMac G4, which controlled the stimulus presentation and collected gaze duration data. Prior to the experiment proper, participants went through a 9 point calibration routine that took one to three minutes.
After establishing a successful eye track, participants were told: “Look at the pictures and listen to the sentences.” Participants were first presented with 4 practice trials and then a random sequence of 16 filler trials and 16 experimental trials. At the beginning of every trial, they first saw a gray square that was the same size and luminance as the pictures. Next they heard a terrain sentence or a filler sentence. After 500ms, they saw a new picture and after a further 1000ms, they heard a FM sentence, a non-FM sentence, or a filler sentence. The picture remained on screen for a total of 6000ms. The trial ended with a 2000ms inter-stimulus interval.
Eye movements were recorded for the 6000ms that the picture was on the screen. The eye movement data consisted of which regions-of-interest were fixated at 1/30th of a second intervals. The path region-of-interest was a strip 80 pixels wide that extended vertically or horizontally across the image. This path was further divided into seven equally sized, square regions-of-interest.
Participants’ eye movement data were parsed into two dependent variables: the total looking time in the region of the path, and the frequency of path scanning fixations, in which participants fixated one path region followed immediately by another. Analyses were performed by participants (F1) and items (F2). Though we intended for all paths in the visual images to have symmetrical arrangements, the path in one image was erroneously asymmetric; it contained an anomaly on one end (water coming out of a garden hose). As additional evidence of this image being inappropriate for our purposes, it elicited unusually long looking times to the bottom region of the vertical path, regardless of fictive or terrain condition. For this reason, that item was removed from all analyses. The listeners’ eye movements were influenced by a combination of terrain descriptions and fictive motion language, as shown in Figure 2. As predicted, looking times to the path were affected by an interaction of sentence type and terrain description, (F1(1,56) = 11.78, p <. 001; F2(1,14) = 15.25, p <. 001). Critically, with FM sentences, participants spent more time inspecting paths after difficult terrain descriptions (M = 2014ms) than after easy terrain descriptions (1621ms) (Tukey’s LSD, p < .05), but for non-FM, there was no reliable difference (1681ms and 1847ms, respectively). There were no main effects of terrain (F1 (1,56)=2.30; F2(1,14) = 0.10) or sentence type (F1 (1,56)=0.45; F2(1,14) = 1.21) for looking times.
This pattern of results was echoed by analysis of the path scanning data. There was a significant interaction between sentence type and terrain description (F1(1,56) = 6.87, p <. 02; F2(1,14) = 4.77, p <. 05). Participants made more path scanning fixations after hearing a FM sentence preceded by a difficult (M = 3.6) rather than an easy terrain description (M = 2.8) (Tukey’s LSD, p < .05), but there was no reliable difference for non-FM sentences (2.86 and 3.16, respectively). There were no main effects of terrain (F1 (1,56)=1.57; F2(1,14) = 0.16) or sentence type (F1 (1,56)=1.02; F2(1,14) = 0.98).
Figurative language can have an immediate effect on how we look at the world. Our results suggest that this is because of the distinct spatial representations that figurative descriptions can evoke that their literal counterparts do not. The way participants inspected paths was affected by information about the terrain and the figurative language that described the path. Critically, eye movements were not influenced by descriptions of difficult or easy terrain by themselves. They were influenced only when the terrain descriptions were paired with fictive motion sentences. A plausible explanation for the interaction between fictive motion language and terrain information, we argue, is that comprehending a fictive motion sentence involves a mental representation of motion along a path (Langacker, 1987; Matlock, 2004b; Talmy, 2000), and that the representation incorporates information about terrain. Consequently, difficult terrain would result in slow motion, for example, and the resulting representation is shown by the longer amount of time participants looked at a path and the increased number of fixations scanning along its length. Our interpretation of these results is congruent with perceptual simulation theories (Barsalou, 1999; Glenberg, 1997; Zwaan, 2004), which hold that language comprehension is a process of generating perceptual-motor representations. Comprehension of fictive motion descriptions led to eye movements along the depicted path that mirrored an internal simulation of movement. More generally, simulated motion is known to figure into a broad range of cognitive processes, such as inferring motion from static images (Freyd, 1983; Kourtzi & Kanwisher, 2000), comprehending descriptions of actual motion (Zwaan, Madden, Yaxley, & Aveyard, 2004), and solving everyday physics problems (Schwartz & Black, 1999).
Our fictive motion experiments are an interesting test case for perceptual simulation theories for two reasons. First, previous experiments compared different scenes, such as the nail was hammered into the floor versus into the wall (Stanfield & Zwaan, 2001), or concepts, such as a watermelon versus half a watermelon (Solomon & Barsalou, 2001), and found evidence for differing perceptual-motor activation. In contrast, we are comparing literal and figurative spatial descriptions of the same scene. Though the descriptions are equivalent in objective terms, they have different interactions with perceptual mechanisms. Therefore, we can distinguish between the identical semantic commitments of the sentences and their differing perceptual simulations. Second, previous experiments have been forced to infer the involvement of perceptualmotor representations in language comprehension from reaction time differences in concurrent tasks, such as sensibility judgements, picture matching or visual discriminations (Glenberg & Kaschak, 2002; Richardson, Spivey, McRae, & Barsalou, 2003; Zwaan, Stanfield, & Yaxley, 2002). In contrast to these studies, our eye movement paradigm allows us to directly measure the effect of figurative language on perceptual mechanisms that are unconstrained by any task other than looking and listening. In this experiment all we manipulated was the presence of figurative language, a change that did not alter the literal meaning or truth conditions of the sentence. Nevertheless this change appeared to alter visual processing. We argue that eye movements were affected because fictive motion language evokes a dynamic mental simulation which interacts with the ways in which the visual system interprets and inspects the world. Our findings, which have consequences for both the linguistic accounts of figurative language and the scope of top-down influences in visual perception, help illuminate the ways in which verbal and visual processes are intertwined.
The authors are indebted to Herbert Clark, Natasha Kirkham, Paul Maglio, Michael Ramscar, Michael Spivey, and our anonymous reviewers for helpful discussions and comments.
Altmann, G. T. M., & Kamide, Y. (1999). Incremental interpretation at verbs: Restricting the domain of subsequent reference. Cognition, 73(3), 247-264.
Barsalou, L. W. (1999). Perceptual symbol systems. Behavioral and Brain Sciences, 22(4), 577-660.
Boroditsky, L. (2000). Metaphoric structuring: understanding time through spatial metaphors. Cognition, 7, 1-28.
Boroditsky, L., & Ramscar, M. (2002). The roles of body and mind in abstract thought. Psychological Science, 13(2), 185-189.
Cooper, R. M. (1974). The control of eye fixation by the meaning of spoken language: A new methodology for the real-time investigation of speech perception, memory, and language processing. Cognitive Psychology, 6(1), 84-107.
Freyd, J. J. (1983). The mental representation of movement when static stimuli are viewed. Perception & Psychophysics, 33(6), 575-581.
Gibbs, R. W., Jr. (1994). The poetics of mind: Figurative thought, language, and understanding. New York, NY: Cambridge University Press.
Glenberg, A. M. (1997). What memory is for. Behavioral and Brain Sciences, 20(1), 1- 55.
Glenberg, A. M., & Kaschak, M. P. (2002). Grounding language in action. Psychonomic Bulletin & Review, 9(3), 558-565.
Henderson, J. M., & Ferreira, F. (Eds.). (2004). The integration of language, vision, and action: Eye movements and the visual world. New York: Psychology Press.
Huumo, T. (2005). How fictive dynamicity motivates aspect marking: The riddle of the Finnish quasi-resultative construction. Cognitive Linguistics, 16, 113-144.
Kourtzi, Z., & Kanwisher, N. (2000). Activation in human MT/MST by static images with implied motion. Journal of Cognitive Neuroscience, 12(1), 48-55.
Lakoff, G. (1987). Women, fire and dangerous things. Chicago, IL: The University of Chicago Press.
Langacker, R. W. (1987). Theoretical Prerequisites (Vol. 1). Stanford, CA: Stanford University Press.
Matlock, T. (2004a). The conceptual motivation of fictive motion. In G. Radden & K.
Panther (Eds.), Studies in linguistic motivation [Cognitive Linguistics Research]. New York and Berlin: Mouton de Gruyter.
Matlock, T. (2004b). Fictive motion as cognitive simulation. Memory & Cognition, 32, 1389-1400.
Matlock, T., Ramscar, M., & Boroditsky, L. (2005). The experiential link between spatial and temporal language. Cognitive Science, 29, 655-664.
Matlock, T., & Richardson, D. C. (2004). Do eye movements go with fictive motion? Paper presented at the 26th Annual Cognitive Science Society, Chicago.
Matsumoto, Y. (1996). Subjective motion and English and Japanese verbs. Cognitive Linguistics, 7, 183-226.
Ramscar, M., Matlock, T., & Boroditsky, L. (2005). The experiential basis of abstract language comprehension. Manuscript under review.
Richardson, D. C., & Dale, R. (2005). Looking to understand: The coupling between speakers’ and listeners’ eye movements and its relationship to discourse comprehension. Cognitive Science, 29, 1045-1060.
Richardson, D. C., Spivey, M. J., McRae, K., & Barsalou, L. W. (2003). Spatial representations activated during real-time comprehension of verbs. Cognitive Science, 27, 767-780.
Runner, J. T., Sussman, R. S., & Tanenhaus, M. K. (2003). Assignment of reference to reflexives and pronouns in picture noun phrases: evidence from eye movements. Cognition, 89(1), B1-B13.
Schwartz, D., & Black, T. (1999). Inferences through imagined actions: Knowing by simulated doing. Journal of Experimental Psychology: Learning Memory and Cognition, 25(1), 116-136.
Solomon, K. O., & Barsalou, L. W. (2001). Representing Properties Locally. Cognitive Psychology, 43(2), 129-169.
Spivey, M. J., & Geng, J. J. (2001). Oculomotor mechanisms activated by imagery and memory: Eye movements to absent objects. Psychological
Research/Psychologische Forschung, 65(4), 235-241.
Spivey, M. J., Tyler, M., Richardson, D. C., & Young, E. (2000). Eye movements during comprehension of spoken scene descriptions. Paper presented at the 22nd Annual Conference of the Cognitive Science Society, Philadelphia.
Stanfield, R. A., & Zwaan, R. A. (2001). The effect of implied orientation derived from verbal context on picture recognition. Psychological Science, 12, 153-156.
Talmy, L. (2000). Toward a cognitive semantics (Volume 1: Concept structuring systems). Cambridge, MA, US: The MIT Press.
Tanenhaus, M. K., Spivey Knowlton, M. J., Eberhard, K. M., & Sedivy, J. C. (1995).
Integration of visual and linguistic information in spoken language comprehension. Science, 268(5217), 1632-1634.
Trueswell, J. C., & Tanenhaus, M. K. (Eds.). (2005). Approaches to studying worldsituated language use: bridging the language-as-product and language-as-action traditions. Cambridge: MIT Press.
Zwaan, R. A. (2004). The immersed experiencer: toward an embodied theory of language comprehension. In B. Ross (Ed.), The Psychology of Learning and Motivation (Vol. 44, pp. 35-62). New York: Academic Press.
Zwaan, R. A., Madden, C. J., Yaxley, R. H., & Aveyard, M. E. (2004). Moving words: Dynamic mental representations in language comprehension. Cognitive Science, 28, 611-619.
Zwaan, R. A., Stanfield, R. A., & Yaxley, R. H. (2002). Do language comprehenders routinely represent the shapes of objects? Psychological Science, 13, 168-171..
Psychology Department, University of California, Santa Cruz Teenie Matlock, Social and Cognitive Sciences, University of California, Merced
Contact information: Daniel Richardson, Psychology Department, 273 Social Sciences 2, Santa Cruz, CA 95064
Phone: (831) 459-2002