March 6 - 10, 2006
It is mathematically impossible to tell whether a surface is white, gray, or
black, by looking at it in isolation, since the luminance is the product of
two unknown variables, illumination and reflectance (albedo). Nonetheless
people can do it pretty well, proving that the human visual system is
smarter than the people who study it. Real surfaces, such as paper, cloth,
or stucco, have visual textures that depend on interreflections and specular
reflections, and some of the resultant image statistics are correlated with
surface properties such as albedo and gloss. By manipulating these
statistics, we can make the surface look lighter or darker (and duller or
shinier) without changing the mean luminance. In a related project, we are
exploring how local statistics can be used to separate shading and albedo
in natural images. Working in the derivative domain (as in Retinex), we
train on images with ground truth "intrinsic images" of shading and albedo,
and learn to estimate the derivatives based on local image patches. We then
do a pseudoinverse to retrieve the images. The results are good: we can
separate an image into its shading and albedo components better than
previous methods, including our own previous methods that relied on
classification rather than estimation.
The images generated by varying the underlying articulation parameters
of an object (pose, attitude, light source position, and so on) can be
viewed as points on a low-dimensional "image appearance manifold"
(IAM) in a high-dimensional ambient space. In this talk, we will
expand on the observation that typical IAMs are not differentiable, in
particular if the images contain sharp edges. However, all is not
lost, since IAMs have an intrinsic multiscale geometric structure. In
fact, each IAM has a family of approximate tangent spaces, each one
good at a certain resolution. In the first part of the talk, we will
focus on the particular inverse problem of estimating, from a given
image on or near an IAM, the underlying parameters that produced it.
Putting the multiscale structural aspect to work, we develop a new
algorithm for high-accuracy parameter estimation based on a
coarse-to-fine Newton iteration through the family of approximate
tangent spaces. This algorithm is reminiscent of recently proposed
algorithms for multiscale image registration and super-resolution. In
the second part of the talk, we will explore IAMs in the context of
"Compressive Imaging" (CI), where we attempt to recover an image from
a small number of (potentially random) projections. To date, CI has
focused on sparsity-based image models; we will discuss how IAM models
could offer better performance for geometry-rich images.
This is joint work with Michael Wakin, Hyeokho Choi, and David Donoho.
I will present a simple first order model of how variation in
illumination affects the output of Filter Response Filters (FRF).
FRF are of interest because:
(a) they are commonly used as texture features in automated texture
classification systems, and
(b) they are typically proposed as the "back pocket model" of the first
stage of the human visual system.
I'll show how naïve classifiers built using these simple features can
fail, and how the model can be used to produce a classifier that is
robust to illumination variation.
What this will show is that single still images are not often not
sufficient for the purposes of surface classification - either for human
or automated systems.
I'll conclude by describing some of our recent research that is
investigating our perceptions of surface texture.
Joint work with James Elder.
The statistics of natural scenes useful for contour grouping are examined from an information theoretic point of view. We focus
on two particularly important grouping cues: proximity and good continuation. Advances on previous studies of contour grouping
statistics include: 1. measurements based upon more accurately localized edges, 2. an analysis of grouping statistics as a function
of arc-length separation along the contour, 3. a comparison of competing methods for efficient representation of good continuation
cues, 4. a comparison of contour statistics for natural and human-made scenes. Our results reveal proximity to follow a power law
model, and parallelism and co-circularity to form an intuitive and efficient coding scheme for the angular relationships between
edges.
The perceived shapes of objects in images result from a
collection
of visual clues.
These clues follow from the interplay of geometric features
such as
perceived boundaries,
edges and corners, delineating curves on object surfaces, and
features
resulting from
illumination such as shadow curves and specularity.
Furthermore, a
viewer gains such
information not just from static images but also from perceived
changes
resulting from
change in viewing direction.
In this talk, we explain how it is possible to determine a
catalog
of possible local
models for the generic interplay between geometric features and
shadow
curves. This
catalogue can be expanded to included the expected changes in
such
models under
movement in viewing direction.
Such a catalog is constructed through the use of
singularity theory,
which is a
mathematical theory that allows construction of such
classifications
based on stability and possible perturbations. We explain the
general features of the
classification and
indicate how it is obtained.
This is the result of joint work carried out with Peter
Giblin and
Gareth Haslinger.
Compressive Sensing is an emerging field based on the revelation that a small group of non-adaptive linear projections of a
compressible signal contains enough information for
reconstruction and processing. We propose algorithms and hardware to support a new theory of Compressive Imaging. Our approach is
based on a new digital image/video camera that directly acquires random projections of the light field without first collecting the
pixels/voxels. Our camera architecture employs a digital micromirror array to perform optical calculations of linear projections of
an image onto pseudorandom binary patterns. Its hallmarks include the ability to obtain an image/video snapshot with a single
detection element while measuring the image/video fewer times than the number of pixels/voxels; this can significantly reduce the
computation required for image/video acquisition and encoding. Since our system relies on a single photon detector, it can also be
adapted to image at wavelengths that are currently impossible with conventional CCD and CMOS imagers. We are currently testing a
prototype design for the camera and present experimental results.
This is joint work with Michael Wakin, Jason Laska, Dror Baron, Shriram Sarvotham, Dharmpal Takhar, Kevin Kelly and Richard
Baraniuk.
The important role of contours in visual perception has been recognized
for many years (e.g., Wertheimer 1923/1938). While early Gestalt
insights derive from observation of highly idealized images, decades of
computer vision research have demonstrated the computational complexity
of inferring and exploiting contours in natural images. Physiological
data, while generating some intriguing clues, are often too local
(single unit recording) or too global (imaging) to provide the data
needed to constrain existing models or inspire new ones.
In this talk I will discuss recent work that attempts to bring together
psychophysical, computational and physiological approaches to
understanding contour processing in natural images. A unifying
foundation for this effort is a continuing project to measure and model
the statistics of natural image contours. These ecological results lead
to new computer vision algorithms for natural contour grouping,
normative models for contour processing that may be evaluated
psychophysically, and to new models for neural selectivity to natural
image contours that may be tested against physiological data.
Camera shake during exposure leads to objectionable image blur and
ruins many photographs. Conventional blind deconvolution methods
typically assume frequency domain constraints on images, or overly
simplied parametric forms for the motion path during camera shake.
Real camera motions can follow convoluted paths, and a spatial domain
prior can better maintain visually salient image characteristics. We
introduce a multi-scale method to remove the effects of camera shake
from seriously blurred images, by estimating the most probable blur
and original image using a variational approximation to the posterior
probability, and assuming a heavy-tailed distribution for bandpassed
image statistics. Our method assumes a uniform camera blur over the
image, negligible in-plane camera rotation, and no blur caused by
moving objects in the scene. The algorithm operator specifies an
image region without saturation effects within which to estimate the
blur kernel. I'll discuss issues in this blind deconvolution problem,
and show results for a variety of digital photographs.
Invitation to submit examples: I invite audience members to submit
examples of motion-blurred photographs to me a few days ahead of time.
I'll show the images you submit, and the result of our algorithm
applied to them. If you have a favorite blind deconvolution or
restoration algorithm, please apply it to your image and send it and
I'll show that, too.
Joint work with: Rob Fergus, Barun Singh, both from MIT CSAIL, and
Aaron Hertzman and Sam Roweis, both from the University of Toronto.
Joint work with Haleh Hagh-Shenas.
An ongoing goal of research in multivariate visualization is to
determine how to most effectively use visual features, such as color and
texture, to efficiently and accurately convey information about multiple
scalar-valued data distributions defined over a common domain. While
there currently exists an extensive knowledge base in issues related to
color perception and the effective use of color for uni-variate and
bi-variate data visualization, research into the effective use of
texture for data visualization is considerably less mature. In this
poster we present the findings from three pilot experiments with natural
texture images intended to provide insight into issues in texture
perception that have the potential to inform our efforts to more
effectively harness the full potential of texture as a visual variable
capable of simultaneously conveying information about multiple data
distributions.
The traditional model of primary visual cortex (V1) is in terms of a
retinotopically organized set of spatio-temporal filters. This
model has been extraordinarily fruitful, providing explanations of a
considerable body of psychophysical and neurophysiological results.
It has also produced compelling linkages between natural image
statistics, efficient coding theory, and neural responses.
However,there is increasing evidence that V1 is doing a whole lot more. We
can get insight into early cortical processing by studying not only
the relationship between image input and neural activity, but also
between human visual percepts and early cortical activity. Natural
percepts (in the sense of tapping into natural modes of processing)
are as important as understanding natural images when trying to
find out what primary visual cortex is doing. I will describe several
results from functional magnetic resonance imaging (fMRI) studies which
show that human V1 blood oxygenation level dependent (BOLD)
response to patterns perceived as well-organized is less than to
patterns perceived as less organized, V1 response to natural
image contrast is correlated with perceived contrast, and apparent size
modulates the spatial extent of V1 activity.
Concepts for describing curve points in a continues space are well known in
mathematics for a long time. We apply those concepts to the discrete
space with
the aim to analyse curve-like structures in digital images. For the
characterization of 3D skeletons we distinguish between different types of
voxels. We discuss approaches to define those elements of skeletons and their
properties. We use the distribution and complexity of junctions to extract
features for 3D medical images.
Joint work with Karsten Scheibe, DLR, Berlin-Adlershof.
The talk informs at a general level about new architectures
of
panoramic
cameras (as designed and produced at DLR, the German Air
and Space
Institute
at Berlin), their use for stereo imaging based on studies
at CITR,
and the combination of
those high-resolution images (about 350 Megapixel each)
with range
data generated by
a laser range finder. Results are illustrated for different
objects
such as the castle
"Neuschwanstein" in Bavaria/Germany.
The "Shading Cue" is conventionally framed in the context of perfectly smooth surfaces. "Shading" has ancient roots in the visual arts, and became canonized in the late 20thc. as the "Shape From Shading (SFS) Problem". I reconsider the problem as conventionally posed, presenting a novel analysis of its "observational basis''. When rough surfaces are considered the image structure is augmented (from mere contrast gradient in the former case) with the image illuminance flow structure revealed by texture. The direction and two differential invariants of this flow can be estimated robustly via the structure tensor. This changes the nature of the "shading cue" qualitatively. Shading alone does not specify surface curvature orthogonal to the illumination direction, a lack of data that has to be made up for by the surface integrability conditions. Hence conventional SFS algorithms are based on partial differential equations with global boundary conditions. Allowing illuminance flow as an additional observable alleviates this problem and purely local, algebraic approaches to SFS become feasible. Algorithms can be shown to exist that derive surface curvature from shading and flow observations through a linear operator applied to the observables, the operator being a function of surface attitude and beam direction. Such an approach neatly reveals the remaining group of ambiguity transformations in an intuitive way. I propose novel ways to deal with the intrinsic ambiguities of photomorphometrics. Instead of attempting to find the full class of equivalent solutions I look for specific solutions given certain a priori guesses. Such methods are much more similar to likely mechanisms of human psychogenesis, in particular visual perception, than the conventional "Marrian" approach. I present methods that boil down to linear, local computation, thus very robust and possibly implementable in neural wetware.
Jan J. Koenderink and Pietro Perona Friday Short presentations
Collaborators:
1) John B. Garnett (UCLA, jbg@math.ucla.edu)
2) Peter W. Jones (Yale, jones@math.yale.edu)
3) Luminita A. Vese (UCLA, lvese@math.ucla.edu)
Natural images have many different scales of oscillations.
Texture can be seen as oscillations at smaller scales. Here, we present
variational image decomposition models, which decompose different scales
of an image f into the sum of two scales u+v. Here u is a piecewise smooth
image at a larger scale and v is texture or oscillations at finer scale.
We use different spaces of functions or of generalized functions to model
different scales in images. The use of generalized functions is motivated
by work of Y. Meyer, D. Mumford and B. Gidas. For the piecewise smooth
component u, we use the space BV (Bounded Variation), and the generalized
homogeneous Besov and Sobolev spaces (B(s,p,q), W(s,p) with s
Visual grouping and figure-ground discrimination were first studied by
the Gestalt school of visual perception nearly a century ago. By the use
of cleverly constructed examples, they were able to demonstrate the role
of factors such as proximity, similarity, curvilinear continuity and
common fate in visual grouping and factors such as convexity, size, and
symmetry in figure-ground discrimination. However, this left open (at
least) three major problems (1) there wasn't a precise operationalization
of these factors for general images, (2) the interaction of these cues
was ill understood (3) and there was no justification for why these
factors might be helpful to an observer interacting with the visual
world.
Over the last few years, we have been pursuing these problems in the
following paradigm: (1) We start with a set of natural images and use
human observers to mark the perceptual groups and assign figure-ground
labels to the various boundary contours. (2) We construct computational
models of various grouping and figure-ground factors. (3) We calibrate
and optimally combine the grouping and figure-ground factors by using the
principle that vision evolved to be adaptive to the statistics of objects
in the natural world.
In my talk I will report on two recent results in this paradigm. One is
on understanding the power of the figure-ground cues, specifically size,
lower-region and convexity. We compared the predictions of such a model
with pyschophysics and found a pleasing agreement. The second is an
attempt at a unified probabilistic framework for mid-level vision using
conditional random fields defined on constrained Delaunay triangulations
of image edges.
This talk draws on joint work with Charless Fowlkes, David Martin and
Xiaofeng Ren; various papers can be found on the web site
http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping
Joint work with Katja Doerschner, Huseyin Boyaci.
Researchers studying surface color perception have typically used
stimuli that consist of a small number of matte patches (real or
simulated) embedded in a plane perpendicular to the line of sight
(a Mondrian, Land & McCann, 1971). Reliable estimation of surface
properties analogous to color is a difficult if not impossible
computational problem in such limited scenes (Maloney, 1999). In more
realistic, three-dimensional scenes the problem is not intractable, in
part because considerable information about the spatial and spectral
distribution of the illumination is usually available. We describe a
series of experiments that (1) explore how the human visual system
discounts the spatial and spectral distribution of the illumination
(SSDI) in judging matte surface color and (2) what cues the visual
system uses in estimating the SSDI of in a scene. We find that the
human visual system uses information from cast shadows and specular
reflections in estimating the SSDI and, when more than one cue type is
present, combines these cues effectively. The SSDI can be very complex
in scenes with many different light sources. We examine (3) the limits
of human visual representation of the SSDI, reporting an experiment
intended to tests these limits. Our results indicate that the human
visual representation of the SSD of the illumination in a scene is well-
matched to the task of perception of matte surface color perception.
Land, E. H. & McCann, J. J. (1971), Lightness and retinex theory.
Journal of the Optical Society of America, 61,1-11.
Maloney, L. T. (1999), Physics-based approaches to modeling surface
color perception. In Gegenfurtner, K. R., & Sharpe, L. T. [Eds]
(1999), Color Vision: From Genes to Perception. Cambridge, UK:
Cambridge University Press, pp. 387-422.
Jan J. Koenderink and Pietro Perona Friday Short presentations
How many categories can you recognize? Currently the best
estimate is due to Irv Biederman: 3000 entry-level categories and
perhaps 3*10^{4} categories overall. This estimate was obtained
indirectly, by counting words in a dictionary. I will present a method
to obtain a direct estimate. Alongside the estimate one gets
frequencies of objects and categories for free. I will discuss the
implications for visual recognition and other visual problems.
The Time Dependent Ginzburg-Landau equations describe the evolution
of the order parameter and vector magnetic potential in a superconductor,
giving the density of electrons in the superconducting phase.
There are connections between TDGL and the variational/PDE
equations of image processing. We present an overview of numerical
modeling of a superconductor via TDGL with a focus on how
the mesh for the spatial discretization can have a profound effect
on simulation results. An inadequate mesh resolution will often give
rise to spurious solutions which seem physically correct, but are false.
The effects of different physical parameters and boundary conditions in 2D
and 3D are also presented.
Joint work with G. Brown (UofM and HP Labs) and G. Seroussi (MSRI).
A framework for studying texture in general, and for texture
mixing in particular, is presented in this work. The work follows
concepts from universal type classes and universal simulation.
Based on the well-known Lempel and Ziv (LZ) universal compression
scheme, the universal type class of a one dimensional sequence is
defined as the set of possible sequences of the same length which
produce the same dictionary (or parsing tree) with the classical
LZ incremental parsing algorithm. Universal simulation is realized
by sampling uniformly from the universal type class, which can be
efficiently implemented. Starting with a source texture image, we
use universal simulation to synthesize new textures that have,
asymptotically, the same statistics of any order as the source
texture, yet have as much uncertainty as possible, in the sense
that they are sampled from the broadest pool of possible sequences
that comply with the statistical constraint. When considering two
or more textures, a parsing tree is constructed for each one, and
samples from the trees are randomly interleaved according to
pre-defined proportions, thus obtaining a mixed texture. As with
single texture synthesis, the k-th order statistics of this
mixture, for any k, asymptotically approach the weighted mixture
of the k-th order statistics of each individual texture used in
the mixing. We present the underlying principles of universal
types, universal simulation, and their extensions and application
to mixing two or more textures with pre-defined proportions.
Joint work with Tian-Tsong Ng, Shih-Fu Chang
Jessie Hsu, Lexing Xie (Columbia University).
The increasing photorealism for computer graphics has made
computer graphics a convincing form of image forgery. Therefore,
classifying photographic images and photorealistic computer
graphics has become an important problem for image
forgery detection. We propose a new geometry based
image model, motivated by the physical image generation
process, to tackle the above-mentioned problem. The
proposed model reveals certain physical differences between
the two image categories, such as the gamma correction
in photographic images and the sharp structures in computer
graphics. For the problem of image forgery detection,
we propose two levels of image authenticity definition, i.e.,
imaging-process authenticity and scene authenticity, and analyze
our technique against these definitions. Such definition
is important for making the concept of image authenticity
computable. Apart from offering physical insights, our technique
with a classification accuracy of 83.5% outperforms
those in the prior work, i.e., wavelet features at 80.3% and
cartoon features at 71.0%. We also consider a recapturing
attack scenario and propose a counter-attack measure.
The information contained in an image ("What does the image represent?")
also has a geometric interpretation ("Where does the image reside in the
ambient signal space?"). It is often enlightening to consider this
geometry in order to better understand the processes governing the
specification, discrimination, or understanding of an image. We discuss
manifold-based models for image processing imposed, for example, by the
geometric regularity of objects in images. We present an application in
image compression, where we see sharper images coded at lower bitrates
thanks to an atomic dictionary designed to capture the low-dimensional
geometry. We also discuss applications in computer vision, where we face
a surprising barrier -- the image manifolds arising in many interesting
situations are in fact nondifferentiable. Although this appears to
complicate the process of parameter estimation, we identify a multiscale
tangent structure to these manifolds that permits a coarse-to-fine
Newton method. Finally, we discuss applications in the emerging field of
Compressed Sensing, where in certain cases a manifold model can supplant
sparsity as the key for image recovery from incomplete information.
This is joint work with Justin Romberg, David Donoho, Hyeokho Choi, and
Richard Baraniuk.
Super-resolution seeks to produce a high-resolution image from a set of
low-resolution, possibly noisy, images such as in a video sequence. We
present a method for combining data from multiple images using the Total
Variation (TV) and Mumford-Shah functionals. We discuss the problem of
sub-pixel image registration and its effect on the final result.