Inderjit
S. Dhillon (Department of Computer Sciences, University
of Texas at Austin) inderjit@cs.utexas.edu
http://www.cs.utexas.edu/users/inderjit/
Information-Theoretic
Clustering, Co-clustering and Matrix Approximations Papers:
jmlrdist.pdf
jmlrdist.ps
kdd_cocluster.pdf
kdd_cocluster.ps
Slides: html
pdf
ps
ppt
Given
a two-dimensional co-occurrence matrix, we consider the problem
of co-clustering or simultaneous clustering of the rows and
columns. We view the (scaled) co-occurrence matrix as an empirical
joint probability distribution of two discrete random variables,
and pose the co-clustering problem as an optimization problem
in information theory --- the optimal co-clustering maximizes
the mutual information between the clustered random variables
subject to constraints on the number of row and column clusters.
It can be shown that this objective is identical to the objective
of finding a "low-parameter" non-negative matrix approximation
that is closest in relative entropy to the given matrix under
cluster constraints. Based on this formulation, we present
a co-clustering algorithm that monotonically increases the
preserved mutual information by intertwining both the row
and column clusterings at all stages. We apply the algorithm
to some problems in text mining: distributional clustering
of words applied to text classification, and simultaneous
clustering of words and documents. Experimental results indicate
that distributional word clustering leads to improved text
classification when the training data is small in size, and
co-clustering can overcome problems due to sparsity.
This is joint work with Yuqiang
Guan, Subramanyam Mallela
and Dharmendra Modha.

Usama
Fayyad, Ph.D. (President and CEO, digiMine, Inc.)
Usama@digiMine.com
The
Business Evolution and Challenges of Data Mining
Data
Mining has received much attention as companies and organizations
started to ask how they can better utilize the huge data stores
they built up over the past few decades. While some interesting
progress has been achieved over the past few years, especially
when it comes to techniques and scalable algorithms, very
few organizations have managed to benefit from the technology.
This paradoxical situation of having too much data yet not
be able to utilize it or mine it arose because of both technical
and business challenges. We will cover these challenges, paint
a picture for where the data problems are, and then introduce
data mining and its role.
Rather
than focusing on the algorithmic aspects of data mining: scaling
to large databases, aspects and challenges for fitting data
mining with database systems, we will cover business challenges
of how to make the technology really work. In this coverage,
we will span the spectrum from technical to business issues.
We shall also cover applications in the business setting to
illustrate the challenges and contributions of data mining
in eBusinesses as an example. We conclude by revisiting the
technical challenges facing the field.
Bio
of Usama Fayyad:
Dr.
Fayyad is co-founder, Chairman and CEO of digiMine, Inc. a
privately held company of over 100 employees focused on data
mining and business intelligence solutions. As CEO, he raised
$45 million in venture capital from top investors and has
grown the company to serve many large Fortune 500 clients.
Prior to digiMine, Dr. Fayyad founded and led Microsoft Research's
Data Mining & Exploration (DMX) Group. At Microsoft he also
led the development of data mining components within Microsoft
products, including SQL Server 2000. From 1989 to 1995, Dr.
Fayyad was at NASA's Jet Propulsion Laboratory (JPL), California
Institute of Technology, where he founded and grew a multi-million
dollar advanced research program to develop data mining systems
for the analysis of large scientific databases. This work
solved some significant scientific problems and earned him
numerous awards including the most distinguished excellence
award from Caltech/JPL and a U.S. Government Medal from NASA.
Dr. Fayyad has published over 100 technical articles and is
co-editor of two books on Data Mining and Knowledge Discovery
in Databases. He is very active in the KDD community and has
served as program co-chair of KDD-94 and KDD-95 (the 1st International
Conference on Knowledge Discovery and Data Mining) and as
general chair of KDD-96 and KDD-99. He is Editor-in-Chief
of the scientific technical journal: Data Mining and Knowledge
Discovery. He is a Director of the ACM Special Interest Group
on knowledge Discovery and Data Mining (SIGKDD) and Editor-in-Chief
of its newsletter: SIGKDD Explorations. He serves on several
Editorial Boards including the Communications of the ACM.
He holds a Ph.D. in Computer Science and Engineering from
The University of Michigan, Ann Arbor (1991) in addition to
M.Sc. in Mathematics (1989), MSE in Computer Engineering (1986),
and two B.S.E.'s in EE and CS (1984).

Sudipto
Guha
(Department of Computer and Information Science, University
of Pennsylvania) sudipto@central.cis.upenn.edu
Approximate
Histogram Construction: Offline and Data Stream Algorithms
Slides: pdf
Obtaining
fast and accurate approximations to data distributions is
a problem of central interest to database management. A variety
of popular database applications including, approximate querying,
similarity searching and data mining in most application domains,
rely on such accurate approximations. Histogram based approximation
is a very popular method in database theory and practice to
succinctly represent a data distribution in a space efficient
manner.
In this talk, we will consider various variants of optimal
histogram construction. We will also investigate histogram
construction for data streams.

Thorsten
Joachims
(Department of Computer Science, Cornell University) tj@cs.cornell.edu
http://www.cs.cornell.edu/People/tj/
Transductive
Learning via Spectral Graph Partitioning Slides:
pdf
Consider
the following type of prediction problem: you have a large
database of object (e.g. text documents) that you would like
to organize according to some classification. You have the
classification for a small subset of objects and you would
like to infer the classes for the remaining objects. A typical
example is relevance feedback in information retrieval: a
user specifies some documents as relevant/non-relevant and
wants to find all other relevant documents in the collection.
This type of machine learning task is called transductive
inference. A key difference to the typical inductive machine
learning problem is that the location of the test points can
be observed by the learning algorithm. I will show that exploiting
cluster structure in the test set can help make more accurate
predictions, in particular for text classification.
However,
designing transductive learning algorithms that simultaneously
minimize training error and maximize a measure of clustering
in the test set is challenging. In this talk, I will present
an approach based on spectral graph partitioning. In this
approach, objects in the database form the nodes of a graph,
while the edges represent dependencies. For such graphs, I
will generalize ratio-cuts to find a clustering of the database
that obeys the known classifications for the training examples.
The relaxation of the resulting optimization problem can be
solved as an eigenvalue problem. Empirical results show improved
prediction accuracy especially for small training sets. Furthermore,
the framework is very flexible, making it possible to implement
co-training as a special case.

Michael
I. Jordan (Department of Electrical Engineering
and Computer Science, Department of Statistics, University
of California, Berkeley) jordan@cs.berkeley.edu
Kernel
Methods, Graphical Models and Convex Optimization Papers:
WaiJor_Semidef03tech.pdf
WaiJor_Semidef03tech.ps
csd-02-1202.pdf
csd-02-1202.ps
csd-02-1206.pdf
csd-02-1206.ps
A pervasive problem in data analysis is that of integrating
data from multiple, heterogeneous sources. Data sources may
have varying formats, semantics and degrees of reliability.
I illustrate this problem in two areas: one involving the
annotation of images, and the other involving the prediction
of the cellular location of proteins. In the case of image
annotation, I describe a methodology for solving data integration
problems that is based on probabilistic graphical models,
a formalism that exploits the conjoined talents of graph theory
and probability theory to build complex models out of simpler
pieces. I show how convex relaxations can be used to solve
the probabilistic inference problem that is at the core of
the graphical model approach. In the case of protein annotation,
I discuss an approach based on "kernel methods," an area of
machine learning that also makes significant use of convex
optimization techniques. I show how multiple kernels can be
combined, yielding a problem that is still within the scope
of convex optimization.
(with David Blei, Nello
Cristianini, Gert Lanckriet,
William Noble and Martin
Wainwright)

Adam
Kalai
(Applied Mathematics, Massachusetts Institute of Technology)
akalai@mit.edu
Algorithms
for Online Optimization Problems
Slides:
pdf
ps
Paper: pdf
ps
In
an online optimization problem, one makes a series of decisions
without knowledge of future data. As an example, a factory
may have a fixed feasible set of production options. Each
month, the factory must decide how many of each type of item
to produce, and only after production do they find out the
profits of each item. A second example would be: each day
you have to drive from home to work, and only afterwards do
you find out the times on each of the roads. The goals are
to maximize total profits, in the first case, and minimize
total time, in the second.
A
standard measure of regret is relative to the best decision
one would have made in hindsight, i.e. compared to the best
single choice to use if you had to use the same choice every
day. For a wide variety of online optimization problems, it
is known that one can guarantee low expected regret relative
to the best decision in hindsight. We demonstrate this and
show that one can also compute these online decisions as efficiently
as one can compute the best decision in hindsight.

Ravi
Kannan (Department of Computer Science, Yale University)
kannan@cs.yale.edu
Sampling on the Fly from Massive Data Slides:
pdf
ps
An
approach to problems where the data is too large to be stored
in Random Access Memory let alone be analyzed in full is Sampling.
The simplest kind of sampling, namely Uniform Sampling can
be implemented in one pass through the data. Recently, algorithms
for solving approximately a host of graph and combinatorial
problems which use a surprisingly small uniformly drawn sample
have been developed.
There
has also been progress on using non-uniform sampling in the
"streaming'' model, which allows just one pass through
the data. Clustering and frequency estimation problems have
been thus tackled.
We
will also consider problems, where, the data consists of a
large matrix which can be stored in external memory and read
a limited number of times. We will present algorithms to find
a succinct synopsis of the matrix, storable in RAM, from which,
later, we may answer Similarity (arising in Information Retrieval)
and other queries fast.

Robert
Krauthgamer
(University of California at Berkeley and International Computer
Institute (ICSI)) robi@cs.berkeley.edu
http://www.cs.berkeley.edu/~robi/
Navigating
Nets: Simple Algorithms for Proximity Search
Nearest-neighbor
search is the following problem: Given a set S of n points
in a metric space X, preprocess S so as to efficiently find
the point in S that is closest to a query point q
X. Recently, there has been a significant interest
in the case of general metrics, because in many practical
applications the underlying metric is not Euclidean (or another
lp-metric).
We
devise for this problem a simple and efficient dynamic data
structure (i.e., supports insertions and deletions to S) that
answers 1+
approximate nearest neighbor queries for any
> 0. Our algorithms are deterministic and provably correct
with no assumptions on the metric. Their (time and space)
complexity depends on the dimensionality of the data set S,
which we measure using a very natural notion borrowed from
functional analysis. For example, if S has constant dimension,
the above operations take logarithmic time (for fixed
> 0). We further show that the dependency of our bounds on
the dimension is near-optimal in a certain oracle model. Finally,
our data structure outperforms the metric skip list of Karger
and Ruhl [STOC, 2002] by being deterministic, dimension oblivious,
and conceptually simpler.
[Joint
work with James R. Lee.]

Lillian
Lee
(Department of Computer Science, Cornell University) llee@cs.cornell.edu
http://www.cs.cornell.edu/home/llee
The
Iterative Residual Rescaling algorithm: An analysis and generalization
of Latent Semantic Indexing Paper:
pdf
ps
Using
vector-based representations of document collections has enabled
the application of powerful dimension-reduction techniques
to information retrieval, document clustering, and other text
analysis tasks. One of the most prominent of these techniques
is Latent Semantic Indexing (LSI). However, despite ample
empirical experience with it, there is still little understanding
of when LSI can -- and, just as importantly, cannot -- be
expected to perform well. This talk consists of two parts.
First, after a self-contained introduction to LSI, we provide
a novel formal analysis of it that links its performance in
a precise way to the uniformity of the underlying topic-document
distribution. Second, we present a new algorithm, Iterative
Residual Rescaling (IRR), that corrects for skewed distributions
by automatically adjusting to non-uniformity in the topic
distribution without knowledge of the underlying topics. A
series of experiments over a variety of evaluation metrics
validates our analysis and demonstrates the effectiveness
of IRR. Joint work with Rie Kubota
Joint
work with Rie Kubota Ando.

Milena
Mihail
(College of Computing, Georgia Institute of Technology) mihail@cc.gatech.edu
http://www.cc.gatech.edu/fac/Milena.Mihail/
Conductance
and Spectra of Power Law and Scale Free Graphs
We
generalize the classical result of expansion of random regular
graphs for graphs whose degree sequences follow heavy tailed
statistics. Such graphs arise in complex networks like the
Internet, the WWW and biological networks. In particular,
we establish conductance Omega(1/logn) in the configurational
random graph model and conductance Omega(1) the evolutionary
model of growth with preferential attachment, under the assumption
that the minimum degree is at least 3 in the former and at
least 2 in the latter. A structural implication of our results
is separation of the 1st and 2nd eigenvalues of the stochastic
normalization of the adjacency matrix of the graph. The algorithmic
implications of our results include low congestion routing
on the Internet and fast crawling on the WWW. We give complementary
results for the spectrum of the non-normalized adjacency matrix
of the graph, with implications in information retrieval.
Joint
work with C. Gkantsidis, C.
Papadimitriou and A. Saberi.

Sridhar
Rajagopalan (IBM Almaden Research Center) sridhar@almaden.ibm.com
Practical
Models for Large Data Set Analysis Slides:
html
pdf
ps
ppt
In
this talk, I will introduce two practical models for large
data set analysis. The first is derived from the streaming
computational model by introducing a sort primitive. The second
is a more formal model, streaming networks. We will discuss
three issues in the new context. (1) Show that the two models,
though apparently different, are essentially the same. (2)
Describe efficient solutions to several combinatorial optimization
problems. (3) Describe our experiences in implementing these
models on modern commodity hardware.
I
will then argue that because the model is "stable" from a
complexity theory point of view, amenable to efficient and
cost effective implementation, and general enough to encompass
a wide range of problems, it can be the basis for a computational
platform for large scale data analysis.

Vijay
V. Vazirani
(College of Computing, Georgia Institute of Technology) vazirani@cc.gatech.edu
http://www.cc.gatech.edu/fac/Vijay.Vazirani/
How
Intractable is the "Invisible Hand'': Polynomial Time
Algorithms for Market Equilibria Papers:
market.pdf
market.ps
EC3.pdf
EC3.ps
Slides: html
pdf
ps
ppt
Although
the study of market equilibria has occupied center stage within
Mathematical Economics for over a century, polynomial time
algorithms for such questions have so far evaded researchers.
It has been customary to relegate such efficiency issues to
the "Invisible Hand'' of the market, a notion propounded
by Adam Smith (Wealth of Nations, 1776). In the context of
Algorithmic Game Theory, a nascent area attempting to address
new issues arising from the Internet, we provide the first
polynomial time algorithm for the linear version of a problem
defined by Irving Fisher in 1891.
Although
the linear case provides an excellent starting point, it is
not general enough to capture some of the more attractive
aspects of pricing mechanisms. To study these, we define a
new model which takes into consideration spending constraints
among buyers. Our algorithms are inspired by the primal-dual
schema from combinatorial optimization.
(First
result joint with Devanur, Papadimitriou, and Saberi, and
available at http://www.cc.gatech.edu/fac/Vijay.Vazirani)

Santosh
Vempala
(Department of Mathematics, Massachusetts Institute of Technology)
vempala@math.mit.edu
On
the Spectral Method for Clustering Mixture Models Slides:
html
Papers: mixtures.pdf
mixtures.ps
specfocs.pdf
specfocs.ps
A
mixture model is a weighted combination of probability distributions.
We consider the problem of classifying a sample according
to the underlying component distributions and focus on a simple
spectral algorithm. The success of the algorithm depends on
a geometric separation condition between the means of the
component distributions. For a mixture of $k$ spherical Gaussians,
the algorithm achieves asymptotically optimal bounds. We will
also discuss its performance for general mixtures.
Joint
work with R. Kannan and G.
Wang.

Xiangrong
Yin
(Department of Statistics, University of Georgia) xryin@stat.uga.edu
http://www.stat.uga.edu/~xryin/
Information Extraction: A Dimension Reduction Technique
Based on Information Theory (poster
session)
Modern graphical tools have enhanced our ability to learn
many things from data directly, but the issue is what to plot
with a high-dimensional data set. Thus dimension reduction
shows its importance. Based on information theory, here we
develop a dimension reduction technique for extracting important
information. Our method in general could be applied in any
area relating to information extraction. It involves density
estimation which can be estimated non-parametrically. And
that is feasible with the help of fast computing techniques.
We provide theoretical justification for connecting with central
subspace. Illustrative examples are presented.