HOME    »    PROGRAMS/ACTIVITIES    »    Annual Thematic Program
May 6-9, 2003


2002-2003 Program: Optimization

Inderjit S. Dhillon (Department of Computer Sciences, University of Texas at Austin)  inderjit@cs.utexas.edu  http://www.cs.utexas.edu/users/inderjit/

Information-Theoretic Clustering, Co-clustering and Matrix Approximations    Papers:  jmlrdist.pdf    jmlrdist.ps     kdd_cocluster.pdf    kdd_cocluster.ps
Slides:   html    pdf    ps    ppt

Given a two-dimensional co-occurrence matrix, we consider the problem of co-clustering or simultaneous clustering of the rows and columns. We view the (scaled) co-occurrence matrix as an empirical joint probability distribution of two discrete random variables, and pose the co-clustering problem as an optimization problem in information theory --- the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters. It can be shown that this objective is identical to the objective of finding a "low-parameter" non-negative matrix approximation that is closest in relative entropy to the given matrix under cluster constraints. Based on this formulation, we present a co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages. We apply the algorithm to some problems in text mining: distributional clustering of words applied to text classification, and simultaneous clustering of words and documents. Experimental results indicate that distributional word clustering leads to improved text classification when the training data is small in size, and co-clustering can overcome problems due to sparsity.

This is joint work with Yuqiang Guan, Subramanyam Mallela and Dharmendra Modha.

Usama Fayyad, Ph.D. (President and CEO, digiMine, Inc.)  Usama@digiMine.com

The Business Evolution and Challenges of Data Mining

Data Mining has received much attention as companies and organizations started to ask how they can better utilize the huge data stores they built up over the past few decades. While some interesting progress has been achieved over the past few years, especially when it comes to techniques and scalable algorithms, very few organizations have managed to benefit from the technology. This paradoxical situation of having too much data yet not be able to utilize it or mine it arose because of both technical and business challenges. We will cover these challenges, paint a picture for where the data problems are, and then introduce data mining and its role.

Rather than focusing on the algorithmic aspects of data mining: scaling to large databases, aspects and challenges for fitting data mining with database systems, we will cover business challenges of how to make the technology really work. In this coverage, we will span the spectrum from technical to business issues. We shall also cover applications in the business setting to illustrate the challenges and contributions of data mining in eBusinesses as an example. We conclude by revisiting the technical challenges facing the field.

Bio of Usama Fayyad:
Dr. Fayyad is co-founder, Chairman and CEO of digiMine, Inc. a privately held company of over 100 employees focused on data mining and business intelligence solutions. As CEO, he raised $45 million in venture capital from top investors and has grown the company to serve many large Fortune 500 clients. Prior to digiMine, Dr. Fayyad founded and led Microsoft Research's Data Mining & Exploration (DMX) Group. At Microsoft he also led the development of data mining components within Microsoft products, including SQL Server 2000. From 1989 to 1995, Dr. Fayyad was at NASA's Jet Propulsion Laboratory (JPL), California Institute of Technology, where he founded and grew a multi-million dollar advanced research program to develop data mining systems for the analysis of large scientific databases. This work solved some significant scientific problems and earned him numerous awards including the most distinguished excellence award from Caltech/JPL and a U.S. Government Medal from NASA. Dr. Fayyad has published over 100 technical articles and is co-editor of two books on Data Mining and Knowledge Discovery in Databases. He is very active in the KDD community and has served as program co-chair of KDD-94 and KDD-95 (the 1st International Conference on Knowledge Discovery and Data Mining) and as general chair of KDD-96 and KDD-99. He is Editor-in-Chief of the scientific technical journal: Data Mining and Knowledge Discovery. He is a Director of the ACM Special Interest Group on knowledge Discovery and Data Mining (SIGKDD) and Editor-in-Chief of its newsletter: SIGKDD Explorations. He serves on several Editorial Boards including the Communications of the ACM. He holds a Ph.D. in Computer Science and Engineering from The University of Michigan, Ann Arbor (1991) in addition to M.Sc. in Mathematics (1989), MSE in Computer Engineering (1986), and two B.S.E.'s in EE and CS (1984).

Sudipto Guha (Department of Computer and Information Science, University of Pennsylvania)  sudipto@central.cis.upenn.edu

Approximate Histogram Construction: Offline and Data Stream Algorithms    Slides:   pdf

Obtaining fast and accurate approximations to data distributions is a problem of central interest to database management. A variety of popular database applications including, approximate querying, similarity searching and data mining in most application domains, rely on such accurate approximations. Histogram based approximation is a very popular method in database theory and practice to succinctly represent a data distribution in a space efficient manner.

In this talk, we will consider various variants of optimal histogram construction. We will also investigate histogram construction for data streams.

Thorsten Joachims (Department of Computer Science, Cornell University)  tj@cs.cornell.edu  http://www.cs.cornell.edu/People/tj/

Transductive Learning via Spectral Graph Partitioning    Slides:   pdf

Consider the following type of prediction problem: you have a large database of object (e.g. text documents) that you would like to organize according to some classification. You have the classification for a small subset of objects and you would like to infer the classes for the remaining objects. A typical example is relevance feedback in information retrieval: a user specifies some documents as relevant/non-relevant and wants to find all other relevant documents in the collection. This type of machine learning task is called transductive inference. A key difference to the typical inductive machine learning problem is that the location of the test points can be observed by the learning algorithm. I will show that exploiting cluster structure in the test set can help make more accurate predictions, in particular for text classification.

However, designing transductive learning algorithms that simultaneously minimize training error and maximize a measure of clustering in the test set is challenging. In this talk, I will present an approach based on spectral graph partitioning. In this approach, objects in the database form the nodes of a graph, while the edges represent dependencies. For such graphs, I will generalize ratio-cuts to find a clustering of the database that obeys the known classifications for the training examples. The relaxation of the resulting optimization problem can be solved as an eigenvalue problem. Empirical results show improved prediction accuracy especially for small training sets. Furthermore, the framework is very flexible, making it possible to implement co-training as a special case.

Michael I. Jordan (Department of Electrical Engineering and Computer Science, Department of Statistics, University of California, Berkeley)  jordan@cs.berkeley.edu

Kernel Methods, Graphical Models and Convex Optimization    Papers:    WaiJor_Semidef03tech.pdf    WaiJor_Semidef03tech.ps    csd-02-1202.pdf    csd-02-1202.ps    csd-02-1206.pdf    csd-02-1206.ps

A pervasive problem in data analysis is that of integrating data from multiple, heterogeneous sources. Data sources may have varying formats, semantics and degrees of reliability. I illustrate this problem in two areas: one involving the annotation of images, and the other involving the prediction of the cellular location of proteins. In the case of image annotation, I describe a methodology for solving data integration problems that is based on probabilistic graphical models, a formalism that exploits the conjoined talents of graph theory and probability theory to build complex models out of simpler pieces. I show how convex relaxations can be used to solve the probabilistic inference problem that is at the core of the graphical model approach. In the case of protein annotation, I discuss an approach based on "kernel methods," an area of machine learning that also makes significant use of convex optimization techniques. I show how multiple kernels can be combined, yielding a problem that is still within the scope of convex optimization.

(with David Blei, Nello Cristianini, Gert Lanckriet, William Noble and Martin Wainwright)

Adam Kalai (Applied Mathematics, Massachusetts Institute of Technology)  akalai@mit.edu

Algorithms for Online Optimization Problems
   Slides:   pdf    ps    Paper:   pdf    ps

In an online optimization problem, one makes a series of decisions without knowledge of future data. As an example, a factory may have a fixed feasible set of production options. Each month, the factory must decide how many of each type of item to produce, and only after production do they find out the profits of each item. A second example would be: each day you have to drive from home to work, and only afterwards do you find out the times on each of the roads. The goals are to maximize total profits, in the first case, and minimize total time, in the second.

A standard measure of regret is relative to the best decision one would have made in hindsight, i.e. compared to the best single choice to use if you had to use the same choice every day. For a wide variety of online optimization problems, it is known that one can guarantee low expected regret relative to the best decision in hindsight. We demonstrate this and show that one can also compute these online decisions as efficiently as one can compute the best decision in hindsight.

Ravi Kannan (Department of Computer Science, Yale University)  kannan@cs.yale.edu

Sampling on the Fly from Massive Data    Slides:   pdf   ps

An approach to problems where the data is too large to be stored in Random Access Memory let alone be analyzed in full is Sampling. The simplest kind of sampling, namely Uniform Sampling can be implemented in one pass through the data. Recently, algorithms for solving approximately a host of graph and combinatorial problems which use a surprisingly small uniformly drawn sample have been developed.

There has also been progress on using non-uniform sampling in the "streaming'' model, which allows just one pass through the data. Clustering and frequency estimation problems have been thus tackled.

We will also consider problems, where, the data consists of a large matrix which can be stored in external memory and read a limited number of times. We will present algorithms to find a succinct synopsis of the matrix, storable in RAM, from which, later, we may answer Similarity (arising in Information Retrieval) and other queries fast.

Robert Krauthgamer (University of California at Berkeley and International Computer Institute (ICSI))  robi@cs.berkeley.edu  http://www.cs.berkeley.edu/~robi/

Navigating Nets: Simple Algorithms for Proximity Search

Nearest-neighbor search is the following problem: Given a set S of n points in a metric space X, preprocess S so as to efficiently find the point in S that is closest to a query point q element X. Recently, there has been a significant interest in the case of general metrics, because in many practical applications the underlying metric is not Euclidean (or another lp-metric).

We devise for this problem a simple and efficient dynamic data structure (i.e., supports insertions and deletions to S) that answers 1+epsilon approximate nearest neighbor queries for any epsilon > 0. Our algorithms are deterministic and provably correct with no assumptions on the metric. Their (time and space) complexity depends on the dimensionality of the data set S, which we measure using a very natural notion borrowed from functional analysis. For example, if S has constant dimension, the above operations take logarithmic time (for fixed epsilon > 0). We further show that the dependency of our bounds on the dimension is near-optimal in a certain oracle model. Finally, our data structure outperforms the metric skip list of Karger and Ruhl [STOC, 2002] by being deterministic, dimension oblivious, and conceptually simpler.

[Joint work with James R. Lee.]

Lillian Lee (Department of Computer Science, Cornell University)  llee@cs.cornell.edu   http://www.cs.cornell.edu/home/llee

The Iterative Residual Rescaling algorithm: An analysis and generalization of Latent Semantic Indexing    Paper:   pdf    ps

Using vector-based representations of document collections has enabled the application of powerful dimension-reduction techniques to information retrieval, document clustering, and other text analysis tasks. One of the most prominent of these techniques is Latent Semantic Indexing (LSI). However, despite ample empirical experience with it, there is still little understanding of when LSI can -- and, just as importantly, cannot -- be expected to perform well. This talk consists of two parts. First, after a self-contained introduction to LSI, we provide a novel formal analysis of it that links its performance in a precise way to the uniformity of the underlying topic-document distribution. Second, we present a new algorithm, Iterative Residual Rescaling (IRR), that corrects for skewed distributions by automatically adjusting to non-uniformity in the topic distribution without knowledge of the underlying topics. A series of experiments over a variety of evaluation metrics validates our analysis and demonstrates the effectiveness of IRR. Joint work with Rie Kubota

Joint work with Rie Kubota Ando.

Milena Mihail (College of Computing, Georgia Institute of Technology)  mihail@cc.gatech.edu  http://www.cc.gatech.edu/fac/Milena.Mihail/

Conductance and Spectra of Power Law and Scale Free Graphs

We generalize the classical result of expansion of random regular graphs for graphs whose degree sequences follow heavy tailed statistics. Such graphs arise in complex networks like the Internet, the WWW and biological networks. In particular, we establish conductance Omega(1/logn) in the configurational random graph model and conductance Omega(1) the evolutionary model of growth with preferential attachment, under the assumption that the minimum degree is at least 3 in the former and at least 2 in the latter. A structural implication of our results is separation of the 1st and 2nd eigenvalues of the stochastic normalization of the adjacency matrix of the graph. The algorithmic implications of our results include low congestion routing on the Internet and fast crawling on the WWW. We give complementary results for the spectrum of the non-normalized adjacency matrix of the graph, with implications in information retrieval.

Joint work with C. Gkantsidis, C. Papadimitriou and A. Saberi.

Sridhar Rajagopalan (IBM Almaden Research Center)  sridhar@almaden.ibm.com

Practical Models for Large Data Set Analysis    Slides:   html    pdf    ps    ppt

In this talk, I will introduce two practical models for large data set analysis. The first is derived from the streaming computational model by introducing a sort primitive. The second is a more formal model, streaming networks. We will discuss three issues in the new context. (1) Show that the two models, though apparently different, are essentially the same. (2) Describe efficient solutions to several combinatorial optimization problems. (3) Describe our experiences in implementing these models on modern commodity hardware.

I will then argue that because the model is "stable" from a complexity theory point of view, amenable to efficient and cost effective implementation, and general enough to encompass a wide range of problems, it can be the basis for a computational platform for large scale data analysis.

Vijay V. Vazirani (College of Computing, Georgia Institute of Technology)  vazirani@cc.gatech.edu  http://www.cc.gatech.edu/fac/Vijay.Vazirani/

How Intractable is the "Invisible Hand'': Polynomial Time Algorithms for Market Equilibria    Papers:   market.pdf    market.ps    EC3.pdf    EC3.ps    Slides:   html    pdf    ps    ppt

Although the study of market equilibria has occupied center stage within Mathematical Economics for over a century, polynomial time algorithms for such questions have so far evaded researchers. It has been customary to relegate such efficiency issues to the "Invisible Hand'' of the market, a notion propounded by Adam Smith (Wealth of Nations, 1776). In the context of Algorithmic Game Theory, a nascent area attempting to address new issues arising from the Internet, we provide the first polynomial time algorithm for the linear version of a problem defined by Irving Fisher in 1891.

Although the linear case provides an excellent starting point, it is not general enough to capture some of the more attractive aspects of pricing mechanisms. To study these, we define a new model which takes into consideration spending constraints among buyers. Our algorithms are inspired by the primal-dual schema from combinatorial optimization.

(First result joint with Devanur, Papadimitriou, and Saberi, and available at http://www.cc.gatech.edu/fac/Vijay.Vazirani)

Santosh Vempala (Department of Mathematics, Massachusetts Institute of Technology)  vempala@math.mit.edu

On the Spectral Method for Clustering Mixture Models    Slides:   html    Papers:   mixtures.pdf    mixtures.ps    specfocs.pdf    specfocs.ps

A mixture model is a weighted combination of probability distributions. We consider the problem of classifying a sample according to the underlying component distributions and focus on a simple spectral algorithm. The success of the algorithm depends on a geometric separation condition between the means of the component distributions. For a mixture of $k$ spherical Gaussians, the algorithm achieves asymptotically optimal bounds. We will also discuss its performance for general mixtures.

Joint work with R. Kannan and G. Wang.

Xiangrong Yin (Department of Statistics, University of Georgia)  xryin@stat.uga.edu  http://www.stat.uga.edu/~xryin/

Information Extraction: A Dimension Reduction Technique Based on Information Theory (poster session)

Modern graphical tools have enhanced our ability to learn many things from data directly, but the issue is what to plot with a high-dimensional data set. Thus dimension reduction shows its importance. Based on information theory, here we develop a dimension reduction technique for extracting important information. Our method in general could be applied in any area relating to information extraction. It involves density estimation which can be estimated non-parametrically. And that is feasible with the help of fast computing techniques. We provide theoretical justification for connecting with central subspace. Illustrative examples are presented.

Material from Talks

2002-2003 Program: Optimization

Go