HOME    »    PROGRAMS/ACTIVITIES    »    Annual Thematic Program
Talk Abstract
Matrix Approximations for Large, Sparse Text Data Using Clustering

Inderjit S. Dhillon
Department of Computer Sciences
UT Austin
inderjit@cs.utexas.edu

Large text collections warrant new, effective and efficient methods that allow easy assimilation, retrieval and navigation. Such methods include automatic clustering of documents into conceptual categories, "intelligent" query retrieval, and classification of documents into pre-defined categories.

Towards this end, I will present a geometric k-means like clustering algorithm (after converting the text to a vector-space model) to directly identify the main latent concepts in the document collection. This clustering algorithm directly gives a partitioning of the documents. Additionally, it yields a "concept-revealing" linear subspace of the vector-space model, and a new matrix decomposition of the term-document matrix, which we term "concept decomposition". This linear subspace can be used for better query retrieval in a manner analogous to the LSI (Latent Semantic Indexing) method. We analytically and empirically investigate the relationship between this new subspace and the invariant subspaces of LSI. An advantage of our method is its computational efficiency, especially when compared to the expensive eigenvector computations required in LSI. I will present experimental results that validate these claims. If time permits, I will also present a scheme to "visualize" proximity relationships among the text documents.


Back to Workshop Schedule

Back to IMA "HOT TOPICS" Workshop: Text Mining

Go