HOME    »    SCIENTIFIC RESOURCES    »    Volumes
Abstracts and Talk Materials
User-Centered Modeling
May 07-11, 2012

Sinan Aral - New York University

Identifying (Causal) Influence in Social (Media) Networks
May 11, 2012 9:00 am - 10:00 am

Keywords of the presentation: Peer Effects, Causal Inference, Identification, Econometrics, Experimental Methods

Many of us are interested in whether "networks matter." Whether in the spread of disease, the diffusion of information, the propagation of social contagions, the effectiveness of viral marketing, or the magnitude of peer effects in a variety of settings, a key problem is understanding whether and when the statistical relationships we observe can be interpreted causally. Sinan will review what we know and where research might go with respect to identifying causal peer influence in social networks and the importance of causal inference for policy. He will provide three examples from large scale observational and experimental studies in online social media networks.

Amir Assadi - University of Wisconsin, Madison

Poster - BIGDATA in Plant Biology, Agriculture and Ecology: NLP From Biomolecular Networks to Sustainability Science
May 08, 2012 3:30 pm - 4:30 pm

The literature in biology is vast and rich with valuable empirical, heuristic and theoretical information. Systematic organization and knowledge mining of research articles play a major role in helping with large-scale project formulation, making groundbreaking discoveries and forming novel hypotheses.

The following is a preliminary report on development of “Big Data” analysis tools tailored for systems biology, such as dynamic models of -omic networks, systems-level perturbation of pathways etc. For example, we demonstrate utility of Natural Language Processing (NLP) in discovery of hidden and implicit correlations among pairs of genes in massive gene expression data for diurnal and circadian rhythms in wild type Arabidopsis thaliana (provided by the Chory Lab). A theory of Collective Intelligence in Plant Biology (PhytoCognito) is under development that provides the synthesis of cognitive, computational and informatics that will sustain collaborative, user-centered efforts that aim at continuing the heritage of the past into the major successes of the future and scientific breakthroughs. One of the most important goals of genomic research is to extract functional information from gene expression time series data. Thanks to DNA microarray development, mRNA sampling of thousand genes is now possible by using a single chip. This technology has made it possible to measure gene expression of whole genome over and over again to explore the model response of an organism to a change in condition, e.g., application of some drug or other treatment.

Tanya Berger-Wolf - University of Illinois, Chicago

Computational Behavioral Ecology
May 07, 2012 4:30 pm - 5:30 pm

Keywords of the presentation: Computational biology, social network analysis, image processing, machine learning

Computation has fundamentally changed the way we study nature. Recent advances in data collection technology, such as GPS and other mobile sensors, high definition cameras, satellite images, and genotyping, are giving biologists access to data about the natural world which are orders of magnitude richer than any previously collected. Such data offer the promise of answering some of the big questions about why animals do what they do, among other things.

Unfortunately, in this domain, our ability to analyze data lags substantially behind our ability to collect it. In this talk I will show how computational approaches can be part of every stage of the scientific process of understanding animal sociality, from data collection (identifying individual zebras from photographs by stripes) to hypothesis formulation (by designing a novel computational framework for analysis of dynamic social networks).


Francesco Bonchi - Yahoo! Research

Influence propagation in social networks: a data mining perspective
May 10, 2012 10:30 am - 11:30 am

The study of the spread of influence through a social network has a long history in the social sciences. The first studies focused on the adoption of medical and agricultural innovations, later marketing researchers investigated the "word-of-mouth" diffusion process as an important mechanism by which information can reach large populations, possibly influencing public opinion, driving new product market share and brand awareness. Recently, thanks to the success of on-line social networks and microblogging platforms such as Facebook and Twitter, the phenomenon or influence exerted by users of an online social network on other users and how it propagates in the network, has attracted the interest of computer scientists and IT specialists.

One of the key problems in this area is the identification of influential users, by targeting whom certain desirable outcomes can be achieved. Here, targeting could mean giving free )or price discounted) samples of a product and the desired outcome may be to get as many customers to buy the product as possible.

In this talk we take a data mining perspective and we discuss what (and how) can be learned from available traces of past propagations. While doing this we provide a brief survey of some recent progresses in this area, as well as discuss the open problems.

Ramon Caceres - AT&T Laboratories - Research

Human Mobility Characterization from Cellular Network Data
May 07, 2012 1:30 pm - 2:30 pm

Keywords of the presentation: Human mobility patterns, Call Detail Records

An improved understanding of human mobility patterns can help answer key questions in fields as varied as mobile computing, urban planning, ecology, and epidemiology. Cellular telephone networks can shed light on human movements cheaply, frequently, and on a large scale. We have developed techniques for analyzing anonymous cellphone locations to explore how large populations move in metropolitan areas such as Los Angeles and New York. Our results include measures of how far people travel each day, estimates of carbon footprints due to home-to-work commutes, density maps of the residential areas that contribute workers to a city, and relative volumes of traffic on commuting routes. We have validated our approach through comparisons against ground truth from volunteers and against independent sources such as the US Census Bureau. Throughout our work, we have taken measures to preserve individual privacy. This talk presents an overview of our methodologies and findings.

Sou-Cheng (Terrya) Choi - University of Chicago

Poster - Computing Google's PageRank by Sparse Approximation
May 08, 2012 3:30 pm - 4:30 pm

The Google's PageRank eigenvector is sparse in the sense that most elements are extremely small. Basis Pursuit De-Noising is a reasonable tool for finding the tiny proportion of significant nonzeros.

This is joint work with Michael Saunders.

Brian Dalessandro - media6degrees

Poster - Causally Motivated Attribution for Online Advertising
May 08, 2012 3:30 pm - 4:30 pm

In many online advertising campaigns, multiple vendors, publishers or search engines (herein called channels) are contracted to serve advertisements to internet users on behalf of a client seeking speci c types of conversion. In such campaigns,individual users are often served advertisements by more than one channel. The process of assigning conversion credit to the various channels is called "attribution," and is a subject of intense interest in the industry. This work presents a causally motivated methodology for conversion attribution in online advertising campaigns. We fi rst propose a need for the standardization of attribution measurement and off er four principles upon which standardization may be based. Stemming from these standards, we off er an attribution solution that generalizes prior attribution work in cooperative game theory and recasts the prior work through the lens of a causal framework. We argue that in cases where causal assumptions are violated, our solution can be interpreted as a variable (or channel) importance measure. Finally, we present a practical solution towards managing the potential complexity of the generalized attribution methodology, and show examples of attribution measurement on several online advertising campaign data sets.

Dean Eckles - Stanford University

Identifying peer effects in online communication behaviors
May 08, 2012 9:00 am - 10:00 am

Keywords of the presentation: peer effects, social influence, causal inference, encouragement designs, experimentation

Peer effects can produce clustering of behavior in social networks, but so can homophily and common external causes. For observational studies, adjustment and matching estimators for peer effects require often implausible assumptions, but it is only rarely possible to conduct appropriate direct experiments to study peer influence in situ.

We illustrate the limitations of observational analysis with a constructed observational study that allows us to compare experimental and observational estimates of peer influence in link sharing via Facebook News Feed.

We describe research designs in which individuals are randomly encouraged to perform a focal behavior, which can subsequently influence their peers. Ubiquitous product optimization experiments on Internet services can be used for these analyses. This approach is illustrated with an analysis of peer effects in expressions of gratitude via Facebook on Thanksgiving Day 2010, with implications for the micro-foundations of culture.

Deborah Estrin - University of California, Los Angeles

Participatory mobile health (mHealth): Innovative approaches to data collection and analysis
May 07, 2012 9:00 am - 10:00 am

Keywords of the presentation: mobile health, participatory sensing

The most significant health and wellness challenges increasingly involve chronic conditions, from diabetes, hypertension, and asthma to depression, chronic-pain, sleep and neurological disorders. And three lifestyle behaviors contribute to many of these conditions. Participatory mobile health (mHealth) leverages the power and ubiquity of mobile and cloud technologies to assist individuals, clinicians, and researchers in monitoring, managing, and understanding symptoms, side effects and treatment outside the clinical setting; and to address the lifestyle factors that can bring on or exacerbate these conditions. By empowering individuals to track and manage their key health-related behaviors and outcomes, this approach has the potential to greatly improve people’s health and quality of life, while simultaneously reducing societies’ overall healthcare costs.

Participatory mHealth incorporates a variety of techniques, including automated activity traces, reminders and prompted inputs. This talk will present our experience to date with mHealth pilots and prototypes and will discuss areas in need of exploration: open modular tools for data collection, analysis and visualization across diverse data types; engagement such as adaptive goal setting and game mechanics; and privacy mechanisms.

Shawndra Hill - University of Pennsylvania

Mining Medical Discussion Boards for Adverse Effects to Drugs
May 07, 2012 10:30 am - 11:30 am

Keywords of the presentation: Discussion boards, social networks, drug surveillance, data mining

Medical message boards are online resources where users with a particular condition exchange information, some of which they might not otherwise share with medical providers. Many of these boards contain a large number of posts and patient opinions and experiences that would be potentially useful to clinicians and researchers. We present an approach that is able to collect a corpus of medical message board posts, de-identify the corpus, and extract information on potential adverse drug effects discussed by users. Using a corpus of posts to breast cancer message boards, we identified drug event pairs using co-occurrence statistics. We then compared the identified drug event pairs with adverse effects listed on the package labels of tamoxifen, anastrozole, exemestane, and letrozole. Of the pairs identified by our system, 75–80% were documented on the drug labels. Some of the undocumented pairs may represent previously unidentified adverse drug effects.

Jake Hofman - Yahoo! Inc.

Modeling Social Data
May 10, 2012 9:00 am - 10:00 am

Keywords of the presentation: demographics, networks, temporal models, social, web

This talk provides an overview of several recent projects in modeling social data, including user demographics, social network structure, and temporal behavior. First, we present a study which pairs browsing histories for 250,000 anonymized individuals with user-level demographic data to study variation in Web activity among different demographic groups. Next, we discuss work with the Yahoo! Mail team which aims to infer associations and groups amongst and individual's contacts. We conclude with an interpretable temporal model of communication patterns which, phrased as a hidden Markov model, provides an effective and interpretable characterization of both human and non-human activity.

Tony Jebara - Columbia University

Learning, Linking and Labeling Social Networks
May 08, 2012 10:30 am - 11:30 am

Keywords of the presentation: social networks, linking, privacy, matching, labeling, visualization

Many machine learning problems on data can naturally be formulated as problems on graphs. For example, dimensionality reduction and visualization are related to graph embedding. Given a sparse graph between N high-dimensional data nodes, how do we faithfully embed it in low dimension? We present an algorithm that improves dimensionality reduction by extending the Maximum Variance Unfolding method. But, given only a dataset of N samples, how do we construct a graph in the first place? The space to explore is daunting with 2^(N(N-1)/2) graphs to choose from yet two interesting subfamilies are tractable: matchings and b-matchings. By placing distributions over matchings and using loopy belief propagation, we can efficiently infer maximum weight subgraphs. These fast generalized matching algorithms leverage integral LP relaxations and perfect graph theory. Applications include graph reconstruction, graph embedding, graph transduction, and metric learning with emphasis on data from text, network, mobile and image domains.

Benjamin Letham - Massachusetts Institute of Technology

Poster - Sequential event prediction
May 08, 2012 3:30 pm - 4:30 pm

In sequential event prediction, we are given a "sequence database" of past sequences to learn from, and we aim to predict the next event within a current event sequence. We focus on applications where the set of past events has predictive power and not the specific order of those past events. Such applications arise in recommender systems, equipment maintenance, medical informatics, and in other domains. Our formalization of sequential event prediction draws on ideas from supervised ranking. We show how specific choices within this approach lead to different sequential event prediction problems and algorithms. We apply our approach to an online grocery store recommender system as well as a novel application in the health event prediction domain.

David Madigan - Columbia University

Big-Data-Driven Medicine
May 07, 2012 3:00 pm - 4:00 pm

Keywords of the presentation: observational study, predictive modeling, healthcare

In our data-rich world, key medical decisions, ranging from a regulator’s decision to curtail a drug to patient-specific treatment choices, require optimal consideration of myriad inputs. Statistical/epidemiological methods that can harness real-world medical data in useful ways do exist, but much work remains to achieve the full potential of a truly data-driven user-centric medical environment. I will lay out some of the key challenges before us and describe recent progress in the specific area of drug safety.

Tyler McCormick - University of Washington

Poster - Dynamic Logistic Regression and Dynamic Model Averaging for Binary Classification
May 08, 2012 3:30 pm - 4:30 pm

We propose an online binary classification procedure for cases when there is uncertainty about the model to use and parameters within a model change over time. We account for model uncertainty through Dynamic Model Averaging (DMA), a dynamic extension of Bayesian Model Averaging (BMA) in which posterior model probabilities may also change with time. We apply a state-space model to the parameters of each model and we allow the data-generating model to change over time according to a Markov chain. Calibrating a ``forgetting'' factor accommodates different levels of change in the data-generating mechanism. We propose an algorithm which adjusts the level of forgetting in an online fashion using the posterior predictive distribution, and so accommodates various levels of change at different times.

We apply our method to data from children with appendicitis who receive either a traditional (open) appendectomy or a laparoscopic procedure. Factors associated with which children receive a particular type of procedure changed substantially over the seven years of data collection, a feature that is not captured using standard regression modeling. Because our procedure can be implemented completely online, future data collection for similar studies would require storing sensitive patient information only temporarily, reducing the risk of a breach of confidentiality.

This is a joint work of Tyler H. McCormick (University of Washington), Adrian E. Raftery (University of Washington), David Madigan (Columbia University), Randall Burd (Children's National Medical Center)

Bamshad Mobasher - DePaul University

User Modeling for Context-Aware Recommendation
May 09, 2012 9:00 am - 10:00 am

Keywords of the presentation: recommender systems, context-awareness, collaborative tagging

The role of recommender systems as a fundamental utility for electronic commerce and information access is well established with many commercially-available recommender systems providing benefits to both users and businesses. But, recommender systems tend to use simplistic user models that are additive in nature: new user preferences are simply added to the existing profiles. This additive approach ignores the notion of "situated action," that is, the fact that users interact with systems within a particular context and items relevant within one context may be irrelevant in another. Little agreement exists among researchers as to what constitutes context, but its importance seems undisputed. In psychology, a change in context during learning has been shown to have an impact on recall. Research in linguistics has shown that context plays the important role of a disambiguation function. More recently, the role of context has been explored in intelligent information systems. In particular, a variety of approaches and architectures have emerged for incorporating context or situational awareness in the recommendation process. In this talk, we provide a broad overview of the problem of contextual recommendation and some of the recent solutions to the problem of modeling context. We will specifically focus on several approaches for integrating context in user modeling for personalized recommendation, including an approach inspired by a model of human memory and emphasizes the modeling of context based on observations of user behavior; another that emphasizes the role of domain knowledge and semantics as an integral part of user context, and finally, an approach that exploits social annotations, such as collaborative tagging, as the basis for inferring content.

Cathy O'Neil - Intent Media

Math in Business
May 10, 2012 3:00 pm - 4:00 pm

Cathy will talk about doing math in business, specifically drawing on her experiences as an assistant professor in math, as a quant at a hedge fund, and currently as a data scientist at an internet advertising startup. She will discuss the mathematical as well as the cultural differences of the three jobs, and will suggest how to decide where one may best fit in and why. She will also talk about how questions of ethics fit in to the daily life of a mathematician in business.

Cynthia Rudin - Massachusetts Institute of Technology

Poster - Interpretable User-Centered Predictions
May 08, 2012 3:30 pm - 4:30 pm

I am working on the design of predictive models that are both accurate and interpretable by a human. These models are built from association rules such as "dyspepsia & epigastric pain -> heartburn." I will present three algorithms for "decision lists," where classification is based on a list of rules:

1) A very simple rule-based algorithm, which is to order rules based on the "adjusted confidence." In this case, users can understand the whole algorithm as well as the reason for the prediction.

2) A Bayesian hierarchical model for sequentially predicting conditions of medical patients, using association rules.

3) A mixed-inter optimization (MIO) approach for learning decision lists. This algorithm has high accuracy and interpretability - both owing to the use of MIO.

This is joint work with David Madigan, Tyler McCormick, Ben Letham, Allison Chang, and Dimitris Bertsimas.


Using the Crowd for User-Centered Predictive Modeling
May 10, 2012 1:30 pm - 2:30 pm

Keywords of the presentation: personalized predictions, crowdsourcing, wisdom of crowds, lists, event sequences

I will describe work on three areas related to crowd-based user-centered modeling:

1) Growing Lists: We want to combining the knowledge of many people (experts) in order to create "sets" of things that go together, starting from a small seed. The experts have varying levels of expertise. This is the same problem that Google Sets was designed to solve. (With Ben Letham and Katherine Heller)

2) Sequential Event Prediction for Personalized Recommendations: We are given a "sequence database" of past event sequences to learn from (like sequences of products purchased by customers), and we aim to predict the next event within a current event sequence (the next product purchased). We focus on applications where the set of the past events has predictive power and not the specific order of those past events. This is useful for all different kinds of recommender systems and search engines. (With Ben Letham and David Madigan)

3) Approximating the Crowd on a Budget: The problem of "approximating the crowd" is that of estimating the crowd's majority opinion by querying only a subset of it. Algorithms that approximate the crowd can intelligently stretch a limited budget for a crowdsourcing task, and must balance between exploring the quality of the labelers and exploiting the best ones. (With Seyda Ertekin and Haym Hirsh)


Frank Shipman - Texas A & M University

Multi-Application User Interest Modeling
May 09, 2012 3:00 pm - 4:00 pm

User interest modeling attempts to represent user interests in a form that can be used to improve system support when users are searching for, selecting from, and browsing documents or other resources. Work on recognizing user interests based on their prior activities, such as their browsing behavior, is a common approach to implicit user interest modeling. The work presented expands on this approach by aggregating activity across multiple end-user applications. This talk presents the evolution of the Interest Profile Manager, a local application that collects activity data and acts as a service to those applications seeking to better support information access.

Vitaly Shmatikov - University of Texas, Austin

User Data: The End of Anonymity, the Beginning of Privacy
May 09, 2012 10:30 am - 11:30 am

Keywords of the presentation: privacy, user data, social networks

"We do not collect personally identifiable information"... "This dataset have been de-identified prior to release"... From advertisers tracking Web clicks to biomedical researchers sharing clinical records, anonymization is the main privacy protection mechanism used for sensitive user data today.

I will argue that the distinction between "personally identifiable" and "non-personally identifiable" information is fallacious by showing how to infer private information from fully anonymized data in three settings: (1) records of individual transactions and preferences, illustrated by the Netflix Prize dataset, (2) social networks, and (3) recommender systems, where temporal changes in aggregate statistics allow accurate inference of hidden individual transactions.

I will then outline a program for data privacy research. It includes several challenging problems in the design and implementation of privacy-preserving systems, domain-specific algorithmic research, as well as policy and economic issues.

Ori Stitelman - media6degrees

Poster - Doubly Robust Targeted Maximum Likelihood Estimation (TMLE) Of The Effect Of Display Advertising On Browser Conversion
May 08, 2012 3:30 pm - 4:30 pm

The effectiveness of online display ads beyond simple click-through evaluation is not well established in the literature. Are the high conversion rates seen for subsets of browsers the result of choosing to display ads to a group that has a naturally higher tendency to convert or does the advertisement itself cause an additional lift? How does showing an ad to different segments of the population affect their tendencies to take a specific action, or convert? We present an approach for assessing the effect of display advertising on customer conversion that does not require the cumbersome and expensive setup of a controlled experiment, but rather uses the observed events in a regular campaign setting. The general approach can be applied to many additional types of causal questions in display advertising and beyond. The approach relies on four steps:

  1. Defining the question of interest.
  2. Using domain knowledge and temporal cues to establish causal assumptions.
  3. Choosing a parameter of interest that directly answers the question of interest under the causal assumption.
  4. Estimating the parameter of interest as well as possible using Targeted Maximum Likelihood Estimation (TMLE), a double robust estimating procedure.
We apply the above approach to several display advertising campaigns for m6d, a display advertising company that observes over 5 billion actions a day and uses that data along with machine learning algorithms to determine the best prospects for a brand.

Paul Thompson - Dartmouth Medical School

Poster - Personalized Biomedical Information Retrieval: A Microbiome Case Study
May 08, 2012 3:30 pm - 4:30 pm

Relevance judgments provided by a neonatal human microbiome researcher were used to predict the relevance of additional publications to the researcher's information need. Six PubMed queries were run to retrieve documents which the researcher judged for relevance. These relevance judgments were used to produce training and test sets for the evaluation of two machine learning algorithms: C4.5 and support vector machines. These algorithms were evaluated in two ways: 1) tenfold cross-validation and 2) training on publications from 2008-2010 and testing on documents from 2011. It was found that the researcher's relevance judgments could be used to accurately predict relevance.

Fabian Wauthier - University of California, Berkeley

Bayesian Bias Mitigation for Crowdsourcing
May 09, 2012 1:30 pm - 2:30 pm

Keywords of the presentation: Crowdsourcing, Bias

Biased labelers are a systemic problem in crowdsourcing, and a comprehensive toolbox for handling their responses is still being developed. A typical crowdsourcing application can be divided into three steps: data collection, data curation, and learning. At present these steps are often treated separately. We present Bayesian Bias Mitigation for Crowdsourcing (BBMC), a Bayesian model to unify all three. Most data curation methods account for the effects of labeler bias by modeling all labels as coming from a single latent truth. Our model captures the sources of bias by describing labelers as influenced by shared random effects. This approach can account for more complex bias patterns that arise in ambiguous or hard labeling tasks and allows us to merge data curation and learning into a single computation. Active learning integrates data collection with learning, but is commonly considered infeasible with Gibbs sampling inference. We propose a general approximation strategy for Markov chains to efficiently quantify the effect of a perturbation on the stationary distribution and specialize this approach to allow active learning with Gibbs sampling in our model. Experiments show BBMC to outperform many common heuristics when a useful consensus labelling cannot be estimated.