Main navigation | Main content

HOME » PROGRAMS/ACTIVITIES » Annual Thematic Program

PROGRAMS/ACTIVITIES

Annual Thematic Program »Postdoctoral Fellowships »Hot Topics and Special »Public Lectures »New Directions »PI Programs »Math Modeling »Seminars »Be an Organizer »Annual »Hot Topics »PI Summer »PI Conference »Applying to Participate »

Abstracts for Statistical Methods for Gene Expression

Microarrays and Proteomics

Microarrays and Proteomics

September 29-October 3, 2003

**David
Allison**
(Section on Statistical Genetics, Department of Biostatistics,
University of Alabama at Birmingham) DAllison@ms.soph.uab.edu

**Applying
High-Dimensional Approaches to Microarray Research ** Slides:
pdf

Although termed the post-genomic era, our age may be more accurately labeled the genomic era. Draft sequences of several genomes coupled with new technologies allow study of the influences and responses of entire genomes rather than isolated single genes. This opens a new realm of highly dimensional biology (HDB) where questions involve multiplicity at unprecedented scales. HDB can involve thousands of genetic polymorphisms, gene expression levels, protein measurements, genetic sequences, or any combination of these and their interactions. Such situations demand creative approaches to the processes of inference, estimation, prediction, classification, and study design. Although bench scientists intuitively grasp the need for flexibility in the inferential process, elaboration of formal statistical frameworks supporting this are just beginning. I will discuss some of the unique statistical challenges facing investigators studying high-dimensional biology, describe some approaches being developed by scientists at UAB and elsewhere and offer an epistemological framework for the validation of proffered statistical procedures.

**Shilpi
Arora**
(Cellular and Molecular Biology, Princess Margaret Hospital/Ontario
Cancer Institute, Toronto) sarora@uhnres.utoronto.ca

**Gene
Expression Profiling of Human Oral Cancer Using cDNA Microarrays**
(poster session)

Oral Squamous Cell Carcinoma (OSCC) is a clinically heterogeneous disease. Patients with stage-matched tumors show differences in treatment response and outcome, suggesting that a sub classification system may be possible. In the present study, we used cDNA microarrays and a novel method of analysis (Binary Tree-Structured Vector Quantization - BTSVQ) to classify 20 OSCC samples based on their gene expression profiles. BTSVQ analysis combines k-means clustering and self-organizing maps in a complementary fashion. In our study, the binary tree generated by BTSVQ revealed groups of patients that significantly correlated with male gender (P=0.035), T III-IV disease stage (P=0.035), and nodal metastasis (P=0.035). Further data mining revealed a subset of genes present in the sample cluster that enriched for node positive tumors, and thus may represent potential biomarkers for metastasis. The differential expression of these genes was validated by quantitative real-time PCR. We conclude that molecular sub typing of OSCC can identify distinct patterns of gene expression that correlate with clinical-pathological parameters. The genes identified may influence tumor growth, development and metastasis due to the over expression of normal gene products, gene amplification or mutation. They may therefore represent potential biomarkers for oral carcinomas. Our findings may help to form the basis for a molecular classification of OSCC, thus improving diagnosis, therapeutic decisions and outcome for patients with this lethal disease.

Joint
with Giles C. Warner, Patricia
P. Reis^{1}, Igor Jurisica^{2,3,4},
Mujahid Sultan^{4}, Christina
Macmillan^{5}, Mahadeo
Sukhai^{1,2}, Reidar Grenman^{6},
Richard A. Wells^{1}, Dale
Brown^{7}, Ralph Gilbert^{7},
Patrick Gullane^{7}, Jonathan
Irish^{7},and Suzanne Kamel-Reid^{*1,2,5}.

^{1}
Departments of Cellular and Molecular Biology, ^{2}
Medical Biophysics, ^{3} Computer Science, ^{4}
Cancer Informatics, ^{5} Laboratory Medicine and Pathobiology,
^{7} Otolaryngology/Surgical Oncology, University of
Toronto, Princess Margaret Hospital, Ontario Cancer Institute,
Toronto, Ontario, Canada. ^{6} Department of Otolaryngology,
Turku University Central Hospital, Turku, Finland.

**Keith
Baggerly** M.D. (Anderson Cancer Center) kabagg@odin.mdacc.tmc.edu

**The
Analysis of Proteomics Spectra from Serum Samples**

Slides: pdf

Mass spectrometry profiles can provide quick summaries of the relative levels of hundreds of proteins. By surveying profiles from a large number of samples, we can hopefully zoom in on proteins that are linked with a difference of interest such as the presence or absence of cancer. Using examples from two case studies, we will address issues of experimental design, data cleaning and processing, discriminating subsets, and protecting against spurious structure.

**Fast
Loess for Normalizing Microarray Data** (poster
session)

Joint work with Ann Oberg and Terry Therneau.

Various methods have been developed for normalizing high-density oligonucleotide arrays (as well as other gene expression microarray technologies) so that meaningful comparisons of gene expression levels can be made across arrays (experiments). The most useful methods are those with two explicit features: (1) they use data from all arrays in the proposed comparison to perform the normalization, and (2) they account for the non-linear relationship of intensities among arrays. Commonly used non-linear normalization techniques include cyclic loess and quantile normalization. We propose a new method, fast loess, which is similar in concept to cyclic loess normalization but uses a linear models argument to normalize all arrays at once. Results comparing the performance of cyclic loess, quantile normalization, and fast loess on simulated and real data will be presented. Fast loess and cyclic loess produce similar results but with fast loess being considerably faster than cyclic loess. Both fast loess and cyclic loess produce superior results to quantile normalization.

**Joseph
Beyene**
(Department of Public Health Sciences, University of Toronto)
joseph@utstat.toronto.edu

**A
Spectral Clustering Method for Microarray Data** (poster
session)

Joint work with David Tritchler and Shafagh Fallah.

Cluster analysis is a commonly used dimension reduction technique. We introduce a clustering method computationally based on eigenanalysis. Our focus is on large problems, and we present the method in the context of clustering genes and arrays using microarray expression data. The computational algorithm for the method has complexity linear in the number of genes. We also introduce a method for assessing the number of clusters exhibited in microarray data based on the eigenvalues of a particular matrix.

**Atul
Butte**,
MD (Children's Hospital Informatics Program and Harvard Medical
School) atul.butte@TCH.Harvard.edu

**Integrative
Genomics and its Implications for Clinical Research and Care:
What are the Real Issues Beyond Analysis?** Slides:
pdf

Microarrays can provide systematic quantitative information on the expression of thousands of unique RNAs, and have been used to elucidate the transcriptional response in many basic biological and clinically relevant experiments, ranging from associative studies between therapeutics and expression, to aiding in diagnostic questions, to discovery of novel subtypes of disease.

Given the over 6000 arrays with data publicly available and the surfeit of microarray facilities, the rate-limiting steps is no longer the sample collection, hybridization, scanning, or even the analysis. Instead, the new challenge is in taking findings, such as the traditional "list of genes" resulting from a microarray analysis, and ascertaining the meaning of the results, such as the biological relationships between the genes. However, tools that link these genes back to known biological pathways, as well as discovering new pathways, are in their infancy. Tools that automatically suggest the importance of particular findings have yet to be invented.

During this presentation, I will describe four packages we have made freely available to the academic genomics community. I will present examples of and would like to discuss (1) not all pathways will be reverse engineered using microarrays, (2) looking for simultaneous gene associations ignores the fact that biology takes time, (3) a discovered diagnostic model doesn't imply the underlying molecular physiology, (4) due to rapidly changing information about the genes already measured, one is never truly finished analyzing a microarray dataset, and (5) the real bottleneck in microarray analysis is not the analysis, but the interpretation of the findings.

**David
B. Dahl**
(Statistics and Biostatistics & Medical Informatics, University
of Wisconsin - Madison) dbdahl@stat.wisc.edu

**Modeling
Differential Gene Expression using a Dirichlet Process Mixture
Model** (poster session)

The literature has given considerable attention to the task of identifying differentially-expressed genes using data from DNA microarrays. This poster proposes a conjugate Dirichlet Process mixture model which naturally incorporates any number of treatment conditions, clusters genes based on their treatment effects and variance, and readily makes general inference on the treatment effects and variance. As a consequence of the model, probabilities of co-regulation are available and there is no need to estimate the correct number of clusters. Any number of hypotheses concerning the parameters can be performed and false discovery rates are easily computed. The proposed methods are applied to a dataset of 10,043 genes measured at 10 treatments conditions with 3 replicates per treatment.

**Sandrine
Dudoit** (Division of Biostatistics, University of
California, Berkeley) sandrine@stat.Berkeley.EDU
http://www.stat.berkeley.edu/~sandrine

**Loss-based
Estimation Methodology with Cross-validation: Prediction of
Clinical Outcomes Using Microarray Data**

Slides: pdf

We propose a unified loss-based methodology for estimator construction, selection, and performance assessment with cross-validation. In this approach, the parameter of interest is defined as the risk minimizer for a suitable loss function and candidate estimators are generated using this (or possibly another) loss function. Cross-validation is applied to select an optimal estimator among the candidates and to assess the overall performance of the resulting estimator. Finite sample and asymptotic optimality results are derived for the cross-validation selector for general data generating distributions, loss functions (possibly depending on a nuisance parameter), and estimators. This general estimation framework encompasses a number of problems which have traditionally been treated separately in the statistical literature, including multivariate outcome prediction and density estimation based on censored data. Applications to genomic data analysis include the prediction of biological and clinical outcomes (possibly censored) using microarray gene expression measures, the identification of regulatory motifs in DNA sequences, and genetic mapping with single nucleotide polymorphisms (SNP). This talk will focus on tree-based estimation of patient survival with microarray expression measures.

Joint work with: Mark van der Laan, Sunduz Keles, and Annette Molinaro.

**A
Theoretical Framework For Reconstructing Missing Data in Genome
- Wide Matrix** (poster session)

This is a joint work with Amir Niknejad.

Since last decade, molecular biologist have been using DNA microarray(chip) as a tool for analyzing information embedded in gene expression data. During the laboratory process,some spots on the array may be missed and probing genes might fail . It is still very costly making chips to probe genes(DNA microarray).

There have been several attempts by molecular biologists,statistician, and computer scientists to recover the missing gene expressions by some ad-hoc methods. Most recently, microarray gene expression has been formulated as a gene-array matrix .In this setting, the analysis of missing gene expression on the array would translate to recovering some missing entries in gene - expression matrix.

The most common methods for recovery are: (a) Various clustering analysis methods such as K - nearest neighbor clustering , hierarchical clustering. (b) SVD - Singular Value Decomposition. In these methods, the recovery of missing data is done independently, i.e. the completion of each missing entry does not influence the completion of other entries.

We suggest here a new method in which the completion of missing entries is done simultaneously, i.e. the completion of one missing entry influence the completion of other entries. Our method is closely related to the methods and techniques for solving inverse eigenvalue problems.

**Wolfgang
Huber**
(German Cancer Research Center, Division of Molecular Genome
Analysis) w.huber@dkfz-heidelberg.de

**Interpretation
and Transformation of Microarray Data**

Slides: html pdf ps
ppt

Data from microarray experiments is often reported in the form of logarithmic ratios or logarithm-transformed intensities. This amounts to the assumption that an increase from, say, 1000 units to 2000 units has the same biological significance as one from 10000 to 20000. While this approach is useful for large intensities, it fails when the true level of expression of a gene in one of the conditions is small or zero. However, these genes may be biologically relevant, perhaps even the most relevant ones.

We derive a measure of differential expression that has comparable resolution across the whole dynamic range of expression. Mathematically, it can be expressed in terms of a variance stabilizing transformation. The measure coincides with the log-ratio in those cases where the latter is well-defined, and is a meaningful extrapolation in those cases where the log-ratio is unstable. The measure is closely related to the standardized log-ratio ("moving-window z-score"), but has more preferable mathematical and computational properties.

We present a parametric statistical model that leads to a robust estimator for the transformation parameters, as well as the between-array normalization parameters. In applications to several benchmark datasets, this approach compares favorably to other normalization algorithms.

**Rebecka
Jornsten**
(Department of Statistics, Rutgers University) rebecka@stat.rutgers.edu
http://www.stat.rutgers.edu/~rebecka

**Data
Depth Based Clustering and Classification** (poster
session)

Clustering and classification are important tasks for the analysis of microarray gene expression data. Classification of tissue samples can be a valuable diagnostic tool for diseases such as cancer. Clustering samples or experiments may lead to the discovery of subclasses of diseases. Clustering can also help identify groups of genes that respond similarly to a set of experimental conditions. In addition to these two tasks it is useful to have validation tools for clustering and classification. Here we focus on the identification of outliers - units that may have been misallocated, or mislabeled, or are not representative of the classes or clusters. We present two new methods: DDclust and DDclass, for clustering and classification. These non-parametric methods are based on the intuitively simple concept of data depth. We apply the methods to several gene expression and simulated data sets. We also discuss a convenient visualization and validation tool - the Relative Data Depth (ReD) plot.

**Christina
Kendziorski** (Department of Biostatistics and Medical
Informatics, University of Wisconsin, Madison) kendzior@biostat.wisc.edu

**Hidden
Markov Models for Microarray Time Course Data in Multiple Biological
Conditions **

Slides: html
pdf
ps
ppt

Among the first microarray experiments were those measuring expression over time, and time course experiments remain common. Most methods to analyze time course data attempt to group genes sharing similar temporal profiles within a single biological condition. However, with time course data in multiple conditions, a main goal is to identify differential expression patterns over time. I will present a Hidden Markov modeling approach designed specifically to address this question. Simulation studies show a substantial increase in sensitivity without an increase in the false discovery rate when compared to a marginal analysis at each time point. Results from three case studies will be discussed.

This is joint work with Ming Yuan, a graduate student in the Department of Statistics, University of Wisconsin (see the poster session for additional details).

**Kathleen
Kerr**
(Department of Biostatistics, University of Washington) katiek@u.washington.edu

**Empirical
Evaluation of Methodologies for Microarray Data Analysis, With
Some Thoughts on Statistical Implications ** Slides:
pdf

This talk will present recent results of empirical tests of various methodologies of microarray data analysis. The data are spike-in assays produced as part of a "standardization" experiment performed by the Toxicogenomics Research Consortium. Findings support the use of an intensity-based normalization procedure and provide strong evidence that the practice of local background subtraction is detrimental. Statistically, the most interesting findings pertain to the relative ability of various methodologies for detecting differentially expressed genes. The speaker will present these results along with some opinions on directions of research to advance statistical methodology for microarray data analysis.

**Boris
Khots** and **Dmitriy Khots**
(Iowa,USA, bkhots@cccglobal.com
dkhots@blue.weeg.uiowa.edu

**Why
Infinite-dimensional Topological Groups may Work for Genetics
Data** (poster session)

Slides: html
pdf
ps
ppt

A significant role in mathematical modeling and algorithms for applications to processing of Genetic data (for example, Gene expression data) may play infinite-dimensional P-spaces and connected with them infinite-dimensional P-groups and P-algebras (B.S. Khots, Groups of local analytical homeomorphisms of line and P-groups, Russian Math Surveys, v.XXXIII, 3 (201), Moscow, London, 1978, 189-190). Investigation of the topological-algebraic properties of P-spaces, P-groups and P-algebras is connected with the solution of the infinite-dimensional fifth David Hilbert problem. In Genetic data processing the utilization of the topological-algebraic properties of P-spaces, P-groups and P-algebras may permit to find "Gene functionality". We applied these methods to Yeast Rosetta and Lee-Hood Gene expression data, leukemia ALL-AML Gene expression data and found the sets of Gene-Gene dependencies, Gene-Trait dependencies. In particular, accuracy of leukemia diagnosis is 0.97. On the other hand, Genetics requires a solution of new mathematical problems. For example, what are the topology-algebraic properties of a P-group (subgroups, normal subgroups, normal serieses,P-algebras, subalgebras, ideals, etc) that is finitely generated by local homeomorphisms of some manifold onto itself?

**Pim
(W.W.) Kuurman**
(Animal Sciences Group, Wageningen UR, P.O. Box 65, 8200 AB
Lelystad, The Netherlands) Pim.Kuurman@wur.nl

**Procedure
for Standardisation and Normalisation of cDNA Microarrays**
(poster session)

Poster
file:
pool.pdf

Talk handout: EAAPpresentatie.pdf
EAAPpresentatie.doc

Joint work with M.H. Pool, B. Hulsegge, L.L.G Janss, J.M.J. Rebel, and S. van Hemert.

Expression levels for large numbers of genes under different conditions can be measured by using microarrays. In livestock species often cDNA-arrays are used for this purpose, because the complete genome sequences are not yet available to engineer oligo-arrays, and use of cDNA arrays allows the direct use of available cDNA libraries. However, cDNA-arrays exhibit larger variability than oligo arrays and therefore require more care in order to reduce noise, standardise and normalise the data, and require some different statistical approaches for analysis because two samples are measured on the same slide, unlike in oligo-array technology. This poster describes procedures developed to treat such data consisting of: (1) correction for background using special blank spots; (2) automatic outlier treatment using iteratively reweighted analysis to allow for a robust fit, similar to using medians; (3) a lowess fit to allow for dye-bias on the ratio's with varying intensity; (4) a procedure to identify poor duplicated values (1 duplicate is made within slide) fitting a heterogeneous variance contour to allow for increasing repeatability with increasing intensity; (5) fitting of a heterogeneous variance contour for sample values to allow for decreasing variance with increasing intensity, used to provide weights for a weighted analysis. The procedure is illustrated on a data set showing differences in gene expression levels between malabsorption syndrome infected and control chickens.

Poster file: pool.pdf

Talk handout: EAAPpresentatie.pdf EAAPpresentatie.doc

**Hongzhe
Li**
(Rowe Program in Human Genetics, UC Davis School of Medicine)
hli@ucdavis.edu

**Microarray
Time Course Gene Expresssion Studies: Some Problems and Statistical
Methods**

Slides: pdf

Since many biological systems and processes in human health and diseases are dynamic systems, genome-wide gene expression levels measured over time can often provide more insights into such systems. Important examples include developmental process, cell cycle process and regulation of circadian rhythm. The noisy nature of microarray data and the potential dependency of the gene expression measurements over time makes analysis of such micorarray time course (MTC) gene expression data challenging. In this talk, I will present some problems and statistical methods for analyzing such MTC gene expression data. Some details will be given on the methods of identifying genes with different time course expression profiles and the methods of identifying periodically regulated genes.

**Wentian
Li **(The Robert S Boas Center for Genomics and Human
Genetics, North Shore LIJ Research Institute, USA) wli@watson.nslij-genetics.org
http://www.nslij-genetics.org/wli

**Extreme-Value
Distribution Based Gene Selection Criteria for Discriminant
Microarray Data Analysis Using Logistic Regression ** (poster
session)

Joint work with Fengzhu Sun (Department of Biological Sciences, Molecular and Computational Biology Program, University of Southern California, USA) and Ivo Grosse (Bioinformatics Center Gatersleben-Halle, Institute for Plant Genetics and Crop Plant Research, Germany) .

We present a calculation of the expected maximum-likelihood and the p-value for the top gene selected by the logistic regression. This calculation is based on the maximum likelihood of the null model and the extreme value distribution of chi-square variables. Based on this calculation, we propose two corresponding gene selection criteria: the E-criterion and the P-criterion. In the E-criterion, a gene is selected if its maximum-likelihood is greater than that of the top gene under the null model. In the P-criterion, a gene is selected if its p-value according to the null distribution of the the top gene is smaller than a pre-determined value. Both gene selection criteria are conservative because non-top-ranked genes are judged by the expected value of the top gene. As a result, a much more compact set of genes is selected.

References:

[1] W Li, I Grosse (2003), "Gene selection criterion for discriminant microarray data analysis based on extreme value distributions", in RECOMB03: Proceedings of the Seventh Annual International Conference on Computational Biology, pp. 217-223 (ACM Press).

[2] W Li, F Sun, I Grosse (2003), "Extreme-value distribution based gene selection criteria for discriminant microarray data analysis using logistic regression", submitted to Journal of Computational Biology.

**Adriana
Lopez**
(Department of Statistics, University of Pittsburgh Pittsburgh,
PA) adl5+@pitt.edu

**Cancer
Tumor Classification Using Gene Expression Data** (poster
session)

At the end of the 90's, biotechnologies such as microarrays have been developed and their use in the research of cancer has increased because they can lead to a more precise and reliable classification of cancer tumors. This research was concerned with discriminant analysis or classification of cancer tumors using expression genetic data from microarrays in previously known classes, using kernel density estimation and combination of classifiers based on this methodology. This technique was compared to other well known discriminant analysis techniques using the misclassification proportion, estimated using training and test sets: fixed and obtained by the 2:1 sampling scheme. An equally efficient performance of the fixed kernel classifiers and the adaptative kernel classifiers was observed for the three data sets that were studied and generally, the kernel classifier was the best nonparametric classifier.

**Geoff
McLachlan** (Department of Mathematics and the Institute
of Molecular Bioscience, University of Queensland) gjm@maths.uq.edu.au

**Classification of Microarray Gene-Expression Data**

Slides: html
pdf
ps
ppt

In the context of cancer diagnosis and treatment, we consider the problem of classifying a relatively small number of tumour tissue samples containing the expression data on very many (possibly thousands) of genes from microarray experiments. For the supervised problem where there are tumour samples of known classification, we discuss the need to correct for the selection bias in assessing the error rate of a prediction rule formed from a small subset of selected genes. We also consider the unsupervised problem where the aim is to cluster the tumour samples on the basis of the gene expressions. The associated problem of assessing the number of clusters is addressed. Attention is concentrated on the mixture model-based approach called EMMIX-GENE. Its performance is demonstrated on various microarray data sets available in the bioinformatics literature.

**Peter
J. Munson**
(Mathematical and Statistical Computing Laboratory, DCB, CIT,
NIH, DHHS) munson@helix.nih.gov

**Mining
a Gene Expression Database**

The now widespread interest in gene expression is motivated by the promise of important new findings in the context of disease or basic biology research. Because of cost constraints, most designed studies are relatively small, involving from 2 to 100 chips. Pooling results of studies allows one to compare expression across potentially 1000s of conditions, with greater promise of additional insights. The NIHLIMS database houses data from about 30 ongoing studies at NIH comprising about 1500 Affymetrix chips, and provides a platform for testing data mining aprpoaches.

Serious data comparability challenges are encountered here, some of which can be addressed with appropriate data normalization. We investigate factors which distinguish patterns of expression. In addition to many technical factors, the cell or tissue type from which mRNA is prepared seems to be a primary source of variability. As a consequence, tissue specific genes can be identified by this approach. Limited demographic information may be available permitting, for example, the determination of gender-specific gene expression patterns. In one particular study, the identification of tissue specific genes in human was compared to tissue specific genes in rodent for the homologous tissue, allowing for an evolutionary comparison of the relevant expression mechanisms.

We discuss several of the statistical techniques needed to compare data across studies, and present a list of challenges now facing data miners.

**Ann
L. Oberg** (Department of Health Sciences Research,
Division of Biostatistics, The Mayo Clinic ) Oberg.Ann@mayo.edu
http://www.mayo.edu/hsr/people/oberg.html

**Joint Estimation of Calibration and Expression for High-density
Oligonucleotide Arrays** (poster session)

Joint work with Karla V. Ballman, Douglas W. Mahoney, and Terry M. Therneau.

There is an increasing awareness that the analysis of high-density oligonucleotide arrays is better modeled as a holistic rather than a piecemeal process. Affymetrix software summarizes each chip (including scaling, background subtraction, and removal of outliers) separately, with the results of that summarization "passed forward" to the next stage of analysis. Li (2001) introduced a "model-based" analysis, where all chips for a given experimental condition were fit in a single model, giving a more complete and accurate picture of both data errors and the fit. Chu (2002) recently extended this idea, using a random-effects model to encompass all chips in an experiment at once. For all of these, however, normalization of the data is done as a separate prior process. We propose a method that integrates the normalization, visualized as chip specific calibration curves based on differential binding characteristics, along with model fitting incorporating experimental design in a unified algorithm. The ability to incorporate experimental design into both the normalization process and the fit leads to more efficient and less biased estimates of the tissue gene expressions. Affycomp results will be presented.

**Michael
Ochs** (Fox Chase Cancer Center, Philadelphia, PA)
m_ochs@fccc.edu

**Encoding
Prior Biological Knowledge in Functional Genomics Analysis**
Slides: html
pdf
ps
ppt

Cancer is a leading cause of death throughout the world. The fundamental cellular biology underlying the development of cancer is extremely complex, since cancer arises from a myriad of different cellular malfunctions. It is clear, however, that cellular signaling pathways that control cell growth, differentiation, apoptosis, and motility play a critical role in many cancers. New technologies such as microarrays and protein arrays offer the possibility of elucidating key pathways involved in cancer and of monitoring the effect of targeted therapeutics on those pathways. However, because of the limited nature of our knowledge of signaling pathways in humans and high noise levels in the data, difficulties arise during analysis. The inclusion of prior knowledge can enhance probabilistic reasoning in such a case. Analysis of functional genomics data is especially suitable for the inclusion of prior information, since a vast framework of biological knowledge exists.

Bayesian Decomposition is a Markov chain Monte Carlo method that uses Bayesian statistics to encode prior knowledge. The inclusion of biological information both during the analysis and when interpreting patterns identified in the data has greatly increased the power of the algorithm. This is demonstrated with three separate data sets. First, the recovery of a pattern related to the yeast mating pathway is accomplished by use of annotations from the Yeast Proteome Database. Second, tissue identification in Black6 mice is used to isolate tissue specific expression patterns that can be interpreted using gene ontology. Third, links between genes known to be coregulated in yeast is used to demonstrate the effect of such prior knowledge on the analysis.

**John
Quackenbush**
(Department of Mammalian Genomics, The Institute for Genomic
Research (TIGR)) johnq@tigr.org

**Beyond
Significance: Integrating Diverse Data Types to Extract Biological
Meaning from Microarrays **

Slides:
pdf

Microarray expression analysis has rapidly become a mainstay in functional genomics laboratories. With the rapid expansion of this field has come equally rapid advances in statistical analysis methodologies that have revolutionized the way we design experiments and analyze data. However, even the best designed, conducted, and analyzed experiments yield, at best, statistically significant lists of genes. The scientific challenge we now face it placing these genes into a broader biological context through the use of diverse ancillary information. I will present an overview of the problem with some examples of how we have integrated diverse data types to add biological meaning to expression measures.

**Marco
F. Ramoni**
(Assistant Professor of Pediatrics, Medicine, Oral Medicine,
Infection and Immunity, Harvard Medical School, Boston, MA 02115)
marco_ramoni@harvard.edu
http://chip.tch.harvard.edu/people/marco

**Bayesian
Methods for Microarray Data Analysis** Slides:
pdf

Data produced by microarray experiments - measuring thousands of genes with limited replicates - present unparallel opportunities to understand the global behavior of the genome and unprecedented analytical challenges. This talk will introduce a general Bayesian framework able provide coherent solutions to some critical problems of microarray data analysis and open new, unexplored avenues of discovery. The talk will start by describing a Bayesian approach to the analysis of comparative experiments, able to deliver high sensitivity and superior reproducibility. It will then describe a Bayesian solution to clustering gene expression data and it will introduce a principled probabilistic criterion to automatically identify the optimal number of clusters underlying a set of microarray experiments. It will also show how this clustering method can be naturally extended to profile the temporal behavior of gene expression dynamics. Finally, the talk will take this Bayesian framework one step forward, and show how it can be used to dissect the regulatory mechanisms of gene expression using a new class of Bayesian networks, called Generalized Gamma Networks, specifically designed to handle the peculiar distributional nature of microarray data and the non-linearity of gene expression control.

**Lídia
Rejtö** (Statistics
Program, University of Delaware, 214 Townsend Hall, Newark,
DE 19717-1303, USA) rejto@udel.edu

**Bayesian
Analysis of Microarrays** (poster session)

Microarray technology enables the assessment of expression patterns of thousands of genes over time and under multiple conditions. The analysis of these patterns requires detecting whether observed differences in expression levels are significant or not. To perform the analysis, one must first normalize the data. Here we present a stochastic model offering a method to normalize the data and to detect differentially expressed genes. The model is appropriate to deal with more than two experimental conditions or time series experiments.

We construct a model to describe the stochastic relationship between the real and the measured gene-expression levels. We introduce a Bayesian component, which assumes that there is a prior probability for the event, that the real expression levels are different under different conditions. The prior probability of the Bayesian component is estimated, together with the other model parameters by using the maximum-likelihood method. Having the estimated model parameters, we estimate the real gene-expression levels as conditional expectations. Furthermore, for each gene the posterior probability of differential expression is given. We estimated the variances of the estimates of the model parameters with the help of bootstrapping. The fitted parametric model was validated by verification of differential gene expression with real-time quantitative RT-PCR (qRT-PCR) analysis. The comparison shows that the stochastic model is adequate in identifying differentially expressed genes on microarrays.

The
software BAM (Bayesian Analysis of Microarrays) is available
at online http://udgenome.ags.udel.edu/^{~}cogburn/Gene_Expression_Studies.htm
or contact with bukszar@eecis.udel.edu.

Joint
with Gábor Tusnády,^{1}
József Bukszár,^{2}
Guang Gao^{2} and Larry
Cogburn^{3}.

^{1}
Alfréd Rényi Mathematical Institute of the Hungarian
Academy of Sciences, Budapest, P.O. Box 127, H-1364, Hungary

^{2} Delaware Biotechnology Institute, 15 Innovation
Way, Newark, 19711, USA

^{3} University of Delaware, Department of Animal and
Food Sciences, Newark, DE 19717, USA

**David
M. Rocke** (Department of Applied Science (College
of Engineering), Division of Biostatistics (School of Medicine),
and Center for Image Processing and Integrated Computing, University
of California, Davis) dmrocke@ucdavis.edu
http://www.cipic.ucdavis.edu/~dmrocke

**Measurement
Errors and Data Transformation for Gene Expression Data, Proteomics
and Metabolomics Data **

Slides: pdf

Gene expression microarrays comprise a suite of related technologies for measuring the expression of thousands of genes simultaneously from a single biological sample. There are also numerous other high-throughput biological assays that can measure large numbers of proteins, lipids, and other biologically active compounds. In this talk, I will describe an important statistical challenge in the use of such data. Using raw data, logarithms, or ratios, the variability of the measurements is strongly dependent on the level of expression, causing a failure of the assumptions of most standard methods of statistical analysis. We present a solution to this problem via a specially tuned data transformation and show how it promotes the effectiveness of simple and sophisticated analyses of the data.

**Hae-Hiang
Song** (Department of Biostatistics, The Catholic University
of Korea, Seoul 137-701, Korea ) hhsong@catholic.ac.kr

**Statistical
Inference Methods for Detecting Altered Gene Associations**
(poster session)

Joint work with Sang-Heon Yoon and Je-Suk Kim.

In many gene expression studies, the assumption is that knowledge of where and when a gene is expressed carries important information about what the gene does. We consider the problem of understanding the gene functions with microarray expression data of histological progressive grades, starting from dysplastic nodule in cirrhotic liver to hepatocellular carcinoma Edmonson grade III. The statistical procedures are divided into two parts: First, microarray data are suitably normalized including a method of analysis of variance (ANOVA). Much diverse comments are found for the currently used normalization methods. In order to proceed to the second part of statistical analyses of gene-pair associations, these normalization methods need first to be compared. Based on the assumption that a union set of significant genes from these normalization methods includes sufficiently general and well defined differentially expressed genes, the second part of statistical analyses of searching evidence of altered gene-gene relationships with progression of disease is carried out. Significantly altered gene-pair associations are identified with the ratio of gene-pair correlations. When we use the phrase of "difference between normal and tumor expression patterns," in a broad sense it contains not only the information summarized by the first moment of average expression levels, but also imply correlation changes between two stages, and this kind of exploration goes on to a higher order moments. The need to study association changes naturally arises when analyzing gene expression levels of multiple arrays obtained in different stages of progression. We identify altered gene-gene relationships with replicated microarray expression data.

Keywords: oligonucleotide array, normalization, correlation ratio statistic, hepatic nodular lesions

**Terry
Speed** (WEHI, Melbourne and UC Berkeley) terry@wehi.edu.au:

**Mining
a Tandem Mass Spectrometry Database to Determine the Trends
and Global Factors Influencing Peptide Fragmentation**

Slides: html
pdf
ps
ppt

A statistical and non-statistical method have been used to analyse the gas phase fragmentation behavior of protonated peptides that involves mining a database of several thousand unique product ion spectra derived from tryptic digestion and low-energy collision induced dissociation in a quadrupole ion trap mass spectrometer. This bioinformatic approach has resulted in the derivation of a ³relative proton mobility scale² that takes into account both the charge state and the amino acid composition of a peptide, and provides an effective classification system for categorizing peptide MS/MS spectra for subsequent data mining and statistical analysis. We show that the most important factor influencing fragmentation is proton mobility and that peptides classified as non-mobile generally give scores below currently acceptable thresholds using current MS/MS search algorithms. An amino acid residues preference for N- and/or C-terminal cleavage has been quantified in accordance with the proton mobility scale and the trends determined are predictable based on an analysis of the most abundant cleavage sites. (Joint work with Eugene A. Kapp, Frédéric Schütz and Richard J. Simpson)

**Mahlet
G. Tadesse** (Department of Statistics, Texas A&M University)
mtadesse@stat.tamu.edu http://www.stat.tamu.edu/~mtadesse

**A
Bayesian Method for Class Discovery and Gene Selection** (poster
session)

Joint work with Naijun Sha and Marina Vannucci.

The analysis of the high-dimensional data (p n) generated by DNA microarrays poses challenge to standard statistical methods. This has revived a strong interest in clustering algorithms. A typical goal in these analyses is the discovery of new classes of disease and the identification of relevant genes. Currently, investigators first resort to data filtering procedures or dimension reduction techniques before clustering the data. In addition, the clustering algorithms that are widely used do not provide an objective way to assess the number of classes. We propose a Bayesian method, which simultaneously identifies the number of clusters in the data and selects genes that best discriminate the different groups.

**Terry
M. Therneau **(Division of Biostatistics, Mayo Clinic)
therneau@mayo.edu

**Joint
Calibration and Fitting of Microarray Data**

Slides:
pdf
ps

Figure: description
pdf
ps

Joint work with Karla Ballman and Ann Oberg.

In biological assays it is common to have a "logistic" shaped dose- response curve, where the horizontal axis is the true level of the material we are trying to measure, and the vertical axis is the value derived from the assay. In ELISA assays, it is common to put known controls in several of the wells to estimate the calibration curve for a given plate directly. The analysis issues have long been known as well; see for instance DJ Finney's tutorial paper on radioligand assay (Biometrics, 1976). The non-linearity is most severe when an assay spans a wide range; with values from 20 to 20,000 we would expect microarrays to be particularly affected. Plots of log(dose) vs log(response) from the Affymetrics and Gene Logic spike in data sets show precisely this shape, completely in agreement with Finney's observations.

ropriate normalization would be clear. We fit models that alternate between estimation of the true level for each probe, using a linear model incorporating the experimental design of the study, estimation of the per-chip calibration curves from a plot of "true" vs observed, normalization of the data based on the calibration curves, refit of the linear model, etc. When the linear model is particularly simple, containing only an intercept per probe, this turns out to be equivalent to the cyclic loess method of normalization (but computationally much faster).

The exciting aspect of this formulation is that it gives a framework in which other aspects of the array can be incorporated, e.g., joint use of the PM and MM probes, or biochemical data on predicted backround binding affinity.

**Achim
Tresch** (Department of Bioinformatics, Fraunhofer
Institute for Algorithms and Scientific Computing (SCAI), Schloss
Birlinghoven 53754 Sankt Augustin Germany) gieger@scai.fhg.de
http://www.scai.fraunhofer.de/profil/mitarbeiter/gieger.html

**Using
Text Mining Networks for the Context Specific Interpretation
of Expression Data** (poster session)

Joint work with Christian Gieger, Daniel Hanisch, Juliane Fluck, Hartwig Deneke (Fraunhofer-Institute SCAI, Sankt Augustin, Germany), Tobias Mittelstädt, and Albert Becker (Institute for Neuropathology, University Hospital, Bonn, Germany).

Gene expression data are most often analysed without utilizing biomedical a-priori knowledge. The inclusion of metabolic, regulatory or protein-protein interaction networks into the analysis process itself provides a way to put results of expression experiments into a biological context. Unfortunately, network information stored in databases is oftentimes incomplete or not specific enough with respect to certain species or cell types. For this reason, we developed text mining methods for the construction of interaction networks based on biomedical free text. These methods were applied to the complete set of MEDLINE abstracts and resulted in a substantial network of protein relations. In this method, we used an automatically generated and curated gene/protein dictionary together with a biomedical grammar which defines rules to extract concepts describing relevant relations between genes/proteins and other biological entities. The resulting text mining network can be used for explorative data analysis by mapping the results of gene expression experiments onto the network. For this purpose, the ToPNet application was developed. Besides its visualization capabilities, it is able to identify sub-networks relevant according to observed expression patterns by applying a new method called Significant Area Search. Our approach was successfully applied to data from two sets of gene expression experiments in the context of epilepsy and brain cancer research.

**Mark
van der Laan** (Division of Biostatistics, School of
Public Health, University of California, Berkeley) laan@stat.Berkeley.EDU

**Prediction
of Survival** Slides:
pdf

We propose a unified method for cross-validation which also applies to censored data, and propose a new deletion/substitution/addition algorithm for nonparametric multivariate regression. This combination provides us with a new black-box algorithm for multivariate regression on censored and uncensored outcomes. We show that the cross-validation selection procedure satisfies an oracle property in the sense that it performs asymptotically as well as the best possible selector when given the true data generating distribution. We also provide the finite sample properties of this procedure. In addition, we study the properties of the deletion/substitution/addition algorithm in simulations. We apply the method to detect binding sites in yeast gene expression experiments, and predict survival in cancer data sets.

**Yee
Hwa (Jean) Yang** (Division of Biostatistics, University
of California, San Francisco) jean@biostat.ucsf.edu

**Comparing
Normalization Methods Based on Splice Array Experiments**
Slides: pdf

There are many sources of systematic variation in microarray experiments that affect measured gene expression levels. Normalization is the term used to describe the process of removing such variations. In this talk, I will describe a set of experiments based on splice-specific microarrays. These arrays provide a basis to investigate the effect of mutations and other factors on splicing events in the creation of mature mRNA. In particular, the design of these arrays provides a platform for comparing the performance of different normalization methods.

**Kenny
Q. Ye** (Department of Applied Mathematics and Statistics,
SUNY at Stony Brook, Stony Brook, New York, 11794-3600. (631)632-9344,
(631)632-8490(FAX)) kye@ams.sunysb.edu

**Pooling
or not Pooling in Microarray Experiments - an Experimental Design
Point of View **(poster session)

Joint work with Anil Dhundale, Department of Biomedical Engineering and Center for Biotechnology, SUNY at Stony Brook, Stony Brook, New York 11794-2580, (631) 632-8521, anil.dhundale@sunysb.edu

Microarray experiments are often used to detect differences in gene expression between two populations of cells; a test population versus a control population. However in many cases, such as individuals in a population, the biological variability can present changes that are irrelevant to the question of interest and it then becomes important to assay many individual samples to collect statistically meaningfully results. Unfortunately the cost of performing some types of microarray experiments can be prohibitive. A potentially effective but not well publicized alternative is to pool individual RNA samples together for hybridization on a single microarray. This method can dramatically reduce the experimental costs while maintaining high power in detecting the changes in expression levels that relate to the specific question of interest. In this talk, we will discuss why this technique works and the optimal design strategy for pooling. This idea will also be illustrated by a synthetic experiment and a real experiment that studies Afib (cardiac atrial fibrillation), a condition that is a serious health condition that affects a large percent of the population but mechanistically remains not well understood.

**Ming
Yuan**
(Department of Statistics, University of Wisconsin, Madison)
yuanm@stat.wisc.edu

**Hidden
Markov Models for Microarray Time Course Data in Multiple Biological
Conditions** (poster session)

Motivated by several real applications, an approach is proposed to compare expression profiles from different biological conditions over time. It is based on a hidden Markov model (HMM) with states corresponding to expression patterns across conditions. To investigate properties of the proposed approach, we have implemented the HMM assuming a parametric hierarchical mixture model for the emissions, here intensities. As shown in simulation studies comparing the HMM approach to one which simply overlooks the correlation over time, both the sensitivity and the specificity increase substantially without sacrificing the false discovery rate. I will present a detailed analysis of the methodology and its performance.

This is a joint work with Prof. Kendziorski (see her talk on Thursday).