February 13 - 17, 2012
Keywords of the presentation: Sparse covlutions, polynomials, finite fields, length reduction.
Efficient handling of sparse data is a key challenge in Computer Science. Binary convolutions, such as the Fast Fourier Transform or theWalsh Transform are a useful tool in many applications and are efficiently solved.
In the last decade, several problems required efficient solution of sparse binary convolutions.
Both randomized and deterministic algorithms were developed for efficiently computing the sparse FFT. The key operation in all these algorithms was length reduction. The sparse data is mapped into small vectors that preserve the convolution result. The reduction method used to-date was the modulo function since it preserves location (of the ”1” bits) up to cyclic shift.
In this paper we present a new method for length reduction - polynomials. We show that this method allows a faster deterministic computation of the sparse FFT than currently known in the literature. It also enables the development of an efficient algorithm for computing the binary sparse Walsh Transform. To our knowledge, this is the first such algorithm in the literature.
(Joint work with Oren Kappah, Ely Porat, and Amir Rothschild)
Keywords of the presentation: RNA structure characterization, high-throughput sequencing, maximum likelihood estimation
New regulatory roles continue to emerge for both natural and engineered noncoding RNAs, many of which have specific secondary and tertiary structures essential to their function. This highlights a growing need to develop technologies that enable rapid and accurate characterization of structural features within complex RNA populations. Yet, available structure characterization techniques that are reliable are also vastly limited by technological constraints, while the accuracy of popular computational methods is generally poor. These limitations thus pose a major barrier to the comprehensive determination of structure from sequence and thereby to the development of mechanistic understanding of transcriptome dynamics. To address this need, we have recently developed a high-throughput structure characterization technique, called SHAPE-Seq, which simultaneously measures quantitative, single nucleotide-resolution, secondary and tertiary structural information for hundreds of RNA molecules of arbitrary sequence. SHAPE-Seq combines selective 2’-hydroxyl acylation analyzed by primer extension (SHAPE) chemical mapping with multiplexed paired-end deep sequencing of primer extension products. This generates millions of sequencing reads, which are then analyzed using a fully automated data analysis pipeline. Previous bioinformatics methods, in contrast, are laborious, heuristic, and expert-based, and thus prohibit high-throughput chemical mapping.
In this talk, I will review recent developments in experimental RNA structure characterization as well as advances in sequencing technologies. I will then describe the SHAPE-Seq technique, focusing on its automated data analysis method, which relies on a novel probabilistic model of a SHAPE-Seq experiment, adjoined by a rigorous maximum likelihood estimation framework. I will demonstrate the accuracy and simplicity of our approach as well as its applicability to a general class of chemical mapping techniques and to more traditional SHAPE experiments that use capillary electrophoresis to identify and quantify primer extension products.
This is joint work with Lior Pachter, Julius Lucks, Stefanie Mortimer, Shujun Luo, Cole Trapnell, Gary Schroth, Jennifer Doudna and Adam Arkin.
Keywords of the presentation: Group testing, Explicit constructions
The basic goal in combinatorial group testing is to identify a set of up to d defective items within a large population of size n >> d using a pooling strategy. Namely, the items can be grouped together in pools, and a single measurement would reveal whether there are one or more defectives in the pool. The threshold model is a generalization of this idea where a measurement returns positive if the number of defectives in the pool exceeds a fixed threshold u, negative if this number is below a fixed lower threshold L 0.
Read More...
Keywords of the presentation: function evaluation, competitive analysis, Boolean functions, decision trees, adaptive testing, non-uniform costs
The problem of evaluating a function by sequentially testing a subset of variables whose values uniquely identify the function’s value arises in several domains of computer science.
A typical example is from the field of automatic diagnosis in Artificial Intelligence. Automatic diagnosis systems employ a sequence of tests and based on their outcomes, they provide a diagnosis (e.g., a patient has cancer or not, a component of a system is failing or not). In general, it is not necessary to perform all the available tests in order to determine the correct diagnosis. Since different tests might incur also very different costs, it is reasonable to look for testing procedures that minimize the cost incurred to produce the diagnosis.
Another example is query optimization, a major issue in the area of databases. Query optimization refers to the problem of defining strategies to evaluate queries in order to minimize the user response time. In a typical database query, thousands of tuples are scanned to determine which of them match the query. By carefully defining a strategy to evaluate the attributes, one can save a considerable amount of computational time. In general, it is not necessary to evaluate all attributes of a tuple in order to determine whether it matches the query or not and a smart choice of the attributes to evaluate first can avoid very expensive and/or redundant attribute evaluations.
This talk will focus on the function evaluation problem where tests have non-uniform costs and competitive analysis is employed for measuring the algorithm's performance. For monotone Boolean functions we will present an optimal algorithm, i.e., with the best possible competitiveness. We will also discuss the case of some classes of non-monotone functions.
Keywords of the presentation: Algorithms, Cryptography, Corruption-Localizing Hashing, Group Testing, Superimposed Codes.
Efficient detection of integrity violations is crucial for the reliability of both data at rest and data in transit. In addition to detecting corruptions, it is often desirable to have the capability of obtaining information about the location of corrupted data blocks. While ideally one would want to always find all changes in the corrupted data, in practice this capability may be expensive, and one may be content with localizing or finding a superset of any changes. Corruption-localizing hashing is a cryptographic primitive that enhances collision-intractable hash functions thus improving the detection property of these functions into a suitable localization property. Besides allowing detection of changes in the input data, corruption-localizing hash schemes also obtain some superset of the changes location, where the accuracy of this superset with respect to the actual changes is a metric of special interest, called localization factor.
In this talk we will address the problem of designing corruption-localizing hash schemes with reduced localization factor and with small time and storage complexity. We will show that this problem corresponds to a particular variant of nonadaptive group testing, and illustrate a construction technique based on superimposed codes.
This group testing approach allowed to obtain corruption-localizing hash schemes that greatly improve on previous constructions.
In particular, we will present a corruption-localizing hash scheme that achieves an arbitrarily small localization factor
while only incurring a sublinear time and storage complexity.
In contrast to the classical goal of group testing we want to find onw defective elements of $D$
defective elements. We examine four
different test functions. We give adaptive strategies and lower bounds for the number of tests.
We treat the cases if the number of defectives are known and if the number of defectives are bounded.
Given a binary matrix $M$, find submatrix with the same number of columns and minimum
number of raws such that all boolean sums of at most $d$ columns are distinct. We show
that for any fixed $d$, this problem has no polynomial-time $o(log n)$-approximation
unless NP-P. Meanwhile, some approximation algorithms are also presented.
Modern non-invasive brain imaging technologies, such as diffusion weighted magnetic resonance imaging
(DWI), enable the mapping of neural fiber tracts in the white matter, providing a basis to reconstruct a detailed
map of brain structural connectivity networks. Brain connectivity networks differ from random networks
in their topology, which can be measured using small worldness, modularity, and high-degree
nodes (hubs). Still, little is known about how individual differences in structural brain network properties relate
to age, sex, or genetic differences. Recently, some groups have reported brain network biomarkers that
enable differentiation among individuals, pairs of individuals, and groups of individuals. In addition to studying
new topological features, here we provide a unifying general method to investigate topological brain networks
and connectivity differences between individuals, pairs of individuals, and groups of individuals at
several levels of the data hierarchy, while appropriately controlling false discovery rate (FDR) errors. We
apply our new method to a large dataset of high quality brain connectivity networks obtained from High Angular
Resolution Diffusion Imaging (HARDI) tractography in 303 young adult twins, siblings, and unrelated
people. Our proposed approach can accurately classify brain connectivity networks based on sex (93% accuracy)
and kinship (88.5% accuracy). We find statistically significant differences associated with sex and kinship
both in the brain connectivity networks and in derived topological metrics, such as the clustering
coefficient and the communicability matrix.
In the past few years, we have experienced a paradigm shift in human genetics. Accumulating lines of evidence have highlighted the pivotal role of rare genetic variations in a wide variety of traits and diseases. Studying rare variations is a needle in a haystack problem, as large cohorts have to be assayed in order to trap the variations and gain statistical power. The performance of DNA sequencing is exponentially growing, providing sufficient capacity to profile an extensive number of specimens. However, sample preparation schemes do not scale as sequencing capacity. A brute force approach of preparing hundredths to thousands of specimens for sequencing is cumbersome and cost-prohibited. The next challenge, therefore, is to develop a scalable technique that circumvents the bottleneck in sample preparation.
My tutorial will provide background on rare genetic variations and DNA sequencing. I will present our sample prep strategy, called DNA Sudoku, that utilizes combinatorial pooling/compressed sensing approach to find rare genetic variations. More importantly, I will discuss several major distinction from the classical combinatorial due to sequencing specific constraints.
Keywords of the presentation: high-throughput sequencing, combinatorial pooling, liquid-handling, compressed sensing
Finding rare genetic variations in large cohorts requires tedious preparation of large numbers of specimens for sequencing. We are developing a solution, called DNA Sudoku, to reduce prep time and increase the throughput of samples. By using a combinatorial pooling approach, we multiplex specimens and then barcode the pools, rather than individuals, for sequencing in a single lane on the Illumina platform.
We have developed a protocol for quantifying, calibrating, and pooling DNA samples using a liquid-handling robot, which has required a significant amount of testing in order to reduce volume variation. I will discuss our protocol and the steps we have taken to reduce CV. For accurate decoding and to reduce the possibility of specimen dropout, it is important that the DNA samples are accurately quantified and calibrated so that equal amounts can be pooled and sequenced. We can determine the number of carriers in each pool from sequencing output and reconstruct the original identity of individual specimens based on the pooling design, allowing us to identify a small number of carriers in a large cohort.
There are many instances in Coding Theory when codewords must be restored from
partial information, like defected data (error correcting codes), or some superposition of the strings. These lead to superimposed codes, a close relative of group testing problems.
There are lots of versions and related problems, like Sidon sets, sum-free sets, unionfree families, locally thin families, cover-free codes and families, etc. We discuss two cases cancellative and union-free codes.
A family of sets Ƒ (and the corresponding code of 0-1 vectors) is called union-free if A ∪ B≠ C ∪ D and A,B,C,D ∈ F imply {A,B} = {C,D}. Ƒ is called t-cancellative if for all distict t + 2 members A_{1}, … ,A_{t} and B,C ∈ Ƒ
A1 ∪ ... ∪ A_{t} ∪ B ≠ A_{1} ∪ … A_{t} ∪C:
Let c_{t}(n) be the size of the largest t-cancellative code on n elements. We significantly improve the previous upper bounds of Körner and Sinaimeri, e.g., we show c2(n) ≤ 2^{0:322n} (for n > n_{0}).
Keywords of the presentation: error correcting codes, local decoding
Tutorial: Locally decodable codes
Locally decodable codes are error correcting codes with the extra property that, in order to retrieve the correct value of any one position of the input with
high probability, it is sufficient to read a small number of positions of the corresponding, possibly corrupted codeword. In applications involving large data sets it is desirable to have decoding procedures that work in sublinear time, that is, without having to access all of the encoded data. Locally decodable codes allow very efficient decoding procedures for recovering individual parts of the original data: not only with sublinear, but sometimes even small constant running times. I will survey constructions of locally decodable codes and the currently known tradeoffs between parameters of such codes.
Keywords of the presentation: streaming algorithms, group testing, sparse signal recovery, introduction.
My talk will be a tutorial about sparse signal recovery but, more importantly, I will provide an overview of what the research problems are at the intersection of biological applications of group testing, streaming algorithms, sparse signal recovery, and coding theory. The talk should help set the stage for the rest of the workshop.
Despite the rapid decrease in sequencing costs, sequencing a large group of individuals is still prohibitively expensive. Recently, several sophisticated pooling designs were suggested that can identify carriers of rare alleles in large cohorts with a significantly smaller number of lanes, thus dramatically reducing the cost of such large scale genotyping projects. These approaches all use combinatorial pooling designs where each individual is either present in a pool or absent from it. One can then infer the number of carriers in a pool, and by combining information across pools, reconstruct the identity of the carriers.
We show that one can gain further efficiency and cost reduction by using “weighted” designs, in which different individuals donate different amounts of DNA to the pools. Intuitively, in this situation the number of mutant reads in a pool does not only indicate the number of carriers, but also of the identity of the carriers.
We describe and study a simple but powerful example of such weighted designs, with non-overlapping pools. We demonstrate that even this naive approach is not only easier to implement and analyze but is also competitive in terms of accuracy with combinatorial designs when identifying very rare variants, and is superior to the combinatorial designs when genotyping more common variants.
We then discuss how weighting can be further incorporated into existing designs to increase their accuracy and demonstrate the resulting improvement in reconstruction efficiency using simulations. Finally, we argue that these weighted designs have enough power to facilitate detection of common alleles, so they can be used as a cornerstone of whole-exome or even whole-genome sequencing projects.
Keywords of the presentation: combinatorial pooling, DNA sequencing and assembly, genomics, next-generation sequencing,
We propose a new sequencing protocol that combines recent advances in
combinatorial pooling design and second-generation sequencing
technology to efficiently approach de novo selective genome
sequencing. We show that combinatorial pooling is a cost-effective
and practical alternative to exhaustive DNA barcoding when dealing
with hundreds or thousands of DNA samples, such as genome-tiling
gene-rich BAC clones. The novelty of the protocol hinges on the
computational ability to efficiently compare hundreds of million of
short reads and assign them to the correct BAC clones so that the
assembly can be carried out clone-by-clone. Experimental results on
simulated data for the rice genome show that the deconvolution is
extremely accurate (99.57% of the deconvoluted reads are assigned to
the correct BAC), and the resulting BAC assemblies have very high
quality (BACs are covered by contigs over about 77% of their length,
on average). Experimental results on real data for a gene-rich subset of
the barley genome confirm that the deconvolution is accurate (almost
70% of left/right pairs in paired-end reads are assigned to the same
BAC, despite being processed independently) and the BAC assemblies have
good quality (the average sum of all assembled contigs is about 88%
of the estimated BAC length).
Joint work with D. Duma (UCR), M. Alpert (UCR), F. Cordero (U of Torino),
M. Beccuti (U of Torino), P. R. Bhat (UCR and Monsanto), Y. Wu (UCR and Google),
G. Ciardo (UCR), B. Alsaihati (UCR), Y. Ma (UCR), S. Wanamaker (UCR),
J. Resnik (UCR), and T. J. Close (UCR).
Preprint available at http://arxiv.org/abs/1112.4438
Read More...
Screening experiments (SE) deal with finding a small number s Active
Inputs (AIs) out of a vast total amount t of inputs in a regressionmodel.
Of special interest in the SE theory is finding the so-called maximal rate
(capacity).
For a set of t elements, denote the set of its s-subsets by (s, t). Introduce
a noisy system with t binary inputs and one output y. Suppose
that only s 0. The f-capacity
Cf (s) = limt!1(log t/Nf (s, t,
)) for any
> 0 is the ‘limit for the per-
formance of the f-analysis’.
We obtained tight capacity bounds [4], [5] by formalizing CSS as a special
case of Multi-Access Communication Channels (MAC) of information
transmission capacity region construction developed by R. Ahlswede
in [1] and comparing CSS’ maximal rate (capacity) with small error for
two practical methods of outputs’ analysis under the optimal CSS design
motivated by applications like [2] . Recovering Active Inputs with
small error probability and accurate parameter estimation are both possible
with rates less than capacity and impossible with larger rates.
A staggering amount of attention was recently devoted to the study
of compressive sensing and related areas using sparse priors in over
parameterized linear models which may be viewed as a special case of
our models with continuous input levels.
The threshold phenomenon was empirically observed in early papers
[3], [6] : as the dimension of a randominstance of a problem grows there
is a sharp transition from successful recovery to failure as a function
of the number of observations versus the dimension and sparsity of the
unknown signal. Finding this threshold is closely related to our capacity
evaluation. Some threshold bounds for the compressive sensing were
made using standard information-theoretic tools, e.g. in [15]
A staggering amount of attention was recently devoted to the study
of compressed sensing and related areas using sparse priors in over
parameterized linear models under versions of linear programming (LP) analysis.
More recently popularity of the sparsity approach for the classical model of
group testing also increased.
The threshold phenomenon was empirically observed in early papers of
Donoho et al : as the dimension of a random instance of a problem
grows there is a sharp transition from successful recovery to failure as a
function of the number of observations versus the dimension and sparsity
of the unknown signal. There are strong parallels between this threshold
behavior and earlier Shannon's justification for similar phenomenon in
terms of the capacity of the information transmission.
We outline sharp asymptotical bounds for general models obtained
by 1979 that were inspired by the Multi-Access Capacity Region
construction that can greatly enhance the current understanding of recovery
of active inputs in sparse compressive sensing. These bounds were obtained for
asymptotically optimal designs which include random designs and
thus imply upper bounds for performance of recovery under arbitrary designs.
These optimal designs are a natural choice in settings where the experiment
can be predesigned, providing in addition a competitive performance of a simplified
analysis method: separate testing of inputs (STI). STI originated in 1959 and can be
further traced back to Fisher's idea of randomization in estimation.
The STI capacity exceeds that of LP relaxation for
randomized designs in a wide range of models and admits several straightforward
generalizations including significantly more powerful and still computationally
feasible greedy version. This will be demonstrated in our simulations.
Keywords of the presentation: group testing, MAC channel, quantization, graphical models
We consider a novel group testing framework where defectives obey a probabilistic model in which the number and set of defectives is governed by a graphical model. We furthermore assume that the defectives have importance factors that influence their strength on the test outcomes and the accuracy with which the outcomes are read. This kind of scenario arises, for example, in MAC channels with networked users using different communication powers, or in DNA pooling schemes which involve large families.
Keywords of the presentation: dimensionality reduction, johnson-lindenstrauss, numerical linear algebra, massive data
The Johnson-Lindenstrauss (JL) lemma (1984) states that any n points
in d-dimensional Euclidean space can be embedded into k = O((log
n)/eps^2) dimensions so that all pairwise distances are preserved up
to 1+/-eps. Furthermore, this embedding can be achieved via a linear
mapping. The JL lemma is a useful tool for speeding up solutions to
several high-dimensional problems: closest pair, nearest neighbor,
diameter, minimum spanning tree, etc. It also speeds up some
clustering and string processing algorithms, reduces the amount of
storage required to store a dataset, and can be used to reduce memory
required for numerical linear algebra problems such as linear
regression and low rank approximation.
The original proofs of the JL lemma let the linear mapping be
specified by a random dense k x d matrix (e.g. i.i.d. Gaussian
entries). Thus, performing an embedding requires dense matrix-vector
multiplication. We give the first construction of linear mappings for
JL in which only a subconstant fraction of the embedding matrix is
non-zero, regardless of how eps and n are related, thus always
speeding up the embedding time. Previous constructions only achieved
sparse embedding matrices for 1/eps >> log n.
This is joint work with Daniel Kane (Stanford).
Keywords of the presentation: biological networks, graph algorithms, network alignment
Sequence-based computational approaches have revolutionized biological understanding. However, they can fail to explain some biological phenomena. Since proteins aggregate to perform a function instead of acting in isolation, the connectivity of a protein interaction network (PIN) will provide additional insight into the inner working on the cell, over and above sequences of individual proteins. We argue that sequence and network topology give insights into complementary slices of biological information, which sometimes corroborate each other, but sometimes do not. Hence, the advancement depends on the development of sophisticated graph-theoretic methods for extracting biological knowledge purely from network topology before being integrated with other types of biological data (e.g., sequence). However, dealing with large networks is non-trivial, since many graph-theoretic problems are computationally intractable, so heuristic algorithms are sought.
Analogous to sequence alignments, alignments of biological networks will likely impact biomedical understanding. We introduce a family of topology-based network alignment (NA) algorithms, (that we call GRAAL algorithms), that produces by far the most complete alignments of biological networks to date: our alignment of yeast and human PINs demonstrates that even distant species share a surprising amount of PIN topology. We show that both species phylogeny and protein function can be extracted from our topological NA. Furtermore, we demonstrate that the NA quality improves with integration of additional data sources (including sequence) into the alignment algorithm: surprisingly, 77.7% of proteins in the baker’s yeast PIN participate in a connected subnetwork that is fully contained in the human PIN suggesting broad similarities in internal cellular wiring across all life on Earth. Also, we demonstrate that topology around cancer and non-cancer genes is different and when integrated with functional genomics data, it successfully predicts new cancer genes in melanogenesis-related pathways.
Keywords of the presentation: protein interaction networks, kinetics, affinity, protein arrays, surface plasmon resonance
Angiogenesis, consisting in the formation of blood vessels from preexisting ones, is of crucial importance in pathological situations such as cancer and diabetes and is a therapeutic target for these two diseases. We have developed protein and sugar arrays probed by surface plasmon resonance (SPR) imaging to identify binding partners of proteins, polysaccharides and receptors regulating angiogenesis. Interactions collected from our own experiments and from literature curation have been used to build a network comprising protein-protein and protein-polysaccharide interactions. To switch from a static to a dynamic and quantitative interaction network, we have measured kinetics and affinity of interactions by SPR to discriminate transient from stable interactions and to prioritize interactions in the network. We have also identified protein interaction sites either experimentally or by molecular modeling to discriminate interactions occurring simultaneously from those taking place sequentially. The ultimate step is to integrate, using bioinformatics tools, all these parameters in the interaction network together with inhibitors of these interactions and with gene and protein expression data available from Array Express or Gene Expression Omnibus and from the Human Protein Atlas. This dynamic network will allow us to understand how angiogenesis is regulated in a concerted fashion via several receptors and signaling pathways, to identify crucial interactions for the integrity of the network that are potential therapeutic targets, and to predict the side effects of anti-angiogenic treatments.
Keywords of the presentation: Code Concatenation, Coding theory, Group Testing, List Decoding, List Recovery
Group testing was formalized by Dorfman in his 1943 paper and was originally used in WW-II to identify soldiers with syphilis. The main insight in this application is that blood samples from different soldiers can be combined to check if at least one of soldiers in the pool has the disease. Since then group testing has found numerous applications in many areas such as (computational) biology, combinatorics and (theoretical) computer science.
Theory of error-correcting codes, or coding theory, was born in the works of Shannon in 1948 and Hamming in 1950. Codes are ubiquitous in our daily life and have also found numerous applications in theoretical computer science in general and computational complexity in particular.
Kautz and Singleton connected these two areas in their 1964 paper by using "code concatenation" to design good group testing schemes. All of the (asymptotically) best know explicit constructions of group testing schemes use the code concatenation paradigm. In this talk, we will focus on the "decoding" problem for group testing: i.e. given the outcomes of the tests on the pools, identify the infected soldiers. Recent applications of group testing in data stream algorithm require sub-linear time decoding, which is not guaranteed by the traditional constructions.
The talk will first survey the Kautz-Singleton construction and then will will show how recent developments in list decoding of codes lead in a modular way to sub-linear time decodable group testing schemes.
Keywords of the presentation: superimposed codes, screening designs, rate of code, threshold designs and codes
We will discuss superimposed codes and non-adaptive group testing designs arising from the potentialities of compressed genotyping models in molecular biology. The given survey is also motivated by the 30th anniversary of our recurrent upper bound on the rate of superimposed codes published in 1982.
I will talk about property testing of Boolean functions, concentrating on two topics: testing monotonicity (an example of a combinatorial property) and testing the property of being a low-degree polynomial (an algebraic property).
Keywords of the presentation: Genomewide association studies, Privacy, Pooled designs, Hypothesis testing, Local Asymptotic normality
Statistical power to detect associations in genome-wide association studies can be enhanced by combining data across studies in meta-analysis or replication studies. Such methods require data to flow freely in the scientific community, however, and this raises privacy concerns.
Till recently, many studies pooled individuals together, making only the allele frequencies of each SNP in the pool publicly available. However a technique that could be used to detect the presence of individual genotypes from such data prompted organizations such as the NIH to restrict public access to summary data . To again allow public access to data from association studies, we need to determine which set of SNPs can be safely exposed while preserving an acceptable level of privacy.
To answer this question, we provide an upper bound on the power achievable by any detection method as a function of factors such as the number and the allele frequencies of exposed SNPs, the number of individuals in the pool, and the false positive rate of the method. Our approach is based on casting the problem in a statistical hypothesis testing framework for which the likelihood ratio test (LR-test) attains the maximal power achievable.
Our analysis provides quantitative guidelines for researchers to make SNPs public without compromising privacy. We recommend, based on our analysis, that only common independent SNPs be exposed. The final decision regarding the exposed SNPs should be based on the analytical bound in conjunction with empirical estimates of the power of the LR test. To this end, we have implemented a tool, SecureGenome, that determines the set of SNPs that can be safely exposed for a given dataset.
Keywords of the presentation: (generalized) group testing, screening clone libraries, DNA microarrays
Group testing has made many contributions to modern molecular biology.
In the Human Genome Project large clone libraries where created
to amplify DNA. Widely used group testing schemes vastly accelerated
the detection of overlaps between the individual clones in these libraries through experiments, realizing savings both in effort and materials.
Modern molecular biology also contributed to group testing. The problem
of generalized group testing (in the combinatorial sense) arises
naturally, when one uses oligonucleotide probes to identify
biological agents present in a sample. In this setting a group testing
design cannot be chosen arbitrarily. The possible columns of a group
testing design matrix are prescribed by the biology, namely by the
hybridization reactions between target DNA and probes
Keywords of the presentation: compressed sensing, group testing, genetics, rare alleles
Identification of rare variants by resequencing is important both for detecting novel variations and for screening individuals for known disease alleles. New technologies enable low-cost resequencing of target regions, although it is still prohibitive to test more than a few individuals. We propose a novel pooling design that enables the recovery of novel or known rare alleles and their carriers in groups of individuals. The method is based on combining next-generation sequencing technology with a Compressed Sensing (CS) approach. The approach is general, simple and efficient, allowing for simultaneous identification of multiple variants and their carriers.
It reduces experimental costs, i.e., both sample preparation related costs and direct sequencing costs, by up to 70 fold, and thus allowing to scan much larger cohorts.
We demonstrate the performance of our approach over several publicly available data sets, including the 1000 Genomes Pilot 3 study.
We believe our approach may significantly improve cost effectiveness of future association studies, and in screening large DNA cohorts for specific risk alleles.
We will present initial results of two projects that were initiated following publication. The first project concerns identification of de novo SNPs in genetic disorders common among Ashkenazi Jews, based on sequencing 3000 DNA samples. The second project in plant genetics involves identifying SNPs related to water and silica homeostasis in Sorghum bicolor, based on sequencing 3000 DNA samples using 1-2 Illumina lanes.
Joint work with Amnon Amir from the Weizmann Institute of Science, and Or Zuk from the Broad Institute of MIT and Harvard
Keywords of the presentation: combinatorial group testing, smart-pooling, interactome mapping
Group testing, also know as smart-pooling, is a promising strategy for achieving high efficiency, sensitivity, and specificity in systems-level projects. It consists in assaying well-chosen pools of probes, such that each probe is present in several pools, hence tested several times. The goal is to construct the pools so that the positive probes can usually be directly identified from the pattern of positive pools, despite the occurrence of false positives and false negatives. While striving for this goal, two interesting mathematical or computational problems emerge: the pooling problem (how should the pools be designed?), and the decoding problem (how to interpret the outcomes?). In this talk I will discuss these questions and the solutions we have proposed: a flexible and powerful combinatorial construction for designing smart-pools (the Shifted Transveral Design, STD), and an efficient exact algorithm for interpreting results (interpool). I will then present the results of validation experiments that we have performed in the context of yeast two-hybrid interactome mapping.
Read More...
The data stream model has emerged as a way of analyzing algorithmic efficiency in the presence of massive data sets. Typically the algorithm is allowed a few (usually one) passes over the input, and must use limited memory and have very fast per input item processing time. I will give a survey of algorithms and lower bounds in this area, with an emphasis on problems such as norm estimation, numerical linear algebra, empirical entropy, l_p-sampling, and heavy hitters. Time-permitting I'll also discuss the extension to functional monitoring, in which there are multiple sites each with a data stream and a central coordinator should continuously maintain a statistic over the union of the data streams.
Keywords of the presentation: metagenomics, sparse representation, compressed seqeuncing
Determining the identities and frequencies of species present in a sample is a central problem in metagenomics, with scientific, environmental and clinical implications.
A popular approach to the problem is sequencing the Ribosomal 16s RNA gene in the sample using universal primers, and using variation in the gene's sequence between different species to identify the species present in the sample.
We present a novel framework for community reconstruction, based on sparse representation; while millions of microorganisms are present on earth, with known 16s sequences stored in a database, only a small minority (typically a few hundreds) are likely to be present in any given sample,
We discuss the statistical framework, algorithms used and results in terms of accuracy and species resolution.