Statistical Genomics

Statistical Genomics - Research

High dimensional genomic studies and multiplicity

Multivariate, high-dimensional data allow scientists to ask many questions. To appropriately evaluate the significance of the patterns that emerge, one has to account for the look-everywhere effect or for the data-driven selections. False discovery rate appears an appropriate criteria for global error and we are interested in its adaptation to genomic settings. Multivariate linear models are often a powerful first step in understanding the dependence structure of multiple variables: one of the problems we tackle is how to carry out model selection in this context in the presence of a large number of genomic explanatory variables.

Sabatti, C., S. Service, and N. Freimer (2003) "False discovery rates in linkage and association linkage genome screens for complex disorders," Genetics 164: 829-833. PMID: 12807801

Brodsky, J. (2011) "Block, Pass, Score: A Multivariate Methodology for Genome-wide Association Studies", UCLA dissertation

Sabatti, C. (2013) "Multivariate linear models for GWAS," in Advances in Statistical Bioinformatics, K. Do, S. Qin, M. Vannucci, ed., Cambridge University Press Preprint

Bogdan, M., E. van den Berg, C. Sabatti, W. Su, E. Candes (2014) "SLOPE -- Adaptive Variable Selection via Convex Optimization," arXiv:1407.3824

Peterson, C., M. Bogomolov, Y. Benjamini and C. Sabatti (2015) "Many Phenotypes without Many False Discoveries: Error Controlling Strategies for Multi-Traits Association Studies," arXiv:1504.00701

Stell, L. and C. Sabatti (2015) "Genetic variant selection: learning across traits and sites," arXiv:1504.00946

Multivariate phenotypes - GTEX

Current genomics data often contains information on a large number of phenotypes: how to capitalize on this? We are involved in a large study of endophenotypes for bipolar disorder, which has motivated methods development. We are also receiving funding from GTEx to develop statistical methods to identify eQTL with high sensitivity and at a low false positive rate, across multiple tissues. This is a collaboration with the group of Eleazar Eskin at UCLA.

Fears, S., S. Service, T. Teshiba, C. Araya, X. Araya, J. Bejarano, J. Gomez-Franco, B. Kremeyer, Z. Abaryan, I. Aldana, M. Ericson, M. Jalbrzkowski, J. Luykx, L. Navarro, N. Sharif, L. Altshuler, G. Bartzokis, J. Escobar, D. Glahn, J. Ospina-Duque, N. Risch, A. Ruiz-Linares, R. Cantor, C. Lopez-Jaramillo, G. Macaya, J. Molina, V. Reus, C. Sabatti, N. Freimer, and C. Bearden (2014) "Multi-system Component Phenotypes of Bipolar Disorder for Genetic Investigations of Extended Pedigrees" JAMA Psychiatry 71 : 375-87. PMID: 24522887

Peterson, C., M. Bogomolov, Y. Benjamini and C. Sabatti (2015) "Many Phenotypes without Many False Discoveries: Error Controlling Strategies for Multi-Traits Association Studies," arXiv:1504.00701

Peterson, C., M. Bogomolov, Y. Benjamini, C. Sabatti (2015) "TreeQTL: hierarchical error control for eQTL findings"

Resequencing studies and identification of functional variants

Thanks to the decrease of sequencing costs we can acquire a comprehensive picture of genomic variations. How can statistical methods help to identify variants that are more likely functional?

Service, S., T. Teslovich, C. Fuchsberger, V. Ramenksy, P. Yajnik, D. Koboldt; D. Larson, Q. Zhang, L. Lin, R. Welch, L. Ding, M. McLellan, M. O'Laughlin, C. Fronick, L. Fulton; V. Magrini, P. Elliott, M. Jarvelin, M. Kaakinen, M. McCarthy, L. Peltonen, A. Pouta, L. Bonnycastle, F. Collins, N. Narisu, H. Stringham, J. Tuomilehto, S. Ripatti, R. Fulton, C. Sabatti, R. Wilson, M. Boehnke, and N. Freimer (2014) "Re-sequencing Expands Our Understanding of the Phenotypic Impact of Variants at GWAS Loci," PLoS Genetics 10: e1004147 PMID: 24497850

Stell, L. and C. Sabatti (2015) "Genetic variant selection: learning across traits and sites," arXiv:1504.00946

DNA copy number variants reconstruction

Raw data from high-density genotyping arrays and resequencing can be used to reconstruct DNA copy number. We have been involved both in method development and data analysis projects.

Wang, H., Y. Lee, S. Nelson, and C. Sabatti (2005) "Inferring genomic loss and location of tumor suppressor genes from high density genotypes," UCLA Stat preprint 423, Journal of the French Statistical Society, 146: 153-171

H. Wang, Veldink,J., R. Ophoff, C. Sabatti (2008) "Markov models for inferring Copy Number Variations from genotype data on Illumina platforms," UCLA Statistics Preprint #533 and Human Heredity, 68: 1-22.

Stefansson, H. et al (2008) Large recurrent microdeletions associated with schizophrenia, Nature 455 232-6.

Vrijenhoek, T., J. Buizer-Voskamp, I. van der Stelt, E. Strengman, Genetic Risk and Outcome in Psychosis (GROUP) Consortium, C. Sabatti, A. van Kessel, H. Brunner, R. Ophoff, J. Veltman (2008) "Recurrent CNVs Disrupt Three Candidate Genes in Schizophrenia Patients," The American Journal of Human Genetics, 83: 504-510.

Zhang, Z., K. Lange, R. Ophoff, C. Sabatti (2010) "Reconstructing DNA copy number by penalized estimation and imputation," The Annals of Applied Statistics , 4: 1749-1773

Buizer-Voskamp JE, Muntjewerff JW; Genetic Risk and Outcome in Psychosis (GROUP) Consortium, Strengman E, Sabatti C, Stefansson H, Vorstman JA, Ophoff RA. (2011) "Genome-Wide Analysis Shows Increased Frequency of Copy Number Variation Deletions in Dutch Schizophrenia Patients," Biol Psychiatry 70:655-62.

Zhang, Z., K. Lange and C. Sabatti (2012) "Reconstructing DNA copy number by joint segmentation of multiple sequences" Stanford Technical Report, Biostatistics series BIO 261

Association mapping

We are generally interested in association mapping. We have contributed to the development of a Bayesian method for haplotype mapping. We have been quite interested in the problems of multiple comparison in association genomescans, and worked on approaches to account for hidden population structure.

Liu, J., C. Sabatti, J. Teng, B. Keats, and N. Risch (2001) "Bayesian analysis of haplotypes for linkage disequilibrium mapping," Genome Research 11: 1716-24. Preprint

Sabatti, C., S. Service, and N. Freimer (2003) "False discovery rates in linkage and association linkage genome screens for complex disorders," Genetics 164: 829-833. Reprint

Freimer, N. and C. Sabatti (2003) "The human phenome project," Nature Genetics 34: 15-21. Reprint

Freimer, N. and C. Sabatti (2004) "Pedigree, sib-pair, and association studies of common diseases; genetic mapping and epidemiology," Nature Genetics 36: 1045-1051. Reprint

Sabatti, C. (2006) "Comment on the `Likelihood-Based Inference on haplotype effects in genetic association studies' by Lin and Zeng," Journal of the American Statistical Association 101: 104-106. (Invited contribution.)

Service, S., The international collaborative group on isolated populations, C. Sabatti, N. Freimer (2007) "Tag SNPs chosen from HapMap perform well in several population isolates," Genetic Epidemiology, Epub ahead of print.

Freimer, N. and C. Sabatti (2007) "Human genetics: variants in common diseases." Nature 445: 828-30. (Invited contribution.)

Ayers, K., C. Sabatti and K. Lange (2007) "A dictionary model for haplotyping, genotype calling, and association mapping" Genetic Epidemiology 31 : 672-683.

Sabatti. C., S. Service, A. Hartikainen, A. Pouta, S. Ripatti, J. Brodsky, C. Jones, N. Zaitlen, T. Varilo, M. Kaakinen, U. Sovio, A. Ruokonen, J. Laitinen, E. Jakkula, C. Lachlan, C. Hoggart, P. Elliott, A. Collins, H. Turunen, S. Gabriel, M. McCarthy, M. Daly, M-R. Jarvelin, N. Freimer, L. Peltonen (2009) "Genomewide association analysis of metabolic phenotypes in a birth cohort from a founder population," Nature Genetics, 41: 35-46.

Kang, H., J-H. Sul, S. Service, N. Zaitlen, S.Kong, N. Freimer, C. Sabatti*, E. Eskin* (2010) "Variance component model to account for sample structure in genome-wide association studies," Nature Genetics, 42 : 348-354.

Teslovich TM et al. (2010) "Biological, clinical and population relevance of 95 loci for blood lipids," Nature 466:707-713.

Linkage disequilibrium

I have been interested for a long time in how to measure linkage disequilibrium and in the variations of LD across the genome and across populations.

Sabatti, C. and N. Risch (2002) "Homozygosity and linkage disequilibrium," Genetics 160: 1707-1719. Preprint

Sabatti, C. (2002) "Measuring dependence with volume tests," The American Statistician 50: 191-195. Preprint

Ayers, K., C. Sabatti, and K. Lange (2006) "Reconstructing ancestral haplotypes with a dictionary model," Journal of Computational Biology, 3, 3: 767-785.

Wang, H., C. Lin, S. Service, The international collaborative group on isolated populations, Y. Chen, N. Freimer, C. Sabatti (2006) "Linkage disequilibrium and haplotype homozygosity in population samples genotyped at a high marker density," Human Heredity , 62 : 175-189.

Chen, Y., C. Lin, C. Sabatti (2006) "Volume measures for linkage disequilibrium," BMC Genetics 7:54

High density SNP genotyping

We developed models for intensity values of the Affymetrix and Illumina genotyping arrays to be used in genotype calls, linkage studies, and loss of heterozygosity studies. In general, we are interested in understanding the measurements error associated with novel technologies.

Sabatti, C. and K. Lange (2005) "Bayesian Gaussian mixture models for high density genotyping arrays," UCLA Stat preprint 421, to appear in JASA.

Gene regulation networks

To recover the dynamic behavior of regulatory proteins and their pathway of influence on cell behavior, we have combined sequence analysis with results of gene expression array experiments. We developed a sparse hidden component model to link transcription factors activity to gene expression. Most recently, we are interesed in incorporating measurements of methylation levels in our models.

Sabatti, C., L. Rohlin, M. Oh, and J. Liao. (2002) "Co-expression pattern from DNA microarray experiments as a tool for operon prediction," Nucleic Acid Research 30: 2886-2893. Reprint

Liao, J., R. Boscolo, Y. Yang, L. Tran, C. Sabatti, and V. Roychowdhury (2003) "Network component analysis: reconstruction of regulatory signals in biological systems," Proceedings of the National Academy of Science 100: 15522-15527. Reprint

Kao, K., Y. Yang, R. Boscolo, C. Sabatti, V. Roychowdhury, and J. Liao (2004) "Determination of multiple transcription regulator activities in Escherichia coli using network component analysis," Proceedings of the National Academy of Science 101: 641-646. Reprint

Sabatti, C. and G. James (2006) "Bayesian sparse hidden components analysis for transcription regulation networks," Bioinformatics, 22: 739-746.

James, G., Sabatti, C., Zhou, N. and Zhu, J. (2010) "Sparse Regulatory Networks," The Annals of Applied Statistics , 4: 663-686.

Gene expression array denoising

Gene expression arrays represent a formidable tool, as they allow investigation of thousand of genes at the same time. However, in order to exploit at best their potential, one has to be able to deal successfully with the statistical issue involved in their analysis. We have suggested a de-noising approach based on thresholding. Using a Bayesian hierarchical model and an approach to multiple comparison that is inspired by the False Discovery Rate, we denoise the signal coming from multiple array experiments with the specific goal of identifying the genes that are up-regulated or down-regulated in a given condition.

Sabatti, C., S. Karsten, and D. Geschwind (2002) "Thresholding rules for recovering a sparse signal from microarray experiments," Mathematical Biosciences 176: 17-34. Preprint

Erickson, S. and C. Sabatti (2005) "Empirical Bayes estimation of a sparse vector of gene expression," Statistical Applications in Genetics and Molecular Biology, 4 :22.

Genomic scale identification of promoter binding sites

One of the best understood mechanisms of transcription regulation is the action of regulatory proteins: binding on the up-stream region of a gene act either as promoters or suppressors. We have developed a stochastic dictionary model to identify the position of known binding sites on a genome-wide scale. We use this information to improve the clustering of array experiments and to reconstruct the regulatory network. Our model organism for these investigations has been E. Coli.

Sabatti, C. and K. Lange (2002) "Genomewide motif identification using a dictionary model," IEEE Proceedings 90: 1803-1810. Preprint

Sabatti, C., L. Rohlin, K. Lange, and J. Liao (2005) "Vocabulon: a dictionary model approach for reconstruction and localization of transcription factor binding sites," Bioinformatics 21: 922-931. Preprint

High Throughput Screens

In collaboration with Koppany Visnyei and Harley Kornblum we developed methods for the analysis of high-throughput screen data. Denise Ferrari has put together an R software package that implements our suggested pre-processing.

Sabatti, C., K. Visnyei, H. Kornblum (2008) "Statistical challenges in High-throughput Screens." UCLA Stat Preprint 532

Visnyei, K., H. Onodera, R. Damoiseaux, K. Saigusa, S. Petrosyan, D. De Vries, D. Ferrari, J. Saxe, E. Panosyan, M. Masterman-Smith, J. Mottahedeh, K. Bradley, J. Huang, C. Sabatti, I. Nakano, H. Kornblum (2011) "A molecular screening approach to identify and characterize inhibitors of glioblastoma multiforme stem cells," Molecular Cancer Therapeutics to appear .

Welcome Projects Education Publications Software People Contact