Haplotype tagging SNP (htSNP) selection in the Multiethnic Cohort Study


Our htSNP selection algorithm is based upon optimizing Rh2, which is the squared correlation between estimates of the number of copies of a particular haplotype h and the true number of copies of haplotype h carried by a subject, averaging over all possible genotype data under an assumption of Hardy-Weinberg equilibrium. The estimate of the number of copies that is used in the Rh2 calculation is identical to the estimate computed in the expectation step of the E-M algorithm of Excoffier and Slatkin (1995) for estimating haplotype frequencies from genotype data for unrelated subjects.

       The calculation of Rh2 and the algorithm that is used to maximize Rh2 for subsets of candidate htSNPs is described in the paper “Stram DO, Haiman CA, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Pike, MC. Choosing haplotype-tagging SNPs based on unphased genotype data from a preliminary sample of unrelated subjects with an example from the Multiethnic Cohort Study, Human Heredity 2003 .


A reprint of this manuscript is available by clicking here.


       The Fortran 90 program that I have written (TagSNPs) which implements our selection algorithm runs under either DOS or UNIX as a console (command line) application.

The full program, documentation, and example datasets may be downloaded in ZIP file format (e.g. WINZIP). A UNIX executable for Sun OS 5.8 is included in the package as well. Other Unix executables and/or the Fortran 90 source code for recompilation are available upon request. After downloading read the README.txt file and tagSNPSDoc.pdf for instructions on setting up and using the program.


Download TagSNPs Version 1


A slide show illustrating basic uses of this program is provided here . 

Features include

  • A very fast and flexible partition-ligation EM algorithm for estimating haplotype frequencies for large numbers of SNPs genotyped in unrelated subjects (the EM command)
  • Tag SNP selection using any of three statistical (R2) criteria (the RSQ command)
    • Rh2  -- for predicting haplotypes (see above) based on tagSNPs
    • Multivariate Rs2 – for predicting unmeasured SNPs based on tagSNPs
    • Pairwise R2 – finding a smallest set of tagSNPs that optimizes the minimum bivariate correlation coefficient between measured and unmeasured SNPs
  • Haplotype and unmeasured SNP prediction (the PREDICT command)
  • Powerful and flexible command language


Download beta-test of Version 2 of tagSNPs   


New features include

  • EM estimation and haplotype prediction for father-mother-children genotypes
  • Direct importing of HapMap data (www.hapmap.org) (the HAPMAP command)
  • Many other enhancements: Genotype summarization; Mendel and HWE error checking; prediction of X chromosome (haploid) data when some genotypes are missing; enhanced data input and output


Users are requested to send me an email message at stram@usc.edu with contact information which will be used for notification of updates, bug fixes, etc.



This macro is used for haplotype imputation based on tag SNP data for candidate gene association studies. It calls tagSNPs to do the PLEM estimation and haplotype imputation, but then merges and renames the haplotype imputations according to a particular naming convention that facilitates use of the estimated haplotype dosage variables in subsequent regression analysis (e.g using Proc Logistic in SAS).  The documentation is contained in the file.






Happy gene-hunting!

Dan Stram

Department of Preventive Medicine

Keck School of Medicine

University of Southern California

  Last Updated: November 8, 2004