Whole-Genome Analysis of the Simons Simplex Collection (SSC)

We’re pleased to announce the recipients of the Whole-Genome Analysis for Autism Risk Variants grants and to provide details on the availability of whole-genome sequence (WGS) data from the Simons Simplex Collection (SSC) for the entire research community, regardless of SFARI funding.

Background on the Simons Simplex Collection

The SSC is a rigorously characterized sample of 2,644 simplex families. Each family includes one child with an autism spectrum disorder (ASD) and unaffected biological parents. More than 80 percent of the families also have at least one unaffected sibling.

In addition to the wide variety of clinical and phenotypic data available on individuals with ASD and unaffected parents and siblings, various biospecimens have also been stored at the Rutgers University Cell and DNA Repository and are available by application through SFARI Base.

Other genomic datasets on the full SSC cohort include single nucleotide polymorphism (SNP) genotype data, whole-exome sequencing data and, for a fraction of the collection, gene expression data from lymphoblastoid cell lines. See here for more information about these datasets.

Approximately 1,500 SSC families have enrolled in a registry at SSC@IAN and are available for recontacting for additional studies. Qualified researchers may apply via SFARI Base for consideration to recontact these families. Details about the application process are available here.

Publications deriving from the SSC can be found here (note: subheadings for the SSC are listed for each of the publication years).

Pilot WGS study on 40 SSC families

In early 2015, SFARI sponsored a pilot study to sequence and analyze whole genomes from the whole blood of 40 SSC quad families (quads are defined as both biological parents, the affected child, and an unaffected designated sibling). The sequencing was performed at the New York Genome Center (NYGC). Sample selection and preliminary analyses of these data were done through a collaboration between the laboratories of Evan Eichler, Ph.D. (University of Washington), Matthew State, M.D., Ph.D. (University of California, San Francisco) and Michael Wigler, Ph.D. (Cold Spring Harbor Laboratory).

WGS of 500 additional SSC families

Based on preliminary analyses from the pilot study, SFARI sponsored the sequencing of an additional 500 quad families and announced the Whole-Genome Analysis for Autism Risk Variants request for applications (RFA) in July 2015 to support analyses of this expanded dataset.

Recipients of Whole-Genome Analysis for Autism Risk Variants grants

The following groups have been awarded grants in response to this targeted RFA. Each award is for a duration of 18 months:

Joseph Buxbaum, Ph.D. (Icahn School of Medicine at Mount Sinai), Michael Talkowski, Ph.D. (Massachusetts General Hospital), Xin He, Ph.D. (University of Chicago)
Integrating large-scale whole-exome data with whole-genome data

Hilary Coon, Ph.D. and Gabor Marth, Ph.D. (University of Utah)
Combining whole-genome sequencing data from Utah high-risk pedigrees and Simons Simplex Collection families

Gregory Cooper, Ph.D. (HudsonAlpha Institute for Biotechnology)
Discovery of regulatory variants underlying pediatric neurological disease

Evan Eichler, Ph.D. (University of Washington)
Structural variation and the genetic architecture of autism

Matthew State, M.D., Ph.D., Stephan Sanders, M.B.B.S., Ph.D., Jeremy Willsey, Ph.D. (University of California, San Francisco), David Goldstein, Ph.D. (Columbia University), Nenad Sestan, M.D., Ph.D. (Yale University)
Extending autism risk locus discovery to the noncoding genome

Which researchers can use the WGS data?

We emphasize that the WGS data are available for use by the entire research community, regardless of SFARI funding, with no restrictions other than an agreement not to publish findings until the whole genomes for all 540 families have been released to the community.

Genome sequencing details

All genomes are being sequenced to 30x mean coverage and will be available in BAM file format containing all passed filter reads and quality scores. Alignment of reads to hg19/NCBI build 37 is being done using Burrows-Wheeler Aligner (BWA) software package (BWA-MEM algorithm), and all local realignment and variant calling uses Genome Analysis Toolkit (GATK) best practices (version 3.2-2 for the pilot 40 families and version 3.4 for the next 500 families. Note: the pilot 40 families will be realigned with version 3.4 in the near future.). Single nucleotide variant (SNV), indel and structural variant calls will also be made available.

The list of the IDs of the 540 families (together with summary information) is available here.

How to access the data

All data will be made available by request after logging into SFARI Base and completing an application. See here for more information about the application process.

Once the application is approved, the data can be accessed in either of two ways:

Cloud-based access (via Amazon Web Services)
Fermilab access (data can be transferred to research institute servers via GridFTP)

Additional details about how to download or access the data (estimated to be roughly 450 TB in total) will be provided to researchers after their SFARI Base application has been approved.

Timeline for data release

As of 11 December 2015, BAM files from the first 40 pilot families (i.e., 160 genomes) are available, with variant call format (VCF) files and structural variant calls to be available shortly.

Of the 2,000 genomes being sequenced in the next phase, BAM files for 400 are currently available. Please note that these 400 genomes do not comprise complete quad families. Additional batches of 400 BAM files will be made available as soon as sequencing is completed and the data is transferred from the NYGC to our data storage archives. Jointly called SNVs, indels and structural variants will be made available periodically in batches of 100 complete quad families. We expect all sequencing to be complete by end of February 2016. We will provide updates to this announcement as new batches of data are released.

Future plans for additional WGS of the SSC

We are also pleased to announce plans to sponsor the WGS of an additional 500 SSC quad families, to be released sometime in 2016. This will bring the total to 1,040 SSC quad families (i.e., 4,160 whole genomes). We will update our website as a more precise timeline for generation and release of these data becomes available.

We are also aware of an effort funded by the National Institutes of Health to sequence a non-overlapping set of SSC families; we will post information about this dataset as soon as possible.