SFARI Gene Workshop Touches on the Future of Autism Gene Databases

Stock image suggestive of genome sequencing.

filo/iStock

SFARI Gene is a comprehensive database that integrates many types of data about genes that have been implicated in autism susceptibility. The resource, which is curated by MindSpec and supported by the Simons Foundation, has become a trusted source of information for the autism research community since its debut in 2008.

Users and developers of SFARI Gene and other data resources relevant to SFARI’s mission gathered in January to discuss how SFARI Gene might be reimagined 15 years after its creation, when both new sources of data about autism and new technologies for data curation have emerged. Workshop organizer and SFARI senior scientist Alan Packer asked participants to consider: “Given the state of the field and the range of resources in existence, what would a useful and sustainable autism genetics database look like in 2025 and beyond?”

Over a day and a half, workshop participants discussed existing databases and other tools focused on different aspects of autism genetics, as well as databases curating autism-relevant datasets that are broader in scope. Speakers highlighted the roles of their resources, how they are curated and maintained, and how they integrate with SFARI Gene or other resources or might do so in the future.

SFARI Gene’s creator and MindSpec president Sharmila Basu described the history of the database, which launched as AutismDB in 2007 as a publicly available, annotated catalog of genes that had been linked to autism susceptibility. The growing database is maintained by MindSpec’s team of scientists, developers, and analysts. Data are manually curated from peer-reviewed scientific literature, after significant standardization and data cleaning before being exported to the database.

SFARI Gene is organized around human gene modules that include primary references, support studies, and ASD-associated variants, with links to modules on ASD-associated copy number variants, animal models, and gene scoring indicating the strength of evidence for an ASD association. The database now includes 1,416 autism-associated genes, with 44 new genes and more than 3,000 variants added in 2023. Looking toward the future, Basu discussed ways SFARI Gene might help close the gap between genetic diagnoses for autism and clinical management, with a key need being curation and standardization of genotype/phenotype data.

Presentations

Genotypes and Phenotypes in Families

Ivan Iossifov, a Cold Spring Harbor Laboratory professor and an associate faculty member at the New York Genome Center, discussed the Genotypes and Phenotypes in Families (GPF) platform, a tool for visualizing and analyzing genetic and phenotypic data from SFARI’s Simons Simplex Collection (SSC), Simons Searchlight and SPARK cohorts. GPF is integrated with SFARI Base, which handles approvals for access to protected data. Gene profiles and statistics as well as a collection of de novo variants are publicly available. An open-source tool, GPF can also be used to analyze and share data from other sources. Iossifov showed how GPF can integrate diverse data from different groups or sources and visualize variants’ occurrence in duos and trios as well as complex, multigenerational families. Users can browse data by genotype or phenotype, measure genotype/phenotype relationships and search for enrichment of de novo variants within gene sets.

SFARI Genome Browser

Monkol Lek, assistant professor at the Yale University School of Medicine, discussed the SFARI Genome Browser, a publicly available tool that integrates and visualizes sequencing data from SFARI cohorts. Developed by adapting the open-source code used in the Genome Aggregation Database (gnomAD), the Genome Browser offers users a quick way to find variants that have been discovered in genes of interest or assess the frequency of those variants within SFARI cohorts in individuals both with and without autism diagnoses. Direct links to specific genes in the SFARI Gene database provide additional information. The tool is publicly available and presents summary data for variants; no individual or family data is shown. A limited amount of raw data is provided so users can evaluate the strength of variant calls. Users can also access and analyze data using the Genome Browser’s API.

VariCarta

More than 300,000 autism-related variant events are catalogued in the VariCarta database, with each event representing a variant’s occurrence in an individual subject. The data, curated from 120 published papers, come from 30,000 individuals who have been diagnosed with autism. Sanja Rogic, a research scientist and lab manager in Paul Pavlidis’s lab at the University of British Columbia, described VariCata’s curation, standardization and harmonization process, including subject ID reconciliation, functional annotation and checking for overlaps in reported variant events. Information about each paper reporting a variant is provided, including study design, cohorts and the source of variant information within the paper. Approximately a quarter of the variants in the database have been reported in more than one paper. Data are available for download and the software is open source.

Denovo-db

Washington University in St. Louis assistant professor Tychele Turner discussed de novo variants in the human genome, which she and colleagues have catalogued in the database Denovo-db. Protein-coding de novo variants are enriched in populations with ASD, as well as intellectual disability (ID), epilepsy and congenital heart defects. Denovo-db’s latest release contains more than one million unique de novo variant sites, identified using short-read exomes, short-read genomes and long-read genomes from 72,633 trios in 80 studies. Variants at some sites have been identified in up to 40 individuals. De novo variants reported in the literature are included in the database regardless of phenotype. Currently, most are associated with ASD or other neurodevelopmental disorders, but other phenotypes are becoming better represented in the database as research on de novo variation broadens.

Gene Metrics for Functioning

Institut Pasteur professor Thomas Bourgeron talked about the importance of moving beyond autism diagnoses to deepen understanding of autism-associated genes’ effects on function. He highlighted the World Health Organization’s International Classification of Functioning framework, a 1,700-question system as a tool for comprehensively assessing individuals’ body function and structure, activities, and participation (such as employment and social interactions). Bourgeron discussed his work investigating the phenotypic effects of genetic variants associated with autism. His team has found that people with loss-of-function variants in autism-associated genes but no diagnosis of autism have lower levels of education, employment, health and income than people without those variants. He also noted that genetic factors can impact study participation, citing an analysis showing that participants in the UK Biobank who have loss-of-function mutations associated with ASD were less likely than others to have a brain MRI or complete certain questionnaires1.

SysNDD

Bernt Popp, a senior physician at Berlin Institute of Health at Charité, discussed SysNDD, which curates gene-disease relationships for neurodevelopmental disorders. SysNDD includes more than 3,000 entities, each comprising a gene, an inheritance pattern and a disease. Expert curators link information to each entity describing phenotypes and variant types associated with the disease, a clinical synopsis and relevant publications. Each is also assigned a confidence status. The actively maintained database currently includes nearly 1,800 definitive entries. Data can be viewed and analyzed in SysNDD’s web browser or accessed through an API. Popp demonstrated how SysNDD can enable systems biology and network analyses and stressed the need for standardization, open data and APIs to enable analysis across datasets and tools.

Evaluation of Autism Gene Link Evidence

Jacob Vorstman, professor at the University of Toronto’s Hospital for Sick Children, discussed the importance of distinguishing genetic findings associated with autism from those linked to other neurodevelopmental disorders, particularly to inform genetic counseling. The Evaluation of Autism Gene Link Evidence (EAGLE) is a framework for evaluating genes’ association specifically with ASD rather than neurodevelopmental disorders more broadly. EAGLE uses the same framework for evaluating evidence as ClinGen, with an additional layer for assessing the quality of the phenotype. The system supports a fine-grained evaluation of genes with definitive associations to ASD, with higher scores assigned to the genes with the most abundant evidence2. SFARI Gene includes an EAGLE score for many of the top-ranked genes in the database. Vorstman demonstrated how comparing genes with high EAGLE scores with gene lists from SFARI or ClinGen can point toward biological distinctions between ASD and intellectual disability without ASD.

Gene Portals to Enhance Knowledge Transfer

Dennis Lal, associate professor at the University of Texas Health Science Center at Houston, talked about gene portals designed to enhance knowledge transfer so research findings can better support clinical decision making and genetic counseling. Lal noted the challenges of interpreting genetic variants in clinical settings, where different variants in the same gene can have significant consequences for disease progression and appropriate treatment. Using the GRIN Portal, he demonstrated how expert-curated information about variants and clinical phenotypes are presented along with educational videos in a visual format accessible to clinicians and families. The portal also directs users to foundations, family groups, registries and outside resources. Information for researchers, such as structural mapping of variants, phenotype measures and variants’ effects on gene function, is also accessible through the portal.

Developmental Brain Disorder Gene Database

Christa Lese Martin, chief scientific officer at Geisinger Health System, discussed the Developmental Brain Disorder Gene Database, which uses a cross-disorder approach to curating genes associated with developmental brain disorders. Curators use associations with any of seven conditions—intellectual disability, autism, attention deficit hyperactivity disorder, schizophrenia, bipolar disorder, epilepsy and cerebral palsy—as evidence for a gene’s role in developmental brain disorders. Evidence comes from published literature including supplementary data, using a curation strategy that combines automated PubMed searches with manual expert curation. The level of evidence for each gene’s association is noted with a three-tier classification system. Martin also explained how ClinGen is engaging patients directly in data collection through Genome Connect, which has registered more than 10,000 participants and enables them to share their medical histories and genetic testing results for use in research.

SynGO

The SynGO consortium coordinated by Guus Smit and speaker Matthijs Verhage, head of the Functional Genomics Department at Vrije Universiteit Medical Center in Amsterdam, has developed an ontology for describing the location and function of synaptic genes and proteins. Experts in synapse biology are using these to annotate synaptic genes and proteins in the SynGO database. They also track the sources of evidence supporting the inclusion of each gene using a framework developed by the consortium. Database users see these sources of evidence and filter search results according to evidence type. SynGO uses sunburst plots for data visualization, using concentric circles to depict pre- and post-synaptic cellular components and biological process. With more than 1,500 genes now annotated in the database, SynGO2 has begun developing interactome networks. Verhage discussed how these can help uncover autism-relevant networks. The consortium’s eventual goal is to build causality models to inform predictions about how genetic variations impact synaptic function.

BrainRBPedia

Howard Lipshitz, professor at the University of Toronto, talked about his group’s use of machine learning to predict which RNA-binding proteins are likely to be associated with autism. Their model is trained and tested using data from several sources, including SFARI Gene. These data include information about each gene’s or protein’s enrichment in neurons or brain tissue, expression during neurodevelopment, loss-of-function intolerance, protein expression and association to ASD susceptibility. Information about these proteins is shared via BrainRBPedia. BrainRBPedia currently includes more than 1,000 RNA binding proteins, including more than 400 canonical RNA-binding proteins with known RNA-binding domains and sequence-structure specificity, and more than 650 non-canonical RNA-binding proteins identified through mRNA-interaction experiments. SFARI Gene includes the genes for 91 of these RNA-binding proteins3.

BioGRID

The BioGRID database developed by speaker Kara Dolinski, director of genome databases at Princeton University, and collaborator Mike Tyres, catalogues data on molecular interactions. Its interface enables users to visualize the data, make comparisons across datasets and link directly to other databases for additional information. Originally developed as a repository for protein-protein interactions in yeast, BioGRID now includes nearly 2.7 million molecular interactions from 83 organisms, including chemical interactions and human gene/protein interactions. Interactions identified in both high- and low-throughput experiments are included. Users can easily see what type of experiments generated the data and find associated publications. AI tools are used to prioritize publications for curation. Curators manually extract interaction data from these publications and standardize them for inclusion in BioGrid. Dolinski highlighted some of the project’s themed curations, which focus on specific areas of biology, such as Alzheimer’s disease and coronaviruses.

LitVar

Zhiyong Lu, deputy director for literature search at the National Center for Biotechnology Information (NCBI), talked about how machine learning can improve literature searches, highlighting NCBI’s LitVar tool. LitVar uses text mining and tagging tools as it searches millions of abstracts, full-text publications and supplementary materials in PubMed and PubMed Central. It normalizes variant names so complete search results are returned despite variability in naming. Users can access LitVar through a web browser or via API requests. Lu noted that this type of natural language processing can be used to improve the efficiency of literature curation; his team has worked with SwisProt, UniProt and GWAS Catalog to incorporate natural language processing tools into their curation pipelines.

The Gene Curation Coalition

Many groups have developed classification systems for evaluating gene-disease relationships and share their findings through online resources. These resources use varied terminology and sometimes show contradictory conclusions about a gene’s role in disease. To better guide both research and genomic medicine, the Gene Curation Coalition (GenCC) is working to harmonize these resources by establishing universal standards for classifying the strength of evidence for a gene’s role in disease. Heidi Rehm, chief genomics officer at Massachusetts General Hospital and a Broad Institute member, explained how the group arrived at standardized terms for describing levels of gene–disease validity and how it is working to resolve curation conflicts between existing resources. More than 21,000 classifications of gene-disease relationship, involving more than 5,100 genes, have been submitted to the GenCC database by members of the coalition.

Discussion

Following speakers’ presentations, workshop participants considered priorities and opportunities for SFARI Gene’s future in a discussion moderated by Michael Gandal, associate professor at the University of Pennsylvania and Children’s Hospital of Philadelphia; Ian Simpson, professor at the University of Edinburgh; and Olga Troyanskaya, professor at Princeton University and deputy director for genomics at the Flatiron Institute. The discussion addressed what types of data should be included in a new version of the database, curation strategies, and how a SFARI Gene 2.0 should interact and integrate with other resources in the field.

What is the best role for SFARI Gene, given the current landscape of data resources?

While the resources discussed during the workshop serve a variety of audiences, including researchers, clinicians and families, most attendees agreed that SFARI Gene should remain a tool for the research community. Many advocated for a database with a broader scope than the current SFARI Gene, curating data for neurodevelopmental or brain disorders more broadly rather than isolating autism. At the same time, it was pointed out that autism-specific data must be available and clearly defined within the database. Some argued that given both the heterogeneity of ASD and the polygenic nature of most inherited brain disorders, it would make sense to move beyond disease diagnoses and focus instead on genetic variants’ relationships to phenotypes.

Workshop participants recognized that as advances in autism genetics have enabled researchers to deepen their investigations of autism biology, the field is generating and depending on more diverse data types than it did when SFARI Gene was created. An array of resources has emerged to curate and share these data. Still, there is no single, authoritative source for genetic and phenotypic data about autism. Speakers were concerned that this is inefficient both for users of these resources, who must seek information from multiple sources, potentially in incompatible formats and via different interfaces, as well as for those who maintain them. With so many databases curating overlapping datasets, effort is likely being duplicated, potentially at great cost.

Participants discussed what role SFARI Gene might play in better integrating existing resources. They were asked to consider what types of data SFARI Gene 2.0 should include in its own database, and where it might be more appropriate to direct users to outside resources, with the goal of creating a resource that is both valuable and sustainable. It was noted that it would be critical to ensure that any data accessed through SFARI Gene are both relevant and reliable.

It was also pointed out that SFARI could impact the field by leading an effort to develop ontologies and standardize autism-relevant data, both for publication and for inclusion in SFARI Gene or other databases, particularly for phenotyping data. Standards and ontologies developed by initiatives such as the Mondo Disease Ontology and the Gene Curation Coalition could be valuable in this effort.

What kinds of data should be included in SFARI Gene?

For genetic data, most workshop participants hoped to see a move toward clarifying specific variants’ associations with disease or phenotype rather than genes. It should be clear to SFARI Gene users where variants are located within a gene and their impact on gene function, as well as their effect size. They noted that it would also be valuable to have data on epigenetic variants, enhancers and affected isoforms, as well as information on molecular interactions. Expression data might be provided via a link to NIH-supported resources.

Notably, while ClinGen and other clinically focused resources curate genes’ associations to neurodevelopmental disorders, attendees thought it would be important for SFARI to continue to maintain its own gene list for researchers. This is expected to include genes whose link to autism may currently be insufficient to meet clinical criteria for a definitive association. Likewise, it should include gene variants that influence autism risk population-wide, even if their effect in individuals is too small to be considered clinically relevant.

Many participants wanted to see variants linked to phenotype instead of disease diagnosis, a shift that would require development of relevant ontologies. Some also pointed out that it would be valuable to capture data from individuals who carry disease-associated variants but do not meet the clinical criteria for a diagnosis.

Workshop attendees discussed what kinds of data on gene function would be most valuable and potential sources of that data. These data are more scarce because generating them is time-consuming and expensive, but even small amounts of well-chosen functional data would be valuable. Large efforts such as the Allen Brain Atlas, ENCODE, Impact of Gene Variation on Function and Convergent Neuroscience were mentioned as sources of functional data, with the caution that biological context would be essential for interpreting them and that not all screens of gene function will be relevant to the developing brain. Expression data from the BRAIN initiative could help parse relevant data. It was suggested that the Simons Foundation could consider partnering with organizations working in this space to fill gaps in functional interpretation. Many participants also stressed the need for biological context for these data, particularly regarding developmental timing.

How should SFARI Gene be maintained and its data shared?

Many presenters during the workshop described curation processes that rely heavily on manual curation by experts, sometimes calling on artificial intelligence (AI) tools to help identify and/or prioritize publications for curation. During the discussion, it was noted that manual curation will continue to be essential, but to keep up with growing amounts of data, curators will increasingly need to rely on automated methods to improve efficiency. For instance, AI tools might be used to pre-annotate data for review by an expert curator. SFARI Gene might also turn to existing tools for involving the community in annotation, following a standardized framework that it develops. To improve automation, SFARI could develop corpora for training text-mining tools.

It was also proposed that SFARI could work with patient organizations to develop infrastructure for moving research data into SFARI Gene and that SFARI Gene could be made a repository for all data generated by SFARI Base users.

It was noted that SFARI Gene should allow users to both analyze and download data. In addition to interacting with the database through a web browser, users of other tools should be able to connect through APIs. It was also noted that SFARI Gene’s approach to open science should extend beyond the sharing of data to include sharing of its schemes and modeling choices for data presentation. Likewise, SFARI should leverage existing infrastructures for presenting complex datasets. SFARI should also consider how its data will be stored and how knowledge graph technologies can integrate it with question-answering tools like large language models.

References

  1. Rolland T. et al. Nat. Med. 29, 1671–1680 (2023) PubMed
  2. Schaaf C.P. et al. Nat. Rev. Genet. 21, 367–376 (2020) PubMed
  3. Han K. et al. bioRxiv (2023) Preprint
Recent Workshop and Meeting Reports