-
Notifications
You must be signed in to change notification settings - Fork 10
SEPIO Overview
Scientific assertions are statements made by a particular agent on a particular occasion, based on the evaluation of evidence for and against the proposition it puts forth as true. Across scientific domains, assertions are generated in a variety of ways and called by a variety of names, including "annotations" (e.g. Gene Ontology annotations [1] stating that a gene has a particular function), "associations" (e.g. genotype-to-phenotype (G2P) associations [2] stating that a genotype causes a particular phenotype), and "interpretations" (e.g. clinical variant interpretations [3] stating that a genetic variant predicts a particular clinical outcome). What these all have in common as scientific assertions is that they draw conclusions about the world that are based on the interpretation of information as evidence.
A given assertion is only as strong as the evidence and provenance that supports it. Evidence for an assertion includes any information that is used to evaluate the validity of the proposition it puts forth. Provenance information describes the process history behind the assertion, including how and by whom it was made, and how the information used as evidence was generated. An ability to evaluate the accumulated evidence and provenance behind an assertion is critical to the evolution and application of scientific knowledge - as it is on this foundation that an assertion may be accepted as fact and acted upon in research and clinical settings.
The Scientific Evidence and Provenance Information Ontology (SEPIO) was developed to support rich, computable representations of the evidence and provenance behind scientific assertions. The core ontology defines a generic model that can be applied in any domain and extended with domain-specific features. The ontological model is the foundation of a larger framework that provides mechanisms for creating custom ontology-based schema for specific applications that leverage modern semantic web standards. This framework is comprised of four main components:
- SEPIO Core Ontology: defines the core, domain-agnostic model using the 'open world' OWL description logic language.
- SEPIO Information Model: provides a UML-like view of the ontology with the constraints of a 'closed world' data model, specifying how terms and design patterns defined in SEPIO may be used to structure data.
- SEPIO Profiles: application specific data models that refine the maximal information model, and can extended it with domain-specific content to support custom schema for a particular use case.
- SEPIO Value Sets: re-usable collections of terms that can be bound to attributes in a particular Profile to constrain data entry.
Rooting of SEPIO data models in ontologies enables generation of ‘semantically-enhanced’ data, where knowledge encoded in supporting ontologies can add value by making the data better interpretable by humans and machines. Practical benefits include enhanced search and data exploration, improved integration with external data, and the capacity for algorithmic derivation of new knowledge through automated reasoning and semantic analysis approaches.
The SEPIO model describes how information is interpreted as evidence in support of an assertion, and how this information is initially generated, accessed, and curated for use as evidence. It is built around a central, repeatable axis that defines the relationship between an Assertion, Evidence Lines, and Evidence Items - which are decorated by additional elements that capture their provenance. (Figure 1).
Figure 1: Core high-level concepts and relationships in the SEPIO model. Three informational entities comprise the central axis (blue). The "Evidence Item" term in guillemots (<<>>) indicates that any Information Content Entity contributing to an Evidence Line is inferred to be an instance of a SEPIO 'Evidence Item'.
- Assertions are evidence-based statements of purported truth, as made by a particular agent on particular occasion (e.g. Counsyl Genetics' 2015 assertion that the BRCA2:c.8023A>G variant is pathogenic for Breast Cancer).
- Evidence Items are the individual pieces of information (Information Content Entities) that are interpreted to build the arguments for or against an Assertion (e.g. population frequency data about the prevalence of the BRCA2:c.8023A>G variant in healthy individuals). Evidence Items can be primary data, statistical calculations derived from primary data, tables or figures depicting these data, statements summarizing the results of a particular study, or prior assertions describing other evidence-based conclusions.
- Evidence Lines are independent, meaningful arguments relevant to the validity of a target assertion, that are supported by one or more Evidence Item (e.g. the argument made for the BRCA2:c.8023A>G variant's pathogenicity by the fact that it is absent in healthy populations). Representing the individual pieces of information used as evidence (Evidence Items) separately from the 'arguments' they make (Evidence Lines) is an important feature of the SEPIO model discussed in detail here.
These central Assertion, Evidence Line, and Evidence Item entities are further described by elements that capture their provenance - who contributed to them, when, and how.
- Agents are the persons, organizations, or intelligent software that conceive of and create Assertions, Evidence Lines, and information used as Evidence Items.
- Activities include the research and curation processes that generate data, and the interpretation and reasoning tasks that apply this information as Evidence to make Assertions.
- Methods are directive specifications that can guide Agents in the execution of such Activities, such as guidelines or heuristics supporting cognitive tasks, or protocols supporting research processes.
- Documents are physical or digital artifacts created to capture and share information, including narrative publications and reports, or structured records in a database.
It is important to note the inherent capacity of this simple core axis for expansion to support rich evidence structures. The cardinalities of relationships along the central axis allow for the model to expand "horizontally" to capture multiple lines of evidence for a given assertion and multiple evidence items supporting each line. Furthermore, the model can expand "vertically" in cases where the Evidence Item for one assertion is itself a prior assertion that has its own trail of evidence and provenance - creating the potential to expand downward to arbitrary depths to trace evidence and provenance through multiple levels of assertions and their supporting evidence. Examples of these expansions can be seen in the GO Annotation and ClinGen-ACMG data examples, and a deeper dive into their implementation can be found here.
Each concept defined in the core SEPIO model above can be characterized in detail by the relationships and design patterns defined in the ontology. The SEPIO Information Model provides a 'data model oriented' view of ontology content that highlights all attributes possible to capture for each core entity (Figure 2). This view can help adopters define SEPIO Profiles tailored to model data in a particular domain of application. SEPIO Profiles contain a subset of the types and attributes in the 'maximal' information model that are relevant for a particular data set or application, and can define extensions to represent domain-specific concepts and relationships. The GO Annotation and ClinGen-ACMG profiles provide examples of relatively simple and complex examples of SEPIO profiles created in this way.
Figure 2: UML Diagram of the Maximal SEPIO Information Model. Boxes represent data model types, holding attributes SEPIO can describe for each. Edges in the diagram represent key relationships between core data types. Attributes in orange are 'shortcut relations' that can be used to directly link objects connected by more than one relationship in a fully normalized model. Attributes with asterisks (*) are those for which sub-properties are defined in the ontology allowing more precise relationships to be captured. See here more detailed view and discussion of the Information model.
Collectively, the concepts and relationships defined in the SEPIO model can be applied to represent the reasoning tasks performed by agents evaluating evidence to make an assertion, the curation tasks involved in accessing and preparing data for use as evidence, and the experimental processes that generated the data in the first place. The features of the SEPIO Model and Framework outlined above support incremental expressivity, where simple or complex data structures can be built to capture the level of detail desired for a particular dataset or application. This flexibility is supported both by inherent features of the core SEPIO Model, and extension mechanisms used to create SEPIO Profiles, and is critical to facilitate utility and adoption of the model across diverse domains and communities of practice.
TO DO: Show CG Profile Data Example to illustrate complexity that is possible using SEPIO Model/Framework.
As noted above, the best way to appreciate these features in action is to explore the data examples provided in this Wiki, starting with the GO Annotation example, and moving on to examples in other domains. Deeper dives into the following topics can be found on Wiki pages dedicated to each. We recommend starting with those below:
- SEPIO GO Annotation Example (TO DO)
- SEPIO Framework
- SEPIO Use Cases (TO DO)
- SEPIO Core Ontology
- SEPIO Information Model
- SEPIO Core Concept Pages (TO DO)
- SEPIO Evidence Lines (TO DO)
- SEPIO Profiles (TO DO)
- ClinGen-ACMG SEPIO Profile
- SEPIO ClinGen-ACMG Variant Interpretation Example (TO DO)
- SEPIO CIViC Somatic Variant Interpretation Example