Source Artefact Identification

Overview

The basis for identifying source (i.e. authored) artefacts is to define a number of separate logically identifying properties, as well as a machine identifier. One or more human-readable identifier(s) can be generated from the non-uid identifying properties. For archetypes and templates, the relevant properties are described in the AOM Archetype package, including the ARCHETYPE_HRID class. Related properties are inherited from the AUTHORED_RESOURCE class into ARCHETYPE are shown, including the lifecycle_state property, as well as all other descriptive meta-data.

Figure 1. am.aom2.archetype Package

For other types of artefacts the detailed model will differ, but the principles are the same.

Three distinct groups of properties shown in the ARCHETYPE_HRID class that underpin the identification scheme described here, as follows:

namespace provides a way of distinguishing logical identifiers created by different organisations that would otherwise compete in a single semantic identifier space;
rm_publisher, rm_closure, rm_class, concept_id form the basis of the main part of a human-readable identifier, e.g. openEHR-EHR-OBSERVATION.bp_measurement;
properties supporting versioning:
- release_version, expressing a 3-part version identifier, e.g. '1.3.0';
- build_uid , incremented at every commit, supporting non-release version ids, such as '1.3.0-rc.28' and '1.3.0-alpha', where the build count is 28;
- description.lifecycle_state, expressing the development state of the artefact, and used to derive the 'rc' (release candidate) and 'alpha' (development) parts of non-release version ids.

Functions such as interface_id, physical_id and version_id are defined to return respectively the 'interface' and 'physical' archetype HRIDs (described below) as strings, and the full version string (computed from release_version , build_uid and description.lifecycle_state ). The functions major_version , minor_version and patch_version extract the various parts of the 3-part release_version property.

The uid property provides the machine identifier, and is assumed to be a Guid.

Both the uid and namespace properties are optional for legacy reasons, since most existing archetypes have neither. The interpretation of an artefact without these identifiers in this specification is that it is unmanaged , i.e. it has no recognised owner organisation. During a period of changeover to the identifiers specified here, there will clearly be artefacts that are in fact managed, and which need to have the uid and namespace properties assigned. This will obviously take some time, as it requires support from the tooling ecosystem.

Different types of human-readable identifiers are used for archetypes, templates and terminology subsets. The following sections describe the formal details of this identification scheme, and how it supports referencing between artefacts.

Formal Model

This section defines a formal grammar for human-readable indentifiers of knowledge artefacts. As described above, more than one human-readable identifier can be constructed from the identifying properties of the ARCHETYPE_HRID class. The grammar is shown in green below.

The highest level distinction is between managed and unmanaged artefacts, with managed status being indicated by the prepending of a namespace to what can be termed a 'local' HRID (i.e. local to a given namespace context).

    artefact_hrid           =   namespaced_hrid | local_hrid ;
    namespaced_hrid         =   namespace '::' local_hrid ;
    local_hrid              =   hrid_root '.v' version_id ;
    namespace               =   V_REVERSE_DOMAIN_NAME ;
    V_REVERSE_DOMAIN_NAME   =   ; (* See IETF RFCs 1035, 123, and 2181. *)

The namespace above is the publisher organisation reverse domain name. Reverse domain names are used in order to aid lexical sorting of identifiers and also tools that build directory structures based on reverse domain name segments. All managed artefacts, including archetypes and templates should include a namespace. Any archetype or template carriying an identifier without a namespace is assumed to be an unmanaged artefact.

Examples:

org.openehr : EHR archetypes library at openEHR.org
uk.nhs : UK National Health Service
edu.nci : US National Cancer Institute

Human-readable Identifier (HRID)

The archetype human-readable identifier consists of two logical parts: an identifier of the reference model (i.e. logical information model) class on which it is based, and an ontological identifier.

The identifier is defined by the following grammar rules, which are a slightly simplified version of the grammar for the openEHR ARCHETYPE_ID type:

    hrid_root                       =   qualified_rm_class_name '.' concept_id ;
    qualified_rm_class_name         =   rm_publisher '-' rm_closure '-' rm_class ;
    rm_publisher                    =   V_ALPHANUMERIC_NAME ;
    rm_closure                      =   V_ALPHANUMERIC_NAME ;
    rm_class                        =   V_ALPHANUMERIC_NAME ;
    concept_id                      =   V_SEGMENTED_ALPHANUMERIC_NAME ;

    V_ALPHANUMERIC_NAME             =   ? [a-zA-Z][a-zA-Z0-9_]+ ? ;  (* regex *)
    V_SEGMENTED_ALPHANUMERIC_NAME   =   ? [a-zA-Z][a-zA-Z0-9_-]+ ? ; (* allows hyphens *)

The field meanings are as follows:

rm_publisher: id of organisation originating the reference model on which this archetype is based;
rm_closure: identifier of the reference model top-level package closure on which the archetype is based;
rm_class: name of class or equivalent entity in the reference model on which the artefact is based;
concept_id: an identifier from an ontology of information artefacts (see below);

The first part takes the form of a 3-part identifier, such as:

openEHR-EHR-EVALUATION
ISO-ISO13606-ENTRY

This historically has been used in openEHR and CEN/ISO 13606-2 to identify the reference model class on which an archetype is based. It includes the publisher of the reference model (e.g. "ISO", "openEHR"), which top level 'closure' is being referred to, and finally which class.

The notion of 'closure' is a top level package from which the focal class can be reached. In general, a given class can be reached from more than one top level package, but an archetype of that class will only be suitable for one of those packages. For example, the openEHR class CLUSTER is used by classes in both the ehr and demographic top level packages. However, an archetype of CLUSTER will usually be designed for use with only one of those packages. The Cluster archetype physical_examination for example will only make sense in data defined by the ehr package. Consequently, it will have an archetype identifier of the form openEHR-EHR-CLUSTER . physical_examination .

The closure part of the identifier could be used by tools to ensure for example that an 'EHR' CLUSTER archetype was never attached to a 'demographic' information item.

Concept Identifier

The second part of the human-readable identifier is a 'short' ontological identifier (known in ADL 1.4 as the 'concept' or 'domain concept'). Such identifiers have historically been natural language words or phrases, typically in a short mnemonic form, e.g. 'bp_measurement' in the archetype identifier ISO-ISO13606-ENTRY.bp_measurement.v1 .

Legacy ADL 1.4 Semantics

Historically in ADL 1.4 (ISO 13606-2:2008), the 'concept' part of the identifier encoded the specialisation hierarchy of concepts as a series of hyphated segments, e.g. 'problem' and 'problem-diagnosis', with the latter identifiying a specialised form of the former.The requirement for the concept name to include specialisations is removed in this specification, as well as the ADL / AOM 1.5 specifications. This enables the domain concept of any artefact to be freely assigned according to the purpose of the artefact.

To allow for the fact that legacy specialised archetypes do in fact include the '-' style of separated domain concept identifier, the '-' character is still be allowed, but no longer has any semantic significance.

One consequence is that for archetypes with identifiers conforming to this specification, the level of specialisation can no longer be determined from the identifier. This new approach is in line with how source artefacts are named in object-oriented languages.

Concept Identifier Semantics

The more important aspect of the concept identifier, is its origin and semantics. Historically it has been part of the identifier for archetypes because it is human readable and facilitates debugging of systems where the data contain such identifiers. Clearly a purely ad hoc assignment of a human-readable identifier is not scalable or reliable. Consequently rules and mechanisms for assignment need to be identified.

This specification takes the point of view that the concept part of a managed knowledge artefact identifier must come from an ontology corresponding to the namespace of the identifier, in other words, an ontology maintained by a Custodian Organisation or some higher authority.

It is not the business of this specification to define the ontology, but we can indicate the general form as being an ontology of information entity types for use in the domain of health. It is assumed that there are nodes within the ontology are related to the classes from the information (i.e. 'reference') model. This leads to an ontology of the form shown below.

Figure 2. Information Artefact Ontology

This (putative) ontology consists of high-level health information recording entities (black), a set of record entry types derived from the Clinical Investigator Record ontology [1], and domain-specific entities in blue. It is assumed that the top node(s) of the ontology could be related to nodes in a published ontology such as the Information Artefact Ontology (IAO), but this is not a pre-requisite for establishing this ontology. More ideally, its categories would be related to categories in the Basic Formal Ontology (BFO).

The blue node measurement_of_systemic_arterial_blood_pressure (bottom left) describes an entity corresponding to a 'record of systemic arterial blood pressure measurement'. Long names such as this are standard in the ontology community, and are designed to ensure that the name of a category is sufficient to unamiguously define its meaning. Such names are typically too long and unwieldy for the purposes of managable lexical identifiers such as for archetypes.

We therefore assume that a system of 'short identifiers' is possible within the ontology, where a 'short id' is a synonym for a full node identifier. If we further assume that the ontology is constructed with tools (e.g. Protege) and that ontology identifiers are checked to ensure uniqueness.

Facilities to manage such ontologies should be available either centrally (e.g. openEHR.org or at The Open Biological and Biomedical Ontologies (OBO)), so that every added archetype, template or subset is assigned a short ontological identifier from the ontology.

Existing archetypes can be accommodated within such ontologies in two possible ways. If they have been in use, and data exist containing these identifiers, then their current ontological identifiers can be proposed as the short id for an ontology class defined for the archetype. If there is a clash, a new archetype concept short identifier will be needed, and the archetype will need to be republished under a different identifier.

Need for RM Class Name in Identifier

Theoretically, the Reference Model class identifier part (qualified_rm_class_name above) should not be needed in a well constructed identifier, on the basis that there should never be a clash of concept identifiers, regardless of the RM class, even though they can easily be similar. For example, a reasonable concept_id for an ENTRY (ISO 13606) or OBSERVATION (openEHR) structure archetyped to represent a generic lab result result might be 'lab_result'. For the COMPOSITION-level archetype designed to contain any 'lab result' ENTRY or OBSERVATION, a reasonable name would typically be 'lab_report' (or the equivalent in another language).

Unfortunately, for some informational concepts, the appropriate name for the actual core data level can appear to be perfectly reasonable also as a name for a higher level container of the same data. Without an efficient and essentially global ontology construction service or authority available, the inclusion of the qualified RM class name acts as a reasonable guard against such clashes.

References

[1] T. Beale and S. Heard, “An Ontology-based Model of Clinical Information.,” in Proceedings MedInfo 2007, K. K. et al., Ed., IOS Publishing, Aug. 2007, pp. 760–764. [Online]. Available: http://www.openehr.org/publications/health_ict/MedInfo2007-BealeHeard.pdf