Text Package

Overview

The rm.data_types.text package contains classes for representing all textual values in the health record, including plain text, coded terms, and narrative text. It is illustrated below.

RM data types.text
Figure 1. rm.data_types.text package

Requirements

The sections below describe the requirements of text data types. Two overriding principles should be noted at the outset with regard to text.

  1. Regardless of what terminologies are (or are not) available to the clinician and/or the software, the primary requirement is that in all cases clinicians are able to record exactly what they want to say. This means that if they want to record something very general, such as "cold" or a very specific term such as "Ross River virus infection" they should be able to, whether or not the appropriate coded terms are available. However, the facility should be available to additionally code any such textual item, at the time or indeed at some later time, so as to satisfy reporting or other needs.

  2. It is assumed that any client of terminology, such as the EHR, uses a terminology service which provides a complete interface to the terminology. The design of the DV_CODED_TEXT type reflects this. Accordingly, there is no concept of "post-coordination" outside the terminology environment allowed by the data types described here: the only thing that is available from the terminology service is a key which refers to a lexical entity, which may be a single term or a code phrase, and which may be part of a reference terminology and/or linked to element(s) of underlying ontologies. It is also assumed that there is no direct access to any particular terminology; access to all terminologies (whether simple coded lexicons or large semantic networks) is via the same abstract interface.

Terminology Ids are likely to be of various types.

  1. Terminology Id = "local": this constant value means that the origin of allowable values is described within the archetype. This is coded to allow translation. The local archetype then only needs the set of codes and the local translation. The archetype may contain a translation table if required.

  2. Terminology Id = "[authority]". This might be "openehr", "centc251", "hl7", etc;

  3. The variant Terminology Id = "[authority]:[Domain value set]" could also be supported, although it should not generally be necessary, since all codes should be unique within a given issuing authority. Examples might be "openehr:event math function", "hl7:gender";

  4. Terminology Id = "SNOMED-CT", "ICD10AM", etc. Idemtifiers of this kind must be unique values in an accepted set of terminologies from an authoritative source. These MUST be universally known. In openEHR, names from the US Natoinal Library of Medicine’s UMLS terminology name list are used. See the terminology package in the for details.

Narrative Text

Narrative text items are used in the EHR in a number of cases, including:

  • values of coded attributes in the reference model;

  • recording of subjective or imprecise patient responses, particularly quantities or dates not deemed sufficiently precise to be represented using structured quantitative or date/time date types;

  • recording of narrative statements, e.g. visual observations;

  • recording of tracts of prose, e.g. overall findings and conclusions, prognoses;

  • recording of values that would normally be coded, but for which no code and/or no terminology service is available.

While narrative text items themselves are not themselves coded, they may have code phrases associated with them, as described below under Mappings, and may be mixed within a paragraph with coded items.

Terminological Entities

Textual entities available in a terminology service are used in the health record to enable processing, from simple queries to decision support. Reasons for using terminology include the following.

  • To guarantee interoperability of meaning. For instance, if the term "cold" is recorded in plain text, it could be interpreted as "feeling cold", "C.O.L.D" (chronic obstructive lung disease), "rhinorrhoea", "coryza" or "U.R.T.I. (upper respiriatory tract infection), among others. If, however, it is coded from a terminology such as ICD10 or SNOMED CT, any party reading the data (including software) knows the intention, since the meaning of the code in the terminology is unambiguous.

  • To standardise textual renderings of terms and avoid informal shorthand. For example, practitioners wanting to write "systolic blood pressure" write things like "systolic BP", "systolic bp", "sys. BP." and so on; use of coded terms ensures that such abbreviations are either avoided, or associated with an unambiguous meaning.

  • For unambiguous naming of problems, medications or diagnoses for support of knowledgebased tools such as prescribing packages and other decision support applications.

  • For standardised names of things in the record e.g. a heading of "Physical examination" or an entry such as "Differential diagnosis".

  • For finite sets of values ('value sets'), e.g. Blood Group = 'A|B|AB|O'.

  • For classifying other data for the purpose of statistical studies, e.g. by putting ICD disease group classifiers on actual disease names entered in health records.

A basic requirement for interoperability of text items, coded using terms (i.e. where the text is the official rubric for the code), is that both the rubric and the code (or 'code-phrase') must be recorded, to ensure the originally intended text is retained for receivers of EHR information who do not have access to the terminologies used at the origin. However, where a terminology service is available, the key can be used to unambiguously locate the string value of the term, and can also be used to find translations in other languages. (Note that these comments do not apply to mappings, which are described below).

In some terminologies, there are semantic networks of links emanating from most coded terms, which classify them or relate them to other terms. Such links provide a means for decision support to make inferences about specific things found in the record. For instance if the term "leukaemia" is found, queries to the terminology service can be made in order to deduce that the patient has both a "cancer" and a "disease of the immune system" (assuming leukaemia is classified under these more general terms in the terminology).

This specification assumes the existence of a terminology service which is responsible for interrogating actual terminologies and performing validated coordination of terms, i.e. creating combinations deemed valid by the underlying source terminology, potentially without even assigning a new code to the result. All validated coordination is carried out inside the terminology service, and any "term" made available by the service is already 'coordinated'. The difference between 'pre-coordinated' and 'post-coordinated' terms is that the former have a single code, whereas the latter have a code phrase, or expression that is interpretable by the terminology. For example, the coordination "foot, left" (a shorthand way of writing the relationship "foot has-laterality left") could be created by the terminology service from the source terms "foot", "left" and "has-laterality" from a terminology such as SNOMED CT. Any such coordination must be valid within the source terminology, i.e. correspond to valid relationships defined therein.

The class DV_CODED_TEXT described here captures the association of two things:

  • the code phrase of a code phrase provided by the terminology service, recorded in the defining_code attribute.

  • the text rubric of the code phrase, recorded in the value attribute (inherited from DV_TEXT);

The class CODE_PHRASE.code_string records a key, in the form of arguments to some retrieval function in the terminology service interface.

The semantics attached to coordinations of terms may differ. Two categories of coordination described in the literature are 'qualification' and 'modification'. A common definition of the first is that "qualification narrows meaning" - i.e. creates a new term whose possible real world instances are within the set denoted by the original root term. Modification on the other hand changes the meaning of a root term. Various cases are described below under Meaning Modification. Both coordination types are assumed to be managed by the terminology service.

Coded terms may also be mapped to terms from other terminologies, which may be intended as equivalents, classifiers, or something in between. The section below on Mappings deals with these.

Text Formatting

In modern information systems in all industries, the ability to visually format text exists. EHR systems are no different, and authors of narrative text parts of the health record expect to be able to create longer tracts of text that include paragraphs, bolding, bullet lists and other typical structures, as well as simple atoms of text. Generally speaking, the text associated with a coded term is a simple string, devoid of formatting or newlines, and if the text is a direct copy of the rubric (i.e. term) from a terminology, this will almost always be the case. However there is no guarantee that this will not change in the future, with perhaps newlines and e.g. bolding being used in 'rich text' rubrics.

Clinicians need the freedom to maximise clarity of any text they write, and therefore should be able to include a reasonable level of formatting in text.

Design

All atomic text items are either instances of the type DV_TEXT or of DV_CODED_TEXT. The former allows the expression of text with optional formatting and hyperlinking. The latter additionally connects the text value to a key in the terminology service, with the implication that the key refers to a terminological entity lexically and semantically identical to the text value. Formatting and hyperlinking are discussed in detail below.

The model of DV_CODED_TEXT is designed to capture the actual coded term chosen by the user or software at runtime; it is implicitly assumed that this includes whichever synonym (term of equivalent meaning from the same terminology) was chosen, for terminologies supporting synonyms, and any coordination of underlying distinct terms. A DV_CODED_TEXT instance can only be used if the final textual value chosen by the user is lexically identical to the rubric returned by the terminology service for the key; if the user makes even the slightest change, the identity of rubric / key is lost, and a mapping (see Mappings) should be used instead.

The type DV_TEXT should be used wherever a coded or non-coded text item is allowed, while the type DV_CODED_TEXT should be used wherever a text item must be coded.

Deprecated: The type DV_PARAGRAPH was used to represent larger tracts of text built from lists of DV_TEXT instances, i.e. instances of DV_TEXT and DV_CODED_TEXT. Such text is now represented as plain text with new lines or markdown text, as described below.

Qualification

Qualification is the process of making a term more specific through the post-coordination of additional terms. It occurs when a terminology defines relationships between a primary term and other terms that qualify the primary. For example a coordination using the term "bronchitis" which creates a qualified term might be "acute bronchitis"; all real world instances of the latter are also instances of the former.

Meaning Modification

Terms that change the meaning of other terms are often known as "modifiers". The difference between modification and qualification is that modifiers change the meaning so that the modifed term as a whole does not refer to instances of the unmodified term. We describe below the particular types of modifiers and how they are represented using the text data types.

Mode-changing Terms

One class of modifers is exemplified by the addition of words like "risk of", "fear of", "history of" and so on. These are sometimes called mode-changing terms, since they change the "mode" of the root term from the present to the past ("history of"), a potential future ("risk of") or some other alternate reality. Terms which are modified in this way should never be matched in queries searching for the root term; for example, a query for "coronary diease" (of the patient) should not match "family history of coronary disease".

Context Sensitivity

There are many terms whose meaning is changed by the context in which they are stated, such as within a certain kind of note or test result. Consider the following:

  • a blood sugar level after a 75gm oral loading has a different meaning than a fasting blood sugar;

  • a systolic blood pressure in the pulmonary artery has a different meaning than a systemic arterial blood pressure;

  • "total hip replacement" in the context of a "planned procedure";

  • "meningitis" in the context of a "differential diagnosis".

Negation

Negation is a special kind of mode change and has been a serious design challenge in the past, because modifiers like "not" or "no" only make sense when attached to some terms, and create nonsensical values or ambiguities by arbitrarily association with other terms.

Representation of Meaning-Modifying Terms

Rather than provide explicit features for representing modifier terms within DV_CODED_TEXT, the general principle underlying representation of all post-coordinations other than qualifications, is that a higher-level, archetyped structure such as an ENTRY (defined in the EHR RM), is a minimal indivisible unit of information. Such higher-level entities can have internal structure, and it is possible (and desirable) to achieve the effect of combinations of terms through this structure. In the case of ENTRY, it will be via structuring of CLUSTER/ELEMENT objects. The general rule is: to obtain the full meaning of any terms found in the record, all of the node names in any ENTRY (coded or not) must be considered from the root to the relevant leaf. Conversely, the "final" meaning of any term in the record cannot be known in isolation from the rest of the terms in the structure.

Accordingly, the concept "family history of coronary disease" is represented as an ENTRY whose root is named (for example) "subject family history", and which includes further structure, which may be in greater of lesser detail; the coded term "coronary disease" would appear somewhere in this structure. The actual structure is completely defined by appropriate archetypes. Contrary to some perceptions, there is no general way to represent concepts such as "family history of coronary disease", since it will vary depending on how much detail is recorded. Where some GPs routinely record just the simplest form, others may record the details of which family members had heart problems and exactly what they were.

The same approach is used for context-dependent terms. Archetypes defining contexts such as "planned procedures" or "differential diagnosis" will use these terms as their root nodes; as a result, the meaning of any term appearing below the root can only be understood by including the root. Once again, the exact structures are completely dependent upon archetyping, and may be simple or quite sophisticated.

Negations are more complex than might first be apparent and are best handled by good archetype design. Terminologies might provide a term such as "No known allergies" which is helpful. But if someone has an allergy of some sort, the medicolegal requirement might be to record that the person has no known allergies to penicillin or another class of medication that is being prescribed. The oftenproposed approach of using a generic negation 'modifier' to deal with such issues results in further problems. Consider the use of negation with liver - "no liver", "no palpable liver", "no liver disease", "no history of liver disease", "no liver function", "no liver function tests". The meaning of negated terms may be non-sensical and difficult to interpret.

A basic principle of dealing with negatives is to realise that most naïve suggested use cases are quite ambiguous as stated. Does "no allergies" mean "no reported episode of allergy", "no allergic reactions ever", "no known allergies to medication" or something else? Does it mean that these statements are taken as given by the patient, or determined by tests? Like all medical phenomena, allergies must be described in some detail for the EHR to be of any real use. Almost inevitably, this precludes the use of negated terms. Since the actual information structure will be determined in advance by archetype designers, clinicians will almost never be in the situation of having to negate a term. However, if the need does arise, it should be dealt with by a negative or quantitative answer, i.e. a value rather than a name. For example, in any ENTRY describing current problems, the clinician may record the name/value pair "allergies: NONE". Here, "allergies" will be a DV_CODED_TEXT, and "NONE" will be either a DV_CODED_TEXT or a DV_TEXT; the two will be associated by a containing object, such as an instance of the ELEMENT class from the EHR RM. There is no explicit model of negation in openEHR.

Mappings

In a number of circumstances, both plain text and coded text items are mapped to terms from other terminologies. In theory, this should never occur, since it means that relationships between terms which should only be knowable in the knowledge base (in the form of the terminology service, or something else) are being created and transmitted as part of EHR information, potentially invalidating or overriding the knowledge base. Where mappings are required, the proper approach is to create thesauri within the knowledge environment, and map through them. Unfortunately, in some cases, activities in the real world do not respect the information/knowledge boundary, hence the model described here includes an explicit mapping concept, which itself includes a "purpose" and a "match" indicator. Matching corresponds to the categories described below.

Classification (Broader Terms)

Any text item, whether coded or not, may be classified with a coded term, for research, reporting and decision support purposes. For example, a GP working in tropical Australia may wish to write "Ross River infection", and be working with ICD9, which does not contain this term (although ICD9-CM does). He or she will use a plain text item, but will still be able to map it to an ICD9 classifier, such as the code for "arbovirus infection NOS". The same approach can be used for adding a classifying term to a coded text item. The utility of classifier terms is various: they allow decision support to make more powerful inferences; in situations where the available terminologies do not provide the classification inbuilt, and where it is known that not all users of EHR data will have terminologies available. In data terms, classification mapping can be visualised as illustrated below.

coded text mappings
Figure 2. Plain Text and Coded Text with Classifier(s)

Classifying mappings are represented by adding a term to the mappings list of the original term. Each mapping is explicitly represented with an instance of TERM_MAPPING, which indicates both the term being associated with the original text item, and a value of '>' for the match attribute, which indicates that the mapping is "broader". The possible values of the match attribute are '>' (broader), '<' (narrower), and '=' (equivalent); they are taken from the ISO standards [ISO_2788]) and [ISO_5964].

Equivalent / Synonymous Terms

Data from pathology laboratories has often been coded using a terminology local to the laboratory, due to lack of or economic unfeasibility of using existing widespread terminologies for the job. However, some laboratories also supply a nearest equivalent code from a well-known terminology such as LOINC, to enable the receiver of the data to process it in a more standard fashion. Here, "equivalence" is taken to mean a term of the same meaning but from a different vocabulary.

Another instance where equivalent terms might be supplied is to effect the translation of terms across specialist vocabularies such as nursing vocabularies when sharing EHRs across jurisdictions.

In theory, the cleanest way for senders and receivers of data coded with both a local and a more standard equivalent to deal with the mapping problem is for the originator of the local terminology to provide a complete thesaurus of translations into one or more recognised terminologies. However, in practice, laboratories using the HL7 v2.x messaging standard usually encode a primary term and equivalents with the HL7 CE data type, meaning that equivalents are included only with the term they are used with. A similar pragmatic approach to mapping equivalent terms in the EHR is likely to be used with the data types described here, and can be effected with the same mapping approach as for classification.

A further situation in which text values - this time plain text - is mapped to equivalent terms is when natural language processing is used to generate coded terms for existing free-text prose. The aim of such processing is to detect word phrases and associate them with a coded term of the same meaning, without obliterating the original text. In terms of the model described here, a CODE_PHRASE is associated with a DV_TEXT instance via the mappings attribute.

In all cases with equivalents, the value of the match attribute is '=', indicating that the mapping is a synonym.

More Specific Mappings (Narrower Terms)

Occasionally, there is a need to create a mapping to a term of narrower meaning than the original text item. Circumstances in which this occurs include when a clinician wants to record a syndrome such as "croup" or "influenza", but the terminology does not contain these general terms, although it does contain more specific terms, e.g. "viral laryngo-tracheitis" or "influenza type A". Clearly the clinician should be allowed to record what he/she wants (as plain text if necessary), but it should also be possible to add a mapping to the more precise term. For mappings to narrower terms, the value of the match attribute is '<'.

The Unified Medical Language System (UMLS)

It has been argued in GEHR [GeHR_Aus_req] that UMLS reference terms should also be supplied with occurrences of coded terms, in the form of the UMLS concept unique identifier, or "CUI". UMLS is a way of encoding terms developed at the National Library of Medicine in the United States, and consists of a meta-thesaurus, in which terms from any extant term set (such as ICDx, SNOMED CT, READ) can be cross-referenced. UMLS CUIs could turn out to be extremely useful for decision support and reporting.

The proper use of UMLS is that terms from particular terminologies are passed to a UMLS interface and a CUI + rubric received in response. However, the mapping approach described above could also be used to map UMLS CUIs to existing text or terms in an EHR; in this case, a DV_CODED_TEXT is constructed for each UMLS "term", where the code is the CUI and the rubric is the text rendering of the CUI (guaranteed unique in UMLS). The same approach can be used for any other thesaurus which becomes available in the future.

Legacy Mapping Scenarios

In cases where legacy data has to be converted to openEHR-compliant data, and only codes are available, e.g. ICD or ICPC codes, the following approach is recommended:

  • create a new DV_TEXT whose value is "(not available)"

  • add a mapping to the DV_TEXT, with:

    • purpose = "legacy conversion"

    • match = "="

    • target = CODE_PHRASE object whose code_string and terminology_id are set to correspond to the available code in the legacy data.

This expresses the reality that no text was ever recorded in the legacy system; rather a code was recorded directly in the data field. In the converted data, this code is more correctly considered a mapping.

Language Translations

In most cases the natural language of a text object is known from the enclosing Entry (i.e. Observation) or other enclosing context. Where it is different (e.g. a german sentence within an English language diagnosis), or there is no enclosing context, the DV_TEXT.language attribute can be set to indicate the language of the text item.

Formatting and Hyperlinking

Formatted text is represented by DV_TEXT and DV_CODED_TEXT instances. In versions prior to RM Release 1.0.4, DV_PARAGRAPH was provided as a way to represent larger tracts of text containing newlines and formatting of sub-sections, with each subsection being represented by one DV_TEXT or DV_CODED_TEXT.

DV_PARAGRAPH is now deprecated, and all text should be represented using the DV_TEXT or DV_CODED_TEXT. For legacy reasons, DV_PARAGRAPH remains legal, and should be supported in at least a basic way.

Formatting of the text in the value field of DV_TEXT and DV_CODED_TEXT instances may be indicated via the optional formatting attribute. How this attribute is used has changed from initial versions of the openEHR Reference Model.

Additionally, the value field was assumed to contain no newlines. This assumption is no longer made starting from RM Release 1.0.4, since DV_PARAGRAPH is no longer use to represent paragraphs.

A common model of text processing is of a processing pipeline that consumes readable text containing markings, which is then rendered into HTML, PDF or some other rendering format, and then displayed by a tool such as a web browser that consumes the latter format. Historical approaches to marking text were known as 'markup', such as nroff, troff, Latek, and then more recently, numerous formats whose base syntax is XML. These formats all have the disadvantages of poor human readability in raw form, and being specific to particular rendering engines, and are starting to fall out of favour for uses involving human reading of the raw text. Another common approach to formatting text is to use the output format, usually HTML, to express e.g. fonts, bolding etc. This is also problematic for human readability, being complicated by the use of CSS classes in tags, which then need to be defined.

Deprecated: This was the original approach assumed for the DV_TEXT.formatting attribute in openEHR, which was defined in versions prior to RM Release 1.0.4 as containing CSS font strings to apply to the whole of the DV_TEXT.value. This usage is now deprecated, while remaining legal in order to allow for legacy systems and data.

The current trend (emerging mainly for documentation in software) is to use 'markdown', which consists of markings in the text that visually indicate the intention: literal newlines, bolding via asterisks etc. No explicit CSS classes are needed, and conversion to rendering form is assumed to be done by an industry-standard markdown-to-HTML or other such converter. This is the approach taken by this specification, and is applied as follows.

The formatting attribute may contain one of the following values, with the associated meaning.

  • Void: if no value is supplied, no formatting marks are assumed to exist in the text, other than newlines, and it may safely be passed in a 'straight-through' fashion to a display device with no further processing, if desired;

  • "markdown": the text is assumed to be markdown format, strongly recommended use of the CommonMark form of markdown, and should preferably be rendered via conversion to HTML in the usual way, but may be passed straight through to display if unavoidable, since markdown is acceptably human-readable;

  • "plain": the text is treated as in the unformatted (Void) case above (newlines are allowed); this option is used to enable archetype-based modelling tools to clearly require a text object to be unformatted;

  • "plain_no_newlines": as for the plain case, but newlines are not allowed - the text is a simple 'atom';

  • (legacy - deprecated): for legacy purposes, it remains legal to use a string of the form "name:value; name:value…" , e.g. "font-weight : bold; font-family : Arial; font-size : 12pt;".

Related to the question of formatting, a second optional attribute, DV_TEXT.hyperlink of type DV_URI, exists to specify a web link associated with the whole text represented by the DV_TEXT instance. Similarly to the case of formatting, the more modern markdown approach to using the value field is also more flexible with regards to links, since multiple instances of the latter may be used within a single text.

Deprecated: Starting at RM Release 1.0.4, the use of the hyperlink field is deprecated in favour of inline link(s) in the value field, represented using markdown; in CommonMark, the basic form of a link is [text](uri).

Not all CommonMark features are allowed in the value field, if it contains markdown. The following should not be used:

  • HTML 'blocks';

  • raw HTML;

  • images - these may be included in openEHR content via the DV_MULTIMEDIA types.

Usage in DV_TEXT Instances

In a concrete instance of DV_TEXT, the following are the likely usages of the value and formatting attributes. The hyperlink is assumed to be Void in all non-legacy cases.

Requirement value attribute formatting attribute Legacy form

Plain text 'atom'
(no formatting, no newlines)

plain text string

Void or "plain_no_newlines"

(same)

Plain text 'atom' with a hyperlink
(no formatting, no newlines)

text string of form [text](uri) or other CommonMark link variant

"markdown"

value = plain text; hyperlink = URI/URL

Plain text paragraphs,
no other formatting

plain text string with newlines

Void or "plain"

DV_PARAGRAPH instance(s) containing plain legacy text atom form DV_TEXTs

Formatted text with paragraphs, bolding, italics, bullet lists, links etc

CommonMark text string

"markdown"

DV_PARAGRAPH instance(s) containing legacy form DV_TEXTs with CSS formatting and/or values in hyperlink

Usage in DV_CODED_TEXT Instances

In a concrete instance of DV_CODED_TEXT, the following are the likely usages of the value and formatting attributes. The hyperlink is assumed to be Void in all non-legacy cases.

Requirement value attribute formatting attribute Legacy form

Plain text of coded term

plain text string

Void

(same)

Plain text of coded term with a hyperlink

text string of form [text](uri) or other CommonMark link variant

"markdown"

value = plain text; hyperlink = URI/URL

Plain text of coded term or other text, containing newlines

Plain formatted text

Void or "plain"

(not applicable)

Class Descriptions

Unresolved include directive in modules/data_types/pages/text_package.adoc - include::../UML/classes/dv_text.adoc[]

Unresolved include directive in modules/data_types/pages/text_package.adoc - include::../UML/classes/term_mapping.adoc[]

Unresolved include directive in modules/data_types/pages/text_package.adoc - include::../UML/classes/code_phrase.adoc[]

Unresolved include directive in modules/data_types/pages/text_package.adoc - include::../UML/classes/dv_coded_text.adoc[]

Unresolved include directive in modules/data_types/pages/text_package.adoc - include::../UML/classes/dv_paragraph.adoc[]