Data Elements for Registries (2024)

1. Introduction

Selection of data elements for a registry requires a balancing of potentially competing considerations. These considerations include the importance of the data elements to the integrity of the registry, their reliability, their necessity for the analysis of the primary outcomes, their contribution to the overall response burden, and the incremental costs associated with their collection. Registries are generally designed for a specific purpose, and data elements not critical to the successful execution of the registry or to the core planned analyses should not be collected unless there are explicit plans for their analysis.

The selection of data elements for a registry begins with the identification of the domains that must be quantified to accomplish the registry purpose. The specific data elements can then be selected, with consideration given to standardized outcome measures, other data standards, common data definitions, and the use of patient identifiers. Next, the data element list can be refined to include only those elements that are necessary for the registry purpose. Once the selected elements have been incorporated into a data collection tool, the tool can be pilot tested to identify potential issues, such as the time required to complete the form, data that may be more difficult to access than realized during the design phase, and practical issues in data quality (such as appropriate range checks). This information can then be used to modify the data elements and reach a final set of elements.

2. Identifying Domains

Registry design requires explicit articulation of the goals of the registry and close collaboration among disciplines, such as epidemiology, health outcomes, statistics, informatics, and clinical specialties. Once the goals of the study are determined, the domains most likely to influence the desired outcomes must be identified and defined. Registries generally capture data on the characteristics of the patient, the disease or condition of interest; exposure(s), including treatments, and outcomes. The characteristics domain consists of data that describe the patient, such as information on patient demographics, medical history, health status, and any necessary patient identifiers. The exposure domain describes the patient’s experience with the product, disease, device, procedure, or service of interest to the registry. Exposure can also include other treatments that are known to influence outcome but are not necessarily the focus of the study, so that their confounding influence can be adjusted for in the planned analyses. The outcomes domain consists of information on the patient outcomes that are of interest to the registry; this domain should include both the primary endpoints and any secondary endpoints that are part of the overall registry goals. These domains are illustrated in the Outcome Measures Framework1 (see Figure 4-1).

In addition to the goals and desired outcomes, it is necessary to consider any important subsets when defining the domains. Measuring potential confounding factors (variables that are linked with both the exposure and outcome) should be taken into account in this stage of registry development. Collecting data on potential confounders will allow for analytic or design control. (See Chapters 3 and 13.) Variables that can change over time must include a time reference in order to distinguish cause-and-effect relationships. For example, a drug taken after an outcome is observed cannot possibly have contributed to the development of that outcome. Time reference periods can be addressed by including start and stop dates for variables that can change; they can also be addressed categorically, as is done in some quality improvement registries. For example, the Paul Coverdell National Acute Stroke Registry organized its patient-level information into categories to reflect the timeframe of the stroke event from onset through treatment to followup. In this case, the domains were categorized as prehospital, emergency evaluation and treatment, in-hospital evaluation and treatment, discharge information, and post-discharge followup.2

3. Selecting Data Elements

The process of selecting data elements begins with identification of the data elements that best quantify each domain and the source(s) from which those data elements can be collected. When selecting data elements, gaining consensus among the registry stakeholders is important, but this must be achieved without undermining the purpose of the registry by including elements solely to please a stakeholder. Each data element should support the purpose of the registry and answer an explicit scientific question or address a specific issue or need. The most effective way to select data elements is to start with the study purpose and objective, and then decide what types of groupings, measurements, or calculations will be needed to analyze that objective. Data elements may also be selected based on performance or quality measures in a clinical area; this is a particularly relevant approach for registries that focus on quality improvement. (See Case Example 11.)

Once the plan of analysis is clear, it is possible to work backward to define the data elements necessary to implement that analysis plan. This process keeps the group focused on the registry purpose and limits the number of extraneous (“nice to know”) data elements that may be included.3 When selecting data elements, it is often helpful to gather input from statisticians, epidemiologists, psychometricians, and experts in health outcomes assessment who will be analyzing the data, as they may notice potential analysis issues that need to be considered at the time of data element selection

3.1. Data Standards

The data element selection process can be simplified if standardized outcome measures or clinical data standards for a disease area exist, as discussed in Chapter 4. In cases where clinical data standards for the disease area do not exist, established datasets or common data elements may be widely used in the field. (See Case Example 12.) For example, United Network of Organ Sharing (UNOS) collects a large amount of data on organ transplant patients. Creators of a registry in the transplant field should consider aligning their data definitions and data element formats with those of UNOS to simplify the training and data abstraction process for sites. The National Institutes of Health (NIH) maintains a repository of common data elements (CDEs) that can be used to find CDEs relevant for use in a wide range of condition areas. Another example of an established dataset is the U.S. Core Data for Interoperability (USCDI). The USCDI, which was developed by the Office of the National Coordinator for Health Information Technology (ONC), is a standardized set of health data classes and constituent data elements that are intended to support national, interoperable health information exchange.4

If clinical data standards for the disease area and established datasets do not exist, it is still possible to incorporate standard terminology into a registry. This will make it easier to compare the registry data with the data of other registries and reduce the training needs and data abstraction burden on sites. Examples of several standard terminologies used to classify important data elements are listed in Table 5-1.

Table 5-1

Standard terminologies.

In addition to these standard terminologies, numerous useful commercial code listings target specific needs, such as proficiency in checking for drug interactions or compatibility with widely used electronic medical record systems. Mappings between many of these element lists are also increasingly available. For example, SNOMED CT^® (Systemized Nomenclature of Medicine Clinical Terminology) can currently be mapped to ICD-10-CM (International Classification of Diseases, 10th Revision, Clinical Modification).5

Despite progress in the use of vocabulary and terminology standards, challenges still exist. Multiple standards are still used for some areas (e.g., medications), and some systems that capture electronic health data use local terminologies instead of existing standards. In addition, some types of electronic health data, such as radiographic images, pathology slides, and clinical notes, may not be recorded using vocabulary and terminology standards. After investigating data standards, registry planners may find that there are no useful standards or established datasets for the registry, or that these standards comprise only a small portion of the dataset. In these cases, the registry will need to select and define data elements with the guidance of its project team, which may include an advisory board.

3.2. Enrollment and Followup Data Elements

When beginning the process of selecting and defining data elements, it can be useful to start by considering the registry design. Since many registries are longitudinal, sites often collect data at multiple visits. In these cases, it is necessary to determine which data elements can be collected once and which data elements should be collected at every visit. Data elements that can be collected once are often collected at the enrollment or baseline visit. Other data elements may be collected at every followup visit or on a specified schedule (e.g., once per year) that reflects routine care. In other cases, the registry may collect data at an event level, meaning all data elements will be collected during the course of the event rather than in separate visits. In considering when to collect a data element, it is also important to determine the most appropriate order of data collection. Data elements that are related to each other temporally (e.g., dietary information and a fasting blood sample for glucose or lipids) should be collected in the same visit rather than in different visits.

Table 5-2 provides examples of possible data elements to be collected at registry enrollment and followup visits, organized into the characteristics, exposures, and outcomes domains described above. The actual data elements selected for a specific registry will vary depending on the design, nature, and goals of the registry.

Table 5-2

Examples of potential registry data elements.

3.3. Data Definitions

Documentation of explicit data definitions for each variable to be collected is essential to the process of selecting data elements. This is important to ensure internal validity of the proposed study so that all participants in data collection are acquiring the requisite information in the same reproducible way. (See Chapter 11.) This process may be simplified if standardized data elements and data definitions are used (e.g., CDEs or data standards). Use of existing, standardized definitions also improves the ability of the registry to compare and exchange data with other systems in the future. However, registries may need to develop data definitions when existing standards do not meet the needs of the registry. The data definitions should include the ranges and allowable values for each individual data element, as well as the potential interplay of different data elements. For example, logic checks may be created for data elements that should be mutually exclusive.

When deciding on data definitions, it is important to determine which data elements are required and which elements may be optional. This is particularly true in cases where the registry may collect some “nice to know” data elements. The determination will differ depending on whether the registry is using existing medical record documentation to obtain a particular data element or whether the clinician is being asked directly. For example, the New York Heart Association Functional Class for heart failure is an important staging element but is often not documented.7 However, if clinicians are asked to provide the data point prospectively, they can readily do so. Consideration should also be given to accounting for missing or unknown data. In some cases, a data element may be unknown or not documented for a particular patient, and followup with the patient to answer the question may not be possible. Including an option on the form for “not documented” or “unknown” will allow the person completing the CRF to provide a response to each question rather than leaving it blank. Depending on the analysis plans for the registry, the distinction between undocumented data and missing data may be important.

3.4. Patient Identifiers

When selecting patient identifiers, there are a variety of options, such as the patient’s name, date of birth, or some combination thereof, that are subject to legal and security considerations. More specific patient information may be needed when linkage to or integration with other data sources is planned, depending on the planned method of patient matching. In selecting patient identifiers, thought should be given primarily to patient privacy and security, as well as to the possibility that patient identifiers may change during the course of the registry. For example, patients may change their names during the course of the registry following marriage/divorce, or patients may move or change their telephone numbers. Patient identifiers can also be inaccurate because of intentional falsification by the patient (e.g., for privacy reasons in a sexually transmitted disease registry), unintentional misreporting by the patient or a parent (e.g., wrong date of birth), or typographical errors. In these cases, having more than one patient identifier for matching patient records can be invaluable. In addition, identifier needs will differ based on the registry goals. For example, a registry that tracks children will need identifiers related to the parents, and registries that are likely to include twins (e.g., immunization registries) should plan for the duplication of birth dates and other identifiers. In selecting patient identifiers for use in a registry, registry planners will need to determine what data are necessary for their purpose and plan for potential inaccurate and changing data.

Generally, patient identifiers can simplify the process of identifying and tracking patients for followup. Patient identifiers also allow for the possibility of identifying patients who are lost to followup due to death (i.e., through the National Death Index) and linking to birth certificates for studies in children. In addition, unique patient identifiers allow for analysis to remove duplicate patients.

When considering the advantages of patient identifiers, it is important to take into account the potential challenges that collecting patient identifiers can present and the privacy and security concerns associated with the collection and use of patient identifiers. Obtaining consent for the use of patient-identifiable information can be an obstacle to enrollment, as it can lead to the refusal of patients to participate. Chapter 7 contains more information on the ethical and legal considerations of using patient identifiers.

3.5. Multinational Registries

Registries are commonly multinational, and data elements must be tailored appropriately for each country. Even when the same concepts are captured, examination and laboratory test results or units may differ among countries, making standardization of data elements necessary at the data-entry level. Data elements relating to cost-effectiveness studies may be particularly challenging, since there is substantial variation among countries in healthcare delivery systems and practice patterns, as well as in the cost of medical resources that are used as “inputs.” Alternatively, if capture of internationally standardized data elements is not desirable or cannot be achieved, registry stakeholders should consider provisions to capture data elements according to local standards. Later, separate data conversions and merging outside the database for uniform reporting or comparison of data elements captured in multiple countries can be evaluated and performed as needed if the study design ensures that all data necessary for such conversions have been collected.

Multinational registries also must carefully consider translation of data elements and case report forms (CRFs) into different languages. Appropriate translation and linguistic validation of CRFs is critically important to maintain a high quality of systematic data collection in the registry and to ensure that data captured from different countries have the same definition and meaning. Linguistic validation is important even when the same language is spoken in different countries. For example, though persons in the United States and the United Kingdom (UK) both commonly speak English, content validity of the same case CRF may differ between the two nations due to different cognitive interpretations. Consider patient-reported weight; a patient in the United States would typically write the full amount in pounds, while a patient in the UK would typically write the amount in stone or pounds or possibly kilograms.

4. Registry Data Map

Once data elements have been selected, a data map should be created. The data map identifies all sources of data (Chapter 6) and explains how the sources of data will be integrated. Data maps are useful to defend the validity and/or reliability of the data, and they are typically an integral part of the data management plan (Chapter 11). Clear operational definitions for each data element are also important to facilitate eventual interpretation of the data.

5. Pilot Testing

After the data elements have been selected and the data map created, it is important to pilot test the data collection tools to determine the time and costs of obtaining the data and the resulting clinician and patient burden. For example, through pilot testing, registry planners might determine that it is wise to collect certain data elements that are either highly burdensome or only “nice to know” in only a subset of participating sites (nested registry) that agree to the more intensive data collection, so as not to endanger participation in the registry as a whole. Pilot testing should also help to identify the rate of missing data and any validity issues with the data collection system.

The burden of data collection is a major factor determining a registry’s success or failure, with major implications for the cost of participation and for the overall acceptance of the registry by healthcare personnel and patients. Moreover, knowing the anticipated time needed for patient recruitment/enrollment will allow better communication to potential sites regarding the scope and magnitude of commitment required to participate in the study. Registries that obtain information directly from patients include the additional issue of participant burden, with the potential for participant fatigue, leading to failure to answer all items in the registry. Highly burdensome questions can be collected in a prespecified subset of subjects. The purpose of these added questions should be carefully considered when determining the subset so that useful and accurate conclusions can be achieved.

Pilot testing the registry also allows the opportunity to identify issues and make refinements in the registry-specific data collection tools, including alterations in the format or order of data elements and clarification of item definitions. Piloting may also uncover problems in registry logistics, such as the ability to accurately or comprehensively identify subjects for inclusion. A fundamental aspect of pilot testing is evaluation of the accuracy and completeness of registry questions and the comprehensiveness of both instructional materials and training in addressing these potential issues. Gaps in clarity concerning questions can result in missing or misclassified data, which in turn may cause bias and result in inaccurate or misleading conclusions. For example, time points, such as time to radiologic interpretation of imaging test, may be difficult to obtain retrospectively and, if they do exist in the chart, may not be consistently documented. Without additional instruction, some hospitals may indicate the time the image was read by the radiologist and others may use the time when the interpretation was recorded in the chart. The two time points can have significant variation, depending on the documentation practices of the institution.

6. Summary

The selection of data elements requires balancing such factors as their importance for the integrity of the registry and for the analysis of primary outcomes, their reliability, their contribution to the overall burden for respondents, and the incremental costs associated with their collection. Data elements should be selected with consideration for established clinical data standards, common data definitions, and whether patient identifiers will be used. The role of PROs and any other information provided directly by the patient is also important to consider. Lastly, it is important to determine which elements are absolutely necessary and which are desirable but not essential. Once data elements have been selected, a data map should be created with clear operational definitions for each variable, and the data collection tools should be pilot tested. Overall, the choice of data elements should be guided by parsimony, validity, and a focus on achieving the registry’s purpose.

References for Chapter 5

1.: GliklichRE, LeavyMB, KarlJ, et al. A framework for creating standardized outcome measures for patient registries. J Comp Eff Res. 2014;3(5):473–80. PMID: 25350799. DOI: 10.2217/cer.14.38. [PubMed: 25350799] [CrossRef]
2.: WattigneyWA, CroftJB, MensahGA, et al. Establishing data elements for the Paul Coverdell National Acute Stroke Registry: Part 1: proceedings of an expert panel. Stroke. 2003;34(1):151–6. PMID: 12511767. [PubMed: 12511767]
3.: GoodPI. A manager’s guide to the design and conduct of clinical trials. New York: John Wiley & Sons, Inc.; 2002.
4.: U.S. Core Data for Interoperability. Office of the National Coordinator for Health Information Technology. Version 1. https://www.healthit.gov/isa/us-core-data-interoperability-uscdi. Accessed June 18, 2019.
5.: National Library of Medicine. SNOMED CT to ICD-10-CM Map. https://www.nlm.nih.gov/research/umls/mapping_projects/snomedct_to_icd10cm.html. Accessed June 18, 2019.
6.: Institute of Medicine. Rare Diseases and Orphan Products: Accelerating Research and Development. Washington, DC: National Academies Press; 2010. [PubMed: 21796826]
7.: YancyCW, FonarowGC, AlbertNM, et al. Influence of patient age and sex on delivery of guideline-recommended heart failure care in the outpatient cardiology practice setting: findings from IMPROVE HF. Am Heart J. 2009;157(4):754–62 e2. PMID: 19332206. DOI: 10.1016/j.ahj.2008.12.018. [PubMed: 19332206] [CrossRef]
8.: GoldbergJ, GelfandHM, LevyPS. Registry evaluation methods: a review and case study. Epidemiol Rev. 1980;2:210–20. PMID: 7000537. DOI: 10.1093/oxfordjournals.epirev.a036224. [PubMed: 7000537] [CrossRef]
9.: SorensenHT, SabroeS, OlsenJ. A framework for evaluation of secondary data sources for epidemiological research. Int J Epidemiol. 1996;25(2):435–42. PMID: 9119571. DOI: 10.1093/ije/25.2.435. [PubMed: 9119571] [CrossRef]
10.: BlandJM, AltmanDG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1(8476):307–10. PMID: 2868172. [PubMed: 2868172]

Case Examples for Chapter 5

Case Example 11. Using Recognition Measures To Develop a Dataset

Description	Get With The Guidelines^® is the flagship program for in-hospital quality improvement of the American Heart Association and American Stroke Association. The Get With The Guidelines—Stroke program supports point of care data collection and real-time reports aligned with the latest evidence-based guidelines. The reports include achievement, quality, reporting, and descriptive measures that allow hospitals to trend their performance related to clinical and process outcomes.
Sponsor	American Heart Association/American Stroke Association
Year Started	2003
Year Ended	Ongoing
No. of Sites	Over 2,000 hospitals have participated
No. of Patients	> 5,000,000

Challenge

The primary purpose of the Get With The Guidelines—Stroke program is to improve the quality of in-hospital care for stroke patients. The program uses the PDSA (plan, do, study, act) quality improvement cycle, in which hospitals plan quality improvement initiatives, implement them, study the results, and then make adjustments to the initiatives. To help hospitals implement this cycle, the program uses a registry to collect data on stroke patients and generate real-time reports showing compliance with a set of standardized stroke recognition and quality measures. The reports also include benchmarking capabilities, enabling hospitals to compare themselves with other hospitals at a national and regional level, as well as with similar hospitals based on size or type of institution.

In developing the registry, the team faced the challenge of creating a dataset that would be comprehensive enough to satisfy evidence-based medicine but manageable by hospitals participating in the program. The program does not provide reimbursem*nts to hospitals entering data, so it needed to keep the dataset as small as possible while still maintaining the ability to measure quality improvement.

Proposed Solution

The team began developing the dataset by working backward from the recognition measures. Recognition measures, based on the sponsor’s guidelines for stroke care, contain detailed inclusion and exclusion criteria to determine the measure population, and they group patients into denominator and numerator groups. Using these criteria, the team developed a dataset that framed the questions necessary to determine compliance with each of the guidelines. The team then added questions to gather information on the patient population characteristics. Since the inception of the program, data elements and measure reports have been added or updated to maintain alignment with the current stroke guidelines. Over time, certain measures have also been promoted to or demoted from the higher tiers of recognition measures, depending on current science and changes in quality improvement focus.

Results

By using this approach, the registry team was able to create the necessary dataset for measuring compliance with stroke guidelines. The program was launched in 2003. As of 2019, over 2,000 hospitals have participated, entering data on more than five million stroke patients. The data from the program have been used in many abstracts and have resulted in dozens of manuscripts since 2007.

Key Point

Registry teams should focus on the outcomes or endpoints of interest when selecting data elements. In cases where compliance with guidelines or quality measures is the outcome of interest, teams can work backward from the guidelines or measures to develop the minimum necessary dataset for their registry.

For More Information

http://www.heart.org
OrmsethCH, ShethKN, SaverJL, et al. The American Heart Association’s Get With the Guidelines (GWTG)-Stroke development and impact on stroke care. Stroke and vascular neurology. 2017;2(2):94–105. PMID: 28959497. DOI: 10.1136/svn-2017-000092. [PMC free article: PMC5600018] [PubMed: 28959497] [CrossRef]
SchwammL, FonarowG, ReevesM, et al. Get With the Guidelines—Stroke is associated with sustained improvement in care for patients hospitalized with acute stroke or transient ischemic attack. Circulation. 2009;119:107–11. PMID: 19075103. DOI: 10.1161/CIRCULATIONAHA.108.783688. [PubMed: 19075103] [CrossRef]

Case Example 12. Patient-Powered Registries: Developing Scalable and Reusable Infrastructure To Support Harmonized Data Collection Across Rare Diseases

Description	NORD created the IAMRARE™ Registry Program to address the lack of rare disease natural history data, developing a disease-agnostic registry platform to support harmonized longitudinal data collection for all rare diseases, with the goal of informing patient decision making, standards of care, and drug development.
Sponsor	National Organization for Rare Disorders (NORD)
Year Started	2014
Year Ended	Ongoing
No. of Sites	1 Platform; >20 disease-specific registries
No. of Patients	Ongoing enrollment; > 6,000 participants

Challenge

Rare diseases pose unique research challenges, such as geographically dispersed patient populations, lack of information on the natural history of the disease, absence of standards of care or treatment guidelines, and limited numbers of clinicians with relevant expertise and experience managing patients with the condition. Longitudinal, observational data captured through patient registries can provide important information about the prevalence, characteristics, and natural history of the disease. These data may be used to supplement clinical trial data and identify meaningful endpoints during drug development, and the registries may serve as vehicles for identifying potential clinical trial participants. Some rare disease registries are developed and managed by patient organizations, however, patient organizations often lack the resources needed to develop and manage a registry, underscoring the importance of a common rare disease registry infrastructure - not only to minimize the burden of conducting longitudinal research studies, but also to facilitate cross-disease analyses and community ownership of the data.

Proposed Solution

NORD has partnered with rare disease stakeholders, including patients, caregivers, researchers, clinicians, and regulatory agencies, to develop a cloud-based registry platform and supplemental support program that facilitates longitudinal and episodic data entry by both patients and caregivers. Patient organizations can leverage the platform infrastructure and the support program to develop and manage patient registries. Common data elements (CDEs) serve as the foundation for each registry and are supplemented by validated disease-specific measures or, in cases where these do not exist, custom surveys. In addition to the platform, NORD provides support for facilitating the development of research consortia to encourage collaboration among partners working in the same disease space and organizing treatment and guideline review meetings that bring together experts across stakeholder groups to utilize registry data to inform the revision of standards of care. NORD works closely with partners to refine the study design, supporting documents, and overall study management protocols, and facilitates the registry launch process through its partnership with a centralized Institutional Review Board (IRB). NORD also provides training and resources through educational webinars, in-person workshops, and individualized guidance. Once a registry has launched, the platform supports concurrent sub-studies that branch off from existing registries to capture specific and complementary data, thereby reducing redundant registry efforts and community fragmentation.

Results

Since 2014, the program has grown to over 35 registries representing 9000+ users who have submitted more than 70,000 surveys (statistics as of May 2019). Throughout this period, continued program development has been driven by ongoing stakeholder engagement and input, collected via targeted questionnaires and meetings, as well as through consistent, open-ended dialogue. The program’s community portal and in-person leadership meetings offer forums for the registry partners to consider new concepts, share resources and lessons learned, and celebrate key milestones.

With NORD’s technical and programmatic infrastructure, harmonized data are collected across registries, from basic demographics to patient-reported outcomes. Preliminary registry data have been presented at national and international conferences, submitted for peer-reviewed publication, and analyzed to inform the development of patient-focused clinical trials.

Key Point

The use of CDEs, repeatable processes, and scalable infrastructure can produce efficiencies in registry development and operations and create opportunities for cross-disease analysis.

For More Information

IAMRARE™ Registry Program. https://rarediseases.org.