The reporting quality of natural language processing studies: systematic review of studies of radiology reports

Davidson, Emma M.; Poon, Michael T. C.; Casey, Arlene; Grivas, Andreas; Duma, Daniel; Dong, Hang; Suárez-Paniagua, Víctor; Grover, Claire; Tobin, Richard; Whalley, Heather; Wu, Honghan; Alex, Beatrice; Whiteley, William

doi:10.1186/s12880-021-00671-8

Research article
Open access
Published: 02 October 2021

The reporting quality of natural language processing studies: systematic review of studies of radiology reports

Emma M. Davidson¹^na1,
Michael T. C. Poon^2,3^na1,
Arlene Casey⁴,
Andreas Grivas⁴,
Daniel Duma⁴,
Hang Dong^2,6,
Víctor Suárez-Paniagua^2,6,
Claire Grover⁵,
Richard Tobin⁵,
Heather Whalley^1,7,
Honghan Wu^6,8,
Beatrice Alex^4,9^na2 &
…
William Whiteley^1,6,10^na2

BMC Medical Imaging volume 21, Article number: 142 (2021) Cite this article

3480 Accesses
13 Citations
10 Altmetric
Metrics details

Abstract

Background

Automated language analysis of radiology reports using natural language processing (NLP) can provide valuable information on patients’ health and disease. With its rapid development, NLP studies should have transparent methodology to allow comparison of approaches and reproducibility. This systematic review aims to summarise the characteristics and reporting quality of studies applying NLP to radiology reports.

Methods

We searched Google Scholar for studies published in English that applied NLP to radiology reports of any imaging modality between January 2015 and October 2019. At least two reviewers independently performed screening and completed data extraction. We specified 15 criteria relating to data source, datasets, ground truth, outcomes, and reproducibility for quality assessment. The primary NLP performance measures were precision, recall and F1 score.

Results

Of the 4,836 records retrieved, we included 164 studies that used NLP on radiology reports. The commonest clinical applications of NLP were disease information or classification (28%) and diagnostic surveillance (27.4%). Most studies used English radiology reports (86%). Reports from mixed imaging modalities were used in 28% of the studies. Oncology (24%) was the most frequent disease area. Most studies had dataset size > 200 (85.4%) but the proportion of studies that described their annotated, training, validation, and test set were 67.1%, 63.4%, 45.7%, and 67.7% respectively. About half of the studies reported precision (48.8%) and recall (53.7%). Few studies reported external validation performed (10.8%), data availability (8.5%) and code availability (9.1%). There was no pattern of performance associated with the overall reporting quality.

Conclusions

There is a range of potential clinical applications for NLP of radiology reports in health services and research. However, we found suboptimal reporting quality that precludes comparison, reproducibility, and replication. Our results support the need for development of reporting standards specific to clinical NLP studies.

Peer Review reports

Background

Medical imaging reports, written by radiologists, contain rich data about patients’ health and disease which are not routinely captured in structured healthcare administrative datasets. Ready access to these data would be of great benefit for research and health-care quality improvement, particularly to examine the health of large populations. However, this resource is currently underutilised because manual extraction of data from free-text imaging reports is time-consuming. Natural language processing (NLP) is an automated technique used to analyse language (often in free-text) and convert it to a structured format that is easier to use; thus, NLP provides the means to retrieve granular information from imaging reports [1], by-passing the need for manual extraction, and simplifies research with these data.

Systematic review of the clinical NLP literature is important to identify promising developments, potential harms, and to help avoid duplication of effort; however, research synthesis in this area is complicated by a lack of consistency in study methods and reporting [2]. There are no clear reporting guidelines for clinical NLP studies, perhaps because NLP is used in so many different study designs. Methods and reporting guidance for clinical trials using machine learning (ML) [3,4,5] have recently been published, and extended guidelines are also being developed for the reporting of predictive ML models [6, 7]. Structured reporting protocols have also been suggested for NLP in clinical outcomes research [8] and also codes of practice for the use of Artificial Intelligence (AI) in radiology [9]. However, publications which have evaluated the reporting standards of ML studies [6] and its sub-field, deep learning (DL) [10], in clinical settings have shown low reporting standards which make this research difficult to interpret, replicate, or synthesise. Whether clinical NLP in general has better reporting is unclear from existing reviews [11].

In this systematic review, we examine the quality of reporting of studies that apply clinical NLP to imaging reports. We chose imaging reports because they are relatively accessible and of small size, with a restricted vocabulary [12], which makes them suitable for NLP. We aimed to establish the current state of reporting of studies that apply NLP to imaging reports and to identify NLP-specific criteria to assist future reporting. An accompanying informatics paper has been written which provides a more detailed overview of the NLP methods used and their clinical applications [13].

Methods

We published our review protocol (https://0-doi-org.brum.beds.ac.uk/10.17504/protocols.io.bmwhk7b6) [14] and this report follows the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) [15] guideline.

Search strategy

We designed an automated search of Google Scholar with 'Publish or Perish' software [16] to identify articles published between January 2015 and October 2019, building on an existing review by Pons et al. [11] which included literature published up to October 2014 (details of the automated search can be found in Additional file 1). Our search was executed on 27th November 2019 and our search terms were: ("radiology" OR "radiologist") AND ("natural language" OR "text mining" OR "information extraction" OR "document classification" OR "word2vec") NOT patent. We also used a snowballing method to conduct a citation search using a list of publications that cite the Pons et al. review [11] and the articles cited in Pons’ review [11]. The results of these two search approaches were combined [13].

Study selection

We first ran an automated screening of papers to remove any duplicates and irrelevant publications. The criteria used to filter out irrelevant publications were: language is not English; the word 'patent' is found in the title or URL; year of publication before 2015, as our review aimed to update a previous review by Pons et al. (2016); the words 'review' or 'overview' found in the title, or 'this review' found in the abstract; image keywords found in the title or abstract with no NLP terminology in the abstract; and finding either no radiology keywords or no NLP terminology in the title or abstract (more details can be found in Additional file 1).

Four reviewers (three NLP researchers [DD, AG, HD] and one epidemiologist [MTCP]) then screened all titles and abstracts for potentially eligible studies. All papers that two or more reviewers approved for inclusion progressed to full paper review, and papers only selected by one reviewer for inclusion were discussed by these four reviewers to achieve agreement on inclusion or exclusion. Lastly, eight reviewers (six NLP researchers [AG, HD, VS, AC, BA, HW] and two epidemiologists [ED and MP]) carried out the full paper screening according to the inclusion and exclusion criteria specified below and resolved any uncertainties by group discussion. All papers were double reviewed by an NLP researcher.

Inclusion and exclusion criteria

We included studies that applied NLP to radiology reports of any imaging modality. Our exclusion criteria were: (1) wrong publication type e.g. case reports, reviews, conference abstracts, comments, or editorials; (2) research not using radiology reports (e.g. using lab reports or clinical notes); (3) research using radiology images only (not using NLP methods); (4) not reporting any NLP results; (5) not available in full text; (6) duplicates; (7) articles written in a language other than English; (8) published before 2015; and (9) patents. The last four criteria should have been pre-filtered out by the automatic screening but we maintained these criteria to be consistent and exclude any papers that the filtering had missed.

Data extraction

The key data extracted were: year of publication, primary clinical application and primary technical objective, study period, language of radiology reports, anatomical region, imaging modality, disease area, size of data set, annotated set size, training set size, validation set size, test set size, external validation performed, domain expert used, number of annotators, inter-annotator agreement, NLP technique(s) used, best reported results (recall, precision and F1 score), availability of data set, and availability of code.

Data categorisation

We categorised the primary clinical application of each study. ‘Clinical application’ was the reported health-related purpose of the study. We iteratively developed a classification to represent the literature in our review, extending an existing categorisation [11], which ultimately included the categories of diagnostic surveillance, disease information and classification, language discovery and knowledge structure, quality and compliance, cohort and epidemiology, and technical NLP (Table 1).

Table 1 Clinical application areas and their definitions

Full size table

Quality assessment

There are no reporting guidelines or risk of bias tools available specifically for clinical NLP studies. To address this issue, we specified 15 criteria which we considered would need to be reported to enable assessment of risk of bias and assist replication of these studies. We took account of both existing guidelines for epidemiological research [17] and also guidance emerging from the clinical NLP community [3,4,5,6,7,8,9] when selecting these criteria, and sought group consensus on items that were generic measures of quality that would be readily applicable across the broad selection of methods included under clinical NLP. These criteria are described in detail in Table 2 and fall under the five headings of: data source, datasets, ground truth, outcomes, and reproducibility. Our choice of criteria may not encompass everything necessary to assess all NLP studies in radiology. For example, there may be additional outcomes metrics that need to be reported (other than precision and recall) depending on the NLP tasks and clinical applications. There also may be additional, more specific, measures that would further assist reproduction and allow comparison of the performance of particular types of NLP, such as hyperparameter selection for ML [18]. However, as we included a broad remit of research, across ML, DL and Rule Based systems, we were unable to include such granular measures specific to any particular method. However, our criteria represent core considerations identified to allow a consistent overview of the quality of studies across the heterogeneous body of research comprising clinical NLP and could be further developed for specific methods.

Table 2 Items used to assess the quality of reporting criteria in the current review

Full size table

Assessment of performance

We did not summarise results quantitatively due to the anticipated methodological heterogeneity. Our approach was a narrative synthesis of studies and visual summarisation of NLP performance stratified by quality of reporting and clinical application categories. We categorised studies into high and low reporting quality groups by the median number of qualities achieved. For this analysis, when a study reported precision and recall without F1 score, we derived the corresponding F1 score for summarisation.

Results

Study selection and characteristics

Our search identified 4,836 publications of which 274 were potentially eligible. After full eligibility assessment, we included 164 studies that used NLP on radiology reports (Fig. 1). Figure 2 presents the number of studies identified by year and illustrates the breakdown of studies by both clinical application category and NLP method. The number of publications increased from 22 in 2015 to 55 in 2019 (up to October 2019). There were more studies using deep learning techniques more recently (Fig. 2). Table 1 attributes the studies to their clinical application categories and Table S1 (in Additional file 1) provides a detailed description of the study characteristics.

The most common clinical applications of these NLP studies were disease information or classification (28%) and diagnostic surveillance (27.4%), followed by language discovery and knowledge structure (16.5%), quality and compliance (12.2%) and then research (9.8%). Of the NLP methods used, rule-based alone (26%) and machine learning alone (24%) were most frequently applied. Deep learning methods alone were used in 16 studies (9.8%), rising from 0 in 2015 to 14 papers (25%) in 2019. More specifics of the NLP methods and clinical applications can be found in our accompanying informatics paper [13]. The majority (86%) of studies used English language radiology reports, with the other languages reported including Chinese, Spanish, German, French, Italian, Portuguese, Polish and Hebrew. The imaging modalities reported were mixed (28%), computerised tomography (23%), magnetic resonance imaging (9.8%), X-ray (4.9%), ultrasound (2.4%), mammography (3%), and other types (15%). The most frequent disease area was oncology (24%) and images of mixed anatomical regions were most frequent (26.2%), followed by thorax (19.5%) and head and neck (15.2%). The size of the datasets varied greatly between studies; eleven studies did not give data sizes; and others studies reported numbers of sentences, patients, or mixed data sources rather than numbers of reports. With these caveats, the median dataset size was 3,032 (IQR 875, 70 000).

Reporting quality of included studies

Reporting of the pre-specified criteria varied across the included studies and years of publication (Fig. 3a, b). The median number of qualities achieved was 5. Consistent image acquisition was the most incompletely reported aspect of studies: 11 (6.7%) studies included information on the number and type of imaging machines used and just eight of the 11 studies specified that images were of consistent quality where various sites and imaging machines were used. Other criteria where incomplete reporting was particularly evident were reporting the results of external validation, only 15/139 (10.8%) studies; reporting of study data to make it available for external use, 14 (8.5%) studies; and the reporting of study code to make it available for external use, 15 (9.1%) studies.

The method for imaging reports sampling was also incompletely reported: 71 (43.3%) studies specified their sampling strategy, and only 33 (46.5%) of these studies sampled imaging reports consecutively. Most studies reported the size of their overall data set (93.3%) and 85.4% had a dataset size exceeding 200. However, the split of datasets in each study for training, validation, and test sets was reported only moderately well (63.4%, 45.7%, and 67.7% respectively). Annotated datasets were reported for 110 (67.1%) of the studies. Just under half of the studies (47.6%) reported the annotator expertise and 70 (42.7%) confirmed it was a domain expert. The number of annotators was specified in 91 (55.5%) studies and the inter-annotator agreement was reported for 67 (60.9%) of the 110 studies that used annotated data sets. We found that 80 (48.8%) and 88 (53.7%) studies, respectively, reported the performance metrics of precision and recall for their applications. There was no apparent improvement in reporting on visual inspection (Fig. 3b).

Study performance

In looking at study performance we also examined the 71 (43%) studies reporting F1 score. In studies reporting at least one of the performance measures (precision, recall and F1 score), there was no clear pattern of performance associated with quality of reporting or with stratification by clinical application (Fig. 4).

Discussion

We conducted a systematic review of the quality of reporting of studies of NLP in radiology reports between 2015 and 2019. This review chronologically updated an existing review by Pons et al., although the focus of their review was on the clinical applications of NLP tools, NLP methods, and their performance, and did not assess quality of reporting. We found increased research output in the time period of our review, retrieving 164 relevant publications compared with 67 for the preceding review which searched for all publications indexed up to October, 2014. In our review, as anticipated, the use of deep learning methods had increased, but we found that rule-based and traditional machine classifiers were still widely used. The main clinical applications reported in papers remained broadly similar between the reviews, although we found more papers that did not specify any health-related purpose and we categorised these as: ‘Technical NLP’ and ‘Disease information and classification’. Pons et al. reported that many NLP tools remained at a ‘proof-of-concept’ stage and our study determined that this problem persists in the body of literature we retrieved.

The main focus of our work was on the reporting of clinical NLP studies and we found that this was generally poor (meaning under half of the included studies reported the criterion) for eight of our 15 pre-specified criteria. In particular, the three reproducibility criteria were met by only 15, 14 and 15 studies for external validation, availability of data and availability of code respectively. Although this is an expanding field, with a growth in publications, we found that reporting remained inconsistent and incomplete between 2015 and 2019.

Most studies reported dataset size. However, we found that more detailed information on data sampling was often omitted and that had implications for assessing bias in these studies. For example, not reporting whether imaging reports were sampled from consecutive patients and not detailing the demographics of study participants affected determination of selection bias and impacted on the generalisability of applications from one population to another. The dangers of utilising data from unrepresentative populations, particularly to train ML applications, has been stressed [19, 20] and considerations of equity and how models may vary across different settings have begun to be incorporated in existing guidance for ML [2]. The split of datasets between training, test and validation sets was also inadequately reported: 45.5%, 40.9% and 31.8% of studies published in 2015 reported these criteria respectively. However, these dataset criteria did appear to improve over time: 74.5%, 78.2% and 56.4% of studies published in 2019 reported these criteria respectively. Assessment of information bias was difficult because of the lack of details about comparable imaging machines and details of any annotation, including the number of annotators and whether they were domain experts.

Second, as recognised in ML [6] and DL [10] research, most NLP algorithms were ‘private’ and had not been replicated by their developers in other settings. It is therefore uncertain whether these tools are transferable between settings. External validation is difficult, because obtaining and accessing suitable alternate datasets on which to test NLP tools is not easy. There are few publicly available datasets and those which are available [21,22,23,24] may not be representative of the datasets researchers want to use or the populations for whom they are developing their tools. For example, clinical datasets available from the United States may not translate to another healthcare systems. External validation of clinical NLP tools is important to establish whether they can be adopted for more widespread use and clinical implementation.

Thirdly, external validation can be facilitated by the sharing of code and data to replicate research, but we found code was not available from many studies [25]. There are multiple institutional factors, some particular to healthcare data, which influence disclosure including privacy considerations, inconsistency in decision-making by regulatory bodies, liability concerns due to these technologies being viewed akin to medical devices, and lastly concerns over cybersecurity [26]. Additionally, NLP researchers may not have work capacity to support the use of their NLP systems when used externally. The development of bodies to facilitate health data research, such as Health Data Research UK (HDRUK) promises to address many of these factors [27], but they may remain a barrier for some time and, in the interim, encouraging direct collaboration between clinical NLP researchers working in similar areas may be the most efficient way to expedite external validation. There have been active steps taken by the NLP community towards improving reproducibility for ML in particular, including the development and implementation of reproducibility checklists specific to ML, and this shift in practice may spread to encompass other areas of the clinical NLP research community [28].

Fourth, specifying a clinical application is important to demonstrate that the tool has meaningful clinical relevance, and also because transferability of algorithms to different clinical tasks is not assured. Volmer et al. [2], recently proposed 20 questions concerning transparency, reproducibility, ethics, and effectiveness (TREE) for ML and AI research and their first question urges researchers from the inception of a project to stipulate the relevance of their work to healthcare and patients. This requirement is also born out in the CONSORT-AI reporting guidelines [4]. For our review, we generated six clinical application categories, extending Pons et al.’s existing framework [11] and disaggregating them into underlying subcategories, and we discovered that many studies did not specify a clinical application. Our study taxonomy may be useful for other researchers wishing to identify existing work to build on or to identify clinical areas with gaps that remain unaddressed. In addition, our inclusion of more disaggregated clinical application subcategories (Table 1) could potentially facilitate future work to collate these applications within ‘like’ categories to examine their performance in carrying out similar clinical tasks.

Lastly, we summarized the performance of all 164 studies and sought trends according to their quality of reporting and clinical application. However, no clear associations emerged. This is likely due to the heterogeneity within clinical NLP studies and their contextual nature. The best performing methods for a clinical NLP tool are likely determined by the intersection of multiple factors including clinical application, type of reports (including modality and indication), the specific information required (including rarity of conditions), the need for clinical input, the complexity of the NLP task being carried out, and the performance parameters required to be acceptable for clinical implementation and minimisation of harm.

The implications of our findings for practice are that, despite a large body of work and the potential advantages of NLP in clinical settings, advancing these tools to the stage of widespread implementation is hindered by poor standards of reporting, particularly relating to external validation and the sharing of NLP code and data. This reflects the situation reported for the sub-fields of ML and AI where systematic review identified that most studies failed to use or adhere to any existing reporting guidance [6, 10], and that data and code availability were lacking [10]. However, a move has begun to pursue transparency and replicability within AI, ML and DL research [2], which all clinical NLP should follow, including initiating the development of extended reporting guidelines [3, 4, 7]. Where no extended guidelines exist, we recommend that researchers follow guidelines specific to study type [29] and also consider reporting the 15 NLP specific criteria which we have sought in our review.

Our review was strengthened by the large number, and wide variety, of studies identified. However, the heterogeneity of this literature was also a limitation in that it precluded any meta-analysis of outcomes. Limitations of our review also included having developed our own quality assessment criteria, due to the lack of available tools in this field. We acknowledge that there may be additional criteria that could assist quality assessment either for specific types of NLP (such as hyperparameter selection for ML) or more generally; for example, including a description of computing infrastructure could also assist assessment of reproducibility and could be readily shared [18]. We also did not exclude any studies based on poor quality. However, we feel that this approach is fitting for a review where meta-analysis is not undertaken and where we focus on demonstrating the breadth of work and assessing reporting quality across the whole body of work. Utilising an automated search in Google Scholar may have impeded our search sensitivity, although it has been shown to have very comprehensive coverage [30]. Our clinical application categories were also developed through this review process and there was overlap for some studies where decisions had to be made by the reviewers as to their primary application. These decisions were naturally subjective and studies could be reassigned, however decisions were discussed and agreed by at least two reviewers.

Conclusions and recommendations

Our systematic review of the use of NLP on radiology reports, for the period 2015–2019, found substantial growth in research activity, but no clear improvement in reporting of key data to allow reproducibility and replication. This impedes synthesis of this research field. In this paper we provide an overview of the current landscape and offer developments in both the categorisation of clinical applications for NLP on radiology reports and suggested criteria for inclusion in quality assessment of this research. This paper complements the limited guidance which has been published to date in relation to AI in radiology [9], clinical NLP [8], and ML within NLP [2,3,4,5] and we hope that our criteria can contribute to developments for formally agreed standards specific to clinical NLP.

Availability of data and materials

All data generated or analysed during this study are included in this published article and its companion paper [including their additional/supplementary information files] [13].

Abbreviations

AI:: Artificial Intelligence
CONSORT-AI:: Consolidated Standards of Reporting Trials-Artificial Intelligence
CT:: Computerised tomography
DL:: Deep learning
HDRUK:: Health Data Research UK
IQR:: Interquartile Range
ML:: Machine learning
MRI:: Magnetic resonance imaging
NLP:: Natural language processing
PRISMA:: Preferred Reporting Items for Systematic Reviews and Meta-Analyses
TREE:: Transparency, reproducibility, ethics, and effectiveness

References

Cai T, Giannopoulos AA, Yu S, Kelil T, Ripley B, Kumamaru KK, et al. Natural language processing technologies in radiology research and clinical applications. Radiographics. 2016;36(1):176–91.
Article Google Scholar
Vollmer S, Mateen BA, Bohner G, Király FJ, Ghani R, Jonsson P, et al. Machine learning and artificial intelligence research for patient benefit: 20 critical questions on transparency, replicability, ethics, and effectiveness. BMJ. 2020;368:16927.
Google Scholar
Cruz Rivera S, Liu X, Chan A-W, Denniston AK, Calvert MJ, Ashrafian H, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Lancet Digit Health. 2020;2(10):e549–60.
Article Google Scholar
Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK, Ashrafian H, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Lancet Digit Health. 2020;2(10):e537–48.
Article Google Scholar
Bluemke DA, Moy L, Bredella MA, Ertl-Wagner BB, Fowler KJ, Goh VJ, et al. Assessing radiology research on artificial intelligence: a brief guide for authors, reviewers, and readers—from the radiology editorial board. Radiology. 2019;294(3):487–9.
Article Google Scholar
Yusuf M, Atal I, Li J, Smith P, Ravaud P, Fergie M, et al. Reporting quality of studies using machine learning models for medical diagnosis: a systematic review. BMJ Open. 2020;10(3):e034568-e.
Article Google Scholar
Collins GS, Moons KGM. Reporting of artificial intelligence prediction models. Lancet. 2019;393(10181):1577–9.
Article Google Scholar
Velupillai S, Suominen H, Liakata M, Roberts A, Shah AD, Morley K, et al. Using clinical natural language processing for health outcomes research: overview and actionable suggestions for future advances. J Biomed Inform. 2018;88:11–9.
Article Google Scholar
Geis JR, Brady AP, Wu CC, Spencer J, Ranschaert E, Jaremko JL, et al. Ethics of artificial intelligence in radiology: summary of the Joint European and North American Multisociety Statement. Radiology. 2019;293(2):436–40.
Article Google Scholar
Nagendran M, Chen Y, Lovejoy CA, Gordon AC, Komorowski M, Harvey H, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689.
Article Google Scholar
Pons E, Braun LMM, Hunink MGM, Kors JA. Natural language processing in radiology: a systematic review. Radiology. 2016;279(2):329–43.
Article Google Scholar
Bates J, Fodeh SJ, Brandt CA, Womack JA. Classification of radiology reports for falls in an HIV study cohort. J Am Med Inform Assoc JAMIA. 2016;23(e1):e113–7.
Article Google Scholar
Casey A, Davidson E, Poon M, Dong H, Duma D, Grivas A, et al. A systematic review of natural language processing applied to radiology reports. 2021.
NLP of radiology reports: systematic review protocol [Internet]. 2020. https://www.protocols.io/view/nlp-of-radiology-reports-systematic-review-protoco-bmwhk7b6.
Moher D, Liberati A, Tetzlaff J, Altman DG, The PG. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009;6(7):e100097.
Article Google Scholar
Harzing.com. Publish or Perish 2020. https://harzing.com/resources/publish-or-perish.
von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Int J Surg. 2014;12(12):1495–9.
Article Google Scholar
Dodge J, Gururangan S, Card D, Schwartz R, Smith NAJA. Show your work: improved reporting of experimental results. 2019; abs/1909.03004.
Noor P. Can we trust AI not to further embed racial bias and prejudice? BMJ. 2020;368:m363.
Article Google Scholar
Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH. Ensuring fairness in machine learning to advance health equity. Ann Intern Med. 2018;169(12):866–72.
Article Google Scholar
Demner-Fushman D, Kohli MD, Rosenman MB, Shooshan SE, Rodriguez L, Antani S, et al. Preparing a collection of radiology examinations for distribution and retrieval. J Am Med Inform Assoc JAMIA. 2016;23(2):304–10.
Article Google Scholar
Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. 2019.
Johnson AEW, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng C, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data. 2019;6(1):317.
Article Google Scholar
Kim C, Zhu V, Obeid J, Lenert L. Natural language processing and machine learning algorithm to identify brain MRI reports with acute ischemic stroke. PLoS ONE. 2019;14(2):e0212778.
Article CAS Google Scholar
Haibe-Kains B, Adam GA, Hosny A, Khodakarami F, Shraddha T, Kusko R, et al. Transparency and reproducibility in artificial intelligence. Nature. 2020;586(7829):E14–6.
Article CAS Google Scholar
McKinney SM, Karthikesalingam A, Tse D, Kelly CJ, Liu Y, Corrado GS, et al. Reply to: Transparency and reproducibility in artificial intelligence. Nature. 2020;586(7829):E17–8.
Article CAS Google Scholar
HDRUK. National Implementation Project: National Text Analytics Resource [cited 2021 18th January]. https://www.hdruk.org/projects/national-text-analytics-project/.
Pineau J, Vincent-Lamarre P, Sinha K, Larivière V, Beygelzimer A, d'Alché-Buc F, et al. Improving reproducibility in machine learning research (A Report from the NeurIPS 2019 Reproducibility Program). 2020;abs/2003.12206.
Enhancing the QUAlity and Transparency Of health Research EQUATOR network [cited 2020 4th November]. https://www.equator-network.org/.
Gehanno JF, Rollin L, Darmoni S. Is the coverage of Google Scholar enough to be used alone for systematic reviews. BMC Med Inform Decis Mak. 2013;13:7.
Article Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This research was supported by the Alan Turing Institute, Alzheimer’s Society, MRC, HDR-UK and the Chief Scientist Office. B.A., A.C, D.D., A.G. and C.G. have been supported by the Alan Turing Institute via Turing Fellowships (B.A,C.G.) and Turing project funding (EPSRC grant EP/N510129/1). A.G. was also funded by a MRC Mental Health Data Pathfinder Award (MRC-MCPC17209). E.D. was supported by the Alzheimer’s Society. H.W. is MRC/Rutherford Fellow HDR UK (MR/S004149/1). H.D. is supported by HDR UK National Phemomics Resource Project. V.S-P. is supported by the HDR UK National Text Analytics Implementation Project. W.W. is supported by a Scottish Senior Clinical Fellowship (CAF/17/01). M.T.C.P is supported by Cancer Research UK Brain Tumour Centre of Excellence Award (C157/A27589). The funders had no role in determining the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the paper for publication.

Author information

Emma M. Davidson and Michael T. C. Poon: Joint first authors
Beatrice Alex and William Whiteley: Joint senior authors

Authors and Affiliations

Centre for Clinical Brain Sciences, University of Edinburgh, Chancellor’s Building, Little France, Edinburgh, EH16 4TJ, Scotland, UK
Emma M. Davidson, Heather Whalley & William Whiteley
Centre for Medical Informatics, Usher Institute, University of Edinburgh, Edinburgh, Scotland, UK
Michael T. C. Poon, Hang Dong & Víctor Suárez-Paniagua
Brain Tumour Centre of Excellence, Cancer Research UK Edinburgh Centre, University of Edinburgh, Edinburgh, Scotland, UK
Michael T. C. Poon
School of Literatures, Languages and Cultures (LLC), University of Edinburgh, Edinburgh, Scotland, UK
Arlene Casey, Andreas Grivas, Daniel Duma & Beatrice Alex
Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh, Edinburgh, Scotland, UK
Claire Grover & Richard Tobin
Health Data Research UK, London, UK
Hang Dong, Víctor Suárez-Paniagua, Honghan Wu & William Whiteley
Division of Psychiatry, University of Edinburgh, Edinburgh, UK
Heather Whalley
Institute of Health Informatics, University College London, London, UK
Honghan Wu
Edinburgh Futures Institute, University of Edinburgh, Edinburgh, Scotland, UK
Beatrice Alex
Nuffield Department of Population Health, University of Oxford, Oxford, UK
William Whiteley

Authors

Emma M. Davidson
View author publications
You can also search for this author in PubMed Google Scholar
Michael T. C. Poon
View author publications
You can also search for this author in PubMed Google Scholar
Arlene Casey
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Grivas
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Duma
View author publications
You can also search for this author in PubMed Google Scholar
Hang Dong
View author publications
You can also search for this author in PubMed Google Scholar
Víctor Suárez-Paniagua
View author publications
You can also search for this author in PubMed Google Scholar
Claire Grover
View author publications
You can also search for this author in PubMed Google Scholar
Richard Tobin
View author publications
You can also search for this author in PubMed Google Scholar
Heather Whalley
View author publications
You can also search for this author in PubMed Google Scholar
Honghan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Beatrice Alex
View author publications
You can also search for this author in PubMed Google Scholar
William Whiteley
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

BA, WW and HW conceptualised this study. DD carried out the search including automated filtering and designing meta-enriching steps. BA, AG, CG and RT advised on the automatic data collection method devised by DD. MTCP, AG, HD and DD carried out the first stage review and AC, ED, VS-P, MTCP, AG, HD, BA and DD carried out the second-stage review. ED and MP wrote the main manuscript with contributions from all authors. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Emma M. Davidson.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

. Additional details of the automated search and more detailed characteristics.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Davidson, E.M., Poon, M.T.C., Casey, A. et al. The reporting quality of natural language processing studies: systematic review of studies of radiology reports. BMC Med Imaging 21, 142 (2021). https://0-doi-org.brum.beds.ac.uk/10.1186/s12880-021-00671-8

Download citation

Received: 01 April 2021
Accepted: 20 September 2021
Published: 02 October 2021
DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12880-021-00671-8

The reporting quality of natural language processing studies: systematic review of studies of radiology reports

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Search strategy

Study selection

Inclusion and exclusion criteria

Data extraction

Data categorisation

Quality assessment

Assessment of performance

Results

Study selection and characteristics

Reporting quality of included studies

Study performance

Discussion

Conclusions and recommendations

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Imaging

Contact us