A deep learning model incorporating spatial and temporal information successfully detects visual field worsening using a consensus based approach

Aug 23, 2023

Scientific Reports volume 13, Article number: 1041 (2023) Cite this article

688 Accesses

11 Altmetric

Metrics details

Glaucoma is a leading cause of irreversible blindness, and its worsening is most often monitored with visual field (VF) testing. Deep learning models (DLM) may help identify VF worsening consistently and reproducibly. In this study, we developed and investigated the performance of a DLM on a large population of glaucoma patients. We included 5099 patients (8705 eyes) seen at one institute from June 1990 to June 2020 that had VF testing as well as clinician assessment of VF worsening. Since there is no gold standard to identify VF worsening, we used a consensus of six commonly used algorithmic methods which include global regressions as well as point-wise change in the VFs. We used the consensus decision as a reference standard to train/test the DLM and evaluate clinician performance. 80%, 10%, and 10% of patients were included in training, validation, and test sets, respectively. Of the 873 eyes in the test set, 309 [60.6%] were from females and the median age was 62.4; (IQR 54.8–68.9). The DLM achieved an AUC of 0.94 (95% CI 0.93–0.99). Even after removing the 6 most recent VFs, providing fewer data points to the model, the DLM successfully identified worsening with an AUC of 0.78 (95% CI 0.72–0.84). Clinician assessment of worsening (based on documentation from the health record at the time of the final VF in each eye) had an AUC of 0.64 (95% CI 0.63–0.66). Both the DLM and clinician performed worse when the initial disease was more severe. This data shows that a DLM trained on a consensus of methods to define worsening successfully identified VF worsening and could help guide clinicians during routine clinical care.

Glaucoma is the leading cause of irreversible blindness worldwide and early identification of worsening is critical for prevention1,2. Visual field (VF) testing is one of the most critical strategies to monitor disease worsening3. Identifying worsening in VFs is difficult due to the presence of fluctuating performance, variability, and lack of a gold standard4,5,6,7. One approach to address this problem includes more frequent testing, though this can present a significant burden for patients while still requiring multiple years to identify progression8,9,10,11,12.

Various objective methods have been developed to help determine VF progression; these can be broadly divided into the event- and trend-based methods. Event-based methods identify progression by scoring VFs with various rules based on density and depth of defect compared to the baseline VF and have been used in major clinical trials such as EMGT, CIGTS, and AGIS13,14,15. Guided progression analysis (GPA), which is similar to the EMGT criteria, is commonly used in clinical practice and previous studies have found it identified progression sooner but with less specificity16,17. Trend-based methods use linear regression which can be applied to global VF parameters or pointwise data. Previous work has suggested event-based methods could identify progression sooner than trend-based methods18,19. Two studies compared all of these methods on a large set of longitudinal VFs and showed weak agreement, suggesting a need for a consensus among the distinct algorithms to identify progression20,21.

The use of artificial intelligence represents one potential approach to identify worsening earlier and more consistently22,23,24,25,26. It has even been used to predict future VF or identify patients at the highest risk of worsening27,28. Traditional machine learning approaches utilize pre-specified transformation of subcomponents of the data while deep learning approaches allow training models with raw data29. Deep learning has a variety of approaches that can be useful depending on the structure of the data. In a recent paper a specific kind of deep learning model (DLM), a convolutional long short-term memory (LSTM) model, showed success in identifying VF worsening30. This model is unique in that allows the extraction of spatiotemporal features which are both critical for assessing VFs.

The goal of the current work was to assess the performance of a convolutional LSTM at detecting VF worsening when trained on a consensus of event and trend-based algorithms commonly used to detect worsening. To further evaluate the robustness of the DLM we assessed its performance at identifying worsening when trained with fewer VFs. We also compare agreement among the various algorithms used to detect worsening to emphasize the importance of the need for a consensus measure of worsening. Based on the data presented here a DLM could help clinicians identify VF worsening.

8705 eyes from 5099 patients were included (Fig. 1). The median age across all patients at their first VF was 62.3 years with 56.2% female (Table 1). The initial VF mean deviation (MD) across all eyes was − 2.5 dB with a mean longitudinal decline of 0.19 dB/year. The distribution of baseline MD is shown in the histogram (Supplementary Fig. 1) Each eye had about 12 visual fields (VF) done approximately once per year. The patients were divided into training (80%), validation (10%), and test sets (10%). Table 1 displays these and additional characteristics for training, validation, and test eyes. There was no statistically significant difference among the three groups (p > 0.05, ANOVA). Using only one eye from each patient in the test set (n = 510) did not change the results (data not shown).

Study inclusion criteria. The flow chart shows the total number of patients, eyes, and VF exams that were present at baseline. Eyes were excluded if they did not have complete VF data and did not have at least 7 reliable fields. The final criterion for inclusion was the clinicians’ decision of worsening at the time of VF testing, which was retrospective.

For each eye, all methods to assess progression were computed and the results are shown in Fig. 2. This plot shows the total number of progressing eyes on the left of the rows near each method; CIGTS had the highest number of progressing eyes with 2411 (27.7%) followed by GPA with 2192 (25.2%). VFI slope and AGIS identified the fewest progressing eyes with 643 (7.4%) and 784 (9.0%), respectively. The clinicians were in the middle identifying 1353 (15.6%) progressing eyes. The columns show the number of eyes that had progression based on various combinations of methods in each row, in total 126 eyes were found to be progressing by all methods and clinicians (rightmost column).

Upset plot with all methods to detect worsening. Each row in the table corresponds to a different method to detect worsening. The bar chart on the left indicates the total number of eyes identified as worsening by the indicated method with the gray lines identifying 1000 and 2000. The columns indicate, with dots and lines, the combination of methods being assessed. The bar chart above the column shows the number of eyes progressing for that specific combination of methods. The first seven columns show how many were identified as progressing by each method alone while the rightmost column shows how many eyes were identified as progressing by every method.

Kappa coefficients to compare the agreement among each method are shown in Table 2. The agreement across all methods of detecting VF worsening was calculated and Fleiss kappa (95% CI) was 0.34 (0.33, 0.36) when clinicians’ assessments of worsening were included and 0.41 (0.39, 0.42) when the clinician assessments were not included. Trend-based methods (MD slope, PLR slope, and VFI slope) in general had a higher agreement among themselves (darker shade). Of the event-based methods (AGIS, GPA, and CIGTS) CIGTS had the least agreement with other trend and event-based methods. The clinicians’ assessment of worsening had a weak agreement with all other methods.

The deep learning model (DLM) was trained to detect the worsening of visual fields based on the 4 of 6 reference standard (Fig. 3). The DLM had an AUC (95% CI) of 0.94 (0.93, 0.99) (blue line, Fig. 4). In the ROC plot (Fig. 3), the clinician assessment of worsening is demonstrated to have a lower true positive rate (TPR) and higher false positive rate (FPR) than the DLM. The clinician assessment had a TPR (95% CI) of 0.42 (0.32, 0.54) and an FPR (95% CI) of 0.16 (0.06, 0.37). At the clinician’s TPR (0.42), the all-VF DLM had an FPR (95% CI) of 0.024 (0.00, 0.062). At the clinician’s FPR (0.16), the all-VF DLM had a TPR of 0.93 (0.87, 0.99). The estimated AUC for clinicians was 0.63 (0.62, 0.64). One benefit to applying a DLM is that the model performance can be assessed with fewer data points. For each eye, up to the most recent 6 VFs were removed and the model performance was assessed (multi-colored lines). The AUC decreased with the removal of more VFs but all AUCs were still significantly larger than the clinician assessment using all the VF Data (p < 0.001 for all models compared to the clinician). The DLM had a significantly higher AUC regardless of how many tests (1 out of 6 to 6 out of 6) were required for the reference standard (Supplementary Table 1). The mixed-effects model also had a lower AUC than the DLM with an AUC of 0.82 (0.77–0.86, data not shown).

Deep learning model diagram. Deep learning architecture that incorporates data from visual fields and their 8 global metrics.

Test set performance of deep learning model and clinician assessment of VF worsening. The blue line shows the model performance with the full data. A decreasing AUC can be seen with the removal of more VFs (rightward shift of ROC curve) to the pink, where 6 of the last VFs were removed. The AUC decreased from 0.94 (0.91, 0.98) to 0.78 (0.72, 0.84) when comparing the full data and removal of 6 VFs, respectively. The cyan dot and 95% CI whiskers show the sensitivity and specificity of clinicians in detecting worsening in the same set of eyes during routine clinical practice. The estimated AUC for clinicians was 0.63 (0.62, 0.64).

Table 3 shows sensitivity and specificity for the DLM and clinicians after subdividing the data based on initial disease severity. The performance is significantly worse for both the DLM and clinicians when patients had more severe disease at baseline (p < 0.05 for both comparisons).

A similar analysis was conducted using the clinician assessment of worsening as the reference standard and the DLM was also able to successfully identify worsening with an AUC of 0.79 (Supplementary Fig. 2). The comparison of AUC for disease severity is also shown (Supplementary Table 2).

In this large population of patients, there was significant variability in agreement among the various methods to identify VF worsening. We show that a DLM trained to identify VF worsening based on a consensus of these methods performed well. Additionally, the DLM was robust and had a significantly higher AUC than clinician performance and the mixed-effects model when provided with less VF data than available to the clinician. Both the DLM and clinicians had more difficulty assessing worsening when the disease at onset was more advanced. The DLM can help clinicians better assess when VF is worsening.

Multiple studies have compared agreement among algorithms for identifying VF worsening. Initial studies showed event-based methods, namely GPA, had more sensitivity and earlier detection of worsening compared to trend-based, namely VFI and MD regression18,19. Various studies comparing event and trend-based methods show variation in agreement ranging from poor to moderate with kappa coefficients ranging from 0.22 to 0.5118,19,20,21. Agreement within event-based methods is better, ranging from 0.48 to 0.55. Trend-based methods also have high agreement up to 0.67 between MD and VFI, but also as low as 0.2 between MD and PLR20,21. Our study also showed moderate agreement between GPA and both AGIS (0.45) and CIGTS (0.48). We found higher agreement among distinct trend-based methods, ranging from 0.57 to 0.72. One unique strength of our study was assessing GPA agreement in a large sample. The other study with a large sample (~ 13,000 eyes) did not assess GPA20. Interestingly, the percentage of eyes identified as worsening has varied across studies. Our results contrast with a recent report that found PLR had the highest proportion of VFs progressing at almost 50% and CIGTS was the lowest at 10%20. Another report found CIGTS/GPA/PLR identified worsening in the highest number of eyes, while VFI rate was the lowest which is more similar to our findings21. Importantly, the differences here could arise from variability in patient population and practice patterns. The eyes in this study had more mild disease at baseline with mean MD of − 2.5 dB compared to about − 5 dB in the other studies. The demographics of patients in this study are comparable to other studies though there is a higher percentage of female and Black patients than seen in population studies31.

A range of factors may underlie when algorithms agree. With more VF tests trend-based methods were able to find progression compared with GPA18. To specifically assess discordance one study identified eyes where 3/6 algorithms identified worsening and the other 3 showed no worsening and found that discordance was associated with worse initial MD, older age, more VFs, longer follow-up duration, and institution from which data was from20. These findings highlight the difficulty in identifying any single method as an objective reference standard. Even the decision of clinical experts shows significant variation32,33. In this study, we combined objective metrics to identify a consensus. Requiring consensus of too many algorithms would create too much stringency; for example, 5/6 and 6/6 algorithms agreement in one study found worsening in only 3.1% and 2.5%, respectively20. In this study, the percentage of patients identified as worsening with 4, 5, and 6 algorithms identifying worsening was 10.0%, 6.8%, and 3.8%. We applied the definition of consensus as 4/6 or more agreement. One strength of requiring four algorithms was that any eye identified as worsening required at least one event- and trend-based method to agree. Though our major focus here uses the consensus decision as a reference standard we also conducted a supplementary analysis using the clinician decision as the reference standard. The DLM was successfully trained with an AUC of 0.79. This worse performance, when compared with the consensus as the reference standard, could be due to numerous reasons such as less algorithmic approach by the clinicians or the inclusion of clinical factors which are not available to the model.

Traditional machine learning has been applied to glaucoma for many years and more recent advances in computing have allowed more complex models29. Since VF changes have a significant spatiotemporal component, a recent paper showed success using a convolutional LSTM (cLSTM) model which retains spatial and temporal features. In that study, the changes in VF were defined by trend-based methods and it was shown that cLSTM identified worsening successfully with AUC values as high as 0.93930. These values are higher than what was seen in traditional machine learning approaches, for example, the Gaussian mixture model had sensitivity and specificity of 89.9% and 93.8% with an AUC of 0.8622. However, these studies are all difficult to compare given the various reference standards. This study is unique in that cLSTM is used to identify VF worsening based on the consensus of multiple algorithms. We also compare the DLM with a mixed-effects model and show superior performance. Another comparison in this study is clinician performance which demonstrates the potential value of the DLM in routine clinical care. Though the clinician performance here has limitations, to our knowledge, this is the first study to show clinician performance in a large dataset and compare it to a DLM34. Previous deep learning studies have shown excellent results such as excellent accuracy30, ability to predict future VFs27,28, and earlier identification of progression35. However, comparison of deep learning performance with clinicians will be critical if such models will be deployed in a clinical setting to assess for worsening. Since other studies had shown the successful ability of deep learning to forecast future VFs, we assessed the performance of the model after removing the final VFs. Removing each additional VF caused worse performance of the model, but even after removing 5 of the most recent VFs the DLM performs as well as a mixed-effects model. These findings show that deep learning is not only for accurate identification of disease diagnosis or detection of progression, but also may identify early markers for higher-risk patients.

This study has some limitations. The data is retrospective and from a tertiary referral center. Additionally, there was some filtering of the data to include only those eyes with longitudinal data and reliable VFs to allow accurate identification of worsening. This could create bias in selected patients and limit the results' generalizability. Though it is important to note the patients throughout the disease severity spectrum were included in the study. External validation of our cLSTM mode will be required before this model can be deployed for clinical use. The VF data in this study was based on SITA 24-2 testing from Zeiss Humphrey Field Analyzer, utilization of other VF data (e.g. Haag-Streit Octopus Perimeter) would require representative training data of other tests. Another limitation is that the clinician assessment for worsening was made retrospectively and at a single time point at the last visual field and the clinicians were not specifically instructed in how to grade this assessment. However, the clinicians represent glaucoma specialists during routine clinical care who had access to all visual fields as well as progression diagrams containing GPA and MD/VFI slopes. Some future directions include further comparisons into deep learning and clinician performance in more controlled and prospective settings as well as the role of including additional parameters such as clinical data or structural testing in the assessment of worsening.

In conclusion, we show that there is significant variability among the objective methods to classify VF worsening and that the consensus of these methods represents one method to create a reference standard. Using this reference standard, we show that a DLM, specifically cLSTM, can successfully identify VF worsening and would help support clinicians during routine clinical care. After careful external validation, such models may be deployed to identify VF worsening accurately and automatically in glaucoma clinics.

This study was reviewed and approved by the Johns Hopkins University School of Medicine Institutional Review Board and adhered to the tenets of the Declaration of Helsinki. The requirement for informed consent was waived because of the retrospective nature of the study.

Demographic and clinical data were obtained from patients seen at the Johns Hopkins Wilmer Eye Institute from June 1990 to June 2020. The clinical assessment of worsening at the last visual field (VF) was extracted from Epic (Verona, Wisconsin). Clinicians rating eyes either possibly or likely worsening on VF testing were labeled as worsening, while other choices (stable, possibly, or likely improving) were labeled as not worsening. The VF data were HVF 24-2 studies extracted from FORUM (Zeiss, Dublin, CA). The majority of these were SITA-Standard but it also included SITA-Fast, full threshold, and SITA-Faster.

VFs were included only if they were considered reliable with less than 15% false positives and less than either 25% false negative for mild/moderate disease or 50% for severe disease36. We only included eyes with at least 7 reliable VFs so that an accurate determination of longitudinal change could be made. The last VF in the series for each eye was required to have a clinician assessment of VF worsening or not worsening recorded in the charts. The number of VF tests excluded at each step is shown in the flow chart (Fig. 1).

There is no gold standard to assess VF worsening but there are numerous algorithms that have been commonly employed in the field. We used six of these automated methods. This includes three event-based methods: Guided Progression Analysis (GPA), Advanced Glaucoma Intervention Study (AGIS) scoring system, and Collaborative Initial Glaucoma Treatment Study (CIGTS) scoring system. We also used three trend-based methods: Mean deviation (MD) rate of change (MD slope), VF index (VFI) rate of change (VFI slope), and Pointwise linear regression (PLR). In addition to these algorithms, we also had access to clinician assessment of worsening for the last VF in each series. The description of each of these methods is outlined below. In all event-based methods, a baseline was needed which was calculated as the average of the first two VFs.

GPA is typically calculated by proprietary software and based on the Glaucoma Change Probability Analysis 3,21,37. Deviation values at each point in the VF are compared to the average of the values at the first two VFs. The points with a difference significantly higher than the test–retest variability at a p < 0.05 are identified. As we did not have access to the GPA database for thresholds for test–retest variability we determined thresholds for α < 0.05 based on an empiric normative database from the University of Iowa. We also used total deviation values instead of pattern deviation which is classically used by GPA, as previous studies have shown total deviation is more likely to detect progression38. We defined worsening as any three or more points worsening beyond the threshold level for three consecutive fields compared to the average of the first two VF exams.

AGIS score was calculated for each VF as described in the AGIS trial13. Briefly, each VF is graded based on the depth and number of defects in pre-specified locations on the VF. These pre-specified locations include nasal, superior, and inferior hemifields. The score ranges from 0 to 20 and scores for each VF are compared to the baseline scores. A computer program was used to calculate the score39. An AGIS score increase of at least four points which is sustained in three consecutive VFs was classified as worsening.

CIGTS score calculation has been previously described in the CIGTS trial15. This score uses the total deviation probability map and is calculated based on the density and depth of defects across the VF. VFs with multiple isolated points with defects would receive a lower score than when there were clusters of points with defects. The CIGTS score also ranges from 0 to 20 and an increase of three or more test points which is sustained for three consecutive VF was classified as worsening.

The MD slope was calculated as the simple linear regression of the MD values for the VFs. VF worsening was defined as a negative slope ≤ − 0.5 dB/year with a regression p-value less than 0.05. Similarly, the VFI slope was calculated as the linear regression of the VFI values. VF worsening was defined as a negative slope ≤ − 1.8%/year with a p-value of less than 0.0521.

For PLR, linear regression was performed for the total deviation values of each of the 52 VF points separately. VF worsening was defined as the presence of any three points with a negative slope ≤ − 1 dB/year with a p-value ≤ 0.0121.

Clinician assessment of worsening was determined for each eye by the clinician at the time of the last visual field and recorded in Epic. The clinician could choose from checkboxes that denoted likely worsening, possible worsening, stable, possible improvement, or likely improvement. A decision of likely or possible progression was classified as worsening while all other choices were classified as not worsening.

A reference standard for VF worsening was defined as at least four out of six algorithms (GPA, AGIS, CIGTS, MD slope, VFI slope, and PLR) identifying worsening. This was used as the label for worsening to train/test the deep learning model (DLM) and serves as the ground truth for VF worsening in this study. This reference was also used as the reference for the receiver-operating characteristic (ROC) curve in Fig. 4. A supplementary analysis was conducted with the clinician assessment of worsening for worsening used as the reference standard for training the DLM and generating the ROC curve (Supplementary Fig. 2).

The DLM architecture is described in Fig. 1. The input to the network consists of two parts: (1) a set of 7 or more VF images, each image has 54 points which were radially blurred onto a 12 × 12 grid and stacked together; (2) a stack of 7 or more sets of 8 global metrics from each VF (Age, VFI in %, PSD in dB, MD in dB, False Negatives in %, False Positives in %, Test Duration in sec, and Fixation Losses). The DLM architecture can receive unevenly spaced temporal data from each VF series. The dataset was split into 80%, 10%, and 10% for training, validation, and testing, respectively. The data was split on a patient level so if both eyes were included, they would fall within the same set. Including only one eye from each patient did not change the results of the study. The data were randomly distributed so all datasets, training, validation, and testing consisted of eyes that were and were not determined to be worsening. For the deep learning architecture, we implemented a single 2D convolutional LSTM with a 3 × 3 kernel size. Batch normalization was also integrated into the model to reduce internal covariate shift. The output of the model was the probability of VF worsening.

An additional analysis was carried out by removing VFs from the end of the series of VFs that were included for each eye and re-training the model with fewer data points. This tested the DLM’s ability to judge worsening before it had access to all of the information used by the 4 out 6 algorithms reference standard. The VFs were removed sequentially from the end (removing the final VF, removing the final two VFs, removing the final three VFs, etc.). This was done up to a maximum of removing the final 6 VFs since all included eyes required at least 7 VFs. This allowed each eye to have at least 1 VF entered the model as input, though about 87% of eyes had more than this minimum number. The label for worsening and assessment of performance was still based on the original consensus of 4 out of 6 using all the VFs.

Since multiple methods were used to identify VF worsening, we wanted to calculate the level of agreement among these methods. The pairwise agreement was identified based on Cohen’s kappa coefficient. Based on previous literature a kappa coefficient of 0 to 0.2 indicated slight agreement, 0.2 to 0.4 fair agreement, 0.4 to 0.6 moderate agreement, and 0.6 to 0.8 substantial agreement40. Agreement across more than two methods was also determined by calculating the Fleiss’ kappa coefficient41.

Another model for identifying worsening was created using a mixed-effects model that was provided with all the same data as the LSTM (Fig. 3) with “Patient ID” and “Eye ID” treated as random effects and all other features treated as fixed effects.

For the deep learning prediction, we constructed a ROC curve, which can visualize the performance of the DLM at all classification thresholds (Fig. 4). An AUC value and its 95% confidence interval were calculated as a measure of prediction performance. The Clopper-Pearson method was used to calculate the 95% confidence interval of false positive rates and true positive rates42. The same approach was used to identify an AUC for the mixed-effects model approach. For clinician assessment of worsening a fixed true positive rate and false positive rate was calculated. An exact ROC curve cannot be calculated for clinician assessment of worsening since it is a discrete and binary classification. To evaluate clinician prediction performance, a best minmax AUC score and its upper and lower bounds were calculated, assuming the clinician ROC curve is concave or monotone43.

Unless specified otherwise all comparisons and performance analyses were calculated on the test dataset only. The DLM was developed using Python (Python Software Foundation, Wilmington, Delaware). SPSS was used for statistical comparisons (IBM Corp, Armonk, NY).

American Glaucoma Society, Paper Presentation, Nashville, TN, 2022.

The datasets generated and/or analyzed during the current study are not publicly available due to being protected health information. The raw data would not be available to share.

McKean-Cowdin, R. et al. Impact of visual field loss on health-related quality of life in glaucoma: The Los Angeles Latino Eye Study. Ophthalmology 115(6), 941-948.e1. https://doi.org/10.1016/j.ophtha.2007.08.037 (2008).

Article Google Scholar

Tham, Y. C. et al. Global prevalence of glaucoma and projections of glaucoma burden through 2040: A systematic review and meta-analysis. Ophthalmology 121(11), 2081–2090. https://doi.org/10.1016/j.ophtha.2014.05.013 (2014).

Article Google Scholar

Heijl, A. et al. Measuring visual field progression in the early manifest glaucoma trial. Acta Ophthalmol. Scand. 81(3), 286–293. https://doi.org/10.1034/j.1600-0420.2003.00070.x (2003).

Article Google Scholar

Russell, R. A., Crabb, D. P., Malik, R. & Garway-Heath, D. F. The relationship between variability and sensitivity in large-scale longitudinal visual field data. Investig. Ophthalmol. Vis. Sci. 53(10), 5985–5990. https://doi.org/10.1167/iovs.12-10428 (2012).

Article Google Scholar

Heijl, A., Lindgren, A. & Lindgren, G. Test-retest variability in glaucomatous visual fields. Am. J. Ophthalmol. 108(2), 130–135. https://doi.org/10.1016/0002-9394(89)90006-8 (1989).

Article CAS Google Scholar

Wall, M., Woodward, K. R., Doyle, C. K. & Artes, P. H. Repeatability of automated perimetry: A comparison between standard automated perimetry with stimulus size III and V, matrix, and motion perimetry. Investig. Ophthalmol. Vis. Sci. 50(2), 974–979. https://doi.org/10.1167/iovs.08-1789 (2009).

Article Google Scholar

Spry, P. G. D. & Johnson, C. A. Identification of progressive glaucomatous visual field loss. Surv. Ophthalmol. 47(2), 158–173. https://doi.org/10.1016/S0039-6257(01)00299-5 (2002).

Article Google Scholar

Weinreb, R. N., Aung, T. & Medeiros, F. A. The pathophysiology and treatment of glaucoma. JAMA 311(18), 1901. https://doi.org/10.1001/jama.2014.3192 (2014).

Article CAS Google Scholar

Chauhan, B. C. et al. Practical recommendations for measuring rates of visual field change in glaucoma. Br. J. Ophthalmol. 92(4), 569–573. https://doi.org/10.1136/bjo.2007.135012 (2008).

Article CAS Google Scholar

Nouri-Mahdavi, K., Zarei, R. & Caprioli, J. Influence of visual field testing frequency on detection of glaucoma progression with trend analyses. Arch. Ophthalmol. 129(12), 1521–1527. https://doi.org/10.1001/archophthalmol.2011.224 (2011).

Article Google Scholar

Malik, R., Baker, H., Russell, R. A. & Crabb, D. P. A survey of attitudes of glaucoma subspecialists in England and Wales to visual field test intervals in relation to NICE guidelines. BMJ Open 3(5), e002067. https://doi.org/10.1136/bmjopen-2012-002067 (2013).

Article Google Scholar

Wu, Z., Saunders, L. J., Daga, F. B., Diniz-Filho, A. & Medeiros, F. A. Frequency of testing to detect visual field progression derived using a longitudinal cohort of glaucoma patients. Ophthalmology 124(6), 786–792. https://doi.org/10.1016/j.ophtha.2017.01.027 (2017).

Article Google Scholar

Advanced, T., Intervention, G. & Investigators, S. Advanced glaucoma intervention. Study 2. Visual field test scoring and reliability. Ophthalmology 101(8), 1445–1455. https://doi.org/10.1016/S0161-6420(94)31171-7 (1994).

Article Google Scholar

Heijl, A., Leske, M. C., Bengtsson, B., Bengtsson, B. & Hussein, M. Early manifest glaucoma trial group. Measuring visual field progression in the early manifest glaucoma trial. Acta Ophthalmol. Scand. 81(3), 286–293. https://doi.org/10.1034/j.1600-0420.2003.00070.x (2003).

Article Google Scholar

Musch, D. C., Lichter, P. R., Guire, K. E. & Standardi, C. L. The collaborative initial glaucoma treatment study: Study design, methods, and baseline characteristics of enrolled patients. Ophthalmology 106(4), 653–662. https://doi.org/10.1016/S0161-6420(99)90147-1 (1999).

Article CAS Google Scholar

Vesti, E., Johnson, C. A. & Chauhan, B. C. Comparison of different methods for detecting glaucomatous visual field progression. Investig. Ophthalmol. Vis. Sci. 44(9), 3873–3879. https://doi.org/10.1167/iovs.02-1171 (2003).

Article Google Scholar

Heijl, A. et al. A Comparison of visual field progression criteria of 3 major glaucoma trials in early manifest glaucoma trial patients. Ophthalmology 115(9), 1557–1565. https://doi.org/10.1016/j.ophtha.2008.02.005 (2008).

Article Google Scholar

Casas-Llera, P. et al. Visual field index rate and event-based glaucoma progression analysis: Comparison in a glaucoma population. Br. J. Ophthalmol. 93(12), 1576–1579. https://doi.org/10.1136/bjo.2009.158097 (2009).

Article CAS Google Scholar

Rao, H. L. et al. Agreement between event-based and trend-based glaucoma progression analyses. Eye 27(7), 803–808. https://doi.org/10.1038/eye.2013.77 (2013).

Article CAS Google Scholar

Saeedi, O. J. et al. Agreement and predictors of discordance of 6 visual field progression algorithms. Ophthalmology 126(6), 822–828. https://doi.org/10.1016/j.ophtha.2019.01.029 (2019).

Article Google Scholar

Rabiolo, A. et al. Comparison of methods to detect and measure glaucomatous visual field progression. Transl. Vis. Sci. Technol. https://doi.org/10.1167/tvst.8.5.2 (2019).

Article Google Scholar

Yousefi, S. et al. Unsupervised Gaussian mixture-model with expectation maximization for detecting glaucomatous progression in standard automated perimetry visual fields. Transl. Vis. Sci. Technol. https://doi.org/10.1167/tvst.5.3.2 (2016).

Article Google Scholar

Yousefi, S. et al. Asymmetric patterns of visual field defect in primary open-angle and primary angle-closure glaucoma. Investig. Ophthalmol. Vis. Sci. 59(3), 1279–1287. https://doi.org/10.1167/iovs.17-22980 (2018).

Article Google Scholar

Goldbaum, M. H. et al. Progression of patterns (POP): A machine classifier algorithm to identify glaucoma progression in visual fields. Investig. Ophthalmol. Vis. Sci. 53(10), 6557–6567. https://doi.org/10.1167/iovs.11-8363 (2012).

Article Google Scholar

Park, K., Kim, J. & Lee, J. Visual field prediction using recurrent neural network. Sci. Rep. 9(1), 1–12. https://doi.org/10.1038/s41598-019-44852-6 (2019).

Article CAS Google Scholar

Wang, M. et al. An artificial intelligence approach to detect visual field progression in glaucoma based on spatial pattern analysis. Investig. Ophthalmol. Vis. Sci. https://doi.org/10.1167/iovs.18-25568 (2019).

Article Google Scholar

Wen, J. C. et al. Forecasting future humphrey visual fields using deep learning. PLoS One 14(4), 1–14. https://doi.org/10.1371/journal.pone.0214875 (2019).

Article CAS Google Scholar

Shuldiner, S. R. et al. Predicting eyes at risk for rapid glaucoma progression based on an initial visual field test using machine learning. PLoS One 16, 1–16. https://doi.org/10.1371/journal.pone.0249856 (2021).

Article CAS Google Scholar

Thompson, A. C., Jammal, A. A. & Medeiros, F. A. A review of deep learning for screening, diagnosis, and detection of glaucoma progression. Transl. Vis. Sci. Technol. 9(2), 1–19. https://doi.org/10.1167/tvst.9.2.42 (2020).

Article Google Scholar

Dixit, A., Yohannan, J. & Boland, M. V. Assessing glaucoma progression using machine learning trained on longitudinal visual field and clinical data. Ophthalmology 128(7), 1016–1026. https://doi.org/10.1016/j.ophtha.2020.12.020 (2021).

Article Google Scholar

Gupta, P. et al. Prevalence of glaucoma in the United States: The 2005–2008 national health and nutrition examination survey. Investig. Ophthalmol. Vis. Sci. 57(6), 2905–2913. https://doi.org/10.1167/iovs.15-18469 (2016).

Article Google Scholar

Tanna, A. P. et al. Interobserver agreement and intraobserver reproducibility of the subjective determination of glaucomatous visual field progression. Ophthalmology 118(1), 60–65. https://doi.org/10.1016/j.ophtha.2010.04.038 (2011).

Article Google Scholar

Viswanathan, A. C. et al. Interobserver agreement on visual field progression in glaucoma: A comparison of methods. Br. J. Ophthalmol. 87(6), 726–730. https://doi.org/10.1136/bjo.87.6.726 (2003).

Article CAS Google Scholar

Brigatti, L., Nouri-Mahdavi, K., Weitzman, M. & Caprioli, J. Automatic detection of glaucomatous visual field progression with neural networks. Arch. Ophthalmol. 115(6), 725–728. https://doi.org/10.1001/archopht.1997.01100150727005 (1997).

Article CAS Google Scholar

Yousefi, S. et al. Detection of longitudinal visual field progression in glaucoma using machine learning. Am. J. Ophthalmol. 193, 71–79. https://doi.org/10.1016/j.ajo.2018.06.007 (2018).

Article Google Scholar

Yohannan, J. et al. Evidence-based criteria for assessment of visual field reliability. Ophthalmology 124(11), 1612–1620. https://doi.org/10.1016/j.ophtha.2017.04.035 (2017).

Article Google Scholar

Morgan, R. K., Feuer, W. J. & Anderson, D. R. Statpac 2 glaucoma change probability. Arch. Ophthalmol. 109(12), 1690–1692. https://doi.org/10.1001/archopht.1991.01080120074029 (1991).

Article CAS Google Scholar

Artes, P. H. et al. Longitudinal and cross-sectional analyses of visual field progression in participants of the Ocular Hypertension Treatment Study. Arch. Ophthalmol. 128(12), 1528–1532. https://doi.org/10.1001/archophthalmol.2010.292 (2010).

Article Google Scholar

Tseng B. AGIS visual field score web applet.

Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977).

Article CAS MATH Google Scholar

Fleiss, J. L. Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382. https://doi.org/10.1037/h0031619 (1971).

Article Google Scholar

Sakakibara, I., Haramo, E., Muto, A., Miyajima, I. & Kawasaki, Y. Comparison of five exact confidence intervals for the binomial proportion. Am. J. Biostat. 4(1), 11–20. https://doi.org/10.3844/amjbsp.2014.11.20 (2014).

Article Google Scholar

van den Hout, W. B. The area under an ROC curve with limited information. Med. Decis. Mak. 23(2), 160–166. https://doi.org/10.1177/0272989X03251246 (2003).

Article Google Scholar

Download references

The funding was supported by NIH 5 K23 EY032204-02 (JY) and Research to Prevent Blindness (RPB), NY: Unrestricted Grant.

These authors contributed equally: Jasdeep Sabharwal and Kaihua Hou.

Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, MD, USA

Jasdeep Sabharwal, Chris Bradley, Pradeep Y. Ramulu & Jithin Yohannan

Malone Center for Engineering, Johns Hopkins University, Baltimore, MD, USA

Kaihua Hou, Patrick Herbert, Mathias Unberath & Jithin Yohannan

Department of Ophthalmology and Visual Sciences, University of Iowa, Iowa City, IA, USA

Chris A. Johnson & Michael Wall

You can also search for this author in PubMed Google Scholar

J.S. and K.H. are co-first authors. All authors contributed to the methodology and experiments, K.H., P.H., J.Y. developed the DLM, J.S., K.H., J.Y. analyzed the results, J.S., K.H., J.Y. wrote the original draft. All authors reviewed, edited, and approved the manuscript.

Correspondence to Jithin Yohannan.

The authors declare no competing interests.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

Sabharwal, J., Hou, K., Herbert, P. et al. A deep learning model incorporating spatial and temporal information successfully detects visual field worsening using a consensus based approach. Sci Rep 13, 1041 (2023). https://doi.org/10.1038/s41598-023-28003-6

Download citation

Received: 29 August 2022

Accepted: 11 January 2023

Published: 19 January 2023

DOI: https://doi.org/10.1038/s41598-023-28003-6

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Previous: The number of examinations required for the accurate prediction of the progression of the central 10 Next: Structure

Send inquiry

Send