Deep learning for obstructive sleep apnea diagnosis based on single channel oximetry
[ad_1]
Databases
The challenge of robustness is often raised in ML, especially for medical applications27. Indeed, there are many sources for distribution shifts, here defined as changes in the model input distribution, such as the difference across ethnicity groups. The model could learn a certain “bias” of the training set and generalize poorly on external test sets. We performed a large retrospective multicenter analysis including a total of 12,923 PSG recordings, totaling 115,866 hours of continuous data, from six independent databases. The databases include distribution shifts in age, sex, ethnicity, and comorbidity. The institutional review board from the Technion-IIT Rapport Faculty of Medicine was obtained under number 62-2019 to use the retrospective databases obtained from the open access sleepdata.org resource for this research. Table S4 provides summary statistics on demographics to describe the population samples for each of these databases.
Sleep Heart Health Study (SHHS)
SHHS28 is a multicenter cohort study conducted by the National Heart Lung & Blood Institute (ClinicalTrials.gov Identifier: NCT0000527) to determine the cardiovascular and other consequences of sleep-disordered breathing. Individuals were recruited to undergo a type II home PSG. The Nonin XPOD 3011 pulse oximeter (Nonin Medical, Inc., Plymouth, MI, USA) is used for recording. The signal is sampled at 1 Hz. In the first visit, denoted SHHS1, 6441 men and women, aged more than 40 years, are included in the database between November 1, 1995 and January 31, 1998. Recordings from 5793 subjects undergoing unattended full night PSG at baseline are available. A second visit has been performed from January 2001 to June 2003 and will be denoted as SHHS2. This second visit includes 3295 participants.
Río Hortega University Hospital of Valladolid (UHV)
UHV29 is composed of 369 oximetry recordings. The original database composed of 350 in lab PSG recordings is further described in Andres Blanco et al. and Levy et al.13,29. A total of 19 recordings were added to this research. The Nonin WristOx2 3150 was used to perform portable oximetry (simultaneously to PSG) and sampled at 1 Hz for the first 350 recordings, and 16 Hz for the additional 19. The UHV was the only database that was not part of the National Sleep Research Resource (available on sleepdata.org). However, the protocol for annotating the UHV PSG recordings also followed the AASM 2012 recommendations11, and scoring was formed by certified sleep technicians. The UHV database contains patients with suspected sleep disordered breathing and 78 patients with chronic obstructive pulmonary disease (COPD), which is a bias from the other databases. COPD is a lung pathology characterized by persistent airflow limitation that is usually progressive and an enhanced chronic inflammatory response to noxious particles or gases in the airways and the lungs30.
Cleveland Family Study (CFS)
The CFS database31 is made up of 2284 individuals from 361 families, one recording per patient. A subset of the original database, composed of 728 recordings, was available on NSRR and was used for this study. CFS is a large familial sleep apnea study designed to quantify the familial aggregation of sleep apnea. The oximetry was recorded using a Nonin 8000 sensor and sampled at 1 Hz. The database was acquired in the hospital when the patient underwent a type I PSG. Among the 728 recordings available, there are 427 (59%) Black and African American participants.
Osteoporotic Fractures in Men Study (MROS)
MROS32 is an ancillary study of the parent Osteoporotic Fractures in Men Study. Between 2000 and 2002, 5994 community dwelling men 65 years or older were enrolled in 6 clinical centers in a baseline examination. Between December 2003 and March 2005, 3135 of these participants were recruited to the sleep study when they underwent a type II home PSG and 3–5-day actigraphy studies. The objectives of the sleep study are to understand the relationship between sleep disorders and falls, fractures, mortality, and vascular disease. The oximetry signal was recorded with a Nonin 8000 sensor and sampled at of 1 Hz.
Multi Ethnic Study of Atherosclerosis (MESA)
MESA33 is a six-center collaborative longitudinal investigation of factors associated with the development of sub clinical cardiovascular disease. The study includes PSGs of 2056 individuals divided into four ethnic groups: Black and African American participants (n = 572), white participants (n = 743), Hispanic participants (n = 491), and Chinese American participants (n = 250) men and women ages 45–84 years, recorded between 2000 and 2002 with a type II home PSG. The oximetry signals were recorded using a Nonin 8000 sensor, with a sampling rate of 1 Hz.
Scoring rules
Figure 5 presents the distribution of actual AHI for each database. Table 3 summarizes the base demographic data and the AHI of each database. We ensured that the definitions of apnea and hypopnea events and thus the computation of the reference AHI were homogeneous across databases.
The databases SHHS, CFS, MROS and MESA provided by the NSSR were scored following the procedure described in Redline et al.28. Briefly, physiological recordings were originally scored in Compumedics Profusion where apnea and hypopnea respiratory events were scored according to drops in airflow that lasted more than 10 s without criteria of arousal or desaturation for hypopneas. After apneas and hypopneas were identified, the Compumedics Profusion software linked each event to data from the oxygen saturation and EEG channels. This allowed each event to be characterized according to various degrees of associated desaturation and associated arousals and/or combinations of these parameters. This top-down approach enabled the NSSR to generate AHI variables following different recommendations and rules (recommended/alternative). These AHI variables are available on the NSRR website. We used the ahi_a0h3a variable, which is consistent with AASM12 recommended rule as specified on the NSSR website and re-confirmed through private correspondence by the NSSR administrators. Data from the UHV were recorded after 2012 and followed the AASM 2012 guidelines. Annotations for all the NSRR databases (SHHS, CFS, MROS and MESA) as well as the UHV database were made by certified technicians.
Following the American Academy of Sleep Medicine (AASM) 2012 and ICSD-3 guidelines, the AHI was defined as the average number of apneas and hypopneas per hour of sleep. Apneas were scored if (i) there is a drop in the peak signal excursion by ≥90% of pre-event baseline using an oronasal thermal sensor (diagnostic study), positive airway pressure device flow (titration study), or an alternative apnea sensor; and (ii) the duration of the ≥90% drop in sensor signal is ≥10 s. In the same regard, following the recommended rule, hypopneas were defined as a ≥30% fall in an appropriate hypopnea sensor for ≥10 s and with a ≥3% desaturation or associated arousal.
Preprocessing and exclusion criteria
Recordings with technical faults (missing oximetry channel, or corrupted file) and patients under 18 years old, were excluded. Recordings from the UHV database were re-sampled at 1 Hz so that all databases had the same sampling rate. The Delta Filter34,35 was applied to the oximetry time series, to remove non physiological values due to the motion of the oximeter, or lack of proper contact between the finger and the probe. If there were fewer than three consecutive non-physiological values in the signal, a linear interpolation was performed, to fill in the missing values.
Initiating sleep may take some time and individuals with severe OSA may have numerous overnight awakenings. When computing the AHI in regular PSG examinations, the wake periods are excluded from the computation of the AHI, i.e., the cumulative number of apnea and hypopnea events is divided by the total sleep time. To partially account for this in our experiments, we have defined the sleep onset as the beginning of the first consecutive 5 min segments labeled and sleep offset as the end of the last consecutive 5 min segments labeled as sleep on the hypnogram provided for each recording. We approximated the total sleep time (\(\widehat{{{{{{\rm{TST}}}}}}}\)) as being the time interval between sleep onset and sleep offset. In practice, this can be easily estimated using the photoplethysmography (PPG) signal recorded by the pulse oximeter as we have demonstrated in our research work26.
Signals with \(\widehat{{{{{{\rm{TST}}}}}}} < 4\) (i.e., less than 4 h of sleep) were excluded28. All remaining signals were padded to 7 h. This enables us to handle signals of different lengths, from 4 to 7 h. Patients younger than 18 years were not considered in this study and were removed from the databases.
Baseline model
Two baseline classical ML models were implemented to benchmark against the DL approach. The first model included a single oximetry feature, which is the ODI with a threshold at 3%13. For the second model, SpO2 features were computed from the oximetry time series using the open source POBM toolbox35. These biomarkers are divided into five categories: (1) General Statistics: time based statistics describing the oxygen saturation data distribution. For example, Zero-Crossing36 and delta index37. (2) Complexity: quantifies the presence of long range correlations in non stationary time series. For example, Approximate Entropy38, or Detrended Fluctuation Analysis (DFA)39. (3) Periodicity: quantifies consecutive events to identify periodicity in the oxygen saturation time series. For example, Phase rectified signal averaging (PRSA)40 and power spectral density (PSD). (4) Desaturations: time based descriptive measures of the desaturation patterns occurring throughout the time series. For example, area, slope, length and depth of the desaturations. (5) Hypoxic Burden: time-based measures quantifying the overall degree of hypoxemia imposed on the heart and other organs during the recording period. For example, cumulative time under the baseline (CT)41. We have made open source on physiozoo.com the code for computing the digital oximetry features. In addition, the features are more extensively described, including their mathematical definition, in our previous work35.
A CatBoost regressor42 was trained, using a total of 178 engineered features. Further description of the features is available in the supplementary note 1. We used the maximum relevance minimum redundancy (mRMR) algorithm for feature selection. The model was optimized with fivefold cross-validation using Bayesian hyper parameter search. For both the ODI and OBM models, sex and age were added as two additional demographic features (Table S4).
OxiNet
Architecture
In contrast to classical ML approaches, DL techniques provide the ability to automatically learn and extract relevant features from the time series. The model takes as input the preprocessed overnight signal. Recordings with a \(\widehat{{{{{{\rm{TST}}}}}}}\) duration of over 4 h were included. Those with a \(\widehat{{{{{{\rm{TST}}}}}}}\) duration of 4–7 h were padded to 7 h. For those with a \(\widehat{{{{{{\rm{TST}}}}}}}\) duration of more than 7 h (less than 3% over all test set examples) only the first 7 h were included. The oximetry signal is independently processed by two branches as inspired by the architecture proposed by Interdonatoa et al.43. The first branch is based on convolutions, extracting useful patterns in the time series, and is called Convolutional Neural Network (CNN). The second branch, called Convolutional Recurrent Neural Network (CRNN), exploits the long range temporal correlation present in the time series.
For the CNN branch, the signals were split into overlapping windows of length Lwindow. The windows are fed to the first part of the branch, which extracts local features. This first part is composed of nB sequential blocks, which extract features from each window. One block is composed of nL 1D convolutional layers with kernel size 3, batch normalization, and Leaky ReLU activation followed by a maxpool layer of stride 2. There are skipped connections between each block. The local features of each window are then concatenated, and the second part of the model extracts long range temporal features, using dilated convolution. A total of nDB dilated blocks are sequentially used. A dilated block is made of nC 1D convolutional filters with kernel size Kdilated, followed by a Leaky ReLU activation. The dilation rate is progressively increased in order to increase the network’s field of view, beginning at ratedilation and being multiplied by 2 between each block. The CNN branch produces the feature vector \({V}_{{{{{{\rm{CNN}}}}}}}\in {{{{{{{{\mathcal{R}}}}}}}}}^{{N}_{{{{{{\rm{CNN}}}}}}}}\).
For the CRNN branch, a representation of the data is first created, to reduce the temporal resolution. This is done by using 2 CNN blocks when one block is composed of a convolutional layer with kernel size kCRNN, batch normalization, and Leaky ReLu activation followed by a maxpool layer of stride 2. Then, a total of two stacked layers of bidirectional Long Short Term Memory (LSTM) with nLSTM units is then applied. The CRNN branch produces the feature vector \({V}_{{{{{{\rm{CRNN}}}}}}}\in {{{{{{{{\mathcal{R}}}}}}}}}^{{N}_{{{{{{\rm{CRNN}}}}}}}}\).
The clinical metadata (META) is processed thanks to a fully connected layer, producing the feature vector \({V}_{{{{{{\rm{META}}}}}}}\in {{{{{{{{\mathcal{R}}}}}}}}}^{{N}_{{{{{{\rm{META}}}}}}}}\). The aggregated feature vector Vfinal = [VCNN, VCRNN, VMETA] is processed by nclassifier classifier blocks to give the final prediction. A classifier block is composed of a fully connected layer, batch normalization, Leaky Relu activation, and then dropout (with dropout rate dclassifier). Each classifier block reduces the dimensionality of the input by 2. A last fully connected layer is then applied, to output the predicted AHI.
This approach allows the model to learn complementary features and better exploit the information hidden in the time series. In order to enforce the discriminating power of the different subsets of features, we adopt the approach proposed by Hou et al.44. Two auxiliary regressors were created, working respectively on the VCNN and VCRNN vectors. These regressors were not involved in the final prediction of the model, but helped in the training process, by ensuring that each subset of the features was trained to be independently discriminative. Figure 6 presents the architecture of the resulting OxiNet model.
The experiments were performed on a PowerEdge R740, 1 GPU NVIDIA Ampere A100, 40GB, 512 GB RAM. For our diagnostic objective, evenly distributed errors with low variance are preferable, as they might not change the final diagnosis of the model. For the above reason, the model was optimized using the Mean Squared Error (MSE) loss combined with L2 regularization. A total of two data augmentation techniques are used: moving window and jitter augmentation. More details about the loss and the data augmentation are available in the supplements.
Loss
For our diagnostic objective, evenly distributed errors with low variance are preferable, as they might not change the final diagnosis of the model. For the above reason, the model was optimized using the Mean Squared Error (MSE) loss combined with L2 regularization. This loss was computed three times: for the two auxiliary regressors, and for the final prediction. The final loss function used was:
$${{{{{{{\mathcal{L}}}}}}}}={{{{{{{{\mathcal{L}}}}}}}}}_{{{{{{\rm{aggregated}}}}}}}+{\lambda }_{{{{{{\rm{CNN}}}}}}} * {{{{{{{{\mathcal{L}}}}}}}}}_{{{{{{\rm{CNN}}}}}}}+{\lambda }_{{{{{{\rm{CRNN}}}}}}} * {{{{{{{{\mathcal{L}}}}}}}}}_{{{{{{\rm{CRNN}}}}}}}$$
(1)
When λCNN, λCRNN are two hyper parameters controlling the impact of \({{{{{{{{\mathcal{L}}}}}}}}}_{{{{{{\rm{CNN}}}}}}},{{{{{{{{\mathcal{L}}}}}}}}}_{{{{{{\rm{CRNN}}}}}}}\), respectively. At the beginning of the training process, λCRNN = λCNN = 1. Then every four epochs the two hyper parameters are multiplied by 80%, so the weight of the auxiliary classifiers in the final loss decreases. The intuition is that these regressors help the final model to converge, but are not part of it. That is why as long as the training process continues, their weight is decaying.
Data augmentation
The OxiNet model is composed of approximately 870,000 parameters, which is a few orders of magnitude larger than the number of examples contained in our training set. Data augmentation was used to increase the training set size, especially Jitter augmentation, adding white noise to the signal. The generated signal is:
$${X}_{{{{{{\rm{new}}}}}}}=X+N,\quad N \sim {{{{{{{\mathcal{N}}}}}}}}(0,\, {\sigma }_{{{{{{\rm{noise}}}}}}})$$
(2)
where Xnew is the signal generated, X is the original signal and N is the noise added. σnoise is a hyperparameter of the model. Figure S4 presents the original and generated signals, with σnoise = 0.5. Although the generated signal may not be biologically feasible, this augmentation technique adds variance to the samples that are fed to the model and prevent overfitting the training set.
Training strategy
The SHHS1 database was split into a 90% training set and a 10% test set. All the hyperparameters of the model were optimized using Bayesian search, over 100 iterations. To this end, the SHHS1 training set was split into 70% training and 30% validation. In the first step, the model was trained on the SHHS1 train for 100 epochs. The Adam optimization algorithm was used, with a learning rate of 0.005. The set of hyperparameters leading to the smallest validation loss was retained. Then the performance measures were reported for the test set for each database, independently.
Explainability
Explainability is a critical aspect to ensure that the model is trustworthy and can be integrated into clinical practice. It enables the identification of the contributing factors and provides explanations for the predictions made by the model. Indeed, DL models are known for their black box nature, making it difficult to understand how they arrive at their predictions. To that end, we adapted the algorithm proposed by Zeiler et al.45, named Feature Occlusion (FO) and originally proposed for image recognition. The algorithm has already been used in the context of time series prediction several times46,47. The algorithm computes the importance score as the difference in output after replacing each contiguous region with a given baseline. We defined a region as a window of Lregion seconds in the oximetry signal and performed the occlusion with a sliding window of size Lregion/2, in order to have an importance score for each batch of Lregion/2 seconds. The baseline to replace with was set to be the overall mean of the signal.
Performance measures
The Kruskal–Wallis test was applied (p value cut-off of 0.05) to evaluate whether individual demographic features were discriminating between the four groups of OSA severity: non OSA (AHI < 5), mild (5 ≤ AHI < 15), moderate (15 ≤ AHI < 30) or severe (AHI ≥ 30) OSA. Table S4 presents the summary statistics for these variables and across the databases. Bland-Altman and correlation plots were generated to analyze the agreement and association between the estimated and reference AHI. The agreement was displayed as the median difference between \(\widehat{{{{{{\rm{AHI}}}}}}}\) and AHI and the 5th and 95th percentiles of their difference. For the regression task, the Intraclass Correlation Coefficient (ICC) was reported and is defined as follows:
$${{{{{\rm{ICC}}}}}}=\frac{{{{{{\rm{MS}}}}}}_{{{{{{\rm{I}}}}}}}-{{{{{\rm{MS}}}}}}_{{{{{{\rm{E}}}}}}}}{{{{{{\rm{MS}}}}}}_{{{{{{\rm{I}}}}}}}+(O-1){{{{{\rm{MS}}}}}}_{{{{{{\rm{E}}}}}}}+O * \frac{{{{{{\rm{MS}}}}}}_{{{{{{\rm{O}}}}}}}-{{{{{\rm{MS}}}}}}_{{{{{{\rm{E}}}}}}}}{n}}$$
(3)
where O is the number of observers (two, in this case, the real and predicted AHI), MSI is the instances mean square, MSE is the mean square error and MSO is the observers mean square.
After converting the AHI into the four levels of severity (i.e., non-OSA, mild, moderate, and severe OSA), the macro averaged F1 score was reported as the measure of diagnostic accuracy. The F1 score was computed as follows:
$${{{{{\rm{Se}}}}}}_{{{{{{\rm{M}}}}}}}=\frac{1}{4}\mathop{\sum }\limits_{k=1}^{4}\frac{{{{{{\rm{TP}}}}}}_{k}}{{{{{{\rm{TP}}}}}}_{k}+{{{{{\rm{FN}}}}}}_{k}}$$
(4)
$${{{{{\rm{PPV}}}}}}_{{{{{{\rm{M}}}}}}}=\frac{1}{4}\mathop{\sum }\limits_{k=1}^{4}\frac{{{{{{\rm{TP}}}}}}_{k}}{{{{{{\rm{TP}}}}}}_{k}+{{{{{\rm{FP}}}}}}_{k}}$$
(5)
$${F}_{1,{{{{{\rm{M}}}}}}}=2\frac{{{{{{\rm{PPV}}}}}}_{{{{{{\rm{M}}}}}}} * {{{{{\rm{Se}}}}}}_{{{{{{\rm{M}}}}}}}}{{{{{{\rm{PPV}}}}}}_{{{{{{\rm{M}}}}}}}+{{{{{\rm{Se}}}}}}_{{{{{{\rm{M}}}}}}}}$$
(6)
where, for a given class k, TPk is the number of true positives, TNk the number of true negatives, FPk the number of false positives, and FNk the number of false negatives. Additional performance measures are defined in the Supplementary Note.
We estimated the confidence interval for the F1 and ICC scores of the different models using bootstrapping, similar to the work of Biton et al.48. That is, the F1 and ICC scores were repeatedly computed on randomly sampled 80% of the test set (with replacement). The procedure was repeated 1000 times and used to obtain the intervals, which are defined as follows:
$${C}_{{{{{{\rm{n}}}}}}}=\bar{x}\pm {z}_{0.95} * {{{{{\rm{se}}}}}}_{{{{{{\rm{boot}}}}}}}$$
with \(\bar{x}\) as the bootstrap mean, z0.95 is the critical value found from the distribution table of normal CDF, and seboot is the bootstrap estimate of the standard error. Bootstrap was performed on each database separately. To determine if there was a statistical difference, the Wilcoxon rank-sum test was applied and a p value cut-off at 0.05 was used. The statistical test was also used to determine if there is a significant difference in performance measures for male vs female.
[ad_2]
Source link