Readnote: Scalable and accurate deep learning for electronic health records

Data representation and processing

  • To represent events in a patient’s timeline, we adopted the FHIR standard (Fast Healthcare Interoperability Resources)
  • Each event is derived from a FHIR resource and may contain multiple attributes
  • represents the entire EHR in temporal order: data are organized by patient and by time
  • Data in each attribute was split into discrete values which we refer to as a tokens.
    • the text was split into a sequence of tokens, one for each word. Numeric values were normalized
  • The entire sequence of time-ordered tokens, from the beginning of a patient’s record until the point of prediction, formed the patient’s personalized input to the model.


  • an important clinical outcome (death),
  • a standard measure of quality of care (readmissions),
  • a measure of resource utilization (length of stay),
  • and a measure of understanding of a patient’s problems (diagnoses)


  • health systems collect and store electronic health records in various formats in databases
  • all available data for each patient is converted to events recorded in containers based on the Fast Healthcare Interoperability Resource (FHIR) specification
  • FHIR resources are placed in temporal order, depicting all events recorded in EHR (ie. timeline). Deep learning model uses this full history to make each prediction

This conversion did not harmonize or standardize the data from each health-system other than map them to the appropriate resource. The deep learning model could use all data available prior to the point when the prediction was made. Therefore each prediction, regardless of the task, used the same data.


  • LSTM
  • attention-based time-aware neural network model (TANN)
  • neural network with boosted time-based decision stumps

final model is an ensemble of predictions from the three underlying model architectures

LSTM and TANN models were trained with TensorFlow and the boosting model was implemented with C++ code. Statistical analyses and baseline models were done in Scikit-learn Python

Weighted RNN

In the RNN model, sparse features of each category (such as medication or procedures) were embedded into the same d-dimensional embedding. d for each category was chosen based on the number of possible features for that category. The embeddings from di erent categories are concatenated and for the same category and same time, they are averaged according to an automatically learned weighting.

The sequence of embeddings were further reduced down to a shorter sequence. Typically, the shorter sequences were split into time-steps of 12 hours where the embeddings for all features within a category in the same day were combined using weighted averaging. The weighted averaging is done by associating each feature with a non-negative weight that is trained jointly with the model. These weights are also used for prediction attribution. The log of the average time-delta at each time-step is also embedded into a small floating-point vector (which is also randomly initialized) and concatenated to the input embedding at each time-step.

This reduced sequence of embeddings were then fed to an n-layer Recurrent Neural Network (RNN), specifically a Long Short-Term Memory network (LSTM)

The hidden state of the final time-step of the LSTM was fed into an output layer, and the model was optimized to minimize the log-loss (either a logistic regression or softmax loss depending on the task). We applied a number of regularization techniques to the model, namely embedding dropout, feature dropout, LSTM input dropout and variational RNN dropout.4 We also used a small level of L2 weight decay, which adds a penalty for large weights into the loss. We trained with a batch size of 128 and clipped the norm of the gradients to 50. Finally, we optimized everything jointly with Adagrad6. We trained using the TensorFlow framework on the Tesla P100 GPU. The regularization hyperparameters and learning rate were found via a Gaussian-process based hyperparameter search on each dataset’s validation performance.

Boosted, embedded time-series model

Every binary rule, which we refer to as a binary predicate, was assigned a scalar weight, and the weighted sum was passed through a softmax layer to create a prediction. To train, we first created a bias binary predicate which was true for all examples and its weight was assigned as the log-odds ratio of the positive label class across the dataset.

The final binary predicates were then embedded into a 1024 dimensional vector space and then fed to a feed-forward network of depth 2 and 1024 hidden units per layer with ELU non-linearity. For regularization, Gaussian noise of mean 0 and standard deviation 0.8 was added to the input of the feed forward network. We also used multiplicative Bernoulli noise of p=0.01 (also known as dropout) at the input and output (just before the applying the sigmoid function to the logits) of the feed forward layer. At test time, no Gaussian or Bernoulli noise was used. We optimized everything with Adam.


Patients were randomly split into development (80%), validation (10%) and test (10%) sets. Model accuracy is reported on the test set, and 1000 bootstrapped samples were used to calculate 95% confidence intervals. To prevent overfitting, the test set remained unused (and hidden) until final evaluation.

We assessed model discrimination by calculating area under the receiver operating characteristic curve (AUROC) and model calibration using comparisons of predicted and empirical probability curves.

Automated Hyperparameter Tuning

The hyper-parameters, which are settings that a ect the performance of all above neural networks were tuned automatically using Google Vizier [35] with a total of >201,000 GPU hours.

Baseline models

All values were log transformed and standardized to have a mean of zero and standard deviation of 1


our study’s approach uses a single data-representation of the entire EHR as a sequence of events, allowing this system to be used for any prediction that would be clinically or operationally useful with minimal additional data preparation.

Future research is needed to determine how models trained at one site can be best applied to another site

our methods are computationally intensive and at present require specialized expertise to implement


Accurate predictive models can be built directly from EHR data for a variety of important clinical problems with explanations highlighting evidence in the patient’s chart.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s