DEEP LEARNING IN THE CANA CAR: A CANCER TRIAL DATA SCIENCE CHALLENGE
Updated: Oct 16, 2020
As a Senior Operations Research Analyst at CANA Advisors, I’ve had the opportunity to apply both my military logistics experience and healthcare analytics expertise to a variety of challenging problems. I recently shared a learning experience with peers at our monthly CANA Analytics Roundtable (CAR). The CANA “CAR” is a monthly gathering open to all hands of the company, which highlights employee-chosen topics related to analytics techniques and technologies. Each monthly CAR typically hosts four to five short presentations. During our most recent roundtable, I talked about my strategy and initial execution of a cancer trial data science challenge hosted by Oak Ridge National Laboratories.
The challenge set forth was to analyze cancer patient information and data sets to appropriately determine an individual’s assignment to a select clinical trial. I identified 25 input features within the patients’ information database that would eventually feed into the target variable – “Selected for Study, Yes or No.” I noted a specific need for improvement based on doctors’ feedback: trial names did not always match to the relevant disease site, thereby missing a critical linkage point between patients and potential trial participation.
I used PyTextRank as one means to address the data challenge. Although the “bag of words” is a common model in text mining, it focuses mostly on simple word identification and count. In this instance, I felt Python’s Pytextrank library was the right tool, given the types of trials and abstracts in the challenge, to classify titles to the correct cancer sites. Pytextrank can be used not only for identification and counting but also to select keywords, assign importance to the word, and build summary sentences from text. I used Pytextrank to review the summaries and to establish an Eigenvalue centrality metric that ranked node importance based on not only the number of connections but also the quality of the connected nodes, essentially creating a network strength metric.
Another critical tool was the R deep learning API, Keras. Although it appears a difficult language to work in, it seemed most of the heavy lifting work was done in data preprocessing to put the data in matrix vector format. Keras was critical in addressing my intent to define and train the model to appropriately match a large body of cancer trials to specific cancer anatomical sites, e.g., brain, breast, prostate, etc., thereby enabling efficacious patient match-up to a potentially useful trial. In order to put these different elements together, I used Reticulate to embed Python Pytextrank in R. This approach was fairly effective, and I was able to demonstrate initial iterations of my model.
As I continued through validation and analysis of my approach, I realized the classification model did not produce what I considered significant results. I need to further feature engineer the text corpus dataset and improve the model's input features. This iterative process will help determine features that best represent, classify, and connect the data flowing into the model to provide optimal results. My next steps are to experiment and test out the methods used here in https://cloud4scieng.org/2020/08/28/deep-learning-on-graphs-a-tutorial/.
This deep learning approach may reveal more about the underlying structure of the cancer study data; define the nodes and edges that detail its connections and features; identify or predict links and communities; and enable classification between classes. I intend to, quite literally, connect the dots of the data to solve this cancer clinical trial challenge.
Jerome Dixon is a Senior Operations Research Analyst at CANA Advisors. email@example.com