PLUS

Positive and unlabeled Learning from Unbalanced cases and Sparse structures, or PLUS, represents the first one to use positive and unlabeled learning framework to specifically model the under-diagnosis issue in predicting cancer metastasis potential. PLUS is specifically tailored for studying metastasis that deals with the unbalanced instance allocation as well as unknown metastasis prevalence, which are not capable by any other methods. Its robustness grants the possibility to harness the power of big data by integrating large scale datasets from different cancer types. Insights gleaned from this research will prove useful to the diagnosis and treatment of clinical metastatic disease.

The motivation of PLUS

(a) Among the patients who were diagnosed as non-metastatic, some were under-diagnosed. Traditional classification using diagnosis as response may underestimate the cancer metastasis potential. PLUS is designed to recognize the bias in under-diagnosis, so that patients with higher metastasis potential could be accurately classified. (b) In TCGA Pan-cancer study, for patients who are clinically diagnosed as non-metastatic (M0) at baseline in each cancer type (columns), the top three rows shows the proportions of patients with follow-up information who were found alive and with non- progressed disease (NP-Alive), alive and with progressed disease (P-Alive), and dead (Dead); and the bottom three rows show the same proportions for patients who were diagnosed as metastatic (M1) at baseline. (c) The median follow-up time for patients who were diagnosed as non-metastatic (blue) and as metastatic (yellow) at baseline for each cancer type.

Installation

#install dependent pkg

#install.packages("glmnet")

#install.packages("devtools")

devtools::install_github("xiaoyulu95/PLUS",force=TRUE)

Usage

Prediction=PLUS(train_data=X,Label.obs=Label,Sample_use_time=30,l.rate=1,qq=0.1)

Arguments

train_data: Gene expression matrix which has N samples and M variables
Label.obs: Positive Unlabeled for each sample, 1 means true positive label, 0 means unlabeled labels
Sample_use_time: Used in stop criteria, how many times each samples to be used in training process
l.rate: Control how much information from last iteration will be used in next
qq: Quantile of the probability for positive samples, used to determine the cutoff between positive and negtive

Value

Result list contains three elements: pred.y shows the probability for each same to be predicted as positive; cutoff is the reference cutoff to transfer continues probability to binary 0/1 label; pred.coef1 take the variable coefficient used in prediction model.

Example

### The R packages involved in PLUS package
library(PLUS)
library(glmnet)

X=PLUS::example_data$train_data
Label=PLUS::example_data$Label.obs
Prediction=PLUS(train_data=X,Label.obs=Label,Sample_use_time=30,l.rate=1,qq=0.1)

PU data simulation

PLUS package also provide the capability to simulate positive unlabeled data in different setting. Detail: (https://github.com/xiaoyulu95/PLUS/tree/master/PU%20data%20simulation)

Contact Information

Xiaoyu Lu (lu14@iu.edu)

Ph.D. candidate, Indiana University School of Medicine

Junyi Zhou (junyzhou@iu.edu)

Ph.D. candidate, Department of Biostatistics, Indiana University

Reference

Zhou, J., Lu, X., Chang, W., Wan, C., Zhang, C. and Cao, S., 2020. PLUS: predicting pan-cancer metastasis potential based on positive and unlabeled learning