Journal of Applied Microbiology, Vol.127, No.6, 1656-1664, 2019
Capreomycin resistance prediction in two species of Mycobacterium using a stacked ensemble method
Aims Predicting bacterial resistance provides valuable information that can assist in clinical decisions. With recent advances in whole genome sequencing technology, the detection of antibiotic resistance (AR) proteins directly from genomic data is becoming feasible. AR genes/proteins can be identified using best-hit methods that work by comparing candidate sequences with known AR genes in public databases. However, these approaches may fail to detect resistance genes with sequences that differ significantly from known sequences. Our goal is to develop a machine learning technique to accurately predict capreomycin resistance in Mycobacteria with low false discovery rates. Methods and Results We present a stacked ensemble learning model as an alternative to traditional DNA sequence alignment-based methods using optimal features generated from the physicochemical, evolutionary and secondary structure properties of protein sequences. We train logistic regression, C5.0 and support vector machine (SVM) algorithms as our base classifiers, and our stacked ensemble predictors combine the results from the base classifiers to achieve higher accuracy. Compared with our most accurate base classifier (SVM), our most accurate stacked ensemble predictor increases training accuracy by 2 center dot 43%. Our stacked ensemble predictors achieve test accuracy up to 81 center dot 25%. Conclusions We developed a stacked ensemble model to predict capreomycin resistance for Mycobacteria with an accuracy >80% using protein sequences with sequence similarity ranging between 10% and 70%. This performance cannot be achieved with best-hit methods due to differences in sequence similarity. Significance and Impact of the Study Today an estimated one-half million cases of multidrug-resistant (MDR) and extensively drug-resistant (XDR) tuberculosis (TB) occur annually worldwide at a great cost. Because capreomycin is a second-line drug used to treat drug-resistant TB, the ability to use a machine learning approach to classify capreomycin-resistant TB in a timely manner is crucial for the successful treatment of MDR or XDR TB.
Keywords:antibiotic resistance;capreomycin resistance;ensemble learning;feature selection;machine learning;physicochemical features;secondary structure features;tuberculosis