Bounded coordinate-descent for biological sequence classification in high dimensional predictor space
Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Standard
Bounded coordinate-descent for biological sequence classification in high dimensional predictor space. / Ifrim, Georgiana; Wiuf, Carsten.
Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11. 2011. s. 708-716.Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - Bounded coordinate-descent for biological sequence classification in high dimensional predictor space
AU - Ifrim, Georgiana
AU - Wiuf, Carsten
PY - 2011/9/16
Y1 - 2011/9/16
N2 - We present a framework for discriminative sequence classification where linear classifiers work directly in the explicit high-dimensional predictor space of all subsequences in the training set (as opposed to kernel-induced spaces). This is made feasible by employing a gradient-bounded coordinatedescent algorithm for efficiently selecting discriminative subsequences without having to expand the whole space. Our framework can be applied to a wide range of loss functions, including binomial log-likelihood loss of logistic regression and squared hinge loss of support vector machines. When applied to protein remote homology detection and remote fold recognition, our framework achieves comparable performance to the state-of-the-art (e.g., kernel support vector machines). In contrast to state-of-the-art sequence classifiers, our models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem - a crucial requirement for the bioinformatics and medical communities.
AB - We present a framework for discriminative sequence classification where linear classifiers work directly in the explicit high-dimensional predictor space of all subsequences in the training set (as opposed to kernel-induced spaces). This is made feasible by employing a gradient-bounded coordinatedescent algorithm for efficiently selecting discriminative subsequences without having to expand the whole space. Our framework can be applied to a wide range of loss functions, including binomial log-likelihood loss of logistic regression and squared hinge loss of support vector machines. When applied to protein remote homology detection and remote fold recognition, our framework achieves comparable performance to the state-of-the-art (e.g., kernel support vector machines). In contrast to state-of-the-art sequence classifiers, our models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem - a crucial requirement for the bioinformatics and medical communities.
KW - Greedy coordinate-descent
KW - Logistic regression
KW - Sequence classification
KW - String classification
KW - Support vectormachines
UR - http://www.scopus.com/inward/record.url?scp=80052661040&partnerID=8YFLogxK
U2 - 10.1145/2020408.2020519
DO - 10.1145/2020408.2020519
M3 - Article in proceedings
AN - SCOPUS:80052661040
SN - 9781450308137
SP - 708
EP - 716
BT - Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11
T2 - 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11
Y2 - 21 August 2011 through 24 August 2011
ER -
ID: 203900304