Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests

Research output: Contribution to journalConference articleResearchpeer-review

Standard

Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests. / Lülf, Christian; Mayr Lima Martins, Denis; Vaz Salles, Marcos Antonio; Zhou, Yongluan; Gieseke, Fabian Cristian.

In: Proceedings of the VLDB Endowment, Vol. 16, No. 13, 2023, p. 2845–2857.

Research output: Contribution to journalConference articleResearchpeer-review

Harvard

Lülf, C, Mayr Lima Martins, D, Vaz Salles, MA, Zhou, Y & Gieseke, FC 2023, 'Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests', Proceedings of the VLDB Endowment, vol. 16, no. 13, pp. 2845–2857. https://doi.org/10.14778/3611479.3611492

APA

Lülf, C., Mayr Lima Martins, D., Vaz Salles, M. A., Zhou, Y., & Gieseke, F. C. (2023). Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests. Proceedings of the VLDB Endowment, 16(13), 2845–2857. https://doi.org/10.14778/3611479.3611492

Vancouver

Lülf C, Mayr Lima Martins D, Vaz Salles MA, Zhou Y, Gieseke FC. Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests. Proceedings of the VLDB Endowment. 2023;16(13):2845–2857. https://doi.org/10.14778/3611479.3611492

Author

Lülf, Christian ; Mayr Lima Martins, Denis ; Vaz Salles, Marcos Antonio ; Zhou, Yongluan ; Gieseke, Fabian Cristian. / Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests. In: Proceedings of the VLDB Endowment. 2023 ; Vol. 16, No. 13. pp. 2845–2857.

Bibtex

@inproceedings{25c6a396318a4fb1a707e9e6b2e85c8d,
title = "Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests",
abstract = "The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find “inter- esting” objects in large databases, users typically define a query using positive and negative example objects and train a classification model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by-classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our frame- work can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds using a single server, compared to hours needed by classical scanning-based approaches.",
author = "Christian L{\"u}lf and {Mayr Lima Martins}, Denis and {Vaz Salles}, {Marcos Antonio} and Yongluan Zhou and Gieseke, {Fabian Cristian}",
year = "2023",
doi = "10.14778/3611479.3611492",
language = "English",
volume = "16",
pages = "2845–2857",
journal = "Proceedings of the VLDB Endowment",
issn = "2150-8097",
publisher = "VLDB Endowment",
number = "13",

}

RIS

TY - GEN

T1 - Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests

AU - Lülf, Christian

AU - Mayr Lima Martins, Denis

AU - Vaz Salles, Marcos Antonio

AU - Zhou, Yongluan

AU - Gieseke, Fabian Cristian

PY - 2023

Y1 - 2023

N2 - The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find “inter- esting” objects in large databases, users typically define a query using positive and negative example objects and train a classification model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by-classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our frame- work can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds using a single server, compared to hours needed by classical scanning-based approaches.

AB - The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find “inter- esting” objects in large databases, users typically define a query using positive and negative example objects and train a classification model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by-classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our frame- work can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds using a single server, compared to hours needed by classical scanning-based approaches.

U2 - 10.14778/3611479.3611492

DO - 10.14778/3611479.3611492

M3 - Conference article

VL - 16

SP - 2845

EP - 2857

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

SN - 2150-8097

IS - 13

ER -

ID: 359258813