Simple Hashing-Based Algorithms with Strong Theoretical Guarantees

Research output: Book/ReportPh.D. thesisResearch

In recent years, demand has immensely increased for algorithms for handling and analyzing large-scale data. The data often arrives in a stream, and if it is not processed in time, the information is lost. Indeed, the high volume of the data makes storing a complete copy and performing exact computations an impossible task. These challenges motivate the following general question. "Can we design efficient algorithms and data structures that provide reliable statistical analyses in large-scale data scenarios?". A powerful tool in designing such algorithms is randomization, and in particular, randomization through the use of hash functions. Conceptually, a hash function can be thought of as distributing a set of balls into a set of bins. Ideally it should do this in a fully random way, but unfortunately such fully random hashing is impossible to implement in practice. This thesis seeks to investigate practical, pseudorandom hash functions that still have enough randomness in them to give the same good theoretical guarantees as fully random hashing. First, we introduce a new family of hash functions, tabulation-permutation, which is simple to implement and demonstrated to be extremely fast in practice. We prove that this family of hash function provides reliable statistics on data, essentially showing that certain hash based sums nicely follow a normal distribution. We further show how access to such a hashing scheme leads to speedups of various streaming algorithms. We also study the distribution of non-empty bins when using simple tabulation hashing, another fast and practical hashing scheme. In several ways, this distribution is shown to resemble that from the fully random scenario.

Second, we study algorithms for estimating the frequencies of elements in a large stream of data. We show how such algorithms can profit from having access to a machine learning oracle that can predict whether an item is a heavy hitter or not.

Finally, we present a basic graph algorithm which can be understood as solving the following problem. For a city, can we make all the streets of the city one-way such that it is still possible to get from any place to any other place? The algorithm decides in linear time whether this can be done, also finding such a one-way orientation of the streets if it exists.
Original languageEnglish
PublisherDepartment of Computer Science, Faculty of Science, University of Copenhagen
Number of pages189
Publication statusPublished - 2020

ID: 250973835