Two-pass greedy regular expression parsing
Research output: Chapter in Book/Report/Conference proceeding › Article in proceedings › Research › peer-review
We present new algorithms for producing greedy parses for regular
expressions (REs) in a semi-streaming fashion. Our lean-log
algorithm executes in time O(mn) for REs of size m and input
strings of size n and outputs a compact bit-coded parse tree
representation. It improves on previous algorithms by:
operating in only 2 passes; using only O(m) words of
random-access memory (independent of n); requiring only kn
bits of sequentially written and read log storage, where k <
1/3 m is the number of alternatives and Kleene stars in the
RE; processing the input string as a symbol stream and not
requiring it to be stored at all. Previous RE parsing
algorithms do not scale linearly with input size, or require
substantially more log storage and employ 3 passes where the
first consists of reversing the input, or do not or are not
known to produce a greedy parse. The performance of our
unoptimized C-based prototype indicates that the superior
performance of our lean-log algorithm can also be observed in
practice; it is also surprisingly competitive with RE tools
not performing full parsing, such as Grep.
expressions (REs) in a semi-streaming fashion. Our lean-log
algorithm executes in time O(mn) for REs of size m and input
strings of size n and outputs a compact bit-coded parse tree
representation. It improves on previous algorithms by:
operating in only 2 passes; using only O(m) words of
random-access memory (independent of n); requiring only kn
bits of sequentially written and read log storage, where k <
1/3 m is the number of alternatives and Kleene stars in the
RE; processing the input string as a symbol stream and not
requiring it to be stored at all. Previous RE parsing
algorithms do not scale linearly with input size, or require
substantially more log storage and employ 3 passes where the
first consists of reversing the input, or do not or are not
known to produce a greedy parse. The performance of our
unoptimized C-based prototype indicates that the superior
performance of our lean-log algorithm can also be observed in
practice; it is also surprisingly competitive with RE tools
not performing full parsing, such as Grep.
Original language | English |
---|---|
Title of host publication | Implementation and Application of Automata : 18th International Conference, CIAA 2013, Halifax, NS, Canada, July 16-19, 2013. Proceedings |
Editors | Stavros Konstantinidis |
Number of pages | 12 |
Publisher | Springer |
Publication date | 2013 |
Pages | 60-71 |
ISBN (Print) | 978-3-642-39273-3 |
ISBN (Electronic) | 978-3-642-39274-0 |
DOIs | |
Publication status | Published - 2013 |
Event | 18th International Conference on Implementation and Application of Automata - Halifax, Canada Duration: 16 Jul 2013 → 19 Jul 2013 Conference number: 18 |
Conference
Conference | 18th International Conference on Implementation and Application of Automata |
---|---|
Nummer | 18 |
Land | Canada |
By | Halifax |
Periode | 16/07/2013 → 19/07/2013 |
ID: 94401773