Taming the Length Field in Binary Data: Calc-Regular Languages
Joshua König and Norina Marie Grosch
Presented at the 2017 LangSec Workshop
at the IEEE Symposium on Security & Privacy Workshops
May 25, 2016
San Jose, CA
When binary data are sent over a byte stream, the binary format sender and receiver are using is a "data serialization language", either explicitly specified, or implied by the implementations. Security is at risk when sender and receiver disagree on details of this language. If, e.g., the receiver fails to reject invalid messages, an adversary may assemble such invalid messages to compromise the receiver's security.
Many data serialization languages are length-prefix languages. When sending/storing some F of flexible size, F is encoded at the binary level as a pair (|F|,F), with |F| representing the length of F (typically in bytes).
This paper's main contributions and results are as follows.
(1) Length-prefix languages are not context-free. This might seem to justify the conjecture that parsing those languages is difficult and not efficient.
(2) The class of "calc-regular languages" is proposed, a minimalistic extension of regular languages with the additional property of handling length-fields. Calc-regular languages can be specified via "calc-regular expressions", a natural extension of regular expressions.
(3) Calc-regular languages are almost as easy to parse as regular languages, using finite-state machines with additional accumulators. This disproves the conjecture from (1).