ISLa Specs

ISLa is a grammar-aware string constraint solver with its own specification language.

View on GitHub rindphi/isla

The ISLa Language Specification

The Input Specification Language (ISLa) is a notation for formally specifying context-sensitive properties of strings structured by a context-free grammar. The purpose of this document is to precisely specify ISLa’s syntax and semantics.

The ISLa version considered in this document is ISLa 0.8.16. Please consult the ISLa CHANGELOG to find out if there were recent additions.

Introduction

Strings are the basic datatype for software testing and debugging at the system level: All programs inputs and outputs are strings, or can be straightforwardly represented as such. In parsing and fuzzing, Context-Free Grammars (CFGs) are a popular formalism to decompose unstructured strings of data.

Consider, for example, a simple language of assignments such as x := 1 ; y := x. The following grammar, here presented in Extended Backus–Naur Form (EBNF), can be used to parse and produce syntactically valid assignment programs:

stmt  = assgn, { " ; ", stmt } ;
assgn = var, " := ", rhs ;
rhs   = var | digit ;
var   = "a" | "b" | "c" | ... ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;

Using this grammar, the input x := 1 ; y := x can be parsed into the following tree structure:

stmt
├─ assgn
│  ├─ var
│  │  └─ "x"
│  ├─ " := "
│  └─ rhs
│     └─ digit
│        └─ "1"
├─ " ; "
└─ stmt
   └─ assgn
      ├─ var
      │  └─ "y"
      ├─ " := "
      └─ rhs
         └─ var
            └─ "x"

In the context of parsing, such trees are called “parse trees;” in the fuzzing domain, the notion “derivation tree” is preferred. In this document, we will use the term “derivation tree.”

We call labels of the inner nodes of derivation trees such as stmt that have to be further expanded to produce or parse a string “nonterminal elements” or simply “nonterminals;” consequently, the leaves of the tree are labeled with “terminal elements” or “terminals.” From a tree, we obtain a string by chaining the terminals in the order in which they are visited in a depth-first traversal of the tree. CFGs map nonterminals to one or more expansion alternatives or, simply, expansions. When using a grammar for fuzzing, we can expand a nonterminal using any alternative. During parsing, we have to find the right alternative for the given input (that is, one right alternative, since CFGs can be ambiguous).

CFGs allow decomposing strings into their elements. However, they are—by definition—too coarse to capture context-sensitive language features. In the case of our assignment language, x := 1 ; y := z is not considered a valid element of the language, since the identifier z has not been assigned before (such that its value is undefined). Similarly, the program x := x is “illegal.” This property, that all right-hand side variables must have been assigned in a previous assignment, could, in principle, be expressed in a less restricted grammar. Examples are context-sensitive or even unrestricted grammars, where left-hand sides can contain additional context in addition to a single nonterminal value. However, such grammars are tedious to use in specification, and we do not know any parsing or fuzzing tool based on more general grammar formalisms.

Enter ISLa. ISLa specifications are based on a reference grammar. The nonterminals of that grammar determine the vocabulary of the grammar. They take the roles of variables in unit-level specification languages like JML. The following ISLa constraint restricts the language of the reference grammar shown above to exactly those assignment programs using only previously assigned variables as right-hand sides:

forall <assgn> assgn:
  exists <assgn> decl: (
    before(decl, assgn) and 
    assgn.<rhs>.<var> = decl.<var>
  )

In ISLa language, nonterminals are surrounded by angular brackets (see also the section on grammars). The above constraint specifies that

Note that the . syntax allows accessing immediate children of elements in the parse tree; decl.<var> thus uniquely identifies the left-hand side of an assignment (since variables in right-hand sides appear as a child of a <rhs> nonterminal).

In the remainder of this document, we specify the syntax and semantics of ISLa formulas.

Syntax

In this section, we describe the syntax of ISLa’s reference grammars and the syntax of ISLa formulas themselves. We introduce the ISLa syntax on a high level by providing grammars in EBNF. In the section on ISLa’s semantics, we discuss the individual ISLa syntax elements in more details and explain their meaning formally and based on examples.

Grammars

ISLa’s uses simple CFGs as reference grammars, i.e., without repetition etc. Valid ISLa grammars are exactly those that can be expressed in Backus-Naur Form (BNF).1

The EBNF grammar for the concrete syntax of ISLa reference grammars looks as follows, where NO_ANGLE_BRACKET represents any character but < and >:

bnf_grammar = derivation_rule, { derivation_rule } ;

derivation_rule = NONTERMINAL, "::=", alternative, { "|", alternative } ;

alternative = ( STRING | NONTERMINAL ) { STRING | NONTERMINAL } ;

NONTERMINAL = "<", NO_ANGLE_BRACKET, { NO_ANGLE_BRACKET }, ">" ;

STRING = '"' { ESC|. }? '"';
ESC = '\\' ("b" | "t" | "n" | "r" | '"' | "\\") ;

Here’s how our example grammar from the introduction looks in this format (we abbreviated the definition of <var>):

<start> ::= <stmt>
<stmt> ::= <assgn> | <assgn> " ; " <stmt>
<assgn> ::= <var> " := " <rhs>
<rhs> ::= <var> | <digit>
<var> ::= "a" | "b" | "c" | ...
<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Lexer Rules

ISLa’s lexer grammar is shown below. In addition to the rules shown, ISLa knows Python-style line comments starting with #. These comments as well as whitespace between tokens are ignored during lexing. The only string delimiter known to ISLa are double quotes ". Inside strings, double quotes are escaped using a backslash character: \". Most notably, this also holds for SMT-LIB expressions, which is a deviation from the SMT-LIB standard where quotes are escaped by doubling them. In standard SMT-LIB, a quotation mark inside double quotes is expressed (""""), whereas in ISLa, one writes "\"".

AND = "and" ;
OR = "or" ;
NOT = "not" ;

XOR = "xor" ;
IMPLIES_SMT = "=>" ;
IMPLIES_ISLA = "implies" ;

SMT_INFIX_RE_STR =
      "re.++"
    | "str.++"
    | "str.<="
    ;

SMT_NONBINARY_OP =
      ABS
    | "re.+"
    | "re.*"
    | "str.len"
    | "str.in_re"
    | "str.to_re"
    | "re.none"
    | "re.all"
    | "re.allchar"
    | "str.at"
    | "str.substr"
    | "str.prefixof"
    | "str.suffixof"
    | "str.contains"
    | "str.indexof"
    | "str.replace"
    | "str.replace_all"
    | "str.replace_re"
    | "str.replace_re_all"
    | "re.comp"
    | "re.diff"
    | "re.opt"
    | "re.range"
    | "re.loop"
    | "str.is_digit"
    | "str.to_code"
    | "str.from_code"
    | "str.to.int"
    | "str.from_int"
    ;

XPATHEXPR = (ID | VAR_TYPE), XPATHSEGMENT, { XPATHSEGMENT } ;

XPATHSEGMENT =
      DOT, VAR_TYPE
    | DOT, VAR_TYPE, BROP, INT, BRCL
    | TWODOTS, VAR_TYPE
    ;

VAR_TYPE  = LT, ID, GT ;

DIV = "div" ;
MOD = "mod" ;
ABS = "abs" ;

STRING = '"', { ESC | . }?, '"';
ID = ID_LETTER, { ID_LETTER | DIGIT } ;
INT  = DIGIT, { DIGIT } ;
ESC  = "\\", ( "b" | "t" | "n" | "r" | '"' | "\\" ) ;

DOT  = "." ;
TWODOTS  = ".." ;
BROP  = "[" ;
BRCL  = "]" ;

MUL = "*" ;
PLUS = "+" ;
MINUS = "-" ;
GEQ = ">=" ;
LEQ = "<=" ;
GT = ">" ;
LT = "<" ;

ID_LETTER  = "a".."z" | "A".."Z" | "_" | "\\" | "-" | "." | "^" ;
DIGIT  = "0".."9" ;

Parser Rules

Below, you find ISLa’s parser grammar. SMT-LIB expressions are usually expressed in a Lisp-like S-expression syntax, e.g., (= x (+ y 13)). This is fully supported by ISLa, and is robust to extensions in the SMT-LIB format as long as new function symbols can be parsed as alphanumeric identifiers. Our prefix and infix syntax that we added on top of S-expressions, as well as expressions using operators with special characters, are only parsed correctly if the operators appear in the lexer grammar. This is primarily to distinguish expressions in prefix syntax (op(arg1, arg1, ...)) from structural and semantic predicates. In future versions of the grammar, we might relax this constraint.

The top-level constant declaration const my_const: <my_type>; is optional. We default to const start: <start>;. Consequently, if no top-level constant is provided, the start symbol of the reference grammar must be <start> and all quantified formulas without an explicit in ... specification address elements in start.

Match expressions (see the section on quantifiers) are hidden inside the underspecified nonterminal MATCH_EXPR. We describe the lexer and parser grammars for match expressions further below.

isla_formula = [ const_decl ], formula;

const_decl = "const", ID, ":", VAR_TYPE, ";" ;

formula =
    "forall", VAR_TYPE, [ ID ],                  [ "in" (ID | VAR_TYPE) ], ":", formula
  | "exists", VAR_TYPE, [ ID ],                  [ "in" (ID | VAR_TYPE) ], ":", formula
  | "forall", VAR_TYPE, [ ID ], "=", MATCH_EXPR, [ "in" (ID | VAR_TYPE) ], ":", formula
  | "exists", VAR_TYPE, [ ID ], "=", MATCH_EXPR, [ "in" (ID | VAR_TYPE) ], ":", formula
  | "exists", "int", ID, ":", formula
  | "forall", "int", ID, ":", formula
  | "not", formula
  | formula, AND, formula
  | formula, OR, formula
  | formula, XOR, formula
  | formula, IMPLIES_ISLA, formula
  | formula, "iff", formula
  | ID, "(", predicate_arg, { ",", predicate_arg }, ")"
  | "(", formula, ")"
  | sexpr
  ;

sexpr =
    "true"
  | "false"
  | INT
  | ID
  | XPATHEXPR
  | VAR_TYPE
  | STRING
  | SMT_NONBINARY_OP 
  | smt_binary_op
  | SMT_NONBINARY_OP, "(", [ sexpr, { "," sexpr } ], ")"
  | sexpr, SMT_INFIX_RE_STR, sexpr
  | sexpr, ( PLUS | MINUS ), sexpr
  | sexpr, ( MUL | DIV | MOD ), sexpr
  | sexpr, ( "=" | GEQ | LEQ | GT | LT ), sexpr
  | "(", sexpr, sexpr, { sexpr }, ")"
  ;

predicate_arg = ID | VAR_TYPE | INT | STRING | XPATHEXPR ;


smt_binary_op:
  '=' | GEQ | LEQ | GT | LT | MUL | DIV | MOD | PLUS | MINUS | SMT_INFIX_RE_STR | AND | OR | IMPLIES_SMT | XOR ;

Match Expression Lexer Rules

We show the actual ANTLR rules of the match expression lexer, since they use ANTLR “modes” to parse variable declarations and optional match expression elements into tokens. For details on match expressions, we refer to the section on quantifiers.

BRAOP : '{' -> pushMode(VAR_DECL) ;

OPTOP : '[' -> pushMode(OPTIONAL) ;

TEXT : (~ [{[]) + ;

NL : '\n' + -> skip ;

mode VAR_DECL;
BRACL : '}' -> popMode ;
ID: ID_LETTER (ID_LETTER | DIGIT) * ;
fragment ID_LETTER : 'a'..'z'|'A'..'Z' | [_\-.] ;
fragment DIGIT : '0'..'9' ;
GT: '>' ;
LT: '<' ;
WS : [ \t\n\r]+ -> skip ;

mode OPTIONAL;
OPTCL : ']' -> popMode ;
OPTTXT : (~ ']') + ;

Match Expression Parser Rules

The parser rules for match expressions are depicted below in the EBNF format.

matchExpr = matchExprElement, { matchExprElement } ;

matchExprElement =
    BRAOP, varType, ID, BRACL
  | OPTOP, OPTTXT, OPTCL
  | TEXT
  ;

varType : LT ID GT ;

Simplified Syntax

ISLa 0.3 added a simplified syntax layer allowing to specify ISLa constraints much more concisely in many cases. Furthermore, SMT-LIB expressions can be expressed in the more common prefix or infix instead of S-expression syntax. All the additions described in this section are “syntactic sugar;” they are translated to core ISLa during parsing. In the semantics section, we thus exclusively focus on “core ISLa” (with explicit variable names, without “free nonterminals” and XPath expressions, etc.), assuming that all features described in this section are translated to that language core as described here.

Simplified Syntax by Example

Before we describe the syntactic additions one by one, we demonstrate most of them along the example of the assignment language from the introduction. The definition-use constraint discussed there is expressed as follows in core ISLa:

forall <assgn> assgn="<var> := {<var> rhs}" in start:
  exists <assgn> decl="{<var> lhs} := <rhs>" in start: (
    before(decl, assgn) and 
    (= lhs rhs)
  )

As a first simplification step, we can remove the in start, which is the default (since start is the default “global” constant; cf. the explanations in the semantics section):

forall <assgn> assgn="<var> := {<var> rhs}":
  exists <assgn> decl="{<var> lhs} := <rhs>": (
    before(decl, assgn) and 
    (= lhs rhs)
  )

Binary SMT-LIB expressions can be written in infix syntax:

forall <assgn> assgn="<var> := {<var> rhs}":
  exists <assgn> decl="{<var> lhs} := <rhs>": (
    before(decl, assgn) and 
    lhs = rhs
  )

While match expressions such as "<var> := {<var> rhs}" permit a lot of flexibility for pattern matching and binding variables in subtrees, it is often easier (but subject to individual taste) to use ISLa’s “XPath” syntax. This syntax allows addressing children of variables using a dot notation. It is inspired by the XPath abbreviated syntax, but uses dots . instead of slashes /. Our assignment language constraint can be equivalently expressed as

forall <assgn> assgn:
  exists <assgn> decl: (
    before(decl, assgn) and 
    assgn.<rhs>.<var> = decl.<var>
  )

Finally, we can omit the universal quantifiers, and instead use its nonterminal symbol in the quantified formula. Such “free nonterminals” are during parsing replaced by a new variable bound by an added top-level universal quantifier. Thus, the most concise formulation of the definition-use constraint is

exists <assgn> decl: (
  before(decl, <assgn>) and 
  <assgn>.<rhs>.<var> = decl.<var>
)

Generalized SMT-LIB syntax

At the core of ISLa are SMT-LIB expressions; quantifiers and predicate formulas are constructed around those. In the SMT-LIB language, all expressions are written in the S-expression format (op arg1 arg2) known from languages like LISP and Scheme. In addition to this syntax, ISLa supports the prefix notation op(arg1, arg2) more commonly used in mathematics and programming languages and, for binary operators, the infix notation arg1 op arg2. Thus, one may write

17 + str.to.int(y) = str.to.int(x)

instead of

(= (+ 17 (str.to.int y)) (str.to.int x))

in ISLa. When parsing an ISLa formula, all such prefix and infix expressions are translated to S-expressions, which is why S-expressions are printed when unparsing a formula, regardless of whether you used the generalized syntax to specify the formula originally or not.

Name conflicts with ISLa predicates. To distinguish SMT-LIB expressions in prefix syntax from ISLa predicates, the ISLa parser checks whether the provided operator is an SMT-LIB function. If so, the expression is parsed as an SMT-LIB expression. This implies that if an operator name is used both in an SMT-LIB theory and for an ISLa predicate, the SMT-LIB operator always “wins;” it will not be possible to parse the predicate. Thus, name clashes with SMT-LIB operators should be avoided when defining a new ISla predicate.

Omission of in start

Let const my_const: <my_type>; formula be an ISLa specification; if the const declaration is omitted, we assume a specification const start: <start>; as a default. Then, all subformulas forall <type> name: and exists <type> name: in formula (where the name part is optional) are translated to forall <type> name in start: and exists <type> name in start: during parsing.

Omission of Bound Variable Names

ISLa permits the omission of a name for the bound variable in quantifiers. In that case, the nonterminal type of the bound variable may be used to address that variable, as in

exists <assgn>: <assgn> = "x := y"

Formulas of this or similar shape translate to

exists <assgn> assgn in start: (= assgn "x := y")

More formally, in all formulas Q <type> in v: formula, where Q is forall or exists, we choose a “fresh” variable name name and replace the formula Q <type> name in v: formula', where formula' results from formula by replacing all occurrences of <typ> appearing at places where a variable may appear name. Fresh means that the name name is not used anywhere in that formula.

Free Nonterminals

In an ISLa formula, you can use a nonterminal from the reference grammar at every position where a variable may occur, also if that nonterminal is not bound by a quantifier. Those nonterminals represent universally bound variables. A formula formula with at least one unbound occurrence of a nonterminal symbol <type> is turned into a formula forall <type> name in start: formula', where

X-Path Expressions

As an alternative to match expressions, ISLa supports accessing derivation tree children using a notation inspired by the XPath abbreviated syntax. In particular, it supports the “child” and “descendant” axes, i.e., referring direct children and “deeper” descendants of those. In contrast to the original XPath syntax, we use dots . instead of slashes /. We still use the term “XPath expression” to refer to such expressions.

At any position in an ISLa formula where a variable may occur, one may alternatively use an XPath expression. An XPath expression in ISLa consists of one or more segments of the form var.<type1>[pos1].<type2>[pos2] (for the first segment) or <type>.<type1>[pos1].<type2>[pos2] (for the second and later segments). Note that the second form can also be used when specifying an ISLa constraint, but is translated to the first form by universally closing over the free nonterminal <type>. Segments are connected using the .. operator (descendent axis). Each segment consists of one of more child axis usages connected by .. The [pos] specifiers are optional and default to [1].

Semantically, var.<type1>[pos1] refers to the pos1-th direct child of type <type1> in the derivation tree associated to var, where counting starts from 1; var.<type1>[pos1].<type2>[pos2] to the pos2-th child of type <type2> of that element, and so on.

Using the descendent axis, we can address derivation tree children at a greater distance (than 1). The exact distance cannot be specified. An XPath expression var.<type1>[pos1]..<type2>.<type3>[pos3] refers to

A simple example for the use of the .. axis is the formula

checksum(<header>, <header>..<checksum>)

It specifies that a checksum predicate should hold for a <header> element and all <checksum> elements somewhere below <header>.

Translation of XPath expressions. During parsing, XPath segments are translated into match expressions. If more than one possible such translation is possible, we build a conjunction (for universal formulas) or disjunction (for existential formulas). We illustrate this along the example of an XML constraint. The example is based on the following grammar:

<start> ::= <xml-tree>
<xml-tree> ::=   <text> 
               | <xml-open-tag> <xml-tree> <xml-close-tag> 
               | <xml-tree> <xml-tree>
<xml-open-tag> ::= "<" <id> ">" | "<" <id> " " <xml-attribute> ">"
<xml-close-tag> ::= "</" <id> ">"
<xml-attribute> ::= <id> "=" <id> | <xml-attribute> " " <xml-attribute>
<id> ::= <LETTER> | <id> <LETTER>
<text> ::= <text> <LETTER_SPACE> | <LETTER_SPACE>
<LETTER> ::= "a" | "b" | "c" | ... | "0" | "1" | ... | "\"" | "'" | "."
<LETTER_SPACE> ::= "a" | "b" | "c" | ... | "\"" | "'" | " " | "\t"

Now, consider the constraint <xml-open-tag>.<id> = "a" specifying that identifiers in opening tags should equal the letter a. First, we introduce a new variable for <xml-open-tag> which we bind in a universal quantifier (as described in the section on free nonterminals):

forall <xml-open-tag> optag in start: optag.<id> = "a"

Next, we search the reference grammar for expansions of <xml-open-tag> containing at least one <id> nonterminal, which are both expansions for that nonterminal. Thus, we create a match expression for both of these alternative, and obtain the following conjunction of two quantified formulas:

forall <xml-open-tag> optag="<{<id> id}>" in start: id = "a" and
forall <xml-open-tag> optag="<{<id> id} <xml-attribute>>" in start: id = "a"

We can also allow longer identifiers, but restrict the allowed letters to “a.” This can be concisely expressed with the descendant axis ..: <xml-open-tag>.<id>..<LETTER> = "a". When translating this formula to core ISLa, we start from the outside and obtain

forall <xml-open-tag> optag="<{<id> id}>" in start: id..<LETTER> = "a" and
forall <xml-open-tag> optag="<{<id> id} <xml-attribute>>" in start: id..<LETTER> = "a"

Next, we “eliminate” the first XPath segment by introducing a universal quantifier inside the already added one:

forall <xml-open-tag> optag="<{<id> id}>" in start: 
  forall <LETTER> letter in id:
    letter = "a" and
forall <xml-open-tag> optag="<{<id> id} <xml-attribute>>" in start:
  forall <LETTER> letter in id:
    letter = "a"

In summary,

Semantics

In this section, we discuss ISLa’s semantics, i.e., what an ISLa specification means. Clearly, there has to be a relation between ISLa formulas and strings, since ISLa is a specification language for strings. However, it is more convenient to define the semantics of an ISLa formula as the set of derivation trees it represents.

On a high level, we define the semantics of a context-free grammar as the set of derivation trees that can be (transitively) derived from its start symbol. In the subsequent sections, we define (for each ISLa syntax element) a relation \(t\models{}\varphi\) that holds if, and only if, the derivation tree \(t\) satisfies the ISLa formula \(\varphi\). Finally, the semantics of an ISLa formula \(\varphi\) are all derivation trees represented by the reference grammar that satisfy \(\varphi\).

The language of CFGs, i.e., the strings they represent, is thoroughly defined in the standard literature.2 We follow the same style. We assume a relation \(t\Rightarrow{}t’\) between derivation trees that holds if \(t’\) can be derived from \(t\) by adding to some leaf node in \(t\) labeled with a nonterminal symbol \(n\) new children nodes corresponding to some expansion alternative for \(n\). For example, consider the following derivation tree:

<stmt>
├─ <assgn>
│  ├─ <var>
│  │  └─ "x"
│  ├─ " := "
│  └─ <rhs>
│     └─ <digit>
│        └─ "1"
├─ " ; "
└─ <stmt>

Using the expansion alternative <stmt> ::= <assgn> from the (BNF) grammar for our assignment language, we can expand the open <stmt> node by adding an <assgn> child. The result looks as follows:

<stmt>
├─ <assgn>
│  ├─ <var>
│  │  └─ "x"
│  ├─ " := "
│  └─ <rhs>
│     └─ <digit>
│        └─ "1"
├─ " ; "
└─ <stmt>
   └─ <assgn>

This is not the only option: We can also expand <stmt> with the expansion alternative <stmt> ::= <assgn> " ; " <stmt>, which results in

<stmt>
├─ <assgn>
│  ├─ <var>
│  │  └─ "x"
│  ├─ " := "
│  └─ <rhs>
│     └─ <digit>
│        └─ "1"
├─ " ; "
└─ <stmt>
   ├─ <assgn>
   ├─ " ; "
   └─ <stmt>

If \(t\) is the initial tree and \(t_1\) and \(t_2\) are the two expansions, then both \(t\Rightarrow{}t_1\) and \(t\Rightarrow{}t_2\) hold. Now, let \(\Rightarrow^\star\) be the reflexive and transitive closure of \(\Rightarrow\). Then, the set of derivation trees \(T(G)\) represented by a CFG \(G\) is defined as \(T(G):=\{t\,\vert\,t_0\Rightarrow^\star{}t\}\), where \(t_0\) is a derivation tree consisting only of the grammar’s start symbol.

Assuming the relation \(\models\) has been defined, we define the semantics \([\![\varphi]\!]\) of an ISLa formula \(\varphi\) as \([\![\varphi]\!]:=\{t\in{}T(G)\,\vert\,t\models\varphi\wedge\mathit{closed}(t)\}\), where \(G\) is the reference grammar for \(\varphi\) and the predicate \(\mathit{closed}\) holds for all derivation trees whose leaves are labeled with terminals.

In the remaining parts of this section, we discuss each element of the ISLa syntax and define the relation \(\models\) along the way.

When doing so, we also need (and define step by step) a function \(\mathit{freeVars}(\varphi)\) that returns the free variables of a formula \(\varphi\). Those are the variables that are not bound by a quantifier.

In ISLa, all variables are of “string” sort. This is especially important when writing SMT-LIB expressions, since appropriate conversions have to be added when, e.g., comparing a variable representing an integer to an actual integer.

To define \(\models\) for formulas with free variables, we use an additional variable assignment \(\beta\) associating variables with derivation trees. We write \(\beta\models\varphi\) to express that \(\varphi\) holds when instantiating free variables in \(\varphi\) according to the assignments in \(\beta\).

The notation \(t\models\varphi\) used above, where \(t\) is a derivation tree, is a shortcut. When specifying an ISLa formula, we can declare a global constant using the syntax const constant_name: <constant_type>; (cf. the ISLa grammar). The declaration is optional; if it is not present, a constant start of type <start> will be assumed. Assuming c is this constant, then \(t\models\varphi\) is

Atoms

The name ISLa has a double meaning: First, it is an acronym for “Input specification language;” and second, “isla” is the Spanish word for “island.” The reason for this second meaning is that ISLa embeds the SMT-LIB language as an island language. Around this embedded language, ISLa essentially adds quantifiers aware of the structure of context-free grammars. Thus, SMT-LIB expressions are the heart and the most important atomic ISLa formulas. Atomic means that they do not contain additional ISLa subformulas. ISLa also knows another type of atomic formula: predicate formulas. Here, we distinguish structural and semantic predicates. Structural predicates allow addressing structural relations such as “before” of “inside;” semantic predicates complement SMT-LIB and allow expressing complex constraints that are out of reach of the SMT-LIB language. This section address all three types of ISLa atoms.

SMT-LIB Expressions

ISLa embeds the SMT-LIB language. Since all ISLa variables are strings, the SMT-LIB string theory is the most relevant theory in the ISLa context. The function str.to.int3 converts strings to integers, such that integer operations using the integer theory are possible. A typical SMT-LIB ISLa constraint (inspired by our ISLa tutorial) is (>= (str.to.int pagesize) 100). For this to work, all derivation trees that can be substituted for the pagesize variable have to be positive integers (cf. the response by Z3’s lead developer Nikolaj Bjorner in this GitHub issue). SMT-LIB uses a Lisp-like S-expression syntax. We abstain from discussing this syntax here and instead refer to the SMT-LIB documentation.

Free variables. The set \(freeVars(\varphi)\) for an SMT-LIB expression \(\varphi\) consists of all symbols not part of the SMT-LIB language and not contained in one of the (built-in) SMT-LIB theories, in particular the integer and string theories.

We assume a function \(\mathit{sat}\) mapping an SMT-LIB formula (expression of Boolean type) to the values \(\mathit{SAT}\), \(\mathit{UNSAT}\), or \(\mathit{UNDEFINED}\). \(\mathit{SAT}\) means that there exists a variable assignment for which the formula holds. A returned \(\mathit{UNSAT}\) value implies that there does not exist any such an assignment. Furthermore, the \(\mathit{UNDEFINED}\) value is issued if no definitive decision could be made (e.g., due to a timeout or a prover insufficiency). We will not define \(\mathit{sat}\) formally in this document, since it is no original contribution of ISLa. The ISLa solver implements \(\mathit{sat}\) by calling the Z3 theorem prover.

Semantics. Let \(\beta\) be a variable assignment and \(\varphi\) an SMT-LIB formula. Furthermore, let \(\varphi’\) be a formula resulting from negating the instantiation of \(\varphi\) according to \(\beta\). Then, \(\beta\models\varphi\) holds if, and only if, \(\mathit{sat}(\varphi’)=\mathit{UNSAT}\). When instantiating \(\varphi\), we have to convert the derivation trees in \(\beta\) to strings first, since SMT and Z3 do not know derivation trees.

Structural Predicates

Consider the assignment language from the introduction. The relevant semantic property for this language is that for each right-hand side variable, there has to exist a preceding assignment to that variable. SMT-LIB expressions are insufficient for expressing such structural relations because they are unaware of grammatical structure. In simple cases, we could in principle express them using string functions (e.g., comparing indices). Too many such constraints, however, often result in solver timeouts. More complex use cases (e.g., some element has to occur inside or as the n-th child of another one) can even be impossible to express using SMT-LIB alone. To that end, ISLa introduces so-called structural predicates. These predicates mostly reason about the relative positions of elements in a parse tree. An example is the before predicate used for the assignment language example:

forall <assgn> assgn:
  exists <assgn> decl: (
    before(decl, assgn) and 
    assgn.<rhs>.<var> = decl.<var>
  )

This predicate accepts two variables as arguments, and holds if, and only if, the first argument occurs earlier in the reference tree than the second one. The reference tree is the instantiation of the global constant (implicitly) declared in an ISLa specification. Structural predicates might also accept string literals as arguments. For example, level("GE", "<block>", decl, expr), inspired by the definition-use constraint of a C-like language, expresses that the declaration decl has to occur at the same or a greater <block> level than expr, as in {int x; {int y = x;}}, where expr would be the x in int y = x; and decl the int x; declaration. The program {{int x;} int y = x;} would not satisfy this constraint.

In the following table, we informally describe the meaning of the ISLa built-in structural predicates.

Predicate Intuitive Meaning
after(node_1, node_2) node_1 occurs after node_2 (not below) in the parse tree.
before(node_1, node_2) node_1 occurs before node_2 (not below) in the parse tree.
consecutive(node_1, node_2) node_1 and node_2 are consecutive leaves in the parse tree.
different_position(node_1, node_2) node_1 and node_2 occur at different positions (cannot be the same node).
direct_child(node_1, node_2) node_1 is a direct child of node_2 in the derivation tree.
inside(node_1, node_2) node_1 is a subtree of node_2.
level(PRED, NONTERMINAL, node_1, node_2) node_1 and node_2 are related relatively to each other as specified by PRED and NONTERMINAL (see below). PRED and NONTERMINAL are Strings.
nth(N, node_1, node_2) node_1 is the N-th occurrence of a node with its nonterminal symbol within node_2. N is a numeric String.
same_position(node_1, node_2) node_1 and node_2 occur at the same position (have to be the same node).

Free variables. All structural predicate atoms \(\varphi\) are of the form \(\mathit{pred}(a_1,a_2,\dots,a_n)\). The free variables \(\mathit{free_vars}(\varphi)\) of \(\varphi\) consists of all those \(a_i\) that are not string literals.

Semantics. To evaluate structural predicates, we need to access elements in the derivation tree assigned to the global ISLa constant \(c\) in the variable assignment \(\beta\). Thus, we assume that we know \(c\).4 The semantics of structural predicate atoms can only be defined for each predicate individually. Below, we demonstrate this along the example of before.

Semantics of before. Since before reasons about relative positions of derivation trees, we need to encode these positions. To that end, we introduce the notion of paths in derivation trees. A path is a (potentially empty) tuple of natural numbers \((n_1,n_2,\dots,n_2)\). In a tree \(t\), the empty path \(()\) points to the tree itself, the path \((n)\) to the \(n\)-th child of \(t\)’s root, the path \((n,m)\) to the \(m\)-th child of the \(n\)-th child of the root, and so on. Counting start from 0, so \((0)\) is the path of the root’s first child. Below, we indicate the paths of each subtree in the derivation tree from the introduction:

<stmt>                 <- ()
├─ <assgn>             <- (0)
│  ├─ <var>            <- (0,0)
│  │  └─ "x"           <- (0,0,0)
│  ├─ " := "           <- (0,1)
│  └─ <rhs>            <- (0,2)
│     └─ <digit>       <- (0,2,0)
│        └─ "1"        <- (0,2,0,0)
├─ " ; "               <- (1)
└─ <stmt>              <- (2)
   └─ <assgn>          <- (2,0)
      ├─ <var>         <- (2,0,0)
      │  └─ "y"        <- (2,0,0,0)
      ├─ " := "        <- (2,0,1)
      └─ <rhs>         <- (2,0,2)
         └─ <var>      <- (2,0,2,0)
            └─ "x"     <- (2,0,2,0,0)

The subtree with path \((2,0,2)\) in this tree, for example, is the tree

<rhs>
└─ <var>
   └─ "x"

We assume a partial function \(\mathit{path}(t_1, t_2)\) that returns that path pointing to \(t_1\) in the tree \(t_2\).5 The function is defined if, and only if, \(t_1\) occurs in \(t_1\). For example, if \(t_1\) is the <rhs>-tree above and \(t_2\) the <stmt> tree from which we extracted it, \(\mathit{path}(t_1, t_2)=(0,2,0)\).

Let \(\beta\) be a variable assignment, \(c\) the global constant from the ISLa specification, and \(p_i\) be the paths \(\mathit{path}(t_i, \beta(c))\) for \(i=1,2\). Now, \(\beta\models\mathit{before}(t_1, t_2)\) holds if, and only if, the function \(\mathit{isBefore}(p_1, p_2)\) returns \(\top\). We recursively define \(\mathit{isBefore}\) as follows:

\[\mathit{isBefore}(p_1, p_2) = \begin{cases} \bot & \text{if } p_1 = () \text{ or } p_2 = () \\ \bot & \text{if } p_2^1 < p_1^1 \\ \top & \text{if } p_1^1 < p_2^1 \\ isBefore((\mathit{tail}(p_1), \mathit{tail}(p_2)) & \text{otherwise} \end{cases}\]

where \(p^1\) is the first element of the tuple \(p\) and \(\mathit{tail}(p)\) is the (possibly empty) tuple resulting from removing the first element of \(p\).

The remaining structural predicates can be defined similarly. Only for predicates like level we additionally need to retrieve labels of subtrees (grammar symbols) in addition to paths.

Semantic Predicates

Semantic predicates were introduced mainly to compensate shortcomings of SMT-LIB expressions. A classic example would be a constraint involving a checksum computation. For example, the checksum for TAR archives involves setting the original checksum to spaces, summing up the TAR header, and converting the results to an octal number. Even if this was expressible in SMT-LIB, it would probably result in solver timeouts in many cases. When using semantic predicates in input generation, they not only can evaluate to a Boolean value, but also change the reference derivation tree. For example, they might assign the part of the tree corresponding to the checksum value with a tree representing the correct checksum.

The below table displays the semantic predicates currently supported by ISLa. For special purposes, it is possible to add dedicated semantic predicates (e.g., for checksum computation in binary formats) that must be implemented in Python.

Predicate Intuitive Meaning
count(in_tree, NEEDLE, NUM) There are NUM occurrences of the NEEDLE nonterminal in in_tree. NEEDLE is a String, NUM a numeric String or int variable.

The currently only standard semantic ISLa predicate is count. For example, count(in_tree, "<line>", "3") holds if, and only if, there exist exactly three children labeled with the <line> nonterminal inside the derivation tree in_tree. What distinguishes count from a structural predicate is, e.g., that it is possible to pass a numeric variable (see the section on numeric quantifiers) instead of a string literal as the third argument. Structural predicates could not handle this: Whether the predicate holds or not would depend on the actual—not present—value of that variable. The semantic predicate count can, however, count the <line> occurrences in in_tree and assign the resulting number to the variable.

Free variables and semantics. For the purpose of defining the semantics of semantic predicates, we can, however, re-use the definitions from the section on structural predicates. Since the variable assignment \(\beta\) must assign values to all variables, there do not occur any remaining “free” variables in the arguments of a semantic predicates. Thus, also a semantic predicate atom can either hold or not hold for a given variable assignment and list of arguments.

Propositional Combinators

Propositional combinators allow combining ISLa formulas to more complex specifications. ISLa supports not, and, and or. Additionally, one can use the derived combinators xor, implies, and iff. The formula A xor B is translated to (A and (not B)) or (B and (not A)); A implies B is translated to (not A) or B; and A iff B is translated to (A and B) or ((not A) and (not B)).

In many cases, we can omit parentheses based on the operator precedence of propositional combinators. This precedence is specified by the order of the combinators in the ISLa parser grammar. The list of operators ordered by their binding strength in increasing order is not, and, or, xor, implies, iff. Thus, (A and (not B)) or (B and (not A)) (the encoding of A xor B) is equivalent to A and not B or B and not A—we do not need any parentheses.

Free variables. For not, we define \(\mathit{freeVars}(\mathtt{not}~\varphi):=\mathit{freeVars}(\varphi)\). Let \(\circ\) be one of the binary propositional combinators and \(\varphi,\,\psi\) be two ISLa formulas. Then,

\[\mathit{freeVars}(\varphi\circ\psi):=\mathit{freeVars}(\varphi)\cup\mathit{freeVars}(\psi)\]

Semantics. We define the semantics for not, and, and or as follows:

Quantifiers

Quantifiers come in two flavors in ISLa. First, we have quantifiers over derivation trees, e.g., forall <type> name in tree: .... In the introduction, we have shown examples for those. A second type of quantifier are the quantifiers over integers. They have the form forall int name: ... or exists int name: .... The variable name is, as all ISLa variables, of string sort, but ranges over numeric values.

Tree Quantifiers

There are four types of tree quantifiers in ISLa, as universal quantifiers (forall) and existential quantifiers (exists) can come with or without a specified match expression. We first discuss the semantics of tree quantifiers without match expressions, and then focus on those with match expressions.

Tree Quantifiers without Match Expressions

Intuitively, a formula forall <type> name in tree: A is true if A holds for all subtrees in tree labeled with <type>. Conversely, exists <type> name in tree: A holds if there is just some such subtree in tree.

Free variables. In a quantified tree formula without match expression, the free variables are the free variables of the formula in the core of the quantifier as well as the variable after in; the variable name is excluded from this set, since it is bound by the quantifier. Let \(Q\) be either forall or exists. We define

\[\mathit{freeVars}(Q~T~v~\mathtt{in}~\mathit{t}:~\varphi) :=\left(\mathit{freeVars}(\varphi)\cup\{t\}\right)\setminus\{v\}\]

Semantics. Let \(\mathit{subtrees}(N, t)\) be the subtrees of the derivation tree \(t\) labeled with the nonterminal symbol \(N\). Then, we define the semantics of quantified formulas without match expressions as follows:

where \(\beta[v\mapsto{}t’]\) is the variable assignment mapping variable \(v\) to the derivation tree \(t’\) and all other variables to their mapping in \(\beta\).

Tree Quantifiers with Match Expressions

A match expression is an abstract word from the language of the reference grammar. For example, x := y is a concrete word from our assignment language (see introduction); <var> := y would be a valid match expression. Additionally, match expressions allow binding elements of that “abstract” word to variables, as in {<var> myvar} := y. By using match expressions in quantifiers, we can (1) restrict the possible instantiations of the bound variable to those that match the match expression, and (2) access elements of these instantiations by giving them names that can be referred in the quantified formula.

To increase the flexibility of match expressions, they can contain optional elements inside square brackets [...]. Elements inside these optionals cannot be bound to variables: Only plain terminal and nonterminal symbols may occur there. For example, <assgn>[ ; <stmt>] is a possible match expression for a <stmt> nonterminal in the assignment language. It matches statements consisting of a single assignment and one with an assignment followed by a statement. Match expressions can also go “deeper.” The expression <assgn>[ ; <assgn>] is also a valid match expression for a <stmt>, matching any statement consisting of exactly one or two assignments.

Free variables. For a match expression \(\mathit{mexpr}\), let \(\mathit{freeVars}(\mathit{mexpr})\) be the variables specified in curly braces inside \(\mathit{mexpr}\). Let \(Q\) be forall or exists. Then, we define

\[\mathit{freeVars}(Q~T~v\mathtt{=}\text{"$\mathit{mexpr}$"}~\mathtt{in}~\mathit{t}:~\varphi) :=\left(\mathit{freeVars}(\varphi)\cup\mathit{freeVars}(\mathit{mexpr})\cup\{t\}\right)\setminus\{v\}\]

To define the semantics of quantifiers with match expressions, we parse the match expression into a set of (typically open, i.e., containing nonterminal leaves) derivation trees. We can get more than one result: Either because optionals are used, or because the reference grammar is ambiguous. If the match expression contains variables, we record the paths from the root of these trees to the nodes corresponding to the bound elements. Then, we continue as for quantified formulas without match expressions: If Q <type> var="mexpr" in tree: A is the formula (Q being either forall or exists), we collect subtrees from tree with roots labeled with <type>. Then, however, we select only those of which some derivation tree for the match expression is a prefix. Then, we instantiate var and all variables in mexpr according to the found match, and check whether A holds for all/one of these instantiations.

To make this—slightly—more formal, we “define” a function \(\mathit{mexprTrees}\) by example. The result of \(\mathit{mexprTrees}(T,\mathit{mexpr})\), for a nonterminal symbol \(T\) match expression \(\mathit{mexpr}\), is a set of pairs \((t,P)\) of (1) a derivation tree \(t\) corresponding to the parsed match expression, where \(T\) is used as a start symbol (such that the root of \(t\) is labeled with \(T\)), and (2) a mapping \(P\) of variables bound in match expressions to the paths where the corresponding tree elements occur in \(t\).

Consider again the assignment language and let the match expression \(\mathit{mexpr}\) be {<var> lhs} := {<var> rhs}[ ; <assgn>]. We visualize the result of \(\mathit{mexprTrees}(\mathtt{\langle{}stmt\rangle},\mathit{mexpr})\) by indicating the mapping \(P\) in the textual representation of the two resulting derivation trees. The first element of the result set, where the optional part has been left out, is

stmt
└─ assgn
   ├─ var      lhs ↦ (0,0)
   ├─ " := "
   └─ rhs
      └─ var   rhs ↦ (0,2,0)

The second element, which contains the optional second assignment, is

stmt
├─ assgn
│  ├─ var      lhs ↦ (0,0)
│  ├─ " := "
│  └─ rhs
│     └─ var   rhs ↦ (0,2,0)
├─ " ; "
└─ stmt
   └─ assgn

We need a function \(\mathit{match}(t, t’, P)\) taking a derivation tree \(t\) and a pair \(t’,P\) computed by \(\mathit{mexprTrees}\) and returning (1) \(\bot\) if \(t’\) does not match \(t\), or otherwise (2) a variable assignment mapping variables in \(P\) to the corresponding subtrees in \(t\).

We now recursively define \(\mathit{match}\). Thereby,

\[\mathit{match}(t, t', P) := \begin{cases} \bot & \text{if }l(t)\neq{}l(t')\vee(\mathit{numc}(t')>0\wedge \\ & \qquad\mathit{numc}(t)\neq\mathit{numc}(t')) \\ [v\mapsto{}t] & \text{if }P=[v\mapsto{}()]\text{ for some }v \\ \bot & \text{if }\mathit{match}(\mathit{child}(t, i), \mathit{child}(t', i), P_i)=\bot \\ & \qquad\text{for any }i\in[1,\dots,\mathit{numc}(t)] \\ \bigcup_{i=1}^{\mathit{numc}(t)}\Big( & \\ \quad\mathit{match}\big(\mathit{child}(t, i), & \\ \quad\hphantom{\mathit{match}\big(}\mathit{child}(t', i), P_i\big)\Big) & \text{otherwise} \\ \end{cases}\]

Based on \(match\) and \(mexprTrees\), we can now define the semantics of quantified formulas with match expressions.

Numeric Quantifiers

Numeric quantifiers are of the form forall int var; A or exists int var; A. They allow introducing a fresh variable, unconnected to the reference grammar, ranging over strings representing positive integers. We only allow positive integers since the SMT solver used by ISLa, Z3, does not support converting negative numeric strings to integers and back (see this GitHub issue). The main use case of numeric quantifiers is connecting several semantic predicates. Consider, for example, the following ISLa constraint for CSV files:

exists int colno: (
  str.to.int(colno) >= 3 and 
  str.to.int(colno) <= 5 and 
  count(<csv-header>, "<column>", colno) and 
  forall <csv-record> line:
    count(line, "<column>", colno))

This constraint expresses that the number of columns in the <csv-header> field of a CSV header is between 3 and 5, and that all lines after the header have to share the same number of columns. If it was possible to express the count predicate using pure SMT-LIB expressions, we could have expressed this constraint without semantic predicates and, consequently, without the existential integer quantifier by forming an equation of the number of columns in the header and the number of columns in each line.

The forall int quantifier expresses that a property has to hold for all possible positive integers. In our experience, it is not often used in practice. However, if we wanted to negate the above property for CSV files, the result would contain such a quantifier.

Free variables. The free variables set is computed similarly as for tree quantifiers. Let \(Q\) be forall or exists. We define

\[\mathit{freeVars}(Q~\mathtt{int}~v:~\varphi):=\mathit{freeVars}(\varphi)\setminus\{v\}\]

Semantics. Let \(N\) be the set of all derivation trees whose string representation correspond to that of a positive integer, e.g., “0”, 1”, “17”, etc. We define the semantics of quantified integer formulas as follows:

Footnotes

  1. From ISLa 0.8.14 on, the ISLaSolver and the evaluate function both accept grammars in concrete syntax in addition to the Python dictionary format of the Fuzzing Book

  2. For example, John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman: Introduction to Automata Theory, Languages, and Computation, 3rd Edition. Pearson international edition, Addison-Wesley 2007, ISBN 978-0-321-47617-3. 

  3. In the SMT-LIB standard, this function is called str.to_int. ISLa, however, uses the Z3 SMT solver, where the corresponding function has the name str.to.int. Obviously, Z3 supported str.to.int before str.to_int became an official standard

  4. In the ISLa implementation, we distinguish two types (in the sense of object-oriented classes) of variables: bound variables (bound by quantifiers) and constants. Thus, we can always extract reference trees from variable assignments. 

  5. In the ISLa system, derivation trees have unique identifiers, such that we can distinguish trees with identical structure.