Lexical Format¶
Characters¶
The text format assigns meaning to source text, which consists of a sequence of characters. Characters are assumed to be represented as valid Unicode (Section 2.4) scalar values.
Tokens¶
The character stream in the source text is divided, from left to right, into a sequence of tokens, as defined by the following grammar.
Tokens are formed from the input character stream according to the longest match rule. That is, the next token always consists of the longest possible sequence of characters that is recognized by the above lexical grammar. Tokens can be separated by white space, but except for strings, they cannot themselves contain whitespace.
Keyword tokens are defined either implicitly by an occurrence of a terminal symbol in literal form, such as
Any token that does not fall into any of the other categories is considered reserved, and cannot occur in source text.
Note
The effect of defining the set of reserved tokens is that all tokens must be separated by either parentheses, white space, or comments.
For example,
White Space¶
White space is any sequence of literal space characters, formatting characters, or comments.
The allowed formatting characters correspond to a subset of the ASCII format effectors, namely, horizontal tabulation (
The only relevance of white space is to separate tokens. It is otherwise ignored.
Comments¶
A comment can either be a line comment, started with a double semicolon and extending to the end of the line,
or a block comment, enclosed in delimiters .
Block comments can be nested.
Here, the pseudo token indicates the end of the input.
The look-ahead restrictions on the productions for disambiguate the grammar such that only well-bracketed uses of block comment delimiters are allowed.
Note
Any formatting and control characters are allowed inside comments.