Lexical Format

Characters

The text format assigns meaning to source text, which consists of a sequence of characters. Characters are assumed to be represented as valid Unicode (Section 2.4) scalar values.

\[\begin{split}\begin{array}{llll} \def\mathdef3961#1{{}}\mathdef3961{source} & \href{../text/lexical.html#text-source}{\mathtt{source}} &::=& \href{../text/lexical.html#text-char}{\mathtt{char}}^\ast \\ \def\mathdef3961#1{{}}\mathdef3961{character} & \href{../text/lexical.html#text-char}{\mathtt{char}} &::=& \def\mathdef4001#1{\mathrm{U{+}#1}}\mathdef4001{00} ~|~ \dots ~|~ \def\mathdef4002#1{\mathrm{U{+}#1}}\mathdef4002{D7FF} ~|~ \def\mathdef4003#1{\mathrm{U{+}#1}}\mathdef4003{E000} ~|~ \dots ~|~ \def\mathdef4004#1{\mathrm{U{+}#1}}\mathdef4004{10FFFF} \\ \end{array}\end{split}\]

Note

While source text may contain any Unicode character in comments or string literals, the rest of the grammar is formed exclusively from the characters supported by the 7-bit ASCII subset of Unicode.

Tokens

The character stream in the source text is divided, from left to right, into a sequence of tokens, as defined by the following grammar.

\[\begin{split}\begin{array}{llll} \def\mathdef3961#1{{}}\mathdef3961{token} & \href{../text/lexical.html#text-token}{\mathtt{token}} &::=& \href{../text/lexical.html#text-keyword}{\mathtt{keyword}} ~|~ \href{../text/values.html#text-int}{\def\mathdef3983#1{{\mathtt{u}#1}}\mathdef3983{N}} ~|~ \href{../text/values.html#text-int}{\def\mathdef3989#1{{\mathtt{s}#1}}\mathdef3989{N}} ~|~ \href{../text/values.html#text-float}{\def\mathdef3997#1{{\mathtt{f}#1}}\mathdef3997{N}} ~|~ \href{../text/values.html#text-string}{\mathtt{string}} ~|~ \href{../text/values.html#text-id}{\mathtt{id}} ~|~ \def\mathdef4005#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4005{(} ~|~ \def\mathdef4006#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4006{)} ~|~ \href{../text/lexical.html#text-reserved}{\mathtt{reserved}} \\ \def\mathdef3961#1{{}}\mathdef3961{keyword} & \href{../text/lexical.html#text-keyword}{\mathtt{keyword}} &::=& (\def\mathdef4007#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4007{a} ~|~ \dots ~|~ \def\mathdef4008#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4008{z})~\href{../text/values.html#text-idchar}{\mathtt{idchar}}^\ast \qquad (\mathrel{\mbox{if}}~\mbox{occurring as a literal terminal in the grammar}) \\ \def\mathdef3961#1{{}}\mathdef3961{reserved} & \href{../text/lexical.html#text-reserved}{\mathtt{reserved}} &::=& (\href{../text/values.html#text-idchar}{\mathtt{idchar}} ~|~ \href{../text/values.html#text-string}{\mathtt{string}} ~|~ \def\mathdef4009#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4009{,} ~|~ \def\mathdef4010#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4010{;} ~|~ \def\mathdef4011#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4011{[} ~|~ \def\mathdef4012#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4012{]} ~|~ \def\mathdef4013#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4013{\{} ~|~ \def\mathdef4014#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4014{\}})^+ \\ \end{array}\end{split}\]

Tokens are formed from the input character stream according to the longest match rule. That is, the next token always consists of the longest possible sequence of characters that is recognized by the above lexical grammar. Tokens can be separated by white space, but except for strings, they cannot themselves contain whitespace.

Keyword tokens are defined either implicitly by an occurrence of a terminal symbol in literal form, such as \(\def\mathdef4015#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4015{keyword}\), in a syntactic production of this chapter, or explicitly where they arise in this chapter.

Any token that does not fall into any of the other categories is considered reserved, and cannot occur in source text.

Note

The effect of defining the set of reserved tokens is that all tokens must be separated by either parentheses, white space, or comments. For example, \(\def\mathdef4016#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4016{0\$x}\) is a single reserved token, as is \(\def\mathdef4017#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4017{"a""b"}\). Consequently, they are not recognized as two separate tokens \(\def\mathdef4018#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4018{0}\) and \(\def\mathdef4019#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4019{\$x}\), or \("a"\) and \("b"\), respectively, but instead disallowed. This property of tokenization is not affected by the fact that the definition of reserved tokens overlaps with other token classes.

White Space

White space is any sequence of literal space characters, formatting characters, comments, or annotations. The allowed formatting characters correspond to a subset of the ASCII format effectors, namely, horizontal tabulation (\(\def\mathdef4020#1{\mathrm{U{+}#1}}\mathdef4020{09}\)), line feed (\(\def\mathdef4021#1{\mathrm{U{+}#1}}\mathdef4021{0A}\)), and carriage return (\(\def\mathdef4022#1{\mathrm{U{+}#1}}\mathdef4022{0D}\)).

\[\begin{split}\begin{array}{llclll@{\qquad\qquad}l} \def\mathdef3961#1{{}}\mathdef3961{white space} & \href{../text/lexical.html#text-space}{\mathtt{space}} &::=& (\def\mathdef4023#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4023{~~} ~|~ \href{../text/lexical.html#text-format}{\mathtt{format}} ~|~ \href{../text/lexical.html#text-comment}{\mathtt{comment}})^\ast \\ \def\mathdef3961#1{{}}\mathdef3961{format} & \href{../text/lexical.html#text-format}{\mathtt{format}} &::=& \href{../text/lexical.html#text-newline}{\mathtt{newline}} ~|~ \def\mathdef4024#1{\mathrm{U{+}#1}}\mathdef4024{09} \\ \def\mathdef3961#1{{}}\mathdef3961{newline} & \href{../text/lexical.html#text-newline}{\mathtt{newline}} &::=& \def\mathdef4025#1{\mathrm{U{+}#1}}\mathdef4025{0A} ~|~ \def\mathdef4026#1{\mathrm{U{+}#1}}\mathdef4026{0D} ~|~ \def\mathdef4027#1{\mathrm{U{+}#1}}\mathdef4027{0D}~\def\mathdef4028#1{\mathrm{U{+}#1}}\mathdef4028{0A} \\ \end{array}\end{split}\]

The only relevance of white space is to separate tokens. It is otherwise ignored.

Comments

A comment can either be a line comment, started with a double semicolon \(\def\mathdef3982#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef3982{{;}{;}}\) and extending to the end of the line, or a block comment, enclosed in delimiters \(\def\mathdef3980#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef3980{{(}{;}} \dots \def\mathdef3981#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef3981{{;}{)}}\). Block comments can be nested.

\[\begin{split}\begin{array}{llclll@{\qquad\qquad}l} \def\mathdef3961#1{{}}\mathdef3961{comment} & \href{../text/lexical.html#text-comment}{\mathtt{comment}} &::=& \href{../text/lexical.html#text-comment}{\mathtt{linecomment}} ~|~ \href{../text/lexical.html#text-comment}{\mathtt{blockcomment}} \\ \def\mathdef3961#1{{}}\mathdef3961{line comment} & \href{../text/lexical.html#text-comment}{\mathtt{linecomment}} &::=& \def\mathdef3982#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef3982{{;}{;}}~~\href{../text/lexical.html#text-comment}{\mathtt{linechar}}^\ast~~(\href{../text/lexical.html#text-newline}{\mathtt{newline}} ~|~ \mathtt{eof}) \\ \def\mathdef3961#1{{}}\mathdef3961{line character} & \href{../text/lexical.html#text-comment}{\mathtt{linechar}} &::=& c{:}\href{../text/lexical.html#text-char}{\mathtt{char}} & (\mathrel{\mbox{if}} c \neq \def\mathdef4029#1{\mathrm{U{+}#1}}\mathdef4029{0A} \land c \neq \def\mathdef4030#1{\mathrm{U{+}#1}}\mathdef4030{0D}) \\ \def\mathdef3961#1{{}}\mathdef3961{block comment} & \href{../text/lexical.html#text-comment}{\mathtt{blockcomment}} &::=& \def\mathdef3980#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef3980{{(}{;}}~~\href{../text/lexical.html#text-comment}{\mathtt{blockchar}}^\ast~~\def\mathdef3981#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef3981{{;}{)}} \\ \def\mathdef3961#1{{}}\mathdef3961{block character} & \href{../text/lexical.html#text-comment}{\mathtt{blockchar}} &::=& c{:}\href{../text/lexical.html#text-char}{\mathtt{char}} & (\mathrel{\mbox{if}} c \neq \def\mathdef4031#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4031{;} \land c \neq \def\mathdef4032#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4032{(}) \\ &&|& \def\mathdef4033#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4033{;} & (\mathrel{\mbox{if}}~\mbox{the next character is not}~\def\mathdef4034#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4034{)}) \\ &&|& \def\mathdef4035#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4035{(} & (\mathrel{\mbox{if}}~\mbox{the next character is not}~\def\mathdef4036#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4036{;}) \\ &&|& \href{../text/lexical.html#text-comment}{\mathtt{blockcomment}} \\ \end{array}\end{split}\]

Here, the pseudo token \(\mathtt{eof}\) indicates the end of the input. The look-ahead restrictions on the productions for \(\href{../text/lexical.html#text-comment}{\mathtt{blockchar}}\) disambiguate the grammar such that only well-bracketed uses of block comment delimiters are allowed.

Note

Any formatting and control characters are allowed inside comments.

Annotations

An annotation is a bracketed token sequence headed by an annotation id of the form \(\def\mathdef4037#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4037{@id}\) or \(\def\mathdef4038#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4038{@"..."}\). No space is allowed between the opening parenthesis and this id. Annotations are intended to be used for third-party extensions; they can appear anywhere in a program but are ignored by the WebAssembly semantics itself, which treats them as white space.

Annotations can contain other parenthesized token sequences (including nested annotations), as long as they are well-nested. String literals and comments occurring in an annotation must also be properly nested and closed.

\[\begin{split}\begin{array}{llclll@{\qquad\qquad}l} \def\mathdef3961#1{{}}\mathdef3961{annotation} & \href{../text/lexical.html#text-annot}{\mathtt{annot}} &::=& \def\mathdef4039#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4039{(@}~\href{../text/lexical.html#text-annot}{\mathtt{annotid}} ~(\href{../text/lexical.html#text-space}{\mathtt{space}} ~|~ \href{../text/lexical.html#text-token}{\mathtt{token}})^\ast~\def\mathdef4040#1{\mbox{‘}\mathtt{#1}\mbox{’}}\mathdef4040{)} \\ \def\mathdef3961#1{{}}\mathdef3961{annotation identifier} & \href{../text/lexical.html#text-annot}{\mathtt{annotid}} &::=& \href{../text/values.html#text-idchar}{\mathtt{idchar}}^+ ~|~ \href{../text/values.html#text-name}{\mathtt{name}} \\ \end{array}\end{split}\]

Note

The annotation id is meant to be an identifier categorising the extension, and plays a role similar to the name of a custom section. By convention, annotations corresponding to a custom section should use the custom section’s name as an id.

Implementations are expected to ignore annotations with ids that they do not recognize. On the other hand, they may impose restrictions on annotations that they do recognize, e.g., requiring a specific structure by superimposing a more concrete grammar. It is up to an implementation how it deals with errors in such annotations.