Specification

Note: This specification is a work in progress.

Introduction

This is a reference manual for the Antimony programming language.

Antimony is a general-purpose language designed with simplicity in mind. It is strongly typed and supports multiple compile-targets. Programs are constructed from modules, whose properties allow efficient management of dependencies.

Notation

The syntax is specified using altered Extended Backus-Naur Form (EBNF):

Production  = production_name "=" [ Expression ] "." .
Expression  = Alternative { "|" Alternative } .
Alternative = Term { Term } .
Term        = production_name | token [ "..." token ] | Group | Option | Repetition .
Group       = "(" Expression ")" .
Option      = "[" Expression "]" .
Repetition  = "{" Expression "}" .

Productions are expressions constructed from terms and the following operators, in increasing precedence:

|   alternation
()  grouping
[]  option (0 or 1 times)
{}  repetition (0 to n times)

Lower-case production names are used to identify lexical tokens. Non-terminals are in CamelCase. Lexical tokens are enclosed in double quotes "" or single quotes ''.

The form a ... b represents the set of characters from a through b as alternatives. The horizontal ellipsis ... is also used elsewhere in the spec to informally denote various enumerations or code snippets that are not further specified. The character (as opposed to the three characters ...) is not a token of the Antimony language.

Source Code Representation

Source code is Unicode text encoded in UTF-8. The text is not canonicalized, so a single accented code point is distinct from the same character constructed from combining an accent and a letter; those are treated as two code points. For simplicity, this document will use the unqualified term character to refer to a Unicode code point in the source text.

Each code point is distinct; for instance, upper and lower case letters are different characters.

Implementation restriction: For compatibility with other tools, a compiler may disallow the NUL character (U+0000) in the source text.

Characters

The following terms are used to denote specific Unicode character classes:

newline        = /* the Unicode code point U+000A */ .
unicode_char   = /* an arbitrary Unicode code point except newline */ .
unicode_letter = /* a Unicode code point classified as "Letter" */ .
unicode_digit  = /* a Unicode code point classified as "Number, decimal digit" */ .

Letters and digits

The underscore character _ (U+005F) is considered a letter.

letter        = unicode_letter | "_" .
decimal_digit = "0" ... "9" .
binary_digit  = "0" | "1" .
octal_digit   = "0" ... "7" .
hex_digit     = "0" ... "9" | "A" ... "F" | "a" ... "f" .

Lexical elements

Comments

Comments serve as program documentation. A comment starts with the character sequence // and stop at the end of the line.

A comment cannot start inside a string literal, or inside a comment.

Tokens

Tokens form the vocabulary of the Antimony programming language. There are four classes: identifiers, keywords, operators and punctuation, and literals. White space, formed from spaces (U+0020), horizontal tabs (U+0009), carriage returns (U+000D), and newlines (U+000A), is ignored except as it separates tokens that would otherwise combine into a single token.

Identifiers

Identifiers name program entities such as variables and types. An identifier is a sequence of one or more letters and digits. The first character in an identifier must be a letter.

identifier = letter { letter | unicode_digit } .
a
_x9
This_is_aValidIdentifier
αβ

Keywords

The following keywords are reserved and may not be used as identifiers.

break
continue
else
false
fn
for
if
import
in
let
match
new
return
self
struct
true
while

Operators and Punctuation

The following character sequences represent operators (including assignment operators) and punctuation:

+
+=
&&
==
!=
(
)
-
-=
||
<
<=
[
]
*
*=
>
>=
{
}
/
/=
++
=
,
;
%
--
!
.
:

Integer Literals

An integer literal is a sequence of digits representing an integer constant. An optional prefix sets a non-decimal base: 0b or 0B for binary, 0, 0o, or 0O for octal, and 0x or 0X for hexadecimal. A single 0 is considered a decimal zero. In hexadecimal literals, letters a through f and A through F represent values 10 through 15.

For readability, an underscore character _ may appear after a base prefix or between successive digits; such underscores do not change the literal's value.

int_lit        = decimal_lit | binary_lit | octal_lit | hex_lit .
decimal_lit    = "0" | ( "1" … "9" ) [ [ "_" ] decimal_digits ] .
binary_lit     = "0" ( "b" | "B" ) [ "_" ] binary_digits .
octal_lit      = "0" [ "o" | "O" ] [ "_" ] octal_digits .
hex_lit        = "0" ( "x" | "X" ) [ "_" ] hex_digits .

decimal_digits = decimal_digit { [ "_" ] decimal_digit } .
binary_digits  = binary_digit { [ "_" ] binary_digit } .
octal_digits   = octal_digit { [ "_" ] octal_digit } .
hex_digits     = hex_digit { [ "_" ] hex_digit } .

42
4_2
0600
0_600
0o600
0O600       // second character is capital letter 'O'
0xBadFace
0xBad_Face
0x_67_7a_2f_cc_40_c6
170141183460469231731687303715884105727
170_141183_460469_231731_687303_715884_105727

_42         // an identifier, not an integer literal
42_         // invalid: _ must separate successive digits
4__2        // invalid: only one _ at a time
0_xBadFace  // invalid: _ must separate successive digits

Floating-point literals

TO BE IMPLEMENTED

Rune literals

TO BE IMPLEMENTED

String literals

A string literal represents a string constant obtained from concatenating a sequence of characters. String literals are character sequences between double quotes, as in "bar". Within the quotes, any character may appear except newline and unescaped double quote.

If \ character appears in the string, the character(s) following it must be interpreted specially:

  1. \ and " are included unchanged (e.g. "C:\\Users" -> C:\Users)
  2. n emits the newline control chracter (U+000A)
  3. r emits the carriage return control chracter (U+000D)
  4. b emits the backspace control character (U+000C)
  5. t emits a horizontal tab (U+0009)
  6. f emits a form feed (U+000C)
  7. Unknown escape sequences must raise a compile error

TODO: byte values

TODO: Currently, " and ' are valid string characters. Remove ' and only use them for runes.

string_escape =
    "\n" | # Newline (U+000A)
    "\r" | # Carriage return (U+000D)
    "\t" | # Horizontal tab (U+0009)
    "\f" | # Form feed (U+000C)
    "\b" | # Backspace (U+0008)
    `\"` | "\\"
any = /* Any Unicode code point except newline (U+000A) and double quote (U+0022) */ .
string_lit = `"` { any | string_escape } `"` .

"abc"
"Hello, world!"
"Hello\nworld"
"C:\\Users" # Should emit C:\Users
"日本語"