FixScript

Documentation

Types

There are only nine types in the language, a 32-bit integer, a 32-bit float (without denormals), resizable array, string (just a different variant of an array), shared array (array of fixed length and element size), a hash table, function references, weak references and native handles.

The array type is automatically compressed to use only unsigned bytes or shorts if the values don't exceed the ranges required for these smaller types. This allow to efficiently store byte buffers and Unicode strings.

The float type reuses the mark used for array references by limiting the valid array references to use only 23 bits (the size of stored significand). If the integer bitwise representation of a (positive) float value is in this range it denotes a denormalized number. Since the utility of such numbers is often negative (many CPUs have slow fallback or don't implement them at all) and have very limited upsides, they're flushed to zero and thus provide ability to share floats with array references without collisions.

The language lacks direct support for 64-bit integers and floats. However intrinsic functions are provided to support arithmetic with bigger numbers and 64-bit floats (doubles).

The boolean type is using zero for the false value and any non-zero (including array references as these have the integer value bigger than zero) for the true value. This is used in several statements when testing the conditions.

Objects

Objects are defined and used using a convention and stored in arrays. The auto-incrementing constants are used to define object fields as well as the size of the object. For example:

const {
    @OBJ_field1,
    @OBJ_field2,
    OBJ_foo,
    OBJ_bar,
    OBJ_SIZE
};

The @ is used to mark private fields. Extending of objects is also possible (given that SIZE constant is not private):

const {
    SUBCLASS_some = OBJ_SIZE,
    SUBCLASS_more_fields,
    SUBCLASS_SIZE
};

To create and extend instances of objects you can use the following functions:

function obj_create(foo, bar)
{
    var obj = object_create(OBJ_SIZE);
    obj->OBJ_foo = foo;
    obj->OBJ_bar = bar;
    return obj;
}

function subclass_create(foo, bar, some)
{
    var subclass = object_extend(obj_create(foo, bar), SUBCLASS_SIZE);
    subclass->OBJ_foo = null;
    subclass->SUBCLASS_some = some;
    return subclass;
}

Alternative syntax (using token processing):

class Object
{
    var @foo: Foo;
    var @bar: Bar;

    constructor create(foo: Foo, bar: Bar)
    {
        this.foo = foo;
        this.bar = bar;
    }
}

class Subclass: Object
{
    var @some;

    constructor create(foo: Foo, bar: Bar, some)
    {
        super::create(foo, bar);
        this.foo = null;
        this.some = some;
    }
}

To access fields you can use the -> operator. Which is just a better syntax for array access using named constant.

obj->OBJ_foo = 5;
obj[OBJ_foo] = 5; // the same

Arithmetic

There are standard operators for addition (+), subtraction (-), multiplication (*), division (/), remainder (%), bitwise AND (&), bitwise OR (|), bitwise XOR (^), bitwise negation (~), boolean negation (!), signed comparison (<, <=, >=, >), equivalency comparison (==, !=), signed shifts (<<, >>), unsigned shift (>>>), logical AND (&&) and logical OR (||).

You can also combine these with an assignment with these combined operators: +=, -=, *=, /=, %=, &=, |=, ^=, <<=, >>= and >>>=.

The logical AND and OR operators are short-circuiting the evaluation of the arguments.

Syntax

Literals

Integer literals can be specified as a decimal or a hexadecimal number. Hexadecimal numbers are prefixed with 0x. You can also use character literal which simply gets the corresponding Unicode value as an integer. Character literal is a single character enclosed with single quotes ('), you can use several escape sequences to get all kinds of characters. You can also specify up to 4 characters (in LATIN1, having values 0-255) these are combined as individual bytes of the resulting 32-bit integer stored in little endian format.

String literals are enclosed by double quotes ("). String literals are read only, to make them writable use string concatenation (for example: {"mutable string"}) which creates a new string instance.

This is the list of escape sequences for string and character literals:

\r CR (13, 0x0D)
\n LF (10, 0x0A)
\t TAB (9, 0x09)
\\ backslash (92, 0x5C)
\' apostrophe (39, 0x27), not needed in string literals
\" quotes (34, 0x22), not needed in char literals
\XX 8-bit value (each X is a hex number)
\uXXXX 16-bit Unicode code point, excluding surrogate pairs 0xD800..0xDFFF (each X is a hex number)
\UXXXXXX 21-bit Unicode code point with maximum of 0x10FFFF, excluding surrogate pairs 0xD800..0xDFFF (each X is a hex number)

Extended operator

Extended operator is wrapped in the { and } symbols. It has five forms:

  1. empty hash initializer - when the body of the operator is empty: {}
  2. string concatenation - one or more elements delimited by a comma, for example: {"value of PI: ", 3.141592654}
  3. float operation - when two elements are delimited by a float operator (+,-,*,/), for example: {1.0 * 2.0}
  4. hash initializer - when one or more sets of two elements are delimited by a colon (each set delimited by a comma), for example: {"key": "value", 123: 456}
  5. statement expression - allows to put statements inside expressions, for example: { var a = func(); =a*a }

The string concatenation takes the string representation of each element and concatenates them together. The single element form is often used to create mutable instance of the constant string. For example: {"this is now a mutable string"}

Operator precedence

unary ~ ! + - ++ --
multiplicative * / %
additive + -
bitwise << >> >>> & | ^
comparison < <= > >= == != === !==
logical && ||
ternary ?:
assignment = += -= *= /= %= &= |= ^= <<= >>= >>>=

The operator operands are processed from the left to the right for the whole expression (including the assignments). The exception is for the conditional operators (ternary operator and short-circuiting logical operators). The operators themselves are applied according the precedence table.

This has effect on expressions containing pre/post increments/decrements, inner assignments, statement expressions, function calls, etc.

Predefined constants

There are three predefined constants null (0), false (0) and true (1) to make the intent of the code more clear.

Functions

Intrinsic functions

These functions are specially handled by the interpreter, using direct bytecodes for better performance. The float functions can also work with doubles (and 64bit integers for float and int functions), having each parameter taking two slots with low order half first.

Length

length(value)
Returns the length of the given value, it returns the array length for arrays and the number of elements for hash tables. Emits an error in case the value is not array or hash table.

Min/max/clamp

min(a, b)
Returns the smaller integer.
max(a, b)
Returns the bigger integer.
clamp(x, min, max)
Returns the integer clamped to given range (inclusive).

32-bit arithmetic

abs(value)
Returns the absolute integer.
add32(a, b)
add32(a, b, carry)
Returns the result of 32-bit modular addition (including the carry flag when the second result is obtained). Optionally can add using the carry flag from the previous addition.
sub32(a, b)
sub32(a, b, borrow)
Returns the result of 32-bit modular subtraction (including the borrow flag when the second result is obtained). Optionally can subtract using the borrow flag from the previous subtraction.
mul32(a, b)
Returns the result of 32-bit multiplication, wrapping when overflow happens.

64-bit arithmetic

add64(a_lo, a_hi, b_lo, b_hi)
Returns the result of 64-bit modular addition as two result values.
sub64(a_lo, a_hi, b_lo, b_hi)
Returns the result of 64-bit modular subtraction as two result values.
mul64(v1, v2)
Returns the lower and upper 32 bits of the product as two result values.
umul64(v1, v2)
The unsigned version of the mul64 function.
mul64(v1_lo, v1_hi, v2_lo, v2_hi)
Returns the lower and upper 32 bits of the product as two result values.
div64(v1_lo, v1_hi, v2_lo, v2_hi)
Returns the lower and upper 32 bits of the 64-bit division as two result values.
udiv64(v1_lo, v1_hi, v2_lo, v2_hi)
The unsigned version of the div64 function.
rem64(v1_lo, v1_hi, v2_lo, v2_hi)
Returns the lower and upper 32 bits of the 64-bit remainder as two result values.
urem64(v1_lo, v1_hi, v2_lo, v2_hi)
The unsigned version of the rem64 function.

Float functions (32-bit & 64-bit)

float(a)
float(lo, hi)
Converts integer to float.
int(a)
int(lo, hi)
Converts float to integer.
fconv(a)
fconv(lo, hi)
Converts float to double or double to float.
fadd(a_lo, a_hi, b)
fadd(a_lo, a_hi, b_lo, b_hi)
Returns the result of addition of two 64-bit floating point numbers (or 32-bit float for second value).
fsub(a_lo, a_hi, b)
fsub(a_lo, a_hi, b_lo, b_hi)
Returns the result of subtraction of two 64-bit floating point numbers (or 32-bit float for second value).
fmul(a_lo, a_hi, b)
fmul(a_lo, a_hi, b_lo, b_hi)
Returns the result of multiplication of two 64-bit floating point numbers (or 32-bit float for second value).
fdiv(a_lo, a_hi, b)
fdiv(a_lo, a_hi, b_lo, b_hi)
Returns the result of division between of 64-bit floating point numbers (or 32-bit float for second value).
fcmp_lt(a_lo, a_hi, b)
fcmp_lt(a_lo, a_hi, b_lo, b_hi)
Returns true when the first float is smaller than second.
fcmp_le(a_lo, a_hi, b)
fcmp_le(a_lo, a_hi, b_lo, b_hi)
Returns true when the first float is smaller than second or equal.
fcmp_gt(a_lo, a_hi, b)
fcmp_gt(a_lo, a_hi, b_lo, b_hi)
Returns true when the first float is bigger than second.
fcmp_ge(a_lo, a_hi, b)
fcmp_ge(a_lo, a_hi, b_lo, b_hi)
Returns true when the first float is bigger than second or equal.
fcmp_eq(a_lo, a_hi, b)
fcmp_eq(a_lo, a_hi, b_lo, b_hi)
Returns true when the first float is equal to the second.
fcmp_ne(a_lo, a_hi, b)
fcmp_ne(a_lo, a_hi, b_lo, b_hi)
Returns true when the first float is not equal to the second.
fabs(a)
fabs(lo, hi)
Returns the absolute value of a float number.
fmin(a, b)
fmin(a_lo, a_hi, b)
fmin(a_lo, a_hi, b_lo, b_hi)
Returns the smaller float number.
fmax(a, b)
fmax(a_lo, a_hi, b)
fmax(a_lo, a_hi, b_lo, b_hi)
Returns the bigger float number.
fclamp(x, min, max)
fclamp(x_lo, x_hi, min, max)
fclamp(x_lo, x_hi, min_lo, min_hi, max_lo, max_hi)
Returns the float clamped to given range (inclusive).
floor(a)
floor(lo, hi)
Returns rounded down float value (as float).
ifloor(a)
ifloor(lo, hi)
Returns rounded down float value (as int).
ceil(a)
ceil(lo, hi)
Returns rounded up float value (as float).
iceil(a)
iceil(lo, hi)
Returns rounded up float value (as int).
round(a)
round(lo, hi)
Returns rounded float value to the nearest integer (as float).
iround(a)
iround(lo, hi)
Returns rounded float value to the nearest integer (as int).
pow(a, b)
pow(a_lo, a_hi, b)
pow(a_lo, a_hi, b_lo, b_hi)
Returns float value raised to the power of b.
sqrt(a)
sqrt(lo, hi)
Returns the square root of the float value.
cbrt(a)
cbrt(lo, hi)
Returns the cubic root of the float value.
exp(a)
exp(lo, hi)
Returns e raised to the power of a.
ln(a)
ln(lo, hi)
Returns the natural logarithm of a.
log2(a)
log2(lo, hi)
Returns the base 2 logarithm of a.
log10(a)
log10(lo, hi)
Returns the base 10 logarithm of a.
sin(a)
sin(lo, hi)
Returns the sine of a.
cos(a)
cos(lo, hi)
Returns the cosine of a.
asin(a)
asin(lo, hi)
Returns the arc sine of a.
acos(a)
acos(lo, hi)
Returns the arc cosine of a.
tan(a)
tan(lo, hi)
Returns the tangent of a.
atan(a)
atan(lo, hi)
Returns the arc tangent of a.
atan2(y, x)
atan2(y_lo, y_hi, x_lo, x_hi)
Returns the arc tangent of y and x coordinates using signs to determine the quadrant of the result.

Value functions

is_int(value)
Returns true when the value is an integer.
is_float(value)
Returns true when the value is a float.
is_array(value)
Returns true when the value is an array or string.
is_string(value)
Returns true when the value is a string.
is_hash(value)
Returns true when the value is a hash table.
is_shared(value)
Returns true when the value is a shared array.
is_const(value)
Returns true when the value is a constant string.
is_funcref(value)
Returns true when the value is a function reference (either resolved or unresolved).
is_weakref(value)
Returns true when the value is a weak reference.
is_handle(value)
Returns true when the value is a native handle.

Built-in functions

Clone

clone(value)
Returns the shallow duplicate of the given value. Constant strings and shared arrays are always cloned by reference.
clone_deep(value)
Returns the deep duplicate of the given value. Constant strings and shared arrays are always cloned by reference.

Array

array_create(length)
array_create(length, element_size)
Creates an array of given length and optionally initial element size (1, 2 or 4 bytes).
array_create_shared(length, element_size)
Creates an array of fixed length and fixed element size (1, 2 or 4 bytes). Only integers and floats (when element size is 4 bytes) can be put in, no references or native handles are allowed. Upgrading array to bigger element size results into an error. This type of array is suitable for sharing between different heaps (possibly running in different threads). Otherwise it can be also used for fixed size buffers with enforced element size or to avoid the overhead of garbage collection of the contained values.
array_get_shared_count(array)
Returns the number of different heaps that contains reference to given shared array.
array_get_element_size(array)
Returns the (current) element size of the array in bytes (1, 2 or 4 bytes).
array_set_length(array, length)
Sets the length of the given array.
array_copy(dest, dest_off, src, src_off, count)
Copies content of one array into another (or the same).
array_fill(array, value)
array_fill(array, off, count, value)
Fills a portion of the array (or whole) with the given value.
array_extract(array, off, count)
Returns a copy of a portion of the array (string).
array_insert(array, off, value)
Inserts the given value into the array, pushing next elements to higher indicies and increasing the array length.
array_insert_array(dest, idx, src)
array_insert_array(dest, idx, src, off, len)
Inserts an array into another array at given position.
array_append(array, other)
array_append(array, other, off, count)
Appends an array to another array.
array_replace_range(dest, start, end, src)
array_replace_range(dest, start, end, src, off, len)
Replaces range between start and end (exclusive) with the content of given array.
array_remove(array, off)
array_remove(array, off, count)
Removes a portion of the array, pushing next elements to lower indicies and decreasing the array length. The count is set to 1 when omitted.
array_clear(array)
Clears the array by setting the length of the given array to zero.

String

string_const(s)
string_const(s, off, len)
Returns a constant string (can't be modified) for the given string. You can specify a portion of the original string or use the whole string, in such case the same string is returned if it is already a constant. There is always only a single instance for each unique constant string.
string_parse_int(s)
string_parse_int(s, default_value)
string_parse_int(s, off, len)
string_parse_int(s, off, len, default_value)
Parses string as an integer. Optionally it can return the provided default value instead of an error.
string_parse_float(s)
string_parse_float(s, default_value)
string_parse_float(s, off, len)
string_parse_float(s, off, len, default_value)
Parses string as a float. Optionally it can return the provided default value instead of an error.
string_parse_long(s)
string_parse_long(s, off, len)
string_parse_long(s, off, len, default_lo, default_hi)
Parses string as a 64-bit integer. The result is returned as two results. Optionally it can return the provided default value instead of an error. To check the error, use is_int function to distinguish between valid value and an error.
string_parse_double(s)
string_parse_double(s, off, len)
string_parse_double(s, off, len, default_lo, default_hi)
Parses string as a 64-bit float. The result is returned as two results. Optionally it can return the provided default value instead of an error. To check the error, use is_int function to distinguish between valid value and an error.
string_from_long(lo, hi)
string_from_long(s, lo, hi)
Returns string representation of a 64-bit integer. Optionally it can append it to an existing string.
string_from_double(lo, hi)
string_from_double(s, lo, hi)
Returns string representation of a 64-bit float. Optionally it can append it to an existing string.
string_from_utf8(arr)
string_from_utf8(arr, off, len)
string_from_utf8(s, arr)
string_from_utf8(s, arr, off, len)
Returns UTF-8 decoded string from byte array. Optionally it can append it to an existing string.
string_to_utf8(s)
string_to_utf8(s, off, len)
string_to_utf8(arr, s)
string_to_utf8(arr, s, off, len)
Returns UTF-8 encoded string. Optionally it can append it to an existing byte array.

Object

object_create(size)
Creates a new array with given size. This is practically the same as creating a new array by calling array_create function, except that it's intent is for creating of objects.
object_extend(obj, size)
Sets the length of the existing array, the new length must be the same or bigger. This is almost the same as array_set_length function, except that it's intent is for extending of objects and the array is returned.

Weak reference

weakref_create(obj)
weakref_create(obj, container)
weakref_create(obj, container, key)
Creates a new weak reference (or already existing instance). Optionally you can pass a container (hash table or array) for automatic action to occur once the target object is garbage collected. In case of hash tables either the weak reference or provided key is removed. For arrays either the weak reference or provided key is appended to it. Be sure to periodically empty the array to prevent memory leaks.
Note: weak references can't reference directly other weak references (including the key).
weakref_get(ref)
Obtains the reference value for given weak reference.

Function reference

funcref_call(func, params)
Calls the function reference with the parameters passed in an array.

Hash

hash_get(hash, key, default_value)
Returns the value for given key or the provided default value when the value is missing. In case you need to check for presence of any kind of value you can use a reference to a private function to get an unique default value not present in any hash.
hash_entry(hash, idx)
Returns both the key and the value for given index as two result values. Returns zero for both the key and the value in case of an error (the only possible errors are: invalid hash reference or index out of bounds).
hash_contains(hash, key)
Returns true when the hash table contains the key.
hash_remove(hash, key)
Removes the given key from the hash table. Returns the value that was present or emits an error in case the key wasn't present.
hash_keys(hash)
Returns the keys from the hash as a new array.
hash_values(hash)
Returns the values from the hash as a new array.
hash_pairs(hash)
Returns both the keys and the values from the hash as a new array of twice the length with keys and values interleaved.
hash_clear(hash)
Removes all entries from the hash table.

Error

error(message)
Returns an error description with a stack trace. It consists of an array of length 2 where the first element is the message and the second is an array of individual stack trace entries (each being just a string like "func#1 (file.fix:123)"). It is permitted to pass another error as a message.

Log

log(value)
Prints the given string (other values are converted to a string) to the debugging channel provided by the application. Newlines are added automatically. The default implementation prints to standard error stream.
dump(value)
Pretty prints the given value using the log function.
to_string(value)
to_string(value, newlines)
Returns the string representation of the given value. You can optionally specify if you want the newlines (not used by default).

Heap

heap_collect()
Collects the garbage in the heap, removing unused arrays from the memory.
heap_size()
Computes and returns the heap size in kilobytes. It may do nothing (and return zero) depending on the implementation. It returns maximum value in case the heap is bigger than that.

Performance

perf_reset()
Resets the performance debugging timer.
perf_log(msg)
Logs the message (using the log function) with information about elapsed time since the previous perf_log call and also since the performance debugging timer was reset.

Serialization

serialize(value)
serialize(buf, value)
Serializes given value into byte array and returns it. It can optionally append the bytes to an existing array. The serialization will error on native handles, function references and weak references. The serialization format is fixed, suitable for long-term storage.
unserialize(buf)
unserialize(buf, off, len)
Unserializes value from given byte array. The array must contain only the serialized data (no extra data or multiple serialized data).
unserialize(buf, off_ref)
Unserializes value from given byte array at a given offset passed as a reference (therefore wrapped in an array). The offset is adjusted with the resulting position after the operation.

Token processing functions

These functions are available only during token processing. They always return an error if called otherwise.

script_query(name, file_name, constants, locals, functions)
Retrieves information about the file name, constants (both private and public), public local variables and public functions. The script is loaded if it wasn't already compiled. The file name is a mutable string, the constants is a hash table where the key is the name (with @ at the beginning if private) and the value is either an integer, a float, a string or in case the constant references some other constant the value is an array where the elements are: the value, the referenced script name and constant name. The locals and functions are just arrays of the names. All of the output parameters are optional and the order of the values reflects the order in the source file (after being processed by the token processors).
script_line(line)
script_line(fname, tokens, src, line)
Returns the script file name and the line in this format: "script.fix(123)". This function is used for error reporting in token processors. It also correctly adjusts the file name and line based on the @stack_trace_lines constant in the tokens. The tokens for currently processed script are used when not provided. Providing the file name is optional.
script_postprocess(func, data)
Registers the provided function to be called after processing by the token processors. The registered functions are called in a reverse order to allow wrapping behavior of different token processors. The signature of the function is:
postprocess(data, fname, tokens, src)
(the associated data for the function, the file name, the tokens and the source)
script_compile(src)
script_compile(tokens, src)
Compiles the given source code (or tokens) in the context of the token processing and returns a hash table with a list of public functions that can be called (the order of the values reflects the order in the source file).
tokens_parse(tokens, src, s, line)
tokens_parse(tokens, src, s, off, len, line)
Parses given string into tokens. These are appended to the tokens array and the source code to the src string (UTF-8 encoded). The reference to the tokens array is returned for simpler code in the case of creating a new tokens array.
token_parse_string(s)
token_parse_string(s, off, len)
Parses given string (char) literal from UTF-8 source code form.
token_escape_string(s)
Converts string into UTF-8 encoded source code form.

Optimizations

The arrays have ability to compress element sizes to just 8-bit and 16-bit unsigned integers, this is to allow to work with binary data and Unicode strings efficiently. As a side-effect many arrays are also compressed as the numbers generally tend to be near zero.

The compiler directly emits bytecode, skipping any intermediate forms such as AST trees. The forward jump bytecodes are using fixed length encoding to simplify the compiler. This allow to jump up to 2048 bytecodes (should suffice for most cases), however when bigger jump is encountered the whole single script is recompiled with long jumps.

The integer arithmetic operators work on the raw 32-bit integer value, even when it's a reference or float. The result is always an integer. Similarly float operators (using the extended operator) are interpreting this 32-bit value as a float and always return a float. This also allows to mix integer operations with floats to directly work with the float bitwise representation.

The garbage collector is non-compacting, meaning the integer values of references are kept the same. This can be used to create hash keys compared as references and not by values, simply by using arithmetic operation that don't modify the integer value to get the raw reference index (for example by ORing with zero). However you have to make sure that the original reference is still referenced somewhere. You can use weak references if you can't provide this guarantee, at the expense of allocation of an extra object per each key.

Multithreading

The general approach is to use multiple smaller heaps (one or multiple per thread) at natural sections of the application. Some examples:

This approach makes GC pauses pretty localized and very quick making them a non-issue. It's best to clone data between threads/heaps to communicate. You can use shared memory to avoid copy of actual data with some minor adjustments needed in the code for eg. storing structured data.

Basically you can use shared array to store array of objects simply by adjusting the syntax a bit: using shared_array[obj+OBJ_field] instead of the usual obj->OBJ_field. You also need to do your own pointer arithmetic, using obj as an integer offset and by adding OBJ_SIZE when you want to go to next element. You can store pointers to other objects using a simple offset in the shared array.

Garbage collection

Having smaller heaps also allow to collect the garbage as needed without much downsides. For example you can call GC after processing a network request or doing a spike of work, you can avoid the need for the obnoxious try/finally blocks to reclaim native handles in case of an error.

Token processing (macros)

The language supports arbitrary modification of the tokens before they're fed into the compiler. This allow to add new syntax, adjust existing one or even change it completely.

This is achieved by using the use keyword which runs the specified script with the tokens to modify. Usage example:

use "foreach"; // at the top of the file, before any imports

foreach (var k, v in hash_table) {
    ...
}

The implementation script has a single function process_tokens(fname, tokens, src) that accepts the file name, packed array of tokens (every entry taking multiple slots) and the original UTF-8 encoded source (whole file). The tokens start right after the string literal in the use statement. This allow to potentially pass parameters to the processing script.

The token types are these (these are final, no changes will be made to them):

const {
	@TOK_IDENT,
	@TOK_FUNC_REF,
	@TOK_NUMBER,
	@TOK_HEX_NUMBER,
	@TOK_FLOAT_NUMBER,
	@TOK_CHAR,
	@TOK_STRING,
	@TOK_UNKNOWN,

	@KW_DO,
	@KW_IF,
	@KW_FOR,
	@KW_USE,
	@KW_VAR,
	@KW_CASE,
	@KW_ELSE,
	@KW_BREAK,
	@KW_CONST,
	@KW_WHILE,
	@KW_IMPORT,
	@KW_RETURN,
	@KW_SWITCH,
	@KW_DEFAULT,
	@KW_CONTINUE,
	@KW_FUNCTION
};

The tokens structure has these members:

const {
	@TOK_type,
	@TOK_off,
	@TOK_len,
	@TOK_line,
	@TOK_SIZE
};

This is the type, offset into the source and the length of the token. It also contains the line number in the file for error reporting (used both at compile-time and in runtime). The length of the tokens is always divisible by 4 (the size of the structure). To add new tokens the new corresponding source code fragment can be easily appended to the source and referenced by the indicies.

Supported symbols are directly represented as their ASCII value in the type. Multi character symbols are represented as individual bytes in little endian format (eg. >= is 0x3D3E) and directly maps to the multi character literals by simply using eg. '>=' in the source code.

Invalid tokens are passed to the token processors with the TOK_UNKNOWN type. The least amount of characters to produce the same error are used. For unknown characters the maximum consecutive amount of such characters are used in a single TOK_UNKNOWN token. This allow to process arbitrary syntaxes even when they collide with the built-in syntax.

When manipulating the tokens a special care must be taken to not desynchronize symbols stored in the type with the source representation defined with TOK_off and TOK_len. Also each token must have unique offset in the source code, some token processors use this offset to uniquely identify the token even when it's moved in the tokens array.

Potential usages are very broad, however remember that with great power comes great responsibility, please try to follow these guidelines:

Good luck & have fun on your token processing voyage! :-)

Symbols

Length Symbols
1 character ( ) { } [ ] , ; ~ : @ ? + - * / % & | ^ = ! < > # $ . \ `
2 characters += ++ -= -- -> *= /= %= &= && |= || ^= <= << >= >> == != ..
3 characters === !== <<= >>= >>>
4 characters >>>=

When you convert tokens back to a valid source code (eg. for dumping the tokens as a readable code) you may want to omit any unneeded whitespace. You can omit whitespace between any symbols that are not producing another symbol with such concatenation, these are the combinations that require an extra whitespace:

+ +
+ =
+ +=
+ ++
+ ==
+ ===
- -
- =
- >
- -=
- --
- ->
- >=
- >>
- ==
- ===
- >>=
- >>>
- >>>=
* =
* ==
* ===
/ =
/ ==
/ ===
% =
% ==
% ===
& &
& =
& &=
& &&
& ==
& ===
| |
| =
| |=
| ||
| ==
| ===
^ =
^ ==
^ ===
= =
= ==
= ===
! =
! ==
! ===
< =
< <
< <=
< <<
< ==
< ===
< <<=
> =
> >
> >=
> >>
> ==
> ===
> >>=
> >>>
> >>>=
<< =
<< ==
<< ===
>> =
>> >
>> >=
>> >>
>> ==
>> ===
>> >>=
>> >>>
>> >>>=
== =
== ==
== ===
!= =
!= ==
!= ===
>>> =
>>> ==
>>> ===
. .
. ..

Execution environment

The token processors should use only built-in functions. They may use optional native functions provided by customized compilers, but should generally work without them.

The running environment can differ, it can be either in the same heap as the other code (in case of the interpreter), or it can be in a separate heap with possible incremental compilation (various tools and compilers). When incremental compilation is used the heap used for token processing is serialized to disk so it can be resumed later. Not much is needed to support this other than to be prepared to process source files repeatedly.

Communication between token processors

Sometimes you need to provide extra metadata (eg. class descriptions) so that different token processors (or just different versions) can work together. Usually you would use local variables to track such data (which has the benefit of not storing them in the final result in the case of static compilation) but they're limited to the particular version of the token processor.

The official way to do this is to use private constants with descriptive names that don't clash with normal constants. These can then use different values, though string constants are most useful as you can put in custom micro-syntax. Or in case the data is complicated and generated anyway, you can just serialize data into a string, it can be then directly unserialized because of the nature how strings are implemented.

Remember that this metadata is part of the API and should be therefore designed for backward and forward compatiblity.

It is preferred that you provide human readable syntax for your metadata. For example:

const @class_SomeClass = "prefix=some_class,static=create#2:create_other#3";
const @method_SomeClass_create_2 = "(Integer, Float): SomeClass";

The language implementation already implement two of such defineable metadata to adjust stack traces of errors.

Custom function names in stack traces

You can set custom function name in error stack traces. Example:

const @function_some_func_2 = "SomeFunc(int,float)";
const @function_hidden_0 = ""; // removes function from the stack trace (use with caution!)

function some_func(a, b)
{
    ...
}

function hidden()
{
    // best used for some generated wrapper functions
    // that are unlikely to cause error on their own
    // and would unpleasantly obscure the stack trace
}

Change file names and virtual function insertion in stack traces

You can use the @stack_trace_lines constant to set different file names for ranges of lines. You can also optionally insert virtual functions into the stack trace (useful for macros). The value is a serialized array into a string constant. The array contains multiple entries, each having 5 slots. The slots are: start line, end line (inclusive), file name, line number and optional name of inserted virtual function.

The array is processed from the beginning to the end. The virtual functions are inserted to the top (therefore they appear in reverse order in the stack trace). The file name for last matching range without virtual function is used. This allows to put more broad ranges before the more concrete ranges.

const {
    @LINE_start,
    @LINE_end,
    @LINE_file_name,
    @LINE_line_num,
    @LINE_func_name,
    @LINE_SIZE
};

function process_tokens(fname, tokens, src)
{
    var lines = [];
    ...
    lines[off+LINE_start] = 1000;
    lines[off+LINE_end] = 2000;
    lines[off+LINE_file_name] = "some_name.fix";
    lines[off+LINE_line_num] = 123;
    lines[off+LINE_func_name] = null;
    ...
    tokens_parse(tokens, src, {"const @stack_trace_lines = ", token_escape_string(serialize(lines)), ";"});
    ...
}

Error reporting

Use errors with this format for reporting any syntax errors: "script.fix(123): syntax error". This is easily achieved using the built-in script_line function like this:

return 0, error({script_line(line), ": syntax error"});

This makes sure to contain the proper file name (even when changed using the @stack_trace_lines private constant).

Serialization format

The serialization format has fixed structure and won't be changed in the future. It is therefore suited for both temporary and long-term storage, as well as for exchanging data between different systems.

The format is binary, all numbers are stored in little endian byte order. There is no header, but application specific usages may contain one. A single value is encoded, or in case of arrays and hashes more values will recursivelly follow. The arrays are encoded on their first use and assigned an internal ID starting from zero. Whenever the reference to the array is encountered again, only the ID is used. This allow to serialize complete graphs of objects. An empty value is encoded simply as an integer with the value of zero or by using an empty array (depending on the application).

Each value starts with a type and a length in a single byte. The type is in the low 4 bits and the length is in the high 4 bits. The length is present only for arrays, strings and hashes (it is an error to be present on other types). This allows to put the length for up to 12 elements directly into the first byte. The length is read as an additional unsigned byte (when the length is 13), unsigned short (when the length is 14) or 32-bit integer (when the length is 15). The shortest representation must be used and it's an error to accept bigger representation of smaller length.

The floats are stored with the denormals flushed to zero. It's an error to accept values that have them present. Flushing is done on the bitwise integer representation of the float, setting the bits 0-22 (inclusive, totalling of 23 bits) to zero in the case that all of the bits 23-30 (the rest, excluding the sign bit) are set to zero. Similarly, when deserializing, if the bitwise representation (with the sign bit masked away) is between 1 and 0x7FFFFF (inclusive) an error must be emitted as that would be a float number in a denormalized form.

The floats must have NaN (not-a-number) values normalized to a quiet NaN without any payload. If the exponent bits are all set (meaning it's an infinity or NaN) and the low 23 bits are non-zero (it's a NaN), change the low 23 bits to have only the most significant bit set. It's an error to accept non-normalized NaNs.

The smallest array/string/ref/int/float form must be always used for storage (in case of an empty array/string the ARRAY_BYTE/STRING_BYTE type must be used). It's an error to accept bigger array/string/ref/int/float forms that don't have values requiring it. The hash tables must not contain duplicate keys, and it's an error to accept duplicate keys.

These restrictions are to ensure having a canonical format suitable for hash keys and exchange between different systems and implementations as well as to minimize covert channels for data leakage.

Here is a table of all types:

Type ID Description
ZERO 0 zero integer value
BYTE 1 8-bit unsigned integer
SHORT 2 16-bit unsigned integer
INT 3 32-bit signed integer
FLOAT 4 32-bit float (stored with the denormals flushed to zero and having normalized NaNs)
FLOAT_ZERO 5 positive zero float value
REF 6 reference to previously encountered array/string/hash (as 32-bit index)
REF_SHORT 7 reference to previously encountered array/string/hash (as 16-bit index)
ARRAY 8 array of values
ARRAY_BYTE 9 array of unsigned 8-bit integers
ARRAY_SHORT 10 array of unsigned 16-bit integers
ARRAY_INT 11 array of signed 32-bit integers
STRING_BYTE 12 string with each character stored as an unsigned 8-bit integer
STRING_SHORT 13 string with each character stored as an unsigned 16-bit integer
STRING_INT 14 string with each character stored as a signed 32-bit integer
HASH 15 hash table with the insertion order preserved, contains given number of pairs of key and value

Advanced topics

Runtime detection of user defined types

Sometimes you need to identify object types at runtime. Since there are no user types in the language you (directly or indirectly using token processor) have to use a little trick. It uses the property of function references where they are uniquely identified even across different heaps. This allow to use them as a marker in data structures, even when the structures are deep cloned into another heap.

However for ordinary cases it's better to use type member in the base class. This runtime detection is more suitable for cases where you pass unknown kinds of objects and still want to identify certain types without a doubt.

Distributed garbage collection between threads

While you would usually use native handles to determine liveness of objects between different threads (heaps) you can also use the pure language constructs to achieve that.

So you want to make some object available to a different heap as a reference instead of copying. There is a property of shared arrays that when cloned to other heap they retain the same reference within the heap if they were already cloned there before. Additionally you can check how many heaps are referencing the shared array.

From these two features you can easily construct handles that you can pass around and determine global liveness. To create such handle, just create a zero-sized shared array and assign the internal data to it using a hash table. Then pass around the shared array reference. To obtain internal data just get it from the hash table. To determine when the reference is not used anymore in other heaps just traverse the hash table and check if the number of active heaps is just one. Then you know you can release the internal data and invalidate the handle. To make it less processor intensive just iterate the hash table in small batches.

Weak references

Weak references are useful in various scenarios. In the basic form it simply allow you to make the target object garbage collected when no normal references are pointing to it. When this occurs the reference to target is simply cleared from the weak reference and you can't obtain the original object anymore.

This is useful in cases where you register the object to receive some events (change listeners, timers, etc.) but the original object has become unreferenced in the meantime. Ordinarilly you would need to manually deregister it from receiving the events, which would be painful to implement when usually it's not needed for anything else.

Instead in your listener you use a weak reference to the object and once you detect the object is not around anymore you just deregister receiving of the event. To avoid accumulation of many such listener proxies the event source object can directly implement weak listeners.

In the more advanced form, you can set an automatic action when the target object disappears. You can provide a container (hash table or array) to remove an entry (for hash) or add an entry (for array). With this you can create mapping to objects (to externally add additional data to existing objects), implement self-clearing caches or even general detection of object destruction to run arbitrary code (however you need to check periodically for newly collected objects).

Non-local GOTO

In some rare cases you may need to jump out of nested functions and continue other execution path. You can use the exception mechanism for that but instead of creating an error you can pass a function reference as a marker. You can then uniquely detect it and choose a different execution path. This operation is fast as no stack trace is required to be gathered.

The only caveat is that you must make sure that no code between the throwing and catching would wrap the function reference to an error.

Circular imports

Sometimes you end up in a situation where two scripts are depending on each other. This is currently not supported in FixScript because all the possible solutions are quickly leading to overcomplicated and fragile mess, esp. when used with token processors. Maybe a solution will be found in the future, but seems unlikely. There are also some overcomplicated hacks possible using token processors but these are also limited (for example combining multiple complex token processors would be next to impossible).

The solution is to simply merge such scripts into one, after all if they depend on each other so much they should be one unit. Still there are cases where putting it all together would create an unmaintable mess. This is possible to solve using a token processor that simply includes the content of the other script, making it a single unit yet separated in different files.

Other times there is some dependency but it is minimal. For such cases it's better to workaround the problem by dynamic calling of functions or creating a third script file with common stuff.

Guarding internal state of objects

An important property of objects is to encapsulate the internal state from the outside. Often you need to store strings or other mutable objects. But simply storing the reference would expose the internal state to outside manipulation. Therefore you need to guard it by making a copy and store it internally. And then when returning the data back to the user you need to create another copy so the internal state is fully guarded.

While this works it is clearly quite inefficient. The issue is mostly with strings as usually with other kinds of objects storing it by reference is the intended usage.

It is therefore recommended to use the string_const function to store a constant variant of the provided string. The function returns the same string when it's already a constant and it also maintains a set of existing constant strings so there is never more than one instance of the same constant string in the heap.

This way there is just one or no copy (when it was already constant) when storing and there is no need to do anything when returning it to the user.

However for objects intended to work with serialization there is another complication. Once deserialized the strings are no longer constant. You would either need to make all strings constant by recursivelly traversing the whole data structure, or you can make it more efficient to just convert it in the getter before returning it to the user.

The getter needs to call the string_const function and store the result into the object before returning it to the user. At first it will convert it, but only for these strings that are actually obtained, and if obtained more times it will just return the same constant string.