ets-module-collection/AIO_Server/Dep_LuaSrcDiet/technotes.txt

Tech Notes for LuaSrcDiet
=========================

The following are notes on the optimization process for easy reference.


Miscellaneous Ideas, TODO Stuff
===============================

Other Ideas:
(a) remove unused locals that can be removed in the source
(b) remove declarations using nil
(c) remove table constructor elements using nil
(d) extra optional semicolon removal
(e) extra comma or semicolon removal in table constructors
(f) special number forms: using ^ and * to shorten constants: 1^16
(g) simple declaration of locals that can be merged: local a,b,c,d
(h) warn of opportunity for using a local to zap a bunch of globals
(i) warn of trailing whitespace in strings or long strings
(j) spaces to tabs in comments/long comments/long strings
(k) convert long strings to normal strings, vice versa
(l) simplify logical or relational operator expression


Modified Lexer Output
=====================

The lexer is works almost exactly like llex.c in 'normal' Lua. Instead
of returning one token on each call, the lexer processes the entire
string (from an entire file) and returns. Two lists (tokens and semantic
information items) are set up in the module for use by the caller.

For maximum flexibility during processing, the lexer returns non-grammar
lexical elements as tokens too. Non-grammar elements, such as comments,
whitespace, line endings, are classified along with 'normal' tokens. The
lexer classifies 7 kinds of grammar tokens and 4 kinds of non-grammar
tokens:

---------------------------------------------------------------------
TOKEN CLASS	DESCRIPTION
---------------------------------------------------------------------
"TK_KEYWORD" 	keywords
"TK_NAME"	identifiers
"TK_NUMBER" 	numbers (unconverted, kept in original form)
"TK_STRING" 	strings (no translation is done, includes delimiters)
"TK_LSTRING" 	long strings (no translation is done, includes delimiters)
"TK_OP" 	operators and punctuation (most single-char, some double)
"TK_EOS" 	end-of-stream (there is only one for each file/stream)
---------------------------------------------------------------------
"TK_SPACE" 	whitespace (generally, spaces, \t, \v and \f)
"TK_COMMENT" 	comments (includes delimiters, also includes special
		first line shbang, which is handled specially in the
		optimizer)
"TK_LCOMMENT" 	block comments (includes delimiters)
"TK_EOL" 	end-of-lines (excludes those embedded in strings)
---------------------------------------------------------------------


Table for Lexer-Based Optimizations
===================================

We aim to keep lexer-based optimizations free of parser considerations,
i.e. we allow for generalized optimization of token sequences. The table
below considers the requirements for all combinations of significant
tokens. Other tokens are whitespace-like. Comments can be considered to
be a special kind of whitespace, e.g. a short comment needs to have a
following EOL token, if we do not want to optimize away short comments.

FIRST	SECOND TOKEN
TOKEN	  |
  |	  V
  V	Keyword	  Name	Number	String	LString	 Oper
--------------------------------------------------------
Keyword	  [S]	  [S]	  [S]	   0	   0	   0
Name	  [S]	  [S]	  [S]	   0	   0	   0
Number	  [S]	  [S]	  [S]	   0	   0	  [1]
String	   0	   0	   0	   0	   0	   0
LString	   0	   0	   0	   0	   0	   0
Oper	   0	   0	  [1]	   0	   0	  [2]
--------------------------------------------------------

[S] = need at least one whitespace (set as either a space or keep EOL)

[1] = need a space if operator is a '.', all others okay; a '+' or '-'
      is used as part of a floating-point spec, but there does not
      appear to be any way of creating a float by joining with number
      with a a '+' or '-' plus another number, because an 'e' has to
      be somewhere in the first token, this can't be done

[2] = normally there cannot be consecutive operators, but we plan to
      allow for generalized optimization of token sequences, i.e. even
      sequences that are grammatically illegal; so disallow adjacent
      operators if:
      (a) the first is in [=<>] and the second is '='
      (b) disallow dot sequences to be adjacent, but "..." first okay
      (c) disallow '[' followed by '=' or '[' (not optimal)

Also, a minus '-' cannot preceed a Comment or LComment, because comments
start with a '--' prefix. Apart from that, all Comment or LComment
tokens can be set abut with a real token.


Sequence of Lexer-Based Optimization Process
============================================

*** TODO ***


Description of Locals Optimization
==================================

VARIABLE TRACKING AND LOCAL VARIABLE RENAMING

(A) TK_NAME token class considerations
--------------------------------------

A TK_NAME token means a number of things, and some of these cannot be
renamed. We are interested in the use of TK_NAME in the following:

  (a) global variable access
  (b) local variable declaration, includes local statements, local
      functions, function parameters, implicit "self" local, etc.
  (c) local variable access, including upvalue access

TK_NAME is also used in parts of grammar as constant strings -- these
tokens cannot be optimized without user assistance. These include:

  (d) as the key in key=value pairs in table construction
  (e) as field or method names in a:b or a.b syntax forms

For local variable name optimization, we do not need to consider (d) and
(e), and while global variables cannot be renamed (since we do not have
support for user assistance), they need to be considered as part of
Lua's variable access scheme.

(B) Lifetime of a local variable
--------------------------------

Take the following example:

  local string, table = string, table

In the example, the two locals are assigned the values of the globals
with the same names. Therefore, when Lua encounters the declaration
portion:

  local string, table

The parser cannot immediately make the two local variable available to
following code. In the parser and code generator, locals are inactive
when its entry is created. They are activated only when the function
adjustlocalvars() is called to activate the appropriate local variables.

In the example, the two local variables are activated only after the
whole statement has been parsed, that is, after the last "table" token.
Hence, the statement works as expected. Also, once the two local
variables goes out of scope, removevars() is called to deactivate them,
allowing other variables of the same name to become visible again.

Another example worth mentioning is:

  local a, a, a, = 1, 2, 3

The above will assign 3 to 'a'.

Thus, when optimizing local variable names, (1) we need to consider
accesses of global variable names affecting the namespace, (2) for the
local variable names themselves, we need to consider when they are
declared, activated and removed, and (3) within the 'live' time of
locals, we need to know when they are accessed (since locals that are
never accessed don't really matter.)

(C) Local variable tracking
---------------------------

Every local variable declaration is considered an object to be renamed.

From the parser, we have the original name of the local variable, the
token positions for declaration, activation and removal, and the token
position for all the TK_NAME tokens which references this local. All
instances of the implicit "self" local variable are flagged as such.

In addition to local variable information, all global variable accesses
are tabled, one object entry for one name, and each object has a
corresponding list of token positions for the TK_NAME tokens, which is
where the global variables were accessed.

The key criteria is: Our act of renaming cannot change the visibility of
any of these locals and globals at the time they are accessed.

Of course, if every variable has a unique name, then there is no need
for a name allocation algorithm, as there will be no conflict. But, in
order to maximize utilization of short identifier names, we want to
reuse the names as much as possible. In addition, fewer names will
likely reduce symbol entropy and may slightly improve compressibility of
the source code.

(D) Name allocation theory
--------------------------

To understand the renaming algorithm, first we need to establish how
different local and global variables can operate happily without
interfering with each other.

Consider three objects, local object A, local object B and global object
G. A and B involve declaration, activation and removal, and within the
period it is active, there may be zero or more accesses of the local.
For G, there are only global variable accesses to look into.

Assume that we have assigned a new name to A and we wish to consider its
effects on other locals and globals, for which we choose B and G as
examples. We assume local B has not been assigned a new name as we
expect our algorithm to take care of collisions.

A's lifetime is something like this:

        Decl            Act                             Rem
        +               +-------------------------------+
        -------------------------------------------------

where 'Decl' is the time of declaration, 'Act' is the time of
activation, and 'Rem' is the time of removal. Between 'Act' and 'Rem',
the local is alive or 'live' and Lua can see it if its corresponding
TK_NAME identifier comes up.

        Decl            Act                             Rem
        +               +-------------------------------+
        -------------------------------------------------
   *            *             *                              *
  (1)          (2)           (3)                            (4)

Recall that the key criteria is to not change the visibility of globals
and locals. Consider local and global accesses at (1), (2), (3) and (4).

A global G of the same name as A will only collide at (3), where Lua
will see A and not G. Since G must be accessed at (3) according to what
the parser says, and we cannot modify the positions of 'Decl', 'Act' and
'Rem', it follows that A cannot have the same name as G.

                Decl    Act                     Rem
                +       +-----------------------+
                ---------------------------------
 (1)+   +---+   (2)+   +---+      (3)+   +---+	    (4)+   +---+
    ---------      ---------         ---------         ---------

For the case of A and B having the same names and colliding, consider
the cases for which B is at (1), (2), (3) or (4) in the above.

(1) and (4) means that A and B are completely isolated from each other,
hence in the two cases, A and B can safely use the same variable names.
To be specific, since we have assigned A, B is considered completely
isolated from A if B's Activation-to-Removal period is isolated from
the time of A's first access to last access, meaning B's active time
will never affect any of A's accesses.

For (2) and (3), we have two cases where we need to consider which one
has been activated first. For (2), B is active before A, so A cannot
impose on B. But A's accesses are valid while B is active, since A can
override B. For no collision in the case of (2), we simply need to
ensure that the last access of B occurs before A is activated.

For (3), B is activated before A, hence B can override A's accesses. For
no collision, all of A's accesses cannot happen while B is active. Thus
position (3) follows the "A is never accessed when B is active" rule in
a general way. Local variables of a child function are in the position
of (3). To illustrate, the local B can use the same name as local A and
live in a child function or block scope if each time A is accessed, Lua
sees A and not B. So we have to check all accesses of A and see whether
they collide with the active period of B. Now if A was never accessed,
then B can be active anywhere.

The above appears to resolve all sorts of cases where the active times
of A and B overlap. If there is a more simple scheme, do let me know.
Note that in the above, the allocator does not need to know how locals
are separated according to function prototypes. Perhaps the allocator
can be simplified if knowledge of function structure is utilized.

(E) Name allocation algorithm
-----------------------------

To begin with, the name generator is mostly separate from the name
allocation algorithm. The name generator returns the next shortest name
for the algorithm to apply to local variables. To attempt to reduce
symbol entropy (which benefit compression algorithms), the name
generator follows English frequent letter usage. Later, there may be an
option to calculate an actual symbol entropy table from the input data.

Since there are 53 one-character identifiers and (53*63-4) two-character
identifiers (minus a few keywords), there isn't a pressing need to
optimally maximize name reuse. Sample files show that we can eventually
get 20 tokens per identifier name, thus a source file can have over 1000
local variable tokens that are all single character in length.

In theory, we should need no more than 260 local identifiers by default.
Why? Since LUAI_MAXVARS is 200 and LUAI_MAXUPVALUES is 60, at any block
scope, there can be at most (LUAI_MAXVARS + LUAI_MAXUPVALUES) locals
referenced, or 260. Also, those from outer scopes not referenced in
inner scopes can reuse identifiers. The net effect of this is that a
local variable name allocation method should not allocate more than 260
identifier names for locals.

The current algorithm is a simple first-come first-served scheme:

  (a) One local object that use the most tokens is named first.
  (b) Any other non-conflicting locals with respect to the first
      object are assigned the same name.
  (c) Assigned locals are removed and the procedure is repeated for
      objects that have not been assigned new names. (a) to (c)
      repeats until no local objects are left.

In addition, there are a few extra issues to take care of:

  (d) Implicit "self" locals that have been flagged as such are
      already "assigned to" and so they are left unmodified.
  (e) The name generator skips "self" to avoid conflicts. This
      is not optimal but it is unlikely a script will use so many
      local variables as to reach "self".
  (f) Keywords are also skipped for the name generator.
  (g) Global name conflict resolution.

For (g), global name conflict resolution is handled just after the new
name is generated. The name can still be used for some locals even if it
conflicts with other locals. To remove conflicts, global variable
accesses for the particular identifier name is checked. Any local
variables that are active when a global access is made is marked to be
skipped. The rest of the local objects can then use that name.

The algorithm has special handling for locals that use the same name in
the same scope. For example:

  local foo = 10 -- (1)
  ...
  local foo = 20 -- (2)
  ...
  print(e)

Since we are considering name visibility, the first 'foo' does not
really cease to exist when the second 'foo' is declared, because if we
were to make that assumption, and the first 'foo' is removed before (2),
then I should be able to use 'e' as the name for the first 'foo' and
after (2), it should not conflict with variables in the outer scope
with the same name. To illustrate:

  local e = 10 -- 'foo ' renamed to 'e'
  ...
  local t = 20 -- error if we assumed 'e' removed here
  ...
  print(e)

Since 'e' is a global in the example, we now have an error as the
name as been taken over by a local. Thus, the first 'foo' local must
have its active time extend to the end of the current scope. If there
is no conflict between the first and second 'foo', the algorithm may
still assign the same names to them.

The current fix to deal with the above chains local objects in order to
find the removal position. Practically, the parser can be modified to
simplify this.


END.