The part that's incompatible with current semantics of symbols is importing that symbol as
an immutable symbolic reference. Not really a "variable" reference, but as a binding
of a symbol to a value in the run-time namespace (or package in CL terminology, although
CL did not allow any way to specify what I'm suggesting either, as far as I know).
However, that would capture the semantics of ELF shared objects with the text and ro_data
segments loaded into memory that is in fact immutable for a userspace program.
It looks to me like the portable dump code/format could be adapted to serve the purpose I have in mind here. What needs to be added is a way to limit the scope of the dump so only the appropriate set of objects are captured.
I'm going to start with a copy of pdumper.c and pdumper.h renamed to ndumper (n for namespace). The pdmp format conceptually organizes the emacs executable space into a graph with three nodes - an "Emacs executable" node (or the temacs text and ro sections), "Emacs static" (sections of the executable loaded into writeable memory), and a "dump" node, corresponding to heap-allocated objects that were live at the time of the dump. The dump node has relocations that can point into itself or to the emacs executable, and "discardable" relocations for values instantiated into the "Emacs static". While the data structure doesn't require it, the only values saved from the Emacs static data are symbols, primitive subrs (not native compiled), and the thread structure for the main thread.
There can be cycles between these nodes in the memory graph, but cutting the edge[s] between the emacs executable and the Emacs static nodes yields a DAG.
Note, pdumper does not make the partition I'm describing explicitly. I'm inferring that there must be such a partition. The discardable relocations should be ones that instantiate into static data of the temacs executable.
My plan is to refine the structure of the Emacs process introduced by pdumper to yield a namespace graph structure with the same property - cutting the edge from executable to runtime state yields a DAG whose only root is the emacs executable.
Each ndmp namespace (or module or cl-package) would have its own symbol table and a unique namespace identifier, with a runtime mapping to the file backing it (if loaded from a file).
Interned symbols will be extended with three additional properties: static value, constant value and constant function. For variables, scope resolution will be done at compile time:
* Value if not void (undefined), else
* Static value
A constant symbol is referenced by importing a constant symbol, either from another namespace or a variable in the current namespace's compile-time environment. The attempt at run-time to rebind a symbol bound by an import form will signal an error. Multiple imports binding a particular symbol at run-time will effectively cause the shadowing of an earlier binding by the later binding. Any sequence of imports and other forms that would result in the ambiguity of the resolution of a particular variable at compile time will signal an error. That is, a given symbol will have only one associated binding in the namespace scope during a particular evaluation time (eval, compile, compile-compile, etc)
A static value binding will be global but not dynamic. A constant value binding will result from an export form in an eval-when-compile form encountered while compiling the source of the ndmp module. Since static bindings capture the "global" aspect of the current semantics of special variable bindings, dynamic scope can be safely restricted to provide thread-local semantics. Instantiation of a compiled ndmp object will initialize the bindings to be consistent with the current semantics of defvar and setq in global scope, as well as the separation of compile-time and eval-time variable bindings. [I am not certain what the exact approach will be to ensure that will be]. Note constant bindings are only created by "importing" from the compile-time environment through eval-when-compile under the current semantics model. This approach simply avoids the beta substitution of compile-time variable references performed in the current implementation of eval-when-compile semantics. Macro expansion is still available to insert such values directly in forms from the compile-time environment.
A function symbol will resolve to the function property if not void, and the constant function property otherwise.
Each ndmp module will explicitly identify the symbols it exports, and those it imports. The storage of variable bindings for unexported symbols will not be directly referenceable from any other namespace. Constant bindings may be enforced by loading into a read-only page of memory, a write barrier implemented by the system, or unenforced. In other words, attempting to set a constant binding is an error with unspecified effect. Additional declarations may be provided to require the signaling of an error, the enforcement of constancy (without an error), both, or neither. The storage of static and constant variables may or may not be incorporated directly in the symbol object. For example, such storage may be allocated using separate hash tables for static and constant symbol tables to reduce the allocation of space for variables without a static or constant binding.
When compiling a form that imports a symbol from an ndmp module, importing in an eval-when-compile context will resolve to the constant value binding of the symbol, as though the source forms were concatenated during compilation to have a single compile time environment. Otherwise, the resolution will proceed as described above.
There will be a distinguished ndmp object that contains relocations instantiated into the Emacs static nodes, serving the baseline function of pdmp. There will also be a distinguished ndmp object "ELISP" that exports all the primitives of Emacs lisp. The symbols of this namespace will be implicitly imported into every ndmp unless overridden by a special form to be specified. In this way, a namespace may use an alternative lisp semantic model, eg CL. Additonal forms for importing symbols from other namespaces remain to be specified.
Ideally the byte code vm would be able to treat an ndmp object as an extended byte code vector, but the restriction of the byte-codes to 16-bit addressing is problematic.
For 64-bit machines, the ndmp format will restrict the (stored) addresses to 32 bits, and use the remaining bits of relocs not already used for administrative purposes as an index into a vector of imported namespaces in the ndmp file itself, where the 0 value corresponds to an "un-interned" namespace that is not backed by a (permanent) file. I don't know what the split should be in 32-bit systems (without the wide-int option). The interpretation of the bits is specific to file-backed compiled namespaces, so it may restrict the number of namespace imports in a compiled object without restricting the number of namespaces imported in the runtime namespace.
Once implemented, this functionality should significantly reduce the need for a monolithic dump or "redumping" functionality. Or rather, "dumping" will be done incrementally.
My ultimate goal is to introduce a clean way to express a compiled object that has multiple code labels, and a mechanism to call or jump to them directly, so that the expressible control-flow structure of native and byte compiled code will be equivalent (I believe the technical term is that there will be a bisimulation between their operational semantics, but it's been a while). An initial version might move in this direction by encoding the namespaces using a byte-code vector to trampoline
to the code-entry points, but this would not provide a bisimulation. Eventually, the byte-code VM and compiler will have to be modified to make full use of ndmp objects as primary semantic objects without intermediation through byte-code vectors as currently implemented.
If there's an error in my interpretation of current implementation (particular pdumper), I'd be happy to find out about it now.
As a practical matter, I've been working with the 28.1 source. Am I better off continuing with that, or starting from a more recent commit to the main branch?
Lynn