Surviving the Structured Clone algorithm
Used internally by IndexedDB, Workers and other communication Channels, or used directly via structuredClone
global utility, the structured clone algorithm is both “a wonder” for JS primitives and “a curse” for developers.
The Wonder
One could transfer, store or clone almost everything that the platform provides, most notably very complex primitives such as entire files, personal storage access credentials, buffers or views of all kinds and whatnot.
There are just a few limitations that, differently form JSON that silently ignores or loses data while stringifying, would throw hard at runtime if found within the structured data:
- no functions allowed (reasonable: functions have scope and context access that cannot be brought elsewhere)
- DOM nodes (also reasonable: workers, as example, have no DOM and surely not the same one live on the main thread)
- accessors and/or private fields + some special property or prototype chain fully ignored
- symbols will throw … but also will Proxies !!!
The Curse
- No Proxies means that most complex projects need to implement their own logic to avoid passing special references around, including foreign interface based wrappers that are the only possible kind of identity that any WASM targeting programming language can offer to interop with JS … that means: you have a Python Proxy around that contains just some data and you want to
postMessage({ some: py_proxy_map })
somewhere else? Nope ☠️ - No Classes (part 1) means that even if you extend native classes with extra sugar on top, nobody can distinguish that special taste elsewhere, even if the very same library and classes used at the origin of the message are available and identical at the message target world 🤦
- No Classes (part 2) also means that circumventing the fact no functions can travel is not possible because only derived native classes instances can travel: forget about your logic being able to be transferred anywhere else 🥲
- No Classes (part 3) also means that accessors cannot sit where they belong, which is the prototype, because if they do that data won’t travel but if they don’t the accessor logic and/or reactivity will be lost 😱
The Common (non) Solution …
There are literally dozen attempts to solve this very same issue at the serialization level, providing libraries that accept extensions where once gazillion of instanceof
operations are performed, if some of the registered one returns something, then such “something” will be serialized among the rest of the data but in all my benchmarks that dance is a slow and bloated overhead I really don’t want to deal with anymore:
- requires both serialization and deserialization
- requires a lot of callbacks invocation for extensions of all kinds
- it’s not always provided a way to retrieve instances back on the other side
- it’s most of the time fully focused on cross Programming Language portability while I want to solve in JS my JS issues and get rid of any unnecessary abstraction that makes the crytical path way slower than it could
- it produces anyway something that needs to travel via
postMessage
and received viamessage
handlers on the other side, when the project is JS
Don’t get me wrong, projects such as MessagePack or cbor-x are fantastic but these don’t solve my specific issue: I want to send JS references to a JS target and I don’t always need a binary overhead because binary serialization makes sense only with synchronous Atomics.wait
when every other case is better off with just async
dances that won’t need 3rd party libraries or binary serialization to work!
The JSON limit …
I can hear already people thinking “mate, just use JSON then” but that misses the points in so many levels:
toJSON()
escape hatch won’t get triggered by structured clone + it does not provide a way itself to revive whatever was returned after- JSON is not able to be recursive data and any recursive capable alternative is not nearly as fast as JSON is, surely not faster than structured clone
- all complex primitives need to be serializaed in a JSON friendly format where if you pass a buffer as Uint8Array, all its keys from
0
to itslength
will also pass through the optional callback you passed thinking you were smart in there … it’s instead a slippery slope to slowness unfortunately not many out there realize: as soon as any callback to serialize or parse back is provided, goodbye performance!
In short, JSON remains the fastest and preferred way to deal with simple data + it never throws if not when unexpected recursion is passed along but it’s definitively not a solution, although many used it to solve the Proxy issue by providing normal data via toJSON
, when accessed, yet we have no way to have the cake (structured clone) and eat it too (a better mechanism than toJSON
to both serialize and deserialize).
My Proposal @ WHATWG
I wasn’t joking when I’ve said it was my birthday wish, ’cause if there’s something that is driving me crazy and it’s extremely infuriating with the kind of projects I’m dealing with daily (WASM driven PLs hooked into JS via main or worker threads), is that inability to intercept the structured clone internal (recursive capable) data crawling to provide hooks that would let me decide how any Proxy around users’ code could be transformed or reflected elsewhere through a way that can be restored on the other side of the affairs.
In my case, what travels can be anything that the Reflect namespace can handle, via postMessage
orchestration and SharedArrayBuffer
that handle via synchronous cross-realms communication even DOM nodes from the main thread, nodes that can be passed around as method arguments (think just an element.appendChild(other)
that happens within a worker) and whatnot!
The limit in my case is not even my immagination, it’s simply the lack of a better way to deal with this … so here I come with a proposal: a SerializationRegistry namespace to rule them all!
class Serializable {
static revive([count, data]) {
const ref = new this(data);
ref.#count = count;
return ref;
}
// private properties via reviver? ✅
#count = 0;
#data;
constructor(data) {
this.#data = data;
}
// accessors? ✅
get access() {
return this.#count;
}
get data() {
this.#count++;
return this.#data;
}
// define how to travel? ✅
get [SerializationRegistry.symbol]() {
return SerializationRegistry.transfer(
'my-project@Serializable',
[this.#count, this.#data],
);
}
}
// define how to revive? ✅
SerializationRegistry.register(
'my-project@Serializable',
Serializable.revive.bind(Serializable),
);
OK then, we have our very own class that:
- defines a
SerializationRegistry.symbol
accessor, just the waySymbol.toStringTag
or others work so that it’s clear no argument would ever be passed while cloning, which will be accessed while cloning - that accessor returns explicitly something to transfer that assumes the class has been registered with that unique identifier, either in this world or in the receiving one
SerializationRegistry.register
registers that unique identifier so that this class, as module, would work to both send or receive its own kind
… and that’s it? Let’s see it in practice:
const ref = new Serializable({ some: 'data' });
// just to trigger the accessor and increment count
ref.data; // { some: 'data' }
ref.data; // { some: 'data' }
ref.access; // 2
// let's post that reference
postMessage({ extras: true, data: [1, ref, 2] });
// on the receiver side
self.onmessage = event => {
const { extras, data } = event.data;
const ref = data.at(1);
console.log(ref); // instance of Serializable
ref.data; // { some: 'data' }
ref.access; // 2
};
… how wonderful is that?
- we can just define when/where appropriate a way to both serialize and deserialize data
- we don’t need to change anything else around the code
- proxies can handle that symbol when accessed and never throw
- no special IDs, properties, extra checks, extra crawling is needed to retrieve back, or send, data as we meant in our program
- … profit for everyone?
Not just transferable …
Of course the SerializationRegistry
offers a way to register, unregister or transfer explicitly data but its current special symbol, which ideally could be instead a global Symbol.toStructuredClone
so that it’d be detached from the registry logic (although used internally), allows to simply intercept clone intents and provide a substitute:
const pythonHandler = {
get(target, prop) {
if (prop === SerializationRegistry.symbol) {
if (target instanceof PythonProxy)
return target.to_js();
}
return Reflect.get(target, prop);
}
};
const ref = new Proxy(python_ref, pythonHandler);
structuredClone(ref); // it will not throw 🥳
As Summary
It took me years of experience in the Proxy, Atomics, Workers, MessageChannel, SharedArrayBuffer w/ binary serialization or FinalizationRegistry field to land what seems to be an obvious and simple enough proposal to tame the most powerful, yet limited, API we have to send or clone data in JS and I would be more than happy to answer any question around this proposal but please, if you think it solves it all for you too, help me and everyone else working with JS and Web based primitives to move this proposal forward because it is really missing out there and it’s really bad that only the PL itself can decide what class can travel and whaat cannot … so thanks in advance to whoever will help us all to have a way to fix the structured clone related issues that keep affecting our less trivial projects 🙏