Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is encouraging binary encoding/decoding a good idea? Should it be so prominent? #6

Closed
domenic opened this issue Jul 1, 2021 · 10 comments

Comments

@domenic
Copy link
Member

domenic commented Jul 1, 2021

I found the arguments at whatwg/html#6811 (comment) by @Kaiido somewhat persuasive. Basically, if you're encoding your bytes to and from a string, you're probably doing something wrong, and you should instead modify your APIs or endpoints to accept bytes anyway.

There are definitely cases where it's useful, mostly around parsing and serializing older file formats. But I'm not sure they need to be promoted to the language (or web platform).

Relatedly, even if we think this is a capability worth including, I worry that putting it on ArrayBuffer makes it seem too prominent. It makes base64 decoding/encoding feel "promoted" on the same level as fundamental binary-data operations such as slicing or indexed access. From this perspective something like https://github.com/lucacasonato/proposal-binary-encoding (with static methods) seems nicer in that it silos off this functionality to a separate utility class.

@bakkot
Copy link
Collaborator

bakkot commented Jul 1, 2021

you should instead modify your APIs or endpoints to accept bytes anyway

I don't know about you, but a substantial portion of the code I write talks to APIs which I am not in a position to modify. I think that's probably the case for many developers. Just to pick a few examples I've encountered: Google's speech to text API, the Google Drive API, and Github's API all expect you to provide binary data encoded with base64 in some circumstances.

In the other direction a great many APIs return data in base64, usually as part of a larger response - for example, JSON-based APIs generally base64-encode binary data which they wish to return as part of the response (what else could you do?).

It makes base64 decoding/encoding feel "promoted" on the same level as fundamental binary-data operations such as slicing or indexed access.

Ehh... I don't think the fact that two APIs are exposed in the same way implies they are equally promoted. (Though actually indexing is done with syntax, not a method call, so indexing is strictly more promoted than this would be.) I have wanted Array.prototype.map approximately a thousand times more frequently than I've wanted Array.prototype.copyWithin, for example.

And ASCII serialization/deserialization is a pretty fundamental operation on binary data, so ArrayBuffer seems like the right place to put those methods. Certainly I as a developer would not think to look for a class outside of ArrayBuffer to find the method for base64-encoding an ArrayBuffer.

@sffc
Copy link

sffc commented Jul 14, 2021

JSON is a fundamental part of the language, and JSON requires that array buffers be stored as text, so I think Base64 is fundamental enough to be this prominent.

@bathos
Copy link

bathos commented Jul 16, 2021

Not disagreeing with the conclusion, but there are other ways to represent binary data in JSON and the suitability varies. Strings are often the most practical option, but for small binary values, arrays of numbers are usually better. A real world example of where strings are the worst option is the “challenge” and “user handle” binary values that get exchanged in WebAuthn.

Every demo of WebAuthn I’ve seen encodes these (tiny — 64 bytes or less) binaries as urlsafe base64 strings in JSON during interchange. (I’m not sure why they add the extra steps for urlsafe given it’s sent in a JSON body — any flavor of base64 would be fine — but they all seem to do it.)

Encoding those values as ordinary JSON arrays of numbers is more direct, less error-prone, and the size doesn’t make a material difference:

// JSON-serializable representation:
[ ...new Uint8Array(buffer) ];

// Simpler and safer restoration from JSON:
Uint8Array.from(array);

@dead-claudia
Copy link

I would like to point out that there are a few encoding/decoding types that are practically everywhere both client-side and server-side:

  • Base64 string ↔ raw binary, due to JSON, XML, and URLs not supporting arbitrary data without lots of escaping
    • The "URL-safe" encoding simply replaces + and / with _ and - - this would be a good candidate for an encoder option, but that's about it as a single decoder could easily decode both by simply changing a lookup table slightly.
  • Hex string ↔ raw binary, used both for raw data (Base64 would be better, but some people are just lazy) and for cryptographic constants
  • Native string ↔ raw UTF-8 due to all the basically everything that requires it (it's been the default text-to-binary conversion for Node's buffers since the moment .toString was added, and WHATWG's TextDecoder/TextEncoder has never supported anything other than UTF-8 to/from strings)

I've had a WIP transcoding proposal sitting privately, and there is a very significant performance boost to be had by having this done within the engine: they can iterate strings via their native representation (whether it be a cons string or a flat string) and build it with zero unnecessary copies. Additionally, while this isn't in of itself an argument for using JS engines, they're all three embarassingly parallel tasks very well-suited to SSE vectorization, and JS engines are much more likely to look into those and similar where possible than embedders as they already have to care a lot about architectural specifics between their JIT and WebAssembly.

@jimmywarting
Copy link

jimmywarting commented Feb 15, 2022

I do agree with original commenter that base64 url should be discouraging. it's wasteful to send 33% more bandwith and wasting unnecessary processing time encoding/decoding to and from strings.

and the WebAuthn/json reason sending things back and forth between api's can be dealt with other communication strategies such as FormData + Blob

I have abuse fetch power to retrieve multiple files back from a server to the browser by doing something like

const fd = await response.formData()
const files = fd.getAll('files')
const ab = await files[0].arrayBuffer()

much quicker and easier than having to use any zip/tar stuff. there isn't any reason you can do the same thing on the server now either when NodeJS and Deno can do now using the same fetch api on the backend.

just send the webauthn binary using something like:

fd.append('challange', new Blob([uint8array]))
fetch(url, { body: fd })
// and do this on the backend:
const fd = await req.formData()

this way of sending formdata works both ways.

As such i am -1 on implementing new binary encoder

The platform have involved to better handle binary data nowdays that don't require things to be sent via base64 or json
we have things such as bson and protobuf and other binary representation. JSON isn't the only solution and it isn't the best solution for doing everything with it.

you can also use fetch to convert a base64 data url into something else:

const b64toRes = (base64, type = 'application/octet-stream') => {
  return fetch(`data:${type};base64,${base64}`)
  res.arrayBuffer()
  res.blob()
  res.json()
  res.body // stream
}

but again, base64 should have been avoided in the first place


speaking of formdata (off topic)... would it be a great idea to have something like: formdata.append('stuff', typedArray)?

@dead-claudia
Copy link

I am in retrospect growing of the opinion that @domenic is right that this doesn't belong on the array buffer itself, but in maybe a built-in module or something. It's also worth mentioning a built-in module or separate global would be a lot easier for those maintaining embedded runtimes like XS to implement.

@bathos

This comment was marked as off-topic.

@jimmywarting

This comment was marked as off-topic.

@bathos
Copy link

bathos commented Feb 19, 2022

(Hid my question in turn, but appreciate the answer, thanks.)

@bakkot
Copy link
Collaborator

bakkot commented Feb 8, 2024

It is the opinion of the committee that this is worth doing. It's true that it's better to avoid the overhead when possible, but often it simply isn't, and we should make accommodations for that reality.

@bakkot bakkot closed this as completed Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants