The Base45 Data Encoding (opens in new tab)

(datatracker.ietf.org)

93 pointsb55y ago47 comments

47 comments

37 comments · 11 top-level

codeflo5y ago· 7 in thread

So instead of extending QR codes, which are inherently binary, to efficiently handle binary payloads, we invent yet another ASCII-based tunneling scheme. Why ever fix any problem when we can just pile workaround upon workaround upon workaround?

masklinn5y ago

Qrcode already has a binary mode. The problem is that qrcode software treats qrcode data as text and fucks up. You’re not going to get every software out there fixed.

This does not extent qrcode at all, instead it defines a binary-to-text encoding designed to fit in qrcode’s existing alphanumeric mode, exactly like base32, base64 or base85 (or base36, base62, binhex, quopri) but fitting the specific constraints of the qrcode medium.

dolmen5y ago

> You’re not going to get every software out there fixed.

You don't need every QR code software to be fixed anyway. Only the one which will use those QR codes which don't yet exist. These softwares will have to understand Base45 anyway.

If you wrote that QRcode is not extensible (I don't know if that's the case) I would have agreed.

1 more reply

rkangel5y ago

Realistically, if you test with the camera apps from Apple, Google and Samsung I suspect you have 90% of the public's usage covered and everyone else will be forced to follow.

kstenerud5y ago

Unfortunately, the QR code "binary" mode specification defaults to ISO 8859-1 for the encoding (because it was not originally intended to store actual binary data), and there's also no way to indicate what format is actually encoded. So all decoders of course just assume ISO 8859-1 because they have no way of knowing otherwise.

However, we could in theory get around this by using binary data formats that always begin with an invalid text character (0x80-0x9f). This way, an implementation can know that the data is not ISO 8859-1, and try to decode whatever format it discovers through the beginning byte signature.

I've actually put this into Concise Encoding [1]

[1] https://github.com/kstenerud/concise-encoding/blob/master/cb...

atoav5y ago

Becaus sometimes you need your solution to be used be the public, voluntarily. If they have to install an extra app to read your "looks-like-QR-code-but-not-quite" thing, then you have at least an increased risk of not getting the adoption you want to get.

This is not a realistic choice for many projects.

dolmen5y ago

Do you mean that humans will have to decode Base45?

If not, there is a software layer anyway. Adding another skin to the onion is not the best way to compress data.

1 more reply

tlamponi5y ago

Because QR code scanners are very widespread already, can be understood by even basic cameras nowadays and some devices are hard to update or would need yet another self-invented thingy, which would make people rather more suspicious if done for something like the use case this initially had – codes for verifying that a person is tested, vaccinated or recovered OK on borders and the like.

I at least find it good that existing solutions still get optimizations, even and are enhanced, vs. just throwing everything away at the slightest issue and redo everything, the that churn just costs a lot of $€£ while never bringing out something mature, i.e., useful for the masses.

eventreduce15y ago· 5 in thread

What are the benefits to base58?

Base45 uses chars like backslash. This is super annoying when the encoded string is used in an url.

edflsafoiewq5y ago

It's not for avoiding looks-similar-to-a-human chars like Base58. It's for QR codes. The alphabet is exactly the alphabet for Alphanumeric mode QR codes.

A 40-L code allows 4296 characters in Alphanumeric-mode = 2864 bytes after Base45 decoding.

The same code allows 2953 characters in Byte mode. If you use Byte mode to hold Base64, that's 2214 bytes after decoding, so Base45 is more efficient.

The reason it gives for why you can't use Byte mode to directly hold binary data is

> Even in Byte mode a typical QR-code reader tries to interpret a byte sequence as an UTF-8 or ISO/IEC 8859-1 encoded text. Thus QR-codes cannot be used to encode arbitrary binary data directly.

It also says you're not supposed to use it anywhere but QR codes (like URLs)

> If the data is to be sent via some other transport [not stored in a QR-code], a transport encoding suitable for that transport should be used instead of Base45. It is not recommended to first encode data in Base45 and then encode the resulting string in for example Base64 if the data is to be sent via email. Instead the Base45 encoding should be removed, and the data itself should be encoded in Base64.

DocTomoe5y ago

The draft specifically tells you not to do that:

"If the data is to be sent via some other transport, a transport encoding suitable for that transport should be used instead of Base45."

Base45 is mainly useful for binary information in QR codes.

travisgriggs5y ago

I wish they would call it something like QR45 then, not Base45.

1 more reply

zaxomi5y ago

slash = /

backslash = \

They are using slash.

And space, which is confusing. Did not see the space in the "Hello!!" example since the space is the last character on the line.

sixothree5y ago

Trailing spaces too?

Confiks5y ago· 4 in thread

Note that this encoding isn't of the same efficiency as QR binary mode, as it converts 3 bytes into 2 base45 characters. So it's more like 'base41 using the base45' charset.

I'm still a bit sad that with this standard and the packages available now, the namespace of 'base45' is clobbered with this suboptimal implementation. It can best just be renamed to 'base41'. It's a good tradeoff for the DCC, but not for the rest of possible implementations.

For the Dutch variant of the green pass using unlinkable signatures [1], we need all the space we can get, so we use a base45 encoding that uses the exact same method as base58 [2][3], and which has the exact same efficiency as QR binary mode.

[1] https://github.com/minvws/nl-covid19-coronacheck-app-coordin...

[2] https://gist.github.com/confiks/8fcb480d87a50cf1bb5e40e2f093...

[3] https://github.com/confiks/base45-go/tree/main/base45

mjevans5y ago

32 bits raw <==> 33 bits (3 pairs of 11 bits) QR alphanum.

In blocks of 4 bytes this encodes as 6 'base45' (QR alphanum) characters, and uses the same lookup table.

https://en.wikipedia.org/wiki/QR_code#Encoding

The "Alphanumeric character codes" table, at least at a visual glance, is identical to the RFC's lookup table.

lifthrasiir5y ago

The GP is saying that the equal efficiency is possible with using only 41 out of 45 characters, so reducing the symbol set would make base45 (now base41) more useful as a general encoding, not just an encoding for QR codes.

1 more reply

pulse75y ago

One could take 10 digits, 26 letters and 5 operands (+ - * / %) for Base41 encoding. 41 * 41 * 41 = 68921 is enough for 2 bytes (65536 combinatons).

lifthrasiir5y ago

/ and % don't play well with URLs. I would take + - * . $ instead.

1 more reply

radicalbyte5y ago· 4 in thread

A nice tidbit: this RFC has its roots in the EU Covid Certificate project. The encoding was designed to cut the size of the QR payload (which for DCC is a CBOR - binary encoded - object) :)

The smaller the payload, the better and faster the scanning. Which is important for something that is designed to be used during border crossings and the like.

We have a number of implementations here:

https://github.com/ehn-dcc-development/

lawl5y ago

Interesting. The swiss system, which is supposedly compatible with the EU uses JSON + base64 [0].

Seems we also use RSA, and from a quick glance the EU seems to take any x509 certificate authority [1].

Does anyone know if there is a reason elliptic curves weren't been mandated, which should cause smaller signatures than RSA and thus a smaller payload?

[0] https://github.com/admin-ch/CovidCertificate-Apidoc/#respons...

[1] https://github.com/ehn-dcc-development/dgc-java/blob/main/cr...

Nursie5y ago

>> Does anyone know if there is a reason elliptic curves weren't been mandated

Support still patchy ?

I work in this area (certificates mostly) and find issues every so often with various platforms having gaps in their EC stuff. That said most things can now deal with the base subset and sticking to (say) P-256 would probably be pretty trouble-free. Getting europe-wide agreement on this would likely be hard though, and european standards[0] often have stuff in that's not widely supported, like FRP256v1, which doesn't seem to be in openssl yet, or wasn't last time I looked, let alone more obscure or outdated implementations.

[0] - e.g. https://www.etsi.org/deliver/etsi_ts/119300_119399/119312/01...

chugy15y ago

base64 is just used to encode the PDF and PNG to send them over JSON (JSON does not support binary content). The data in the QR code is encoded in base45 as in the EU.

thehappypm5y ago

Honestly feels like premature/unnecessary optimization.

chrismorgan5y ago· 4 in thread

So, it’s using a 45-character alphabet which matches the QR code alphanumeric values table, which lets the QR code encoder switch to a more efficient mode that takes less space.

I just tried rendering a QR code of the 692 characters of the introductory paragraph (with lines joined appropriately), and compared it with a QR code of the same text, uppercased and with out-of-range characters `,`, `[` and `]` changed to %. This reduced an 89×89 code down to 77×77, a 24% reduction in area. If this is roughly the ratio, then Base45-encoding binary data by QR code will yield roughly 17% area savings compared with Base64. (Base45 gets 50% bloat, then 24% shrink = multiple of 1.14; Base64 gets 33% bloat = multiple of 1.33; Base45 / Base64 = 1.14 / 1.33 = 0.83.)

[Edit: edflsafoiewq’s figures on 40-L codes at https://news.ycombinator.com/item?id=27627915 come to about 23% savings, markedly more than my 17%.]

I can’t help but wonder if any of the other modes could be more efficient still—numeric mode, kanji mode and byte mode.

Of course, the specs for all this are ISO specs, so I can’t read them without coughing up the moolah.

In case it’s not clear, I am utterly inexpert in this domain.

Further thoughts:

Base45 encoding two octets in three characters is pretty wasteful: 45³ ÷ 256² ≈ 1.39, which is 39% waste. (By contrast, Base64 is 100% efficient with its alphabet: 64⁴ = 256³.) This means that if you were willing to do more complex encoding and decoding, you could shrink your QR code by roughly 39% more—to about 52% of the size of the Base64, rather than 83%. Leaving such a huge gap on the table puzzles me—I’d have thought that either you’d want something simple (where Base64 is well-understood) or want to minimise your QR codes, and Base45 sits in an awkward place in the middle.

For UTF-8, base-128 will be the most efficient you get. That’ll be ~14% inflation (7 bytes in 8 characters). Which… huh, that looks to be within ε of Base45’s 50% bloat and 24% shrinkage. Not sure if that’s a coincidence or not because I don’t know how alphanumeric mode versus byte mode works in QR codes. But this suggests that alphanumeric mode and Base45-but-not-wasteful would be markedly more efficient than byte mode and Base-122. Still leaves numeric and kanji modes open as possibilities. Again, I’m inexpert and don’t know how the encodings are actually done, and that’ll matter.

On edflsafoiewq’s 40-L figures: Base64 gets 2214 bytes, Base45 gets 2864 bytes, optimally-efficient base-45 would get log₂₅₆ 45⁴²⁹⁶ = 2949 bytes, only around 3% more. I think I must have made a mistake somewhere with some of my numbers.

anderskaseorg5y ago

The right way to measure the area ratio is using entropy. An optimal encoding would save at most 3% area over Base45:

Base64 in binary: log₂ 64 / 8 = 75.000% efficient

Base45 in alphanumeric: 4 log₂ 256 / 33 = 96.970% efficient

Optimal numeric: 3 log₂ 10 / 10 = 99.657% efficient

Optimal alphanumeric: 2 log₂ 45 / 11 = 99.851% efficient

Optimal binary (ISO 8859-1): log₂ 191 / 8 = 94.718% efficient

Optimal binary (UTF-8, single-byte subset): log₂ 128 / 8 = 87.500% efficient

Optimal binary (UTF-8, full): 1 = 128 / 2^(8α) + 1920 / 2^(16α) + 63488 / 2^(24α) + 1048576 / 2^(32α) ⇒ α = 89.706% efficient

Optimal kanji (JIS X 0208): log₂ 6879 / 13 = 98.061% efficient

The mistake in your 39% calculation is that you forgot to take logarithms before calculating the ratio.

chrismorgan5y ago

Ah hah, yes, the 39% was linear but needed to be log. Thanks for that, and all the other figures too.

lifthrasiir5y ago

> I can’t help but wonder if any of the other modes could be more efficient still—numeric mode, kanji mode and byte mode.

Byte mode is obviously most efficient. Alphanumeric mode uses 5.5 bits per each character, so combined with base45 it uses 5.5 * 3 = 16.5 bits to pack two octets. Base45 is actually not that bad as it seems (3.1% overhead). [EDIT: I've since seen edflsafoiewq's mention that typical QR readers try to interpret binary data as UTF-8, so it is not pointless.]

[MORE EDIT: I've completely missed the last paragraph, so I was actually just confirming what chrismorgan said.]

chrismorgan5y ago

> Maybe a compatibility issue?

From the introduction:

> Even in Byte mode a typical QR-code reader tries to interpret a byte sequence as an UTF-8 or ISO/IEC 8859-1 encoded text. Thus QR-codes cannot be used to encode arbitrary binary data directly.

Back to your comment:

> Base45 is actually not that bad as it seems (3.1% overhead)

That’s very close to my logarithm calculations from edflsafoiewq’s 40-L figures (the log₂₅₆ 45⁴²⁹⁶ bit of my comment, just added). Would you mind explaining to me what I did wrong with my 45³ ÷ 256² calculation that said 39%?

1 more reply

EdSchouten5y ago· 2 in thread

The idea of this encoding is to store two bytes of data in three characters. To me it's not obvious why you need a base as high as 45 for that.

Assuming you either want to store two bytes, or a trailing one, you have 256*256 + 256 combinations: 65792. Using three base45 characters, you can get up to 45^3=91125 combinations. It looks like base41 would have been sufficient. That way you can get rid of some of those special characters, making it easier to use through different transports.

dolmen5y ago

This would allow to avoid space, %, / and + to be more URL friendly, and so allow more usages.

kristov5y ago

Having these chars is a pain (particularly space). Manipulating lists of these in files with standard Unix tools will mean having to escape, quote, etc.

1 more reply

kstenerud5y ago

However, we could in theory get around this by using binary data formats that always begin with an invalid text character (such as 0x80-0x9f). This way, an implementation can know that the data is not ISO 8859-1, and try to decode whatever format it discovers through the beginning byte signature.

I've actually put this into Concise Encoding [1]

[1] https://github.com/kstenerud/concise-encoding/blob/master/cb...

nly5y ago

For context, it seems this encoding was covered by this discussion a few days ago:

What's Inside the EU Green Pass QR Code?

https://news.ycombinator.com/item?id=27589913

mjevans5y ago

This could be useful for storing non-ascii armored crypto keys within a printed QR code format that can be stored in a fireproof safe and reasonably OCRed and decoded for use in disaster recovery or other applications.

simojk5y ago

How does this compare with Data Matrix and C40 encoding?

upofadown5y ago

Why would this be a RFC? It seems quite specific to QR codes. Nothing specific to the Internet.

j / k navigate · click thread line to collapse

47 comments

37 comments · 11 top-level

codeflo5y ago· 7 in thread

masklinn5y ago

Qrcode already has a binary mode. The problem is that qrcode software treats qrcode data as text and fucks up. You’re not going to get every software out there fixed.

dolmen5y ago

> You’re not going to get every software out there fixed.

You don't need every QR code software to be fixed anyway. Only the one which will use those QR codes which don't yet exist. These softwares will have to understand Base45 anyway.

If you wrote that QRcode is not extensible (I don't know if that's the case) I would have agreed.

1 more reply

rkangel5y ago

Realistically, if you test with the camera apps from Apple, Google and Samsung I suspect you have 90% of the public's usage covered and everyone else will be forced to follow.

kstenerud5y ago

I've actually put this into Concise Encoding [1]

[1] https://github.com/kstenerud/concise-encoding/blob/master/cb...

atoav5y ago

This is not a realistic choice for many projects.

dolmen5y ago

Do you mean that humans will have to decode Base45?

If not, there is a software layer anyway. Adding another skin to the onion is not the best way to compress data.

1 more reply

tlamponi5y ago

eventreduce15y ago· 5 in thread

What are the benefits to base58?

Base45 uses chars like backslash. This is super annoying when the encoded string is used in an url.

edflsafoiewq5y ago

It's not for avoiding looks-similar-to-a-human chars like Base58. It's for QR codes. The alphabet is exactly the alphabet for Alphanumeric mode QR codes.

A 40-L code allows 4296 characters in Alphanumeric-mode = 2864 bytes after Base45 decoding.

The same code allows 2953 characters in Byte mode. If you use Byte mode to hold Base64, that's 2214 bytes after decoding, so Base45 is more efficient.

The reason it gives for why you can't use Byte mode to directly hold binary data is

> Even in Byte mode a typical QR-code reader tries to interpret a byte sequence as an UTF-8 or ISO/IEC 8859-1 encoded text. Thus QR-codes cannot be used to encode arbitrary binary data directly.

It also says you're not supposed to use it anywhere but QR codes (like URLs)

DocTomoe5y ago

The draft specifically tells you not to do that:

"If the data is to be sent via some other transport, a transport encoding suitable for that transport should be used instead of Base45."

Base45 is mainly useful for binary information in QR codes.

travisgriggs5y ago

I wish they would call it something like QR45 then, not Base45.

1 more reply

zaxomi5y ago

slash = /

backslash = \

They are using slash.

And space, which is confusing. Did not see the space in the "Hello!!" example since the space is the last character on the line.

sixothree5y ago

Trailing spaces too?

Confiks5y ago· 4 in thread

Note that this encoding isn't of the same efficiency as QR binary mode, as it converts 3 bytes into 2 base45 characters. So it's more like 'base41 using the base45' charset.

[1] https://github.com/minvws/nl-covid19-coronacheck-app-coordin...

[2] https://gist.github.com/confiks/8fcb480d87a50cf1bb5e40e2f093...

[3] https://github.com/confiks/base45-go/tree/main/base45

mjevans5y ago

32 bits raw <==> 33 bits (3 pairs of 11 bits) QR alphanum.

In blocks of 4 bytes this encodes as 6 'base45' (QR alphanum) characters, and uses the same lookup table.

https://en.wikipedia.org/wiki/QR_code#Encoding

The "Alphanumeric character codes" table, at least at a visual glance, is identical to the RFC's lookup table.

lifthrasiir5y ago

1 more reply

pulse75y ago

One could take 10 digits, 26 letters and 5 operands (+ - * / %) for Base41 encoding. 41 * 41 * 41 = 68921 is enough for 2 bytes (65536 combinatons).

lifthrasiir5y ago

/ and % don't play well with URLs. I would take + - * . $ instead.

1 more reply

radicalbyte5y ago· 4 in thread

A nice tidbit: this RFC has its roots in the EU Covid Certificate project. The encoding was designed to cut the size of the QR payload (which for DCC is a CBOR - binary encoded - object) :)

The smaller the payload, the better and faster the scanning. Which is important for something that is designed to be used during border crossings and the like.

We have a number of implementations here:

https://github.com/ehn-dcc-development/

lawl5y ago

Interesting. The swiss system, which is supposedly compatible with the EU uses JSON + base64 [0].

Seems we also use RSA, and from a quick glance the EU seems to take any x509 certificate authority [1].

Does anyone know if there is a reason elliptic curves weren't been mandated, which should cause smaller signatures than RSA and thus a smaller payload?

[0] https://github.com/admin-ch/CovidCertificate-Apidoc/#respons...

[1] https://github.com/ehn-dcc-development/dgc-java/blob/main/cr...

Nursie5y ago

>> Does anyone know if there is a reason elliptic curves weren't been mandated

Support still patchy ?

[0] - e.g. https://www.etsi.org/deliver/etsi_ts/119300_119399/119312/01...

chugy15y ago

base64 is just used to encode the PDF and PNG to send them over JSON (JSON does not support binary content). The data in the QR code is encoded in base45 as in the EU.

thehappypm5y ago

Honestly feels like premature/unnecessary optimization.

chrismorgan5y ago· 4 in thread

So, it’s using a 45-character alphabet which matches the QR code alphanumeric values table, which lets the QR code encoder switch to a more efficient mode that takes less space.

[Edit: edflsafoiewq’s figures on 40-L codes at https://news.ycombinator.com/item?id=27627915 come to about 23% savings, markedly more than my 17%.]

I can’t help but wonder if any of the other modes could be more efficient still—numeric mode, kanji mode and byte mode.

Of course, the specs for all this are ISO specs, so I can’t read them without coughing up the moolah.

In case it’s not clear, I am utterly inexpert in this domain.

Further thoughts:

anderskaseorg5y ago

The right way to measure the area ratio is using entropy. An optimal encoding would save at most 3% area over Base45:

Base64 in binary: log₂ 64 / 8 = 75.000% efficient

Base45 in alphanumeric: 4 log₂ 256 / 33 = 96.970% efficient

Optimal numeric: 3 log₂ 10 / 10 = 99.657% efficient

Optimal alphanumeric: 2 log₂ 45 / 11 = 99.851% efficient

Optimal binary (ISO 8859-1): log₂ 191 / 8 = 94.718% efficient

Optimal binary (UTF-8, single-byte subset): log₂ 128 / 8 = 87.500% efficient

Optimal binary (UTF-8, full): 1 = 128 / 2^(8α) + 1920 / 2^(16α) + 63488 / 2^(24α) + 1048576 / 2^(32α) ⇒ α = 89.706% efficient

Optimal kanji (JIS X 0208): log₂ 6879 / 13 = 98.061% efficient

The mistake in your 39% calculation is that you forgot to take logarithms before calculating the ratio.

chrismorgan5y ago

Ah hah, yes, the 39% was linear but needed to be log. Thanks for that, and all the other figures too.

lifthrasiir5y ago

> I can’t help but wonder if any of the other modes could be more efficient still—numeric mode, kanji mode and byte mode.

[MORE EDIT: I've completely missed the last paragraph, so I was actually just confirming what chrismorgan said.]

chrismorgan5y ago

> Maybe a compatibility issue?

From the introduction:

> Even in Byte mode a typical QR-code reader tries to interpret a byte sequence as an UTF-8 or ISO/IEC 8859-1 encoded text. Thus QR-codes cannot be used to encode arbitrary binary data directly.

Back to your comment:

> Base45 is actually not that bad as it seems (3.1% overhead)

1 more reply

EdSchouten5y ago· 2 in thread

The idea of this encoding is to store two bytes of data in three characters. To me it's not obvious why you need a base as high as 45 for that.

dolmen5y ago

This would allow to avoid space, %, / and + to be more URL friendly, and so allow more usages.

kristov5y ago

Having these chars is a pain (particularly space). Manipulating lists of these in files with standard Unix tools will mean having to escape, quote, etc.

1 more reply

kstenerud5y ago

I've actually put this into Concise Encoding [1]

[1] https://github.com/kstenerud/concise-encoding/blob/master/cb...

nly5y ago

For context, it seems this encoding was covered by this discussion a few days ago:

What's Inside the EU Green Pass QR Code?

https://news.ycombinator.com/item?id=27589913

mjevans5y ago

simojk5y ago

How does this compare with Data Matrix and C40 encoding?

upofadown5y ago

Why would this be a RFC? It seems quite specific to QR codes. Nothing specific to the Internet.

j / k navigate · click thread line to collapse