I'm still a bit sad that with this standard and the packages available now, the namespace of 'base45' is clobbered with this suboptimal implementation. It can best just be renamed to 'base41'. It's a good tradeoff for the DCC, but not for the rest of possible implementations.
For the Dutch variant of the green pass using unlinkable signatures [1], we need all the space we can get, so we use a base45 encoding that uses the exact same method as base58 [2][3], and which has the exact same efficiency as QR binary mode.
[1] https://github.com/minvws/nl-covid19-coronacheck-app-coordin...
[2] https://gist.github.com/confiks/8fcb480d87a50cf1bb5e40e2f093...
In blocks of 4 bytes this encodes as 6 'base45' (QR alphanum) characters, and uses the same lookup table.
https://en.wikipedia.org/wiki/QR_code#Encoding
The "Alphanumeric character codes" table, at least at a visual glance, is identical to the RFC's lookup table.
The smaller the payload, the better and faster the scanning. Which is important for something that is designed to be used during border crossings and the like.
We have a number of implementations here:
Seems we also use RSA, and from a quick glance the EU seems to take any x509 certificate authority [1].
Does anyone know if there is a reason elliptic curves weren't been mandated, which should cause smaller signatures than RSA and thus a smaller payload?
[0] https://github.com/admin-ch/CovidCertificate-Apidoc/#respons...
[1] https://github.com/ehn-dcc-development/dgc-java/blob/main/cr...
Support still patchy ?
I work in this area (certificates mostly) and find issues every so often with various platforms having gaps in their EC stuff. That said most things can now deal with the base subset and sticking to (say) P-256 would probably be pretty trouble-free. Getting europe-wide agreement on this would likely be hard though, and european standards[0] often have stuff in that's not widely supported, like FRP256v1, which doesn't seem to be in openssl yet, or wasn't last time I looked, let alone more obscure or outdated implementations.
[0] - e.g. https://www.etsi.org/deliver/etsi_ts/119300_119399/119312/01...
Assuming you either want to store two bytes, or a trailing one, you have 256*256 + 256 combinations: 65792. Using three base45 characters, you can get up to 45^3=91125 combinations. It looks like base41 would have been sufficient. That way you can get rid of some of those special characters, making it easier to use through different transports.
However, we could in theory get around this by using binary data formats that always begin with an invalid text character (such as 0x80-0x9f). This way, an implementation can know that the data is not ISO 8859-1, and try to decode whatever format it discovers through the beginning byte signature.
I've actually put this into Concise Encoding [1]
[1] https://github.com/kstenerud/concise-encoding/blob/master/cb...
What's Inside the EU Green Pass QR Code?
This does not extent qrcode at all, instead it defines a binary-to-text encoding designed to fit in qrcode’s existing alphanumeric mode, exactly like base32, base64 or base85 (or base36, base62, binhex, quopri) but fitting the specific constraints of the qrcode medium.
You don't need every QR code software to be fixed anyway. Only the one which will use those QR codes which don't yet exist. These softwares will have to understand Base45 anyway.
If you wrote that QRcode is not extensible (I don't know if that's the case) I would have agreed.
However, we could in theory get around this by using binary data formats that always begin with an invalid text character (0x80-0x9f). This way, an implementation can know that the data is not ISO 8859-1, and try to decode whatever format it discovers through the beginning byte signature.
I've actually put this into Concise Encoding [1]
[1] https://github.com/kstenerud/concise-encoding/blob/master/cb...
This is not a realistic choice for many projects.
If not, there is a software layer anyway. Adding another skin to the onion is not the best way to compress data.
I at least find it good that existing solutions still get optimizations, even and are enhanced, vs. just throwing everything away at the slightest issue and redo everything, the that churn just costs a lot of $€£ while never bringing out something mature, i.e., useful for the masses.
I just tried rendering a QR code of the 692 characters of the introductory paragraph (with lines joined appropriately), and compared it with a QR code of the same text, uppercased and with out-of-range characters `,`, `[` and `]` changed to %. This reduced an 89×89 code down to 77×77, a 24% reduction in area. If this is roughly the ratio, then Base45-encoding binary data by QR code will yield roughly 17% area savings compared with Base64. (Base45 gets 50% bloat, then 24% shrink = multiple of 1.14; Base64 gets 33% bloat = multiple of 1.33; Base45 / Base64 = 1.14 / 1.33 = 0.83.)
[Edit: edflsafoiewq’s figures on 40-L codes at https://news.ycombinator.com/item?id=27627915 come to about 23% savings, markedly more than my 17%.]
I can’t help but wonder if any of the other modes could be more efficient still—numeric mode, kanji mode and byte mode.
Of course, the specs for all this are ISO specs, so I can’t read them without coughing up the moolah.
In case it’s not clear, I am utterly inexpert in this domain.
Further thoughts:
Base45 encoding two octets in three characters is pretty wasteful: 45³ ÷ 256² ≈ 1.39, which is 39% waste. (By contrast, Base64 is 100% efficient with its alphabet: 64⁴ = 256³.) This means that if you were willing to do more complex encoding and decoding, you could shrink your QR code by roughly 39% more—to about 52% of the size of the Base64, rather than 83%. Leaving such a huge gap on the table puzzles me—I’d have thought that either you’d want something simple (where Base64 is well-understood) or want to minimise your QR codes, and Base45 sits in an awkward place in the middle.
For UTF-8, base-128 will be the most efficient you get. That’ll be ~14% inflation (7 bytes in 8 characters). Which… huh, that looks to be within ε of Base45’s 50% bloat and 24% shrinkage. Not sure if that’s a coincidence or not because I don’t know how alphanumeric mode versus byte mode works in QR codes. But this suggests that alphanumeric mode and Base45-but-not-wasteful would be markedly more efficient than byte mode and Base-122. Still leaves numeric and kanji modes open as possibilities. Again, I’m inexpert and don’t know how the encodings are actually done, and that’ll matter.
On edflsafoiewq’s 40-L figures: Base64 gets 2214 bytes, Base45 gets 2864 bytes, optimally-efficient base-45 would get log₂₅₆ 45⁴²⁹⁶ = 2949 bytes, only around 3% more. I think I must have made a mistake somewhere with some of my numbers.
Base64 in binary: log₂ 64 / 8 = 75.000% efficient
Base45 in alphanumeric: 4 log₂ 256 / 33 = 96.970% efficient
Optimal numeric: 3 log₂ 10 / 10 = 99.657% efficient
Optimal alphanumeric: 2 log₂ 45 / 11 = 99.851% efficient
Optimal binary (ISO 8859-1): log₂ 191 / 8 = 94.718% efficient
Optimal binary (UTF-8, single-byte subset): log₂ 128 / 8 = 87.500% efficient
Optimal binary (UTF-8, full): 1 = 128 / 2^(8α) + 1920 / 2^(16α) + 63488 / 2^(24α) + 1048576 / 2^(32α) ⇒ α = 89.706% efficient
Optimal kanji (JIS X 0208): log₂ 6879 / 13 = 98.061% efficient
The mistake in your 39% calculation is that you forgot to take logarithms before calculating the ratio.
Byte mode is obviously most efficient. Alphanumeric mode uses 5.5 bits per each character, so combined with base45 it uses 5.5 * 3 = 16.5 bits to pack two octets. Base45 is actually not that bad as it seems (3.1% overhead). [EDIT: I've since seen edflsafoiewq's mention that typical QR readers try to interpret binary data as UTF-8, so it is not pointless.]
[MORE EDIT: I've completely missed the last paragraph, so I was actually just confirming what chrismorgan said.]
From the introduction:
> Even in Byte mode a typical QR-code reader tries to interpret a byte sequence as an UTF-8 or ISO/IEC 8859-1 encoded text. Thus QR-codes cannot be used to encode arbitrary binary data directly.
Back to your comment:
> Base45 is actually not that bad as it seems (3.1% overhead)
That’s very close to my logarithm calculations from edflsafoiewq’s 40-L figures (the log₂₅₆ 45⁴²⁹⁶ bit of my comment, just added). Would you mind explaining to me what I did wrong with my 45³ ÷ 256² calculation that said 39%?
Base45 uses chars like backslash. This is super annoying when the encoded string is used in an url.
A 40-L code allows 4296 characters in Alphanumeric-mode = 2864 bytes after Base45 decoding.
The same code allows 2953 characters in Byte mode. If you use Byte mode to hold Base64, that's 2214 bytes after decoding, so Base45 is more efficient.
The reason it gives for why you can't use Byte mode to directly hold binary data is
> Even in Byte mode a typical QR-code reader tries to interpret a byte sequence as an UTF-8 or ISO/IEC 8859-1 encoded text. Thus QR-codes cannot be used to encode arbitrary binary data directly.
It also says you're not supposed to use it anywhere but QR codes (like URLs)
> If the data is to be sent via some other transport [not stored in a QR-code], a transport encoding suitable for that transport should be used instead of Base45. It is not recommended to first encode data in Base45 and then encode the resulting string in for example Base64 if the data is to be sent via email. Instead the Base45 encoding should be removed, and the data itself should be encoded in Base64.
"If the data is to be sent via some other transport, a transport encoding suitable for that transport should be used instead of Base45."
Base45 is mainly useful for binary information in QR codes.