Thank you for emphasizing this. Many junior devs have been bitten by not being told early enough the difference between encryption (requires a secret to be reversed), hashing (cannot be reversed) and encoding (can always be trivially reversed).
Also good to know that while the output looks random, it follows the same entropy as the input. Ie: don't base64 encode your password to make it stronger.
If however your password is engineered to be easier to remember, for example by using dictionaries or some kind of scheme that has a lower entropy, then the base64 encoding step adds a single bit of strength to your password. Meaning anyone who is brute forcing your password using a smart password cracker, has to configure that cracker to also consider base64 encoding as a scheme, basically forcing it to perform a single extra operation for every try.
Anyway, useless information, you shouldn't be using password schemes like that. The horse shoe battery staple type password style should be quite sufficient I think.
I wonder if its better to make an encoder that uses words and the output looks like "horse shoe battery staple" except you don't release your dictionary list of potential words output by the encoder, but then you guarantee that you can always re-create a password if you lose it, assuming you don't lose the dictionary file.
Hashing has many purposes besides security, and for that reason there are many hash libraries. If you plan on using hashes for something related to security or cryptography, you need to use a hash designed for that purpose! Yes, CRC hashing is really fast, and that's great, but it's not great when you use it for user passwords.
I still see newly released projects that choose md5. Like, sure, for the intended use case, probably nobody will construct a collision, but why even allow the possibility?
If they want to encrypt something just tell them to use ROT13 twice.
Perpetuates the idea that there's "binary" and "text", which is incorrect, but also implies you can't encode ordinary ASCII text into base64.
I understand what you mean, of course text can be represented and treated as binary, and the inverse often as well although it isn't necessarily true. Even in Windows-1252, where the upper 127 characters are in use, there are control characters such as null, delete, and EOT which I'd be impressed if a random chat program preserves them across the wire.
I also don't read an implication that ASCII couldn't be converted to b64
Well, there is binary, and there is text. Sure, all text - like "strawman" ;) - is binary somehow, but not all binary data is text, nor can be even interpreted as such, even if you tried really hard ... like all those poor hex editors.
I don't think it implies that at all.
Text isn't binary. Text can be encoded in binary, and there are different ways to do it. ASCII, UTF-8/16/32, latin-1, Shift-JIS, Windows-1252, etc. Many can't encode all text characters, especially languages that don't use the Latin alphabet.
The fact that you have to ensure you're using the correct encoding when processing text from a binary stream is proof enough that text isn't binary. Python before 3.x allowed you to treat binary and text as equal, and it often caused problems.
If you want to eliminate ambiguity for human readers, you can drop to Base58 but in almost all cases, if you are BaseXX-encoding something, it’s long enough that copy-pasting is the norm, so it doesn’t usually matter.
Base32 is sufficient in most cases and can avoid some incidental swear words.
If you want density go for Z85, which is a 4 -> 5 byte chunked encoding and therefore much more efficient on a pipelined CPU.
No, typically the extra characters used are “-“ and “_”. That’s what the table in the IETF link shows.
Well, you're in luck: tilde and dot aren't part of base64url
Since a base64 string with padding is always guaranteed to be a multiple of four characters long, if you get a string that is not a multiple of four in length, you can figure out how much padding it should have had, which tells you how to handle the last three bytes of decoding.
Which makes it a little confusing why base64 needs == padding in the first place.
As these are probably not compressible that means there really isn’t a whole lot lost compared to the optimal solution.
- when do you use =, when do you use == and do you always add = / == or are there cases where you dont add = / == ?
- how to precisely handle leftover bits. for example the string "5byte". and is there anything to consider when decoding?
For context: since a base64 character represents 6 bits, every block of three data bytes corresponds to a block of four base64 encoded characters. (83 == 24 == 64)
That means it's often convenient to process base64 data 4 characters at a time. (in the same way that it's often convenient to process hexadecimal data 2 characters at a time)
1) You use = to pad the encoded string to a multiple of 4 characters, adding zero, one, or two as needed to hit the next multiple-of-4.
So, "543210" becomes "543210==", "6543210" becomes "6543210=", and "76543210" doesn't need padding.
(You'll never need three = for padding, since one byte of data already needs at least two base64 characters)
2) Leftover bits should just be set to zero; the decoder can see that there's not enough bits for a full byte and discard them.
3) In almost all modern cases, the padding isn't necessary, it's just convention.
The Wikipedia article is pretty exhaustive: https://en.wikipedia.org/wiki/Base64
Padding chars at the end (of stream / file / string) can be inferred from the length already processed, and thus are not strictly necessary.
Note how padding is treated is quite subtle, and has resulted in interesting variations in handling as discussed at: https://eprint.iacr.org/2022/361.pdf
So you're essentially encoding in groups of 24 bits at a time. Once the data ends, you pad out the remainder of the 24 bits with = instead of A because A represents 000000 as data.
For the record, I had to read the whole thing twice to understand that too.
This is because if you’ve only got one of the three bytes you’re going to need, your data looks like this:
XXXXXXXX
Then when you group into 6 bit base64 numbers you get XXXXXX XX????
Which you have to pad with two bytes worth of zeroes because otherwise you don’t even have a full second digit. XXXXXX XX0000 000000 000000
so to encode all your data you still need the first two of these four base64 digits - although the second one will always have four zeroes in it, so it’ll be 0, 16, 32, or 48.The ‘=‘ isn’t just telling you those last 12 bits are zeroes - they’re telling you to ignore the last four bits of the previous digit too.
Similarly with two bytes remaining:
XXXXXXXX YYYYYYYY
That groups as XXXXXX XXYYYY YYYY??
Which pads out with one byte of zeroes to XXXXXX XXYYYY YYYY00 000000
And now your third digit is some multiple of 4 because it’s forced to contain zeroes.Funny side effect of this:
Some base64 decoders will accept a digit right before the padding that isn’t either a multiple of four (with one byte of padding) or of 16 (with two).
They will decode the digit as normal, then discard the lower bits.
That means it’s possible in some decoders for dissimilar base64 strings to decode to the same binary value.
Which can occasionally be a security concern, when base64 strings are checked for equality, rather than their decoded values.
People have been making the argument for over a decade that base64 is incumbent and so people stick with it due to interoperability. But base85 represents a 20% compression gain for basically free from a computational perspective. Isn't that worth switching over as a widely used standard?
An alternative would be the encoding specified by RFC 1924, which uses a different, noncontiguous set of characters. It still has the drawback that dividing by 85 is a bit slower than dividing by 64 (which is just a bit shift).
Last but not least, Base64 has the benefit of being easily recognizable by a human. Due to its relatively restricted character set, it doesn’t look like just line noise, but also doesn’t look like more intelligible syntax, or like hex, etc. It sits in a middle sweet spot.
CPUs were slower relative to memory in those days, so it may have made a difference. Many more programs were CPU bound, and people wrote assembly more often
Also base64 is just easier to code
usenet exists since 1979/80 [1] and base64 was first described in 1987 [2].
1: https://en.wikipedia.org/wiki/Usenet 2: https://base64.guru/learn/what-is-base64
$ function iter {
N="$1"
CMD="$2"
STATE=$(cat)
for i in $(seq 1 $N); do
STATE=$(echo -n $STATE | $CMD)
done
cat <<EOF
$STATE
EOF
}
$ echo "HN" | iter 20 base64 | head -1
Vm0wd2QyUXlVWGxWV0d4V1YwZDRWMVl3WkRSV01WbDNXa1JTVjAxV2JETlhhMUpUVmpBeFYySkVU
$ echo "Hello Hacker News" | iter 20 base64 | head -1
Vm0wd2QyUXlVWGxWV0d4V1YwZDRWMVl3WkRSV01WbDNXa1JTVjAxV2JETlhhMUpUVmpBeFYySkVU
$ echo "Bonjour Hacker News" | iter 20 base64 | head -1
Vm0wd2QyUXlVWGxWV0d4V1YwZDRWMVl3WkRSV01WbDNXa1JTVjAxV2JETlhhMUpUVmpBeFYySkVU
EDIT: I just remembered that when I found that out by pure serendipity more than 10 years ago I tweeted cryptically about it [1] and someone made a blog post on the subject which I submitted here but it didn't generate discussion [2]. Someone else posted it on Reddit /r/compsci and it generated fruitful discussion there, correcting the blog post [3]. The blog is down now but the internet archive has a copy of it [4].[1] https://twitter.com/p4bl0/status/298900842076045312
[2] https://news.ycombinator.com/item?id=5181256
[3] https://www.reddit.com/r/compsci/comments/18234a/the_base64_...
[4] https://web.archive.org/web/20130315082932/http://fmota.eu/b...
$ echo -n "abcde" |base64
Otherwise, without the -n, echo injects an extra newline character to the end of the string that would become encoded.(however, the parent's use of "echo" would be fine as it's not using a variable and so won't be interpreting a dash as an extra option etc)
https://github.com/Rezmason/excel_97_egg/blob/main/glsl/base...
I got it down to about thirteen lines of GLSL:
https://github.com/Rezmason/excel_97_egg/blob/main/glsl/base...
I use it for Cursed Mode of my side project, which renders the WebGL framebuffer to a Base64-encoded, 640x480 pixel, indexed color BMP, about 15 times per second:
The base64 under "useful alphabets" is the "natural", iterative divide by radix, base. There's the RFC's "bucket" conversion base under extras.
If the length of your input data isn't exactly a multiple of 3 bytes, then encoding it will use either 2 or 3 base64 characters to encode the final 1 or 2 bytes. Since each base64 character is 6 bits, this means you'll be using either 12 or 18 bits to represent 8 or 16 bits. Which means you have an extra 4 or 2 bits which don't encode anything.
In the RFC, encoders are required to set those bits to 0, but decoders only "MAY" choose to reject input which does not have those set to 0. In practice, nothing rejects those by default, and as far as I know only Ruby, Rust, and Go allow you to fail on such inputs - Python has a "validate" option, but it doesn't validate those bits.
The other major difference is in handling of whitespace and other non-base64 characters. A surprising number of implementations, including Python, allow arbitrary characters in the input, and silently ignore them. That's a problem if you get the alphabet wrong - for example, in Python `base64.standard_b64decode(base64.urlsafe_b64encode(b'\xFF\xFE\xFD\xFC'))` will silently give you the wrong output, rather than an error. Ouch!
Another fun fact is that Ruby's base64 encoder will put linebreaks every 60 characters, which is a wild choice because no standard encoding requires lines that short except PEM, but PEM requires _exactly_ 64 characters per line.
I have a writeup of some of the differences among programming languages and some JavaScript libraries here [1], because I'm working on getting a better base64 added to JS [2].
[1] https://gist.github.com/bakkot/16cae276209da91b652c2cb3f612a...
1. "Another common use case is when we have to store or transmit some binary data over the network that's supposed to handle text, or US-ASCII data. This ensures data remains unchanged during transport."
What does it mean by network that handles text? Why should the network bother about the kind of data in the packet. If the receivers end is expecting a binary data, then why is there a need to encode it using base64. Also if data is changed during transport like "bit-flipping" or some corruption, then should't it affect the credibility of the base64 endcoded data as well.
2. "they cannot be misinterpreted by legacy computers and programs unlike characters such as <, >, \n and many others."
My question here is what happens if the legacy computers interpret characters like <,, > incorrectly? If you sent a binary data, isn't that better since its just 0's and 1's and only the program that understands that binary data, will interpret?
For example, in JavaScript, that involves making sure it's a well-formed string. I did a write-up of that here: https://web.dev/articles/base64-encoding.