Our encoding for encrypted filenames is now open source
Why do Boxcryptor's encrypted filenames consist of Chinese characters? This is a question we often hear about our folder and filename encryption and today we want to give you a detailed explanation.
Filename Character Limit
The underlying problem in choosing the appropriate font is the character limit. Core elements of Microsoft Windows can only handle file paths with a length of maximum 256 characters. This limitation is relevant, for example, when using Windows Explorer. Since files with longer paths would therefore not be accessible, many cloud providers—Dropbox, for example—don’t synchronize files with a path length of more than 260 characters.
However, encrypted filenames require even more bytes than plaintext filenames. Thus, we had to find a way to reduce the character count of encrypted filenames, so Boxcryptor would not constantly trigger sync issues.
Base4K
The solution to this problem is to use Unicode characters instead of ASCII characters. While an ASCII character can only transport 1 byte of information, a Unicode character transports 4 bytes. Thus, for every 4 ASCII characters there is 1 single Unicode character.
Comparable to already existing methods for encoding like Base64, we have developed the binary-to-text method Base4K, which uses 4000 different Unicode characters. While Base64 requires more characters than bytes, Base4K requires fewer characters than bytes. Base4K is more than 50 percent more effective than Base64 in terms of character length.
Asian Characters?
So why do we use Asian characters for Boxcryptor? The answer is quite simple: We need 4000 different characters and there are just not enough ASCII characters in the Latin alphabet. Also, the required elements additionally had to meet several technical specifications:
- Consecutive points in the Unicode system
- Universal representability (this is not the case, for example, with some symbols such as control characters)
- Unambiguous distinguishability
The large size of the Asian character set proved to be perfectly suitable here. Base4k therefore maps bytes (0x00-0xff) to Unicode points in the ranges 0x4000-0x40ff and 0x6000-0x6fff. Thus, the selection was based solely on technical requirements.
The Asian character set contains many times more symbols than the Latin alphabet. Therefore, it is perfectly suited for our encoding of folder and file names. However, real Asian (e.g., Chinese) file names would look completely different, since the encoded symbols follow neither syntax nor language logic.
By the way: If you use Boxcryptor to encrypt files and folders with real Asian names, you should have no problem in most cases. “Normal” Chinese file names, for example, do not correspond to the structure of encrypted Boxcryptor file names, for example. Accordingly, they are not recognized as such. In the unlikely event that Boxcryptor misinterprets a file as encrypted, you can simply enable the “show also non-decryptable files” setting. This will also display misinterpreted, “real” file names normally.
Now Also Open Source
We ourselves are big fans of open-source software and also use various open source libraries for Boxcryptor. This is why we have made the Base4K implementation for various languages (C/C++, C#, Java, JavaScript) freely available to all under the MIT Open Source license. You can find it on our Github account.
By the way, although we will not publish the source code of Boxcryptor in its entirety, we have also added a sample implementation of the Boxcryptor encryption algorithm on Github.