TL;DR: I wrote a rather fast Base64 encoding/decoding library for Clojure.
Update: Alexander Taggart mentioned his library data.codec in the comments. It is about 3-5 times faster than the Apache Commons implementation and I should thus have used it as my reference point. By using
(set! *unchecked-math* true), which I saw there, I could at least lift my own implementation above the Apache one, performance-wise. I can say with confidence that data.codec is, however, really as fast as it’s gonna get.
So, starting with this post I tried to implement Base64 encoding and decoding in Clojure. I used seqs to represent the input data, had a vector as encoding alphabet and a map for decoding. Very clojure-y, as you can see, but slow.
Taking the Apache Commons Base64 Codec as a reference point, my intuitive and short solution took about 100.000 times as long as the Apache one (~300 milliseconds vs. ~3 microseconds for the example quote given on the Wiki page) on a 4-core i5 machine (4GB RAM) I had access to at that time. Surprisingly, using my laptop – an i7 with 8 cores and 8GB RAM, it sped up to around 270 microseconds – only 0.9% of the original time. I really can’t grasp which kind of optimization the JVM is performing on the faster machine to achieve that stunning result; maybe some kind of built-in parallellization on the lazy input seqs? Or maybe it’s not as stunning as I think it is, maybe the computing power just is that much higher? Whatever the answer, I’d like to hear some opinions on that. (The code I was benchmarking can be found here.)
Anyway, I started from scratch, trying to squeeze out as much performance as possible:
- use a String to store the encode alphabet and employ
- create the decode alphabet at compile time using a macro;
- mark encode and decode alphabet as
- put nearly everything in macros to avoid function calls;
- use unchecked integer arithmetic and a lot of manual casts (
byte, …) to avoid reflection calls;
- compare integers with == instead of =;
- avoid seqs and lazyness – I do not map over the input data but access it using indices;
- use a ByteBuffer for result generation.
So, what’s the result of my efforts? Well, you can have a look at base64-clj on Github and judge for yourself.
First of all, ugly. I like Clojure so much, it really hurts to see it be all Java-like and verbose. A simple “+” is now “unchecked-add-int” and there is a cast to “int” on basically every turn of the code.
Secondly, it’s fast. I’m running benchmarks with Criterium and my implementation is only around 1.5 to 2.5 times slower than Apache’s pure-Java one. Have a look at the results.
amd64 Linux 3.2.0-41-generic 8 cpu(s) Java HotSpot(TM) 64-Bit Server VM 23.21-b01 Runtime arguments: -XX:+TieredCompilation -Xmx1g -Dclojure.compile.path=/git/ base64-clj/target/classes -Dbase64-clj.version=0.1.1-SNAPSHOT -Dfile.encoding=UTF-8 -Dclojure.debug=false
Apache Commons Base64 Codec
Benchmarking `apache-base64-encode` ... Evaluation count : 16451460 in 60 samples of 274191 calls. Execution time sample mean : 3.639461 µs Execution time mean : 3.639549 µs Execution time sample std-deviation : 47.434118 ns Execution time std-deviation : 47.654358 ns Execution time lower quantile : 3.564183 µs ( 2.5%) Execution time upper quantile : 3.736487 µs (97.5%) Overhead used : 1.040998 ns Benchmarking `apache-base64-decode` ... Evaluation count : 17019600 in 60 samples of 283660 calls. Execution time sample mean : 3.607791 µs Execution time mean : 3.607120 µs Execution time sample std-deviation : 67.940161 ns Execution time std-deviation : 68.545808 ns Execution time lower quantile : 3.489565 µs ( 2.5%) Execution time upper quantile : 3.749907 µs (97.5%) Overhead used : 1.040998 ns
Benchmarking `encode` ... Evaluation count : 7829460 in 60 samples of 130491 calls. Execution time sample mean : 7.881315 µs Execution time mean : 7.880720 µs Execution time sample std-deviation : 141.536183 ns Execution time std-deviation : 143.438947 ns Execution time lower quantile : 7.672325 µs ( 2.5%) Execution time upper quantile : 8.204788 µs (97.5%) Overhead used : 1.036409 ns Benchmarking `decode` ... Evaluation count : 8894580 in 60 samples of 148243 calls. Execution time sample mean : 6.901715 µs Execution time mean : 6.901556 µs Execution time sample std-deviation : 99.689121 ns Execution time std-deviation : 100.470285 ns Execution time lower quantile : 6.726199 µs ( 2.5%) Execution time upper quantile : 7.088927 µs (97.5%) Overhead used : 1.036409 ns
Well, that’s as fast as I’m gonna get, I think. Thank you for reading.