TL;DR: I needlessly optimized Message Digest creation in Clojure. Find it here.
For one of the projects I’m currently working on (a file/directory watcher I will need sometime in the future), I needed to compute checksums for files to detect modifications. Of course, someone already created a library to do this – and so I tried it out. It worked well, was usable via a small set of functions (which were not fully documented) and operated reliably. Long story short, I’d recommend it to anyone looking for a way to use MD5, SHA-1, etc… from Clojure.
So, I had a look at the source code, as I wanted to know how files were accessed. Was a BufferedReader used? Was the file read into memory completely before the hash was created? Stuff like that. I got my answers but I also got to see some code that looked very slow, e.g. the conversion from a byte array to a String (done using a BigInteger), the creation of a new buffer every time a block of data is read from an InputStream, and so on.
You don’t really need to optimize that. When creating file checksums, the bottleneck is not caused by some 2KB of memory that get repeatedly allocated and garbage-collected, but by the I/O capabilities and performance of the file system.
I attempted it anyways. I couldn’t really fork the repository, change everything and then open a pull request, so I created my own repository. There are some things they still have in common (using a protocol for dispatch), others that just can’t be realized any other way, and those that try to improve the overall experience.
You can have a look at the repository here: https://github.com/xsc/pandect
And, maybe more interesting, I ran some benchmarks to check whether I really improved anything. I think I did.
TL;DR: I wrote a rather fast Base64 encoding/decoding library for Clojure.
Update: Alexander Taggart mentioned his library data.codec in the comments. It is about 3-5 times faster than the Apache Commons implementation and I should thus have used it as my reference point. By using
(set! *unchecked-math* true), which I saw there, I could at least lift my own implementation above the Apache one, performance-wise. I can say with confidence that data.codec is, however, really as fast as it’s gonna get.
So, starting with this post I tried to implement Base64 encoding and decoding in Clojure. I used seqs to represent the input data, had a vector as encoding alphabet and a map for decoding. Very clojure-y, as you can see, but slow.
Taking the Apache Commons Base64 Codec as a reference point, my intuitive and short solution took about 100.000 times as long as the Apache one (~300 milliseconds vs. ~3 microseconds for the example quote given on the Wiki page) on a 4-core i5 machine (4GB RAM) I had access to at that time. Surprisingly, using my laptop – an i7 with 8 cores and 8GB RAM, it sped up to around 270 microseconds – only 0.9% of the original time. I really can’t grasp which kind of optimization the JVM is performing on the faster machine to achieve that stunning result; maybe some kind of built-in parallellization on the lazy input seqs? Or maybe it’s not as stunning as I think it is, maybe the computing power just is that much higher? Whatever the answer, I’d like to hear some opinions on that. (The code I was benchmarking can be found here.)
Anyway, I started from scratch, trying to squeeze out as much performance as possible:
- use a String to store the encode alphabet and employ
- create the decode alphabet at compile time using a macro;
- mark encode and decode alphabet as
- put nearly everything in macros to avoid function calls;
- use unchecked integer arithmetic and a lot of manual casts (
byte, …) to avoid reflection calls;
- compare integers with == instead of =;
- avoid seqs and lazyness – I do not map over the input data but access it using indices;
- use a ByteBuffer for result generation.
So, what’s the result of my efforts? Well, you can have a look at base64-clj on Github and judge for yourself.
First of all, ugly. I like Clojure so much, it really hurts to see it be all Java-like and verbose. A simple “+” is now “unchecked-add-int” and there is a cast to “int” on basically every turn of the code.
Secondly, it’s fast. I’m running benchmarks with Criterium and my implementation is only around 1.5 to 2.5 times slower than Apache’s pure-Java one. Have a look at the results.
amd64 Linux 3.2.0-41-generic 8 cpu(s) Java HotSpot(TM) 64-Bit Server VM 23.21-b01 Runtime arguments: -XX:+TieredCompilation -Xmx1g -Dclojure.compile.path=/git/ base64-clj/target/classes -Dbase64-clj.version=0.1.1-SNAPSHOT -Dfile.encoding=UTF-8 -Dclojure.debug=false
Apache Commons Base64 Codec
Benchmarking `apache-base64-encode` ... Evaluation count : 16451460 in 60 samples of 274191 calls. Execution time sample mean : 3.639461 µs Execution time mean : 3.639549 µs Execution time sample std-deviation : 47.434118 ns Execution time std-deviation : 47.654358 ns Execution time lower quantile : 3.564183 µs ( 2.5%) Execution time upper quantile : 3.736487 µs (97.5%) Overhead used : 1.040998 ns Benchmarking `apache-base64-decode` ... Evaluation count : 17019600 in 60 samples of 283660 calls. Execution time sample mean : 3.607791 µs Execution time mean : 3.607120 µs Execution time sample std-deviation : 67.940161 ns Execution time std-deviation : 68.545808 ns Execution time lower quantile : 3.489565 µs ( 2.5%) Execution time upper quantile : 3.749907 µs (97.5%) Overhead used : 1.040998 ns
Benchmarking `encode` ... Evaluation count : 7829460 in 60 samples of 130491 calls. Execution time sample mean : 7.881315 µs Execution time mean : 7.880720 µs Execution time sample std-deviation : 141.536183 ns Execution time std-deviation : 143.438947 ns Execution time lower quantile : 7.672325 µs ( 2.5%) Execution time upper quantile : 8.204788 µs (97.5%) Overhead used : 1.036409 ns Benchmarking `decode` ... Evaluation count : 8894580 in 60 samples of 148243 calls. Execution time sample mean : 6.901715 µs Execution time mean : 6.901556 µs Execution time sample std-deviation : 99.689121 ns Execution time std-deviation : 100.470285 ns Execution time lower quantile : 6.726199 µs ( 2.5%) Execution time upper quantile : 7.088927 µs (97.5%) Overhead used : 1.036409 ns
Well, that’s as fast as I’m gonna get, I think. Thank you for reading.
Thanks to Planet Clojure I stumbled across an article describing how one could implement Base64 Encoding/Decoding in Clojure. This is not a “Why would you reinvent the wheel?” response but a “How about doing it this way?” suggestion.
I only changed a single thing in the original code’s concept: I’m using integers to represent my 8-bit or 6-bit numbers instead of a seq of zeros and ones. Crazy, I know, but suddenly a lot of complexity just vanished. I only had to define one helper function (create-n-tet) and everything else just fell into place.
Yesterday, I released the first kind-of-usable version of thrift-clj, a wrapper library around Java classes generated by the Apache Thrift compiler. I made a lot of changes since I first wrote down my thoughts ten days ago, extended its capabilities and most importantly wrote a fair amount of unit tests and filled the wiki with hopefully enough documentation to get anyone interested started.
Uh, and I published a short overview of the project using Github Pages. You can find it here.
So, thanks for reading, any feedback is – as always – very appreciated.
Travis CI is a continuous integration environment easily usable via GitHub service hooks. This means, every time you push a commit to your GitHub repository, Travis gets notified, clones your data and runs a user-specified test or build script, sending an email if something went wrong or was fixed.
One major point (originating in the fact that each Travis worker uses a virtual machine/sandbox) is the possibility to execute nearly any command, resulting in custom dependency fetching, custom test execution, custom anything-you-like. A Travis worker instance runs on Ubuntu 12.04 LTS (32bit), allowing for use of apt-get and add-apt-repository to extend your environment – just as you’d do it at home.
I’m writing this post because my latest pet, thrift-clj, needs the Thrift compiler installed to compile Thrift IDL test files to Java source and class files. The process mentioned here does not suit me (I do not need any runtime libraries that Leiningen cannot provide), so I looked for a 3rd-party-repository containing the Thrift executable – and I found it.
So, mesdames et messieurs, to install Thrift 0.9.0 (the compiler, not the libraries) in a Travis CI instance, put the following in you .travis.yml:
... before_install: - sudo add-apt-repository -y ppa:wnoronha/thrift - sudo apt-get update -qq - sudo apt-get install -qq thrift-compiler - which thrift ...
See it in context here.
TL;DR I wrote a plugin for Leiningen to compile Thrift IDL files to Java class files. It’s this one.
Since I’m currently working on making Apache Thrift usable from Clojure I need a way to generate Java classes from my raw Thrift IDL sources. I could obviously do this by hand but I’m not going to explain why that is nowhere near a sufficient solution.
Now, thankfully Leiningen, everyone’s favourite Clojure build tool, offers the possibility to have plugins execute such tasks and – not surprisingly – someone already created one for Thrift. But it has two minor flaws:
- It is marked as “deceased” on Leiningen’s plugin page;
- and it always compiles everything even if nothing changed.
Mostly the second point is the one that prompted development of lein-thriftc. It manages a file with modification times of your sources and only recompiles the ones that have been updated since the last run.
That’s about it. Oh, and check out thrift-clj!
Update: Please be aware that the examples given below use thrift-clj “0.1.0-alpha1″. There have been some API changes since.
I’ve had my eye on Apache Thrift for quite some while now but never got around to trying it. Now, with Clojure residing on the JVM and Thrift IDL files being compilable to Java code, I saw the possibility of finally diving into this great (albeit very badly documented) framework.
Now, a quick Google Search for “thrift” and “clojure” revealed:
- a short interaction about Clojure code generation from Thrift IDLs (which would be great but two years later there still doesn’t seem to be anything in that direction);
- small examples for Thrift serialization, wrapping existing Java classes, …;
- a Leiningen Plugin.
Now, while it is certainly feasible to create Thrift Java code and access it (after all, Clojure’s interop is magnificient) there’s always a thought in the back of my head: This could be easier. And if not that, at least prettier.
TL;DR Lexington now offers facilities to create, combine, transform and visualize ε-NFAs, NFAs and DFAs. Hooray.
As mentioned on lexington’s GitHub page, I have one slight problem with Clojure’s/Java’s regular expressions: they operate on Strings, not on lazy sequences. Lexington, however, takes exactly those kind of sequences as input, posing the need to fully realize the incoming data and transform it into a String object once there is a single regular expression involved. For small inputs that might not be a problem, but scanning, for example, a 10MB/20MB/100MB table of data from a file and having to load everything into memory will certainly not be ideal.
So, I thought, let’s just create our own regular expressions, applicable to lazy inputs and implemented in pure Clojure – I mean, how hard can it be? But let’s start with one of the prerequisites for this final task: generating, transforming and handling finite state machines (FSMs).
Since a regular expression describes a regular language and every regular language can be recognized by an FSM, this is exactly what we need. So, let’s see what lexington has to offer now.
Recently, when experimenting with finite state machines (FSM) in Clojure and trying to implement a simple regular expression, I noticed that my solution was much, much slower than using Java’s native RegEx support (
java.util.regex.*) – the FSM execution took about three times longer than the pattern matching.
Usually, I wouldn’t even think about reinventing the wheel here, but working on another project made me realize one small problem with Java’s/Clojure’s patterns: they operate on strings, not on seqs of characters. (Please correct me if I’m wrong. Seriously.) So, even though I only provide expressions that match the beginning of an input string - which, for all I know, might be infinite – I have to fully generate the complete string by concatenating the single characters. I don’t like that.
I decided to examine the matter using a very simple example: counting all “a” characters at the beginning of an input seq. I will now merely write down my observations, a followup post will be created if I can find satisfying explanations for them. Any input is greatly appreciated!
For quite some time I was not sure how to get back into programming and/or writing about it. I seemed to be waiting for the one project, the one task worthy to be done, before I’d even think about sharing anything with the world. Needless to say, it didn’t come up. But Clojure did – and I liked it.
Seeing that I gathered my first programming experiences using another functional language, Scheme (the PLT variant that has now developed into Racket), I immediately felt nostalgic at the sight of Clojure’s parenthesis-enclosed constructs. And shortly after truly overwhelmed at its beauty, compactness and efficiency, something I never recognized when I sat down – holy sh*t – 11 years ago and wrote my first fibonacci-generating functions. Having been forced to focus mainly on Java and C at university and work, it’s really refreshing to see something that appears incredibly clean, stripped of a lot of unnecessary constraints. Rich Hickey, the mind behind the language, remarked:
“I think programmers have become inured to incidental complexity… when they encounter complexity, they consider it a challenge to overcome, rather than an obstacle to remove. Overcoming complexity isn’t work, it’s waste.”
I’m not going to glorify Clojure since I seem to be an easy-to-overwhelm kind of person and had comparable reactions when first encountering Java (“OOP is the only way to go nowadays!”), XML (“This is awesome; I can describe anything with it!”), JSON (“Phew, finally something not as bloated as XML.”) and others. But I like it and hold the strong opinion that everyone should at least have a look into functional paradigms (yeah, I know, so brave…). And that Clojure is a very good choice for exploring those paradigms, due to its thriving ecosystem among other reasons.
Yesterday morning I started lexington, a pure Clojure lexer generator intended to be the base of some future Clojure parser generator, and later that day I created the GitHub project and wrote documentation. I shutter at the thought of the time I could have wasted trying to replicate its functionality using Java or C.
I don’t know if this is the right thing to announce, but maybe (only maybe!) if Clojure is suited for this task, I might attempt to create some kind of LastSharp, my first big project whose C#-codebase drove me insane after a while.
But no more insanity now! Hahahaha! Ha.