xscDev

Something Something Development

Counting Characters in Clojure (Part 1?)

with 4 comments

Recently, when experimenting with finite state machines (FSM) in Clojure and trying to implement a simple regular expression, I noticed that my solution was much, much slower than using Java’s native RegEx support (java.util.regex.*) – the FSM execution took about three times longer than the pattern matching.

Usually, I wouldn’t even think about reinventing the wheel here, but working on another project made me realize one small problem with Java’s/Clojure’s patterns: they operate on strings, not on seqs of characters. (Please correct me if I’m wrong. Seriously.) So, even though I only provide expressions that match the beginning of an input string  - which, for all I know, might be infinite – I have to fully generate the complete string by concatenating the single characters. I don’t like that.

I decided to examine the matter using a very simple example: counting all “a” characters at the beginning of an input seq. I will now merely write down my observations, a followup post will be created if I can find satisfying explanations for them. Any input is greatly appreciated!

The RegEx Way

We create a single java.util.regex.Pattern instance, then call re-find on the input string:

(def count-a-with-regex
  (let [regex (re-pattern "^a*")]
    (fn [string]
      (count (re-find regex string)))))

Let’s verify its behaviour:

(count-a-with-regex "aaaab")
;-> 4

(count-a-with-regex "baaaa")
;-> 0

We will use the following macro to measure the performance of the different functions:

(defmacro time-it
  [f s]
  (let [n 100000]
    `(let [s# ~s] 
       (doseq [_# (range 5)] 
         (time (doseq [_# (range ~n)] 
                 (~f s#)))))))

Which gives us:

(time-it count-a-with-regex "aaaab")
"Elapsed time: 202.827021 msecs"
"Elapsed time: 216.600224 msecs"
"Elapsed time: 204.28781 msecs"
"Elapsed time: 199.013597 msecs"
"Elapsed time: 200.563942 msecs"

(time-it count-a-with-regex "baaaa")
"Elapsed time: 156.790841 msecs"
"Elapsed time: 183.576875 msecs"
"Elapsed time: 177.559577 msecs"
"Elapsed time: 186.805941 msecs"
"Elapsed time: 197.167199 msecs"

So this is our reference point. We will try longer strings later on.

The loop/recur Way

Okay, now let’s do this by hand and in a less-than-functional kind of fashion:

(defn count-a-with-loop
  [string]
  (loop [characters string
         a-count 0]
    (if (= (first characters) \a)
      (recur (rest characters) (inc a-count))
      a-count)))

Same procedure as above:

(count-a-with-loop (doall (seq "aaaab")))
;-> 4

(count-a-with-loop (doall (seq "baaaa")))
;-> 0

(time-it count-a-with-loop (doall (seq "aaaab")))
"Elapsed time: 221.661533 msecs"
"Elapsed time: 205.445214 msecs"
"Elapsed time: 206.349205 msecs"
"Elapsed time: 201.797586 msecs"
"Elapsed time: 195.219226 msecs"

(time-it count-a-with-loop (doall (seq "baaaa")))
"Elapsed time: 124.39273 msecs"
"Elapsed time: 112.555949 msecs"
"Elapsed time: 90.874501 msecs"
"Elapsed time: 93.509943 msecs"
"Elapsed time: 96.545018 msecs"

Keep in mind that these are very short inputs and the performance will change drastically once we tackle larger ones. Still, we get approximately the same performance as regular expressions in the first example, and a better one in the second. This stems probably from the fact that the check “is the first element \a ?” produces less overhead than the initialization of the RegEx matcher.

The reduce Way

I will just give the code here since it obviously won’t scale well. It always processes the complete sequence, no matter where the first non-”a”-character is located.

(defn count-a-with-reduce
  [string]
  (or (first
        (reduce (fn [[a-count b?] character]
                  (if (and (not b?) (= character \a)) 
                    [(inc a-count)]
                    [a-count true]))
                [0]  
                string))
      0))

The performance is terrible, the code is ugly. Don’t do this.

The take-while Way

This is probably the most Clojure-idiomatic variant of our counter. take-while lazily reads elements from the input sequence matching a given predicate:

(defn count-a-with-take-while
  [string]
  (count (take-while #{\a} string)))

I like it – but will it blend?

(time-it count-a-with-take-while (doall (seq "aaaab")))
"Elapsed time: 477.671781 msecs"
"Elapsed time: 458.32137 msecs"
"Elapsed time: 453.263662 msecs"
"Elapsed time: 461.302077 msecs"
"Elapsed time: 462.934911 msecs"

(time-it count-a-with-take-while (doall (seq "baaaa")))
"Elapsed time: 148.851222 msecs"
"Elapsed time: 145.938989 msecs"
"Elapsed time: 138.214618 msecs"
"Elapsed time: 136.06743 msecs"
"Elapsed time: 138.623688 msecs"

Well, this is awkward… Apparently, the whole internal lazy-seq creation and maintaining is more expensive than anticipated. Plus, depending on the implementation of count, we might be traversing the sequence produced by take-while a second (unnecessary) time. Not even the “Don’t take the first element!”-case is that impressive.

Stepping it up a notch

Now we will confront the different implementations with a 512-character string, consisting of 128 “a” characters and 384 “b” characters, with one test having the As in front, the other one producing Bs first. The code used for this test can be viewed at https://gist.github.com/4107920.

The results speak strongly in favor of regular expressions. First, the “aaa…bbb…” string:

;; Test count-a-with-regex
"Elapsed time: 1066.622254 msecs"
"Elapsed time: 1040.48187 msecs"
"Elapsed time: 1093.826735 msecs"
"Elapsed time: 1076.481674 msecs"
"Elapsed time: 1076.680782 msecs"

;; Test count-a-with-loop
"Elapsed time: 4254.646883 msecs"
"Elapsed time: 4210.534028 msecs"
"Elapsed time: 4215.728785 msecs"
"Elapsed time: 4227.580608 msecs"
"Elapsed time: 4208.980236 msecs"

;; Test count-a-with-take-while
"Elapsed time: 12583.842842 msecs"
"Elapsed time: 12680.212826 msecs"
"Elapsed time: 12578.855909 msecs"
"Elapsed time: 12704.122106 msecs"
"Elapsed time: 12690.439912 msecs"

Now the “bbb…aaa…” string:

;; Test count-a-with-regex
"Elapsed time: 179.352658 msecs"
"Elapsed time: 185.260893 msecs"
"Elapsed time: 181.33897 msecs"
"Elapsed time: 182.686526 msecs"
"Elapsed time: 166.968629 msecs"

;; Test count-a-with-loop
"Elapsed time: 116.102362 msecs"
"Elapsed time: 134.116661 msecs"
"Elapsed time: 116.450281 msecs"
"Elapsed time: 118.110229 msecs"
"Elapsed time: 103.911074 msecs"

;; Test count-a-with-take-while
"Elapsed time: 174.526236 msecs"
"Elapsed time: 164.532302 msecs"
"Elapsed time: 180.466069 msecs"
"Elapsed time: 180.633587 msecs"
"Elapsed time: 156.064908 msecs"

Why is it so hard to just count characters? What kind of optimization is done when using Java’s regular expressions and how much can the resulting algorithm even differ from our loop-based example? Perhaps I’m just running a not-so-clever JVM (OpenJDK here), perhaps I’m missing something – whatever it is: Any input on this would be greatly appreciated! 

And I will keep on trying.

Written by Yannick

November 18th, 2012 at 11:58 pm

Posted in Clojure

Tagged with , ,

  • maclo

    Hi

    my favourite is this version

    (defn count-a-with-split [string]
    (count (first (split string #”[^a]” 2))))

    concise, lispy and reasonable efficient.
    m

    • http://dev.xscheme.de/ Yannick

      Hi maclo,

      first of all, thanks for your comment. I tried your version and, unfortunately, it is about 9 times slower than the regular expression matching and more than 2 times slower than the loop/recur one. Plus, it does not handle the case where there is no “a” at the beginning nearly as well.

      (time-it count-a-with-split “bbbbbbbbbbbbbbbbbbbbbbbbbaaa”)
      “Elapsed time: 844.545396 msecs”
      (time-it count-a-with-loop “bbbbbbbbbbbbbbbbbbbbbbbbbaaa”)
      “Elapsed time: 191.419734 msecs”

      When working with a String the best variant is an index/String.charAt-based loop (I tried that out today), followed by regular expression matching. However, I’m looking for a solution that operates on (possibly lazy) sequences of characters – and it just seems as if there was too much overhead in the sequence implementation…

  • Austin Yun

    Lazy seqs in Clojure are chunked, 32 items per chunk if I recall. The laziness only has a positive performance impact for *large* data sets I think. Grab a Project Gitenberg .txt of Les Misérables and try it out on that.

    • http://dev.xscheme.de/ Yannick

      I was really just playing around here. The benchmarks are not reliable (no JVM warmup, no “-server” flag, etc…) and the “results” are non-existent. From what I’ve seen since I wrote this, the JVM is really able to optimize access to seqs, lazy or not, but if I really had to access a large file or other kinds of not-yet-in-memory data, I’d probably rely on a plain-old InputStream-based solution.

      That is, if performance really, really matters. The beauty and simplicity of Clojure’s seqs and seq functions shall not be traded for minor gains! ;)