TL;DR Lexington now offers facilities to create, combine, transform and visualize ε-NFAs, NFAs and DFAs. Hooray.
As mentioned on lexington’s GitHub page, I have one slight problem with Clojure’s/Java’s regular expressions: they operate on Strings, not on lazy sequences. Lexington, however, takes exactly those kind of sequences as input, posing the need to fully realize the incoming data and transform it into a String object once there is a single regular expression involved. For small inputs that might not be a problem, but scanning, for example, a 10MB/20MB/100MB table of data from a file and having to load everything into memory will certainly not be ideal.
So, I thought, let’s just create our own regular expressions, applicable to lazy inputs and implemented in pure Clojure – I mean, how hard can it be? But let’s start with one of the prerequisites for this final task: generating, transforming and handling finite state machines (FSMs).
Since a regular expression describes a regular language and every regular language can be recognized by an FSM, this is exactly what we need. So, let’s see what lexington has to offer now.
ε-NFA, NFA and DFA
Lexington is available on Clojars and can thus be included e.g. via Leiningen’s project.clj:
It supports three types of state machines:
- Deterministic Finite Automaton (DFA)
There is only one target state per input.
- Non-Deterministic Finite Automaton (NFA)
There are multiple target states per input.
- Non-Deterministic Finite Automaton with ε-transitions (ε-NFA)
There are multiple target states per input. Also, it is possible to change states without consuming input.
Conversions between each of these types are possible (though not always directly). Most commonly used are:
- ε-NFA to NFA to remove non-consuming transitions, and
- NFA to DFA to remove non-determinism.
Lexington supports those conversions. And now it’s example time!
- All you need to define FSMs can be found in lexington.fsm.core.
- Combination functions (e.g.
concat-nfa) are located at lexington.fsm.nfa and lexington.fsm.dfa.
- Analysis and transformation functions reside in lexington.fsm.utils.
- The namespace lexington.fsm.minimize contains an implementation of Hopcroft’s DFA minimization algorithm, but it is not fully tested and still seems to have some problems.
- lexington.fsm.visualize contains the functions
show-fsm!(which opens a window to display a given FSM) and
save-fsm!(which can save PNG/SVG/… files with the FSM’s state diagram).
ε-NFAs for easy combination
ε-NFAs are state machines that allow transitions without consuming input. This makes them prime candidates whenever two FSMs shall be combined either sequentially or in parallel.
Let’s start with three basic machines, each accepting a specific character any number of times:
(ns test.fsm (:use lexington.fsm.core lexington.fsm.nfa)) (def a (nfa (:accept :a \a -> _))) ;; "_" is a reference to the state itself (def b (nfa (:accept :b \b -> _))) (def c (nfa (:accept :c \c -> _)))
You can view the created state machines using
lexington.fsm.visualize/show-fsm!. (Note: You will have to have GraphViz installed.) The results will be:
A more complicated example would be an FSM that combines the basic machines given here and accepts subsequent (possibly empty) blocks of “a”, “b” and “c” (for example “aabbbbbc” and “bbc”). When trying to create it manually, one might end up with something like:
(def abc (nfa (:state :a \a -> _ \b -> :b) (:state :b \b -> _ \c -> :c) (:accept :c \c -> _)))
It looks okay, but one can quite clearly see that it won’t accept the inputs we were aiming for: It only ends up in the accepting state if there is at least one “b” and one “c”. Even in this simple case the result is not immediately clear – and now imagine the hassle with larger state machines!
Well, of course there are algorithms to concatenate two FSMs and an initial step could be, for example, to connect all accepting states to the subsequent FSM’s initial state via ε-transitions.
(def abc (nfa (:state :a \a -> _ epsi -> :b) (:state :b \b -> _ epsi -> :c) (:accept :c \c -> _))) ;; (show-fsm! abc) ;; also: (show-fsm! (concat-nfa a b c))
This machine exhibits non-deterministic behaviour since in states a and b it has to decide whether to consume the expected letter or to switch to another state without touching the input.
In my opinion, this kind of automaton offers the nicest and most comprehensible view onto recognition logic. (And on a sidenote: Regular expressions can be directly translated into ε-NFAs, as described in this very informative article. So this is what I will focus on once I’m in the regex-creating business.)
NFAs for easy evaluation
Pure NFAs do not allow transitions without consumption of an input element. This increases their verbosity (not necessarily complexity) but simplifies the actual recognition task since the transitions triggered by an input always end up in a directly connected state. Take the following, slightly changed variant of the above NFA:
(def abca (-> abc (nfa-add :c \a :c))) ;; (show-fsm! abca)
Consuming an “a” in the first state will let the NFA end up in both state “a” (transition [a]) and state “c” (transition [ε,ε,a]). It’s not dramatic but having a pure NFA will reduce the complexity of the eventual recognition algorithm a little bit.
(def abca-pure (epsilon-nfa->nfa abca)) ;; (show-fsm! abca-pure)
If you examine the two examples you will see that they are indeed equivalent regarding the inputs they accept. (And I’m still figuring out what’s wrong with the state numbers generated by
lexington.fsm.utils/fsm-reindex. There should be no gaps.)
DFAs for linear complexity
DFAs only allow one target state for each source state and input element. This means that the FSM will always be in exactly one state (as opposed to NFA recognizers) and that for an input of length n the FSM will either accept or reject it after n transitions.
This means that DFA-based recognition will exhibit linear time and memory characteristics. However, DFAs can grow very big very quickly when they are forced to resolve ambiguities. Let’s convert the NFA above to its deterministic pendant:
(def abca-dfa (nfa->dfa abca-pure)) ;; (show-fsm! abca-dfa)
That looks… nice. But without knowing the source NFA and ε-NFA it would take me some time to understand what inputs would suit this FSM. Well, let’s stop here.
Things to try
- Check out
- Create DFA intersections/unions/differences using the functions in
- Save Images of your FSMs using
- Create more complex NFAs/DFAs using the
dfamacros and special transitions, e.g.
(:except ...), …
- Leave a comment.
I tried to document lexington’s code the best I could. So, if you’re really crazy enough to dive into it, I hope you won’t be greeted by disappointment. However, proper documentation is somewhere on the top half of my TODO list.