How to parse a clojure expression?

This answer is based on the Mathematica functional parsers package FunctionalParsers.m used and described in these blog posts.

This command loads that package:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/FunctionalParsers.m"]

Using direct parser definitions (first answer)

Consider the EBNF:

<space> = { 'WhitespaceCharacter' } ;
<var> = { 'WordCharacter' } ;
<number> = { 'DigitCharacter' } ;
<atom> =  <var> | <number> ;
<vector> = '[' , <atom> , { <space> , <atom> } , ']' ;
<arg> =  <atom> | <vec> | <tree> ;  
<args-list> = <arg> , { <space> , <arg> } ; 
<head> = <var> ;
<tree> = '(' , <head> , [ <space> , <args-list> ] , ')' ;

Define the corresponding parsers:

pSPACE = ParsePredicate[StringMatchQ[#, WhitespaceCharacter ..] &];

pVAR = var[StringJoin[#]] &\[CircleDot]ParseMany1[
    ParsePredicate[
     StringMatchQ[#, 
       Except[{"(", ")", "[", "]", WhitespaceCharacter}] ..] &]];

pNUMBER = 
  num[ToExpression[StringJoin[#]]] &\[CircleDot]ParseMany1[
    ParsePredicate[StringMatchQ[#, DigitCharacter ..] &]];

pATOM = ParseSpaces[pNUMBER\[CirclePlus]pVAR];

pVECTOR = 
  vec @@ # &\[CircleDot]ParseShortest[
    ParseBracketed[ParseListOf[pATOM, pSPACE]]];

pARGSLIST = 
  ParseShortest[
   ParseListOf[
    ParseSpaces[pATOM\[CirclePlus]pVECTOR\[CirclePlus]pTREE], pSPACE]];

pHEAD = hd @@ # &\[CircleDot]pVAR;

ParseRecursiveDefinition[pTREE, 
  clj @@ Join[{#[[1]]}, #[[2]]] &\[CircleDot]ParseParenthesized[
    pHEAD\[CircleTimes]ParseOption[
      pSPACE \[RightTriangle] pARGSLIST]]];

Note that pVAR is defined for a larger set of strings than <var> in the EBNF. Here is a prettier view on that code:

enter image description here

\[CirclePlus], \[CircleTimes], and \[CircleDot] correspond to the parsers ParseAlternativeComposition, ParseSequentialComposition, and ParseApply respectively. A triangle points to the parser the output of which is picked up in a sequential composition.

Parse the example in the question:

res = pTREE[Characters["(defn double-int [i] (* i 2))"]]

(* {{{}, clj[hd["defn"], {var["double-int"], vec[var["i"]], 
    clj[hd["*"], {var["i"], num[2]}]}]}} *)

and plot the result:

TreeForm[res[[1, 2]]]

enter image description here Here are other parsing examples:

In[164]:= pTREE[Characters["(defn myfun [i j k] (list (* i 2) (+ j 5) (sq k)))"]]

Out[164]= {{{}, clj[hd["defn"], {var["myfun"], vec[var["i"], var["j"], var["k"]], 
    clj[hd["list"], {clj[hd["*"], {var["i"], num[2]}], 
      clj[hd["+"], {var["j"], num[5]}], clj[hd["sq"], {var["k"]}]}]}]}}

In[165]:= pTREE[Characters["(defn myfun [i j k] (list))"]]

Out[165]= {{{}, clj[hd["defn"], {var["myfun"], vec[var["i"], var["j"], var["k"]], 
    clj[hd["list"]]}]}}

Using parser generation from EBNF (update)

Instead of defining the parsers for each of rule of the EBNF grammar we can use the FunctionalParsers.m package to generate the parsers from the EBNF grammar.

ebnfCode = "
  <tree> = '(' &> <head>, [ <args-list> ] <& ')' <@ clj@@#& ;
  <head> = <var> <@ hd@@#& ;
  <args-list> = { <arg> } ; 
  <arg> = <atom> | <vector> | <tree>; 
  <atom> = <var>|<number>; 
  <var> = '_WordString' <@ var ;
  <number> = '_?NumberQ' <@ num ;
  <vector> = '[' &> { <atom> } <& ']' <@ vec@@#& ;
  ";

The strings "&>" and "<&" stand for "parse sequentially and pick right" and "parse sequentially and pick left" respectively. The string "<@" stands for "apply to the parsing result of #1 the modifier function #2".

This command generates the parsers:

GenerateParsersFromEBNF[ParseToEBNFTokens[ebnfCode]];

For each EBNF rule a separate parser is made. E.g. for <vector> = ... the parser pVECTOR is made.

The generated definition of the parser pVAR has to be extended in order to allow the symbols "-", "_", etc. to be in the variable names.

pVAR = var\[CircleDot]ParsePredicate[
    StringMatchQ[#, Except[{"(", ")", "[", "]"}] ..] &];

Let us define a dedicated tokenizer for the considered Clojure expressions:

ToClojureTokens = ParseToTokens[#, {"(", ")", "[", "]"}, {" ", "\n"}] &;

With the tokenizer we can make the following table of parsing results:

statements = {"[7  34  5 72]", "(ar [7  34  5 72])", "(m 3 (p 4 6))", 
   "(defn doubleint [i] (* i 2))", "(list (* i 2) (+ j 5) (sq k))", 
   "(defn myfun [i j k] (list (* i 2) (+ j 5) (sq k)))"};
ParsingTestTable[pTREE, statements, 
 "TokenizerFunction" -> ToClojureTokens, "Layout" -> "Vertical"]

enter image description here

Here are the tree forms of the parsed expressions in the table:

Map[TreeForm@*First@*Reverse@*First@*pTREE@*ToClojureTokens, statements]

enter image description here

Other discussions

The functional parsers approach has been discussed also in answers of other MSE questions.

  1. "CSS Selectors for Symbolic XML" ;

  2. "General strategies to write big code in Mathematica?"


I have a solution which is brittle only in that the Clojure expressions "(", ")" could mess up bracket detection and that I rely on strings like "<4>" to keep track of things. It's not very brittle once string handling is properly treated. Don't know if it's any better than yours, but it was fun to write and I think it's rather slick, modulo aforementioned fixable brittleness. Special mention to the use of a closure varname to count the leafs.

  • parse just takes a depth-1 Clojure expression like (head arg1 arg2 arg3) and returns "head"["arg1", "arg2", "arg3"].
  • prepare replaces [contents] with (List contents) in the string.
  • splitTree breaks out the expression into {top-expression, {{label -> sub-expression, label -> sub-expression, …}}}; see example below.

parse[expr_String?(StringFreeQ[StringTake[#, 2 ;; -2], "("] &)] := 
 (#[[1]] @@ Rest[#]) &@
  StringSplit[StringTake[expr, 2 ;; -2], " "]

varname[start_: 0] := 
 Module[{p = start}, (p++; "<" <> ToString[p] <> ">") &]

prepare[str_] := 
 (
  newvars = varname[];
  StringReplace[str, 
   Shortest["[" ~~ x__ ~~ "]"] :> "(List " <> x <> ")"];
 )

splitTree[str_] := 
 Reap@FixedPoint[
  StringReplace[#, 
   x : r__ ~~ int : Shortest["(" ~~ __ ~~ ")"] ~~ s__ :>
    (With[{var = newvars[]}, Sow[var -> int]; 
    r <> var <> s])] &, str]

Example usage:

prepare["(defn double-int [i] (* i (+ 3 4)))"]

outputs

"(defn double-int (List i) (* i (+ 3 4)))"

then

output = MapAt[parse, splitTree[%], {2, 1, All, 2}]

outputs

{"(defn double-int <3> <2>)",
 {{"<1>" -> "+"["3", "4"], 
   "<2>" -> "*"["i", "<1>"],
   "<3>" -> "List"["i"]}}}

then

parse[output[[1]]] //. output[[2, 1]]

outputs

"defn"["double-int", "List"["i"], "*"["i", "+"["3", "4"]]]