Best approach to render MediaWiki in C#?

Update per 2017:
You can use ParseoidSharp to get a fully compatible MediaWiki-renderer.
It uses the official Wikipedia Parsoid library via NodeServices.
(NetStandard 2.0) Since Parsoid is GPL 2.0, and and the GPL-code is invoked in nodejs in a separate process via network, you can even use any license you like ;)


Pre-2017

Problem solved. As originally assumed, the solution lies in using one of the existing alternative parsers in C#.
WikiModel (Java) works well for that purpose.

First attempt was pinvoke kiwi. It worked, but but failed because:

  • kiwi uses char* (fails on anything non-English/ASCII)
  • not thread safe.
  • bad because of the need have a native dll in the code for every architecture (did add x86 and amd64, then it went kaboom on my ARM processor)

Second attempt was mwlib. That failed because somehow IronPython doesn't work as it should.

Third attempt was Swebele, which essentially turned out to be academic vapoware.

The fourth attempt was using the original mediawiki renderer, using Phalanger. That failed because the MediaWiki renderer is not really modular.

The fifth attempt was using Wiky.php via Phalanger, which worked, but was slow and Wiky.php doesn't very completely implement MediaWiki.

The sixth attempt was using bliki via ikvmc, which failed because of the excessive use of 3rd party libraries ==> it compiles, but yields null-reference exceptions only

The seventh attempt was using JavaScript in C#, which worked but was very slow, plus the MediaWiki functionality implemented was very incomplete.

The 8th attempt was writing an own "parser" via Regex.
But the time required to make it work is just excessive, so I stopped.

The 9th attempt was successful. Using ikvmc on WikiModel yields a useful dll. The problem there was the example-code was hoplessly out of date. But using google and the WikiModel sourcecode, I was able to piece it together.

The end-result can be found here:
https://github.com/ststeiger/MultiWikiParser


Why shouldn't this be possible with regular expressions?

inputString = Regex.Replace(inputString, @"(?:'''''')(.*?)(?:'''''')", @"<strong><em>$1</em></strong>");
inputString = Regex.Replace(inputString, @"(?:''')(.*?)(?:''')", @"<strong>$1</strong>");
inputString = Regex.Replace(inputString, @"(?:'')(.*?)(?:'')", @"<em>$1</em>");

This will, as far as I can see, render all 'Bold and italic', 'Bold' and 'Italic' text.


Here is how I once implemented a solution:

  • define your regular expressions for Markup->HTML conversion
  • regular expressions must be non greedy
  • collect the regular expressions in a Dictionary<char, List<RegEx>>

The char is the first (Markup) character in each RegEx, and RegEx's must be sorted by Markup keyword length desc, e.g. === before ==.

Iterate through the characters of the input string, and check if Dictionary.ContainsKey(char). If it does, search the List for matching RegEx. First matching RegEx wins.

As MediaWiki allows recursive markup (except for <pre> and others), the string inside the markup must also be processed in this fashion recursively.

If there is a match, skip ahead the number of characters matching the RegEx in input string. Otherwise proceed to next character.