Developing a markdown parser for Blogger

Friday, June 5, 2015

TL;DR I wrote a small script that let me use few markdown features in Blogger, the code can be found on [GitHub](https://github.com/ValYouW/simple-md-blogger){_blank}.
For instructions on how to add it to your blog jump to the bottom, there is a "Adding to a Blogger blog" section.

When writing a blog post it is much easier to insert code blocks using 3 backticks `\`\`\`` (GitHub Flavored Markdown), for example:
```
```js
function foo() {
    console.log('Im foo !!');
}
\```
```
Or, adding code quotes using a single backtick, like: `\`quote\``.
Or, adding titles using hashes, like: `# This is a title`
Or adding links using this pattern: `\[text\](url){target}`
This can be done by parsing the html and do some replacements of the text with html tags, there are 2 methods I came up with.

## Using Regex
Here is one options with RegExp only (Note: there is no support for links pattern in this method...)
```js
function parseWithRegex(post) {
    var html = post.innerHTML;

    // Replace all titles (h1/h2/etc), regex explanation:
    // ^[\s]* - Line can start with spaces, we ignore those
    // ([#]+) - Match a sequence of #
    // (?:\s)* - After the hash we can have many spaces, we ignore those
    // (.*?)(?:<br>)?$ - Then match everything in non-greedy mode (as little as possible) until the end, line can end
    // with <br> so we ignore it
    html = html.replace(/^[\s]*([#]+)(?:\s)*(.*?)(?:<br>)?$/gm, function(match, hashes, title) {
        var headerLevel = hashes.length;
        return '<h' + headerLevel + '>' + title + '</h' + headerLevel + '>'
    });

    // Replace all code blocks, regex explanation:
    // ^``` - line should start with ```
    // ([a-zA-Z]+)? - Then optionally we could have a language (characters only)
    // ((?:.|\n)*?)``` - Then we match anything until ```, since '.' doesn't match \n we need (.|\n),
    // and we don't want that brackets in the match so adding ?:
    // and we want to match until first occurrence of ``` (non-greedy) so adding *?
    // (?:\<br>)?$ - Then we expect br tag or end-of-line
    html = html.replace(/^```([a-zA-Z]+)?((?:.|\n)*?)```(?:<br>)?$/gm, '<pre><code class="$1">$2</code></pre>');

    // Next replace code quotes, regex explanation:
    // Match anything between backticks, the character before the closing backtick should not be (escaped backtick)
    html = html.replace(/`(.*?[^\\])`/g, '<code>$1</code>');

    // Next replace all escaped backticks (\`) with backticks
    html = html.replace(/\\`/g, '`');
    // Put the final html back in the post element
    post.innerHTML = html;

    // Loop thru all the created code blocks and remove all <br>, also remove the first "\n"
    var codes = post.querySelectorAll('pre > code');
    for (var i = 0; i < codes.length; ++i) {
        var code = codes[i].innerHTML;
        code = code.replace(/<br>/g, '').replace(/^\n/, '');
        codes[i].innerHTML = code;
    }
}
```
The problem with the method above is that it will replace non-escaped single backticks (`\`xxx\``) within a code block with a <code> tag, which is not desirable.
This can be solved by escaping all backticks within the code block, but lets try another approach which will keep `code blocks` untouched

## Parsing line-by-line
We can take our post html, split it into array using `\n` (so each line is a cell in the array), then we loop thru all the lines and do replacements, if we identify that there is a code block starting, we can mark it and not do anything until the code block closes (actually there is one replace we should do in code block, and that is if we have an escaped end block backticks, `\\`\`\``)
```js
function parseWithArray(post) {
    // Define the regex we are going to use
    var brRegex = /<br>/g;

    // Line starting with ``` and an optional lang specified, can have optional <br>
    var beginCodeRegex = /^```([a-zA-Z]+)?(?:<br>)?/;
    // Line starts with ``` and nothing but spaces till the end
    var endCodeRegex = /^```[\s]*$/;
    // Escaped code is \``` and nothing till the end
    var escapedEndCodeRegex = /^\\```[\s]*$/;
    // Anything between backticks as long as the char before the closing backtick is not a "\" (escaped backtick)
    var codeQuoteRegex = /`(.*?[^\\])`/g;
    // Escaped backtick: \`
    var escapedCodeQuoteRegex = /\\`/g;

    // \[([^\]]+[^\\])] - Anything that is between brackets as long as it is not "]" and there is no "\" before the closing bracket (escaped bracket)
    // \((.*?)\) - Anything between round brackets
    // (?:{(.*?)})? - Optionally can have some text between curly brackets
    var linksRegex=/\[([^\]]+[^\\])]\((.*?)\)(?:{(.*?)})?/g;

    // Match any escaped brackets "\[" or "\]"
    var escapedLinksRegex = /\\(\[|])/g;

    // Line can start with spaces, then comes multiple "#", then can have spaces, then any text, and optional <br> before line ends
    var titleRegex = /^[\s]*([#]+)[\s]*(.*?)(?:<br>)?$/;

    var html = post.innerHTML;
    var lines = html.split('\n');
    var incode = false;
    for (var j = 0; j < lines.length; ++j) {
        var line = lines[j];
        if (line === null) {continue;} // See later that we put null in the array[j+1], skip those cells

        // Check if we are in code block
        if (incode) {
            // No <br> allowed in code blocks
            line = line.replace(brRegex, '');

            // Check if this is the end of the code block
            if (endCodeRegex.test(line)) {
                incode = false;
                line = '</code></pre>';
            } else {
                line = line.replace(escapedEndCodeRegex, '```');
            }
            lines[j] = line;
            continue;
        }

        /*  Not in code block */

        // Check if this is a beginning of code
        var m = line.match(beginCodeRegex);
        if (m) {
            incode = true;
            // The first line in a pre block must be on the same line of the pre block,
            // otherwise we will get an empty space (as pre render \n)
            lines[j] = '<pre><code class="' + (m[1] ? m[1] : '') + '">' + lines[j+1].replace(brRegex, '');
            lines[j+1] = null;
            continue;
        }

        /* This a regular line */

        // Replace titles (#)
        line = line.replace(titleRegex, function(match, hashes, title) {
            var headerLevel = hashes.length;
            return '<h' + headerLevel + '>' + title + '</h' + headerLevel + '>'
        });

        // replace any inline code quotes `xxx`
        line = line.replace(codeQuoteRegex, '<code>$1</code>');

        // Next replace all escaped backtick (\`) with backtick
        line = line.replace(escapedCodeQuoteRegex, '`');

        // Next replace links
        line = line.replace(linksRegex, '<a href="$2" target="$3">$1</a>');

        // Next replace all escaped brackets \[\] with brackets
        line = line.replace(escapedLinksRegex, '$1');

        lines[j] = line;
    }
    post.innerHTML = lines.filter(function(l) {return l !== null;}).join('\n');
}
```

## Adding to a Blogger blog
In order to use this in your blog:

* Edit your blog's template HTML
* Find the <head> tag and add the following script:
```html
<script src='//cdn.rawgit.com/ValYouW/simple-md-blogger/8a142bd56bd29221e5a73e746f49462b7046ebc4/blogger-md.js'/>
```
* Then add a script tag that will parse the blog posts when the page loads:
```html
<script type='text/javascript'>
document.addEventListener("DOMContentLoaded", function(event) {
    // NOTE: Check what is the CSS class of the posts in your blog
    var postsSelector = '.post-body';
    window.VYW.parse(postsSelector);
    // If you have some library to format code blocks (like highlight.js), call it now.
});
</script>
```
* Titles are converted to <h1> <h2> etc tags, you can set the style for those tags as the default might not suit your needs, for example I added the following style tag:
```css
<style type='text/css'>
.post-body h2 {
    font-size: 1.5em;
    font-weight: bold;
}
.post-body h3 {
    font-size: 1.2em;
    font-weight: bold;
}
</style>
```

## Performance
The performance of the two methods is quite similar, a jsperf test can be found [here](http://jsperf.com/vyw-md-parsing){_blank}
That's it, again, the code can be found on [GitHub](https://github.com/ValYouW/simple-md-blogger){_blank}
Enjoy!

2 comments :

  1. Awesome Article Valyouw! Thanks for sharing it with us.
    How can we add Highlighted blocks with Code Prettify and stuff?

    For example, using one of these plugins:

    http://google-code-prettify.googlecode.com/svn/trunk/README.html
    https://github.com/google/code-prettify

    ReplyDelete
    Replies
    1. Hi,

      I am using highlight.js with no problems, just make sure you call to its initHighlighting() method AFTER you did the markdown parsing.
      Check the section "Adding to a Blogger blog" in this blog post, there is an example there (in the code example I put a comment to show where a call to hilghligh.js should be done).

      Delete