Converting Markdown to HTML is something us engineers usually take for granted. We usually just import a library and do not give a second thought about how the conversion takes place. I first encountered Markdown to HTML as an interview question. The question typically goes: we need to implement a Markdown to HTML converter, we need to convert a set list of Markdown elements to HTML like bold, paragraph and heading.
Personally I think it's a bit tough to give the correct answer in an interview setting. So I'd like to explore the solution space in this post. We'll first look at a naive way of completing it, then we will look at what a markdown library actually and see if that is something that is a valid solution to the interview question.
When faced with the interview question, my first instinct is to break it down piece by piece: focusing on a particular
markdown feature first like heading. Heading in markdown is defined with # text here
and in html as
<h1>text here</h1>
. I would create a converter function that would do the actual conversion. Something like:
def heading_conversion(text: str) -> str:
hashtag_count = 0
i = 0
while text[i] == '#':
hashtag_count+=1
i+=1
if hashtag_count == 0:
return ""
return f"<h{hashtag_count}>{text[i:].strip()}</h{hashtag_count}>"
I would repeat a converter like this for all the other features, then I would string these converter functions up in a for loop.
converters = [heading_conversion, paragraph_conversion, link_conversion]
for c in converters:
text = c(text)
So what is wrong with this approach? Well for one this does not scale. Each converter function processes the entire text. If the Markdown text is large we are doing O(kn) operations where k is the number of converter functions we have. This is also doing a lot of repeated work. If we know a block is a paragraph, then it could never be a heading. Ideally, we do not want to reprocess a previously processed block.
We looked at what I would've done in an interview setting. This is obviously not the optimal answer, so let's look at what is. Let's look at what a Markdown Conversion Library would do. Let's look at goMarkdown/markdown.
Looking at the repo, we see that there are basically three components: ast, html, and parser. The gist of it is that parser will parse Markdown text into a Markdown AST (Abstract Syntax Tree).
Abstract Syntax Tree is simply an object representation of the markdown in a tree structure. Each block of text can be broken down into individual nodes. Each Markdown syntax can be broken down into 1 or more nodes. This is how most languages are represented in code. This is also how most programming languages are executed.
Let's follow the heading tag and see how this works. First, we see that the parser will detect a heading block. Once a
heading block has been detected, the parser will parse the block into an ast.heading
node object.
It will then add the node to the tree. When we have completed generating the syntax tree, we can then render the HTML
with the renderer. We see that the heading rendering
is broken up into two parts. a headingEnter
function and a headingExit
function. There's nothing special about these,
they just add the heading opening tag and the heading closing tag.
With an Abstract Syntax Tree, the text is processed only once. When the node is generated, the parser does not regenerate the same node again, so no work is repeated.
Markdown Conversion is being asked in an interview setting. We first looked at what a candidate would do under an interview environment. Next, we explored what the standard library is doing under the hood for converting Markdown to HTML. This exercise in exploring the solution space was really compelling. My mind wouldn't have gone to AST by just looking at the interview question. Now, thinking back it makes a lot os sense. I do question, however, how realistic is it to code up an AST solution in an interview setting. We shall see.