Making my own markup language for fun and profit

So, earlier today i saw una had started using djot in her garden. I think it's neat. Then i thought to myself, what if i made my own markup language, just for fun? So i'm going to try to do that now. This site is the first ever document to be written in this new language, and i'm writing it as i'm implementing the language. I'll see how long it takes to get it working, and i'll try to hook 11ty up to it afterwards (if you can see this, that means i succeeded!). Outputting HTML seems fine to begin with. Writing this document also seems like a nice way to get a grip on what features i want. It might just end up being a subset of markdown, i'm not sure yet.

I'll just be doing this in my garden folder, live on my server. Why not!

So, let's start laying out the types. I'll be using purescript, as it is my favourite language, and it's nice to write parsers in.

npm install -D purescript spago@next purs-tidy
npx spago init
npx purs-tidy generate-config -ua -w 80 -isi
# ^ defaults i like for autoformatting.
# if the config file is in the directory, the vscode purescript plugin
# automatically formats the file on save
npx spago install foldable-traversable qualified-do strings parsing
# ^ dependencies we'll need

Let's say a document is an array of blocks.

type Document = Array Block

Then for now let's say a block is this:

data Block
  = Paragraph InlineContent
  | CodeBlock { language ∷ Maybe String, code ∷ String }
  | OrderedList (Array InlineContent)
  | UnorderedList (Array InlineContent)

This might be a good time to say i don't really know what i'm doing, and i'm not trying to be super proper with this. There are probably better descisions you can make here. Maybe lists should be able to have blocks in them. idk.

But okay, now let's decide what inline content is:

type InlineContent = NonEmptyArray SingleInlineContent

data SingleInlineContent
  = TextContent String
  | Italic InlineContent
  | Bold InlineContent
  | InlineCode { language ∷ Maybe String, code ∷ String }
  | Strikethrough InlineContent
  | Link { content ∷ InlineContent, destination ∷ String }

This is probably fine for now. Okay! We have a cute little AST. Notice how most of the SingleInlineContent stuff is recursive, so you can put a bold outside an italic outside some text content, for example.

I think the next think i'd like to do is write the printer. It feels like this should come after the parser, maybe just because that's the order it will run in, but i feel like it's easier to start with the printer.

First, some helper functions for making html elements:

elem' ∷ Boolean → String → Array (Tuple String String) → String → String
elem' isBlock name attrs content = Semigroup.do
  "<" <> name <> foldedAttrs <> ">"
  (if isBlock then "\n" else "") <> content
  (if isBlock then "\n" else "") <> "</" <> name <> ">"
  where
  foldedAttrs = foldMap
    (\(Tuple attr value) → " " <> attr <> "=\"" <> value <> "\"")
    attrs

elem ∷ Boolean → String → String → String
elem isBlock name content = elem' isBlock name [] content

blockElem ∷ String → String → String
blockElem = elem true

blockElem' ∷ String → Array (Tuple String String) → String → String
blockElem' = elem' true

inlineElem ∷ String → String → String
inlineElem = elem false

inlineElem' ∷ String → Array (Tuple String String) → String → String
inlineElem' = elem' false

htmlEscape ∷ String → String
htmlEscape _ = ""

There's no indenting, we'll have to live with unindented html output for now. I might thread an indent state through this later. Also, a htmlEscape function, which we need for putting raw text and code in html elements. I'll implement it later.

Now we can render blocks like this:

renderDocument ∷ Document → String
renderDocument blocks = intercalate "\n\n" (renderBlock <$> blocks)

renderBlock ∷ Block → String
renderBlock (Paragraph content) = blockElem "p" (renderInline content)
renderBlock (CodeBlock { language, code }) = blockElem "pre" $ case language of
  Just lang →
    blockElem' "code" [ Tuple "class" $ "language-" <> lang ] $ htmlEscape code
  Nothing → blockElem "code" $ htmlEscape code
renderBlock (OrderedList items) = blockElem "ol" $
  foldMap (\item → inlineElem "li" $ renderInline item) items
renderBlock (UnorderedList items) = blockElem "ul" $
  foldMap (\item → inlineElem "li" $ renderInline item) items

And inline content like this:

renderInline ∷ InlineContent → String
renderInline content = foldMap renderSingleInline content

renderSingleInline ∷ SingleInlineContent → String
renderSingleInline (TextContent str) = htmlEscape str
renderSingleInline (Italic content) = inlineElem "em" $ renderInline content
renderSingleInline (Bold content) = inlineElem "strong" $ renderInline content
renderSingleInline (InlineCode { language, code }) = case language of
  Just lang →
    inlineElem' "code" [ Tuple "class" $ "language-" <> lang ] $ htmlEscape code
  Nothing → inlineElem "code" $ htmlEscape code
renderSingleInline (Strikethrough content) =
  inlineElem "s" $ renderInline content
renderSingleInline (Link { content, destination }) =
  inlineElem' "a" [ Tuple "href" destination ] $ renderInline content

Let's get back to the HTML escape function:

htmlEscape ∷ String → String
htmlEscape = Compose.do
  replaceAll (Pattern "&") (Replacement "&amp;")
  replaceAll (Pattern "<") (Replacement "&lt;")
  replaceAll (Pattern ">") (Replacement "&gt;")

This should do. A lot easier than i dreaded.

Alright! I think it's parsing time.

document ∷ Parser String Document
document = whiteSpace *> many (block <* whiteSpace) <* eof

word ∷ Parser String String
word = fromCodePointArray <$> NEA.toArray <$> many1 (satisfyCodePoint isLetter)

block ∷ Parser String Block
block = Alt.do
  CodeBlock <$> ado
    string "```"
    language ← optionMaybe word
    string "\n"
    Tuple code _ ← anyTill (string "\n```")
    in { language, code }
  UnorderedList <$> many1
    (string "-" *> whiteSpace *> inlineContent <* whiteSpace)
  Paragraph <$> inlineContent

inlineContent ∷ Parser String InlineContent
inlineContent = many1 singleInlineContent

inlineContentUntil ∷ String → Parser String InlineContent
inlineContentUntil until = fromFoldable1 <$> many1Till
  singleInlineContent
  (string until)

singleInlineContent ∷ Parser String SingleInlineContent
singleInlineContent = do
  notFollowedBy (string "\n\n" <|> string "\n-" <|> eof $> "")
  Alt.do
    Bold <$> (string "**" *> inlineContentUntil "**")
    Italic <$> (string "*" *> inlineContentUntil "*")
    Strikethrough <$> (string "~~" *> inlineContentUntil "~~")
    InlineCode <$> ado
      string "`"
      Tuple code _ ← anyTill (string "`")
      language ← optionMaybe $ string "{" *> word <* string "}"
      in { code, language }
    Link <$> ado
      content ← string "[" *> inlineContentUntil "]"
      Tuple destination _ ← string "(" *> anyTill (string ")")
      in { content, destination }
    TextContent <$> do
      Tuple content _ ← anyTill
        (oneOfMap (lookAhead <<< string) textEnders <|> eof $> "")
      if content == "" then fail "Empty text content"
      else pure $ replaceAll (Pattern "\n") (Replacement " ") content
  where
  textEnders = [ "\n\n", "\n-", "*", "~~", "[", "]", "`" ]

Something like this should be good. This was a bit harder than i expected, and the text content needs special sequences that end it, so that things that were parsing it can get back control to match their special characters. There's probably a better way to do this (LR parsing?). I also decided to not implement ordered lists for now.

Alright! I think it's time to try it. Let's make a very simple function to run it:

renderArtup ∷ String → String
renderArtup input = case runParser input document of
  Right result → renderDocument result
  Left error → intercalate "\n" $ parseErrorHuman input 20 error

Time to test it on this document!

Oops, there's an infinite loop somewhere...

Okay, fixed it. And then the huge amount of other bugs that showed up afterwards. I went back in this document and pasted the finished code so that i'm not showing off broken code.

Now that we have a functioning parser and printer, it's time to hook it up to 11ty, which can be done by adding the following to the config:

// importing from the purescript output
import { renderArtup } from "./output/Artup/index.js";

export default cfg => {
	// ...

	cfg.addTemplateFormats("artup");
	cfg.addExtension("artup", {
		compile: async inputContent => {
			return async _ => renderArtup(inputContent);
		}
	});
}

The only problem now is that syntax highlighting doesn't work. I naïvely thought that my highlight plugin would work its magic on the outputted HTML, but it's not that simple. Prism doesn't have a function for highlighting a whole html document as a string, only for outputting highlighted code from a string as html, or doing mutations on the DOM (which would have to happen clientside). The 11ty plugin exposes the function to generate html from a source code string, so i'll have to pass that into purescript and use that, instead of just generating the <code> elements myself. Time to edit the printing functions!

renderDocument ∷ (String → String → String) → Document → String
renderDocument highlight blocks = intercalate "\n\n"
  (renderBlock highlight <$> blocks)

renderBlock ∷ (String → String → String) → Block → String
renderBlock _ (Paragraph content) = blockElem "p" (renderInline content)
renderBlock highlight (CodeBlock { language, code }) =
  highlight (fromMaybe "text" language) (htmlEscape code)
renderBlock _ (OrderedList items) = blockElem "ol" $
  foldMap (\item → inlineElem "li" $ renderInline item) items
renderBlock _ (UnorderedList items) = blockElem "ul" $
  foldMap (\item → inlineElem "li" $ renderInline item) items

renderArtup ∷ (String → String → String) → String → String
renderArtup highlight input = case runParser input document of
  Right result → renderDocument highlight result
  Left error → intercalate "\n" $ parseErrorHuman input 20 error

I didn't bother highlighting code spans (i never set any language for them in this document anyway). Prism is only really meant for highlighting blocks, so if you want to highlight spans you have to strip the <pre> like una does in her config here (in the handler for verbatim under djot compilation).

And now we just adapt and pass in the highlight function from the plugin (we need to curry when calling or sending functions to purescript, since that's the internal representation they use):

cfg.addExtension("artup", {
		compile: async function(inputContent) {
			const highlight = this.config.javascriptFunctions.highlight;
			return async _ => renderArtup
				(lang => code => highlight(lang, code))
				(inputContent);
		}
	});

And that's it! Time to have some fun with the following:

italics
bold
bold italics
~~strikethrough~~
i tried to make a line where i mixed multiple of them but it broke, oops

## Conclusion (i didn't implement headers either, just pretend this text is big thanks)

I thought this would be easy and maybe done in an hour, that was absolutely not the case. When the parser didn't work, i eventually ended up going to bed, and came back two days later (today) to fix it and hook it up to 11ty.

I think the way i went about this doesn't really work that well. It might have gone better if i'd preprocessed input with a lexer, but it's probably still not easy. This style of writing parsers might just not be suited for this task at all. The only other big things i've really seen it used for is parsing programming languages (which it's pretty good at). Markup languages are way less rigid, full of loose text everywhere.

It was an interesting experience, but also a little frustrating, and i don't think i'll be writing any more garden plants in artup. It's like, really bad. And if you write something the parser doesn't like, the error message is just printed straight in the page, which looks really bad because it's formatted for a monospace terminal, and html eats the line breaks. I guess in a markup language, you ideally don't want any error messages. There's always a parse.