Golang regexp: matching newline

03/31/2020 regular expressions sources


Golang regexp: matching newline

To make code syntax highlighting on this website available i use regular expressions. The logic is simple — i put simple source code into some special html tag. On Post load i process these tags — search it by regular expression, replace source code with highlighted one.

I spent a lot of time trying to understand why some code examples are not matched by regular expression. I used dot "." special character to match any symbol inside my tag.
Look to the following regexp and text example and guess if it match or not:

Regexp:

<tag>(.*?)</tag>

Text:

<tag>1
2
3</tag>

If you have PHP experience, your answer probably will be "yes".

But simple example from go playground makes clear that answer is actually "no":

match, _ := regexp.MatchString("<tag>(.*)</tag>", "<tag>1\n2\n3</tag>")
fmt.Println(match)

// false

Then i searched GO sources trying to understand how GO deals with character classes, i found following list of flags:

const (
	FoldCase      Flags = 1 << iota // case-insensitive match
	Literal                         // treat pattern as literal string
	ClassNL                         // allow character classes like [^a-z] and [[:space:]] to match newline
	DotNL                           // allow . to match newline
	OneLine                         // treat ^ and $ as only matching at beginning and end of text
	NonGreedy                       // make repetition operators default to non-greedy
	PerlX                           // allow Perl extensions
	UnicodeGroups                   // allow \p{Han}, \P{Han} for Unicode group and negation
	WasDollar                       // regexp OpEndText was $, not \z
	Simple                          // regexp contains no counted repetition

	MatchNL = ClassNL | DotNL

	Perl        = ClassNL | OneLine | PerlX | UnicodeGroups // as close to Perl as possible
	POSIX Flags = 0                                         // POSIX syntax
)

According to sources, GO works following way:

  • It compiles regexp using syntax.Parse
  • syntax.Parse uses Flags to "plan" regular expression execution (to match regexp symbol to operation)
  • regexp.Regexp (public struct) is created using results of syntax.Parse

So we need to compile regexp with DotNL flag.

When i searched all regexp.compile function use cases, i found that there were only 2 regexp flags options available — Posix and Perl. That means there are no option in GO to match newlines with dot.

So, regexp, that works for real is below:

<tag>([[:graph:]\\s]*?)</tag>

Also there are a lot of predefined character classes, documented here. I used to of them to cover really all characters in [] brackets.

Related articles