03/31/2020 regular expressions sources
To make code syntax highlighting on this website available i use regular expressions. The logic is simple — i put simple source code into some special html tag. On Post load i process these tags — search it by regular expression, replace source code with highlighted one.
I spent a lot of time trying to understand why some code examples are not matched by regular expression.
I used dot "." special character to match any symbol inside my tag.
Look to the following regexp and text example and guess if it match or not:
Regexp:
<tag>(.*?)</tag>
Text:
<tag>1 2 3</tag>
If you have PHP experience, your answer probably will be "yes".
But simple example from go playground makes clear that answer is actually "no":
match, _ := regexp.MatchString("<tag>(.*)</tag>", "<tag>1\n2\n3</tag>") fmt.Println(match) // false
Then i searched GO sources trying to understand how GO deals with character classes, i found following list of flags:
const ( FoldCase Flags = 1 << iota // case-insensitive match Literal // treat pattern as literal string ClassNL // allow character classes like [^a-z] and [[:space:]] to match newline DotNL // allow . to match newline OneLine // treat ^ and $ as only matching at beginning and end of text NonGreedy // make repetition operators default to non-greedy PerlX // allow Perl extensions UnicodeGroups // allow \p{Han}, \P{Han} for Unicode group and negation WasDollar // regexp OpEndText was $, not \z Simple // regexp contains no counted repetition MatchNL = ClassNL | DotNL Perl = ClassNL | OneLine | PerlX | UnicodeGroups // as close to Perl as possible POSIX Flags = 0 // POSIX syntax )
According to sources, GO works following way:
syntax.Parse
syntax.Parse
uses Flags to "plan" regular expression execution (to match regexp symbol to operation)syntax.Parse
So we need to compile regexp with DotNL
flag.
When i searched all regexp.compile function use cases,
i found that there were only 2 regexp flags options available — Posix
and Perl
.
That means there are no option in GO to match newlines with dot.
So, regexp, that works for real is below:
<tag>([[:graph:]\\s]*?)</tag>
Also there are a lot of predefined character classes, documented here. I used to of them to cover really all characters in [] brackets.