Golang regexp: matching newline

March 30, 2020 Regular Expressions Sources


Golang regexp: matching newline

Why regular expressions with dot (".") work differently in Go compared to PHP and JavaScript.

To enable code syntax highlighting on this website, I use regular expressions. The logic is simple — I put source code into special HTML tags. When a post loads, I process these tags — search for them using regular expressions, and replace the source code with highlighted versions.

I spent a lot of time trying to understand why some code examples were not matched by regular expressions. I used the dot “.” special character to match any symbol inside my tag. Look at the following regexp and text example and guess if it matches or not:

Regexp:

<tag>(.*?)</tag>

Text:

<tag>1
2
3</tag>

If you have PHP experience, your answer will probably be “yes”.

However, a simple example from the Go Playground makes it clear that the answer is actually “no”:

match, _ := regexp.MatchString("<tag>(.*)</tag>", "<tag>1\n2\n3</tag>")
fmt.Println(match)

// false

Then I searched Go sources trying to understand how Go deals with character classes, and I found the following list of flags:

const (
FoldCase      Flags = 1 << iota // case-insensitive match
Literal                         // treat pattern as literal string
ClassNL                         // allow character classes like [^a-z] and [[:space:]] to match newline
DotNL                           // allow . to match newline
OneLine                         // treat ^ and $ as only matching at beginning and end of text
NonGreedy                       // make repetition operators default to non-greedy
PerlX                           // allow Perl extensions
UnicodeGroups                   // allow \p{Han}, \P{Han} for Unicode group and negation
WasDollar                       // regexp OpEndText was $, not \z
Simple                          // regexp contains no counted repetition

	MatchNL = ClassNL | DotNL

	Perl        = ClassNL | OneLine | PerlX | UnicodeGroups // as close to Perl as possible
	POSIX Flags = 0                                         // POSIX syntax
)

According to the sources, Go works in the following way:

  • It compiles regexp using syntax.Parse
  • syntax.Parse uses Flags to “plan” regular expression execution (to match regexp symbols to operations)
  • regexp.Regexp (a public struct) is created using the results of syntax.Parse

So we need to compile the regexp with the DotNL flag.

When I searched all regexp.Compile function use cases, I found that there were only two regexp flag options available — POSIX and Perl. That means there is no option in Go to match newlines with dot.

So, the regexp that actually works is below:

<tag>([[:graph:]\\s]*?)</tag>

There are also a lot of predefined character classes, documented here. I used two of them to cover all characters in the [] brackets.

Tags:

Test Your Knowledge

1. In Go, does the dot (.) character in regular expressions match newline characters by default?
2. What flag is needed to allow the dot (.) to match newline characters in Go?
3. How many regexp flag options are available in Go's `regexp.Compile` function?
4. What is the recommended regexp pattern to match all characters including newlines in Go?

Related Articles

April 9, 2020

GO Templates: Principles and Usage

GO Templates: Principles and Usage

Packages text/template and html/template are part of the Go standard library. Go templates are used in many Go-programmed software — Docker, Kubernetes, Helm. Many third-party libraries are integrated with Go templates, for example Echo. Knowing Go template syntax is very useful.

This article consists of text/template package documentation and a couple of author’s solutions. After describing Go template syntax, we’ll dive into text/template and html/template sources.

Read More → Templates Html Text Sources
April 4, 2020

Principles of slice type in GO

Principles of slice type in GO

The Go blog describes how to use slices. Let’s take a look at slice internals.

Read More → Slice Allocation Sources
April 2, 2020

Data handling in concurrent programs

Data handling in concurrent programs

In Go, we have goroutines functionality out of the box. We can run code in parallel. However, in our parallel running code we can work with shared variables, and it is not clear how exactly Go handles such situations.

Read More → Map Sources
April 2, 2020

Principles of map type in GO

Principles of map type in GO

The map programming interface in Go is described in the Go blog. We just need to recall that a map is a key-value storage and it should retrieve values by key as fast as possible.

Read More → Map Sources