正则表达式工具:grep

作者:jicanmeng

时间:2015年09月14日


grep使用正则表达式搜索文本,并把匹配的行打印出来。

  1. 例题1:搜寻特定字符串
  2. 例题2:行首字元^和行尾字元$
  3. 例题3:任意一个字元.与重复字元*
  4. 例题4:用中括号[]表示一些字符中的任意一个
  5. 例题5:用\{和\}限定连续RE字符重复次数
  6. 例题6:用\<和\>匹配整个单词
  7. 例题7:用\(和\(进行反向引用
  8. BRE 和 ERE
  9. 例题8:重复0个或1个前一个字符
  10. 例题9:重复1个或1个以上前一个字符
  11. 例题10:搜索多个字符串

例题1:搜寻特定字符串:

  1. 搜寻包含字符串the的行,并打印行号:
    grep -n 'the' regular.txt
  2. 搜寻不包含字符串the的行,并打印行号:
    grep -vn 'the' regular.txt
  3. 搜寻包含字符串the的行,并打印行号,同时忽略大小写:
    grep -in 'the' regular.txt
  4. 承上题,将搜索到的关键字显色:
    grep -in --color=auto 'the' regular.txt

例题2:行首字元^和行尾字元$:

  1. 搜寻以the开头的行,并打印行号:
    grep -n '^the' regular.txt
  2. 搜寻以the结尾的行,并打印行号:
    grep -n 'the$' regular.txt
  3. 搜寻以.结尾的行,并打印行号:
    grep -n '\.$' regular.txt
  4. 查找空白行,并打印行号:
    grep -n '^$' regular.txt
  5. 去除空白行和开头为#的行,显示剩余行:
    grep -v '^$' regular.txt | grep -v '^#'

The Anchor Characters: ^ and $:
The character "^" is the starting anchor, and the character "$" is the end anchor. The regular expression "^A" will match all lines that start with a capital A. The expression "A$" will match all lines that end with the capital A. If the anchor characters are not used at the proper end of the pattern, then they no longer act as anchors. That is, the "^" is only an anchor if it is the first character in a regular expression. The "$" is only an anchor if it is the last character.

Pattern Matches
^A "A" at the beginning of a line
A$ "A" at the end of a line
A^ "A^" anywhere on a line
$A "$A" anywhere on a line
^^ "^" at the beginning of a line
$$ "$" at the end of a line

例题3:任意一个字元.与重复字元*:

  1. 搜寻开头为g结尾为d的长度为4的字串所在的行,并打印行号:
    grep -n 'g..d' regular.txt
  2. 搜寻两个和两个以上的字母o所在的行,并打印行号:
    grep -n 'ooo*' regular.txt
  3. 搜寻g开头和g结尾的字符串所在的行,并打印行号:
    grep -n 'g.*g' regular.txt
注释:*表示重复前一个字符0到无穷多次。

Match any character with .:
The character "." is one of those special meta-characters. By itself it will match any character, except the end-of-line character. The pattern that will match a line with a single characters is
^.$

例题4:用中括号[]表示一些字符中的任意一个:

  1. 搜寻test或tast字串所在的行,并打印行号:
    grep -n 't[ea]st' regular.txt
  2. 搜寻a开头b结尾的长度为3的字串所在的行,中间的那个字符可以是大写字母、小写字母、数字和下划线,并打印行号:
    grep -n 'a[A-Za-z0-9_]b' regular.txt
  3. 注意:不管[]中有几个字符,它仅能代表其中的一个字符。
  4. 搜寻字母oo字符串所在的行,另外oo前面不能为字符g和b:
    grep -n '[^gb]oo' regular.txt
  5. 注意:^用在[]中表示反向选择,而且必须放在[]中的开头才表示反向选择。[a^g]表示这三个字符中的任意一个。
    上面的命令可能会输出如下结果:
        18: google is best tool for search keyword.
        19: gooooogle yes!
    
    因为18行的tool和19行的oooo都满足oo前面不是g和b的条件,所以就被输出了。所以修改为:
    grep -n 'oo' regular.txt | grep -v '[gb]oo'

例题5:用\{和\}限定连续RE字符重复次数:

  1. 搜寻oo字串所在的行,并打印行号:
    grep -n 'oo' regular.txt
    或者grep -n 'o\{2\}' regular.txt
  2. 搜寻g开头,后接2到5个字母o,再接一个g的字串所在的行,并打印行号:
    grep -n 'go\{2,5\}g' regular.txt
  3. 搜寻g开头,后接2个或2个以上字母o,再接一个g的字串所在的行,并打印行号:
    grep -n 'go\{2,\}g' regular.txt
    或者grep -n 'gooo*g' regular.txt

Matching a specific number of sets with \{ and \}:

There is a special pattern you can use to specify the minimum and maximum number of repeats. This is done by putting those two numbers between "\{" and "\}". The backslashes deserve a special discussion. Normally a backslash turns off the special meaning for a character. A period is matched by a "\." and an asterisk is matched by a "\*".

If a backslash is placed before a "<", ">", "{", "}", "(", ")", or before a digit, the backslash turns on a special meaning. This was done because these special functions were added late in the life of regular expressions. Changing the meaning of "{" would have broken old expressions. This is a horrible crime punishable by a year of hard labor writing COBOL programs. Instead, adding a backslash added functionality without breaking old programs. Rather than complain about the unsymmetry, view it as evolution.

You must remember that modifiers like "*" and "\{1,5\}" only act as modifiers if they follow a character set. If they were at the beginning of a pattern, they would not be a modifier. Here is a list of examples, and the exceptions:

Regular Expression Matches
_
* Any line with an asterisk
\* Any line with an asterisk
\\ Any line with a backslash
^* Any line starting with an asterisk
^A* Any line
^A\* Any line starting with an "A*"
^AA* Any line if it starts with one "A"
^AA*B Any line with one or more "A"'s followed by a "B"
^A\{4,8\}B Any line starting with 4, 5, 6, 7 or 8 "A"'s
followed by a "B"
^A\{4,\}B Any line starting with 4 or more "A"'s
followed by a "B"
^A\{4\}B Any line starting with "AAAAB"
\{4,8\} Any line with "{4,8}"
A{4,8} Any line with "A{4,8}"

例题6:用\<和\>匹配整个单词:

  1. 搜寻the单词所在的行,并打印行号:
    grep -n '\<the\>' regular.txt

Matching words with \< and \>:

Searching for a word isn't quite as simple as it at first appears. The string "the" will match the word "other". You can put spaces before and after the letters and use this regular expression: " the ". However, this does not match words at the beginning or end of the line. And it does not match the case where there is a punctuation mark after the word.

There is an easy solution. The characters "\<" and "\>" are similar to the "^" and "$" anchors, as they don't occupy a position of a character. They do "anchor" the expression between to only match if it is on a word boundary. The pattern to search for the word "the" would be "\<the\>". The character before the "t" must be either a new line character, or anything except a letter, number, or underscore. The character after the "e" must also be a character other than a number, letter, or underscore or it could be the end of line character.

例题7:用\(和\(进行反向引用:

Backreferences - Remembering patterns with \(, \) and \1:

Another pattern that requires a special mechanism is searching for repeated words. The expression "[a-z][a-z]" will match any two lower case letters. If you wanted to search for lines that had two adjoining identical letters, the above pattern wouldn't help. You need a way of remembering what you found, and seeing if the same pattern occurred again. You can mark part of a pattern using "\(" and "\)". You can recall the remembered pattern with "\" followed by a single digit. Therefore, to search for two identical letters, use
\([a-z]\)\1

You can have 9 different remembered patterns. Each occurrence of "\(" starts a new pattern. The regular expression that would match a 5 letter palindrome, (e.g. "radar"), would be
\([a-z]\)\([a-z]\)[a-z]\2\1

BRE 和 ERE:

从例题1到例题7说的是基础的正则表达式(Basic Regulra Expression),另外还有一种扩展的正则表达式(Extended Regular Expression)。

BRE和ERE都遵循POSIX规范
POSIX的全称是Portable Operating System Interface for uniX,它由一系列规范构成,定义了UNIX操作系统应当支持的功能,所以“POSIX规范的正则表达式”其实只是“关于正则表达式的POSIX规范”,它定义了BRE和ERE两种类型的规范。

BRE
在Linux/Unix常用工具中,grep、vi、sed都属于BRE这一派,它的语法看起来比较奇怪,元字符(、)、{、}必须转义之后才具有特殊含义,所以正则表达式(a)b只能匹配字符串 (a)b而不是字符串ab;正则表达式a{1,2}只能匹配字符串a{1,2},正则表达式a\{1,2\}才能匹配字符串a或者aa。

之所以这么麻烦,是因为这些工具的诞生时间很早,正则表达式的许多功能却是逐步发展演化出来的,之前这些元字符可能并没有特殊的含义;为保证向后兼容,就只能使用转义。而且有些功能甚至根本就不支持,比如BRE就不支持+和?量词,也不支持多选结构|和反向引用\1、\2 …。

不过今天,纯粹的BRE已经很少见了,毕竟大家已经认为正则表达式“理所应当”支持多选结构和反向引用等功能,没有确实太不方便。所以虽然vi属于BRE流派,但提供了这些功能。GNU也对BRE做了扩展,支持+、?、|,只是使用时必须写成\+、\?、\|,而且也支持\1、\2之类反向引用。这样,GNU的grep等工具虽然名义上属于BRE流,但更确切的名称是GNU BRE。

ERE
在Linux/Unix常用工具中,egrep、awk则属于ERE这一派,。虽然BRE名为“基本”而ERE名为“扩展”,但ERE并不要求兼容BRE的语法,而是自成一体。因此其中的元字符不用转义(在元字符之前添加反斜线会取消其特殊含义),所以(ab|cd)就可以匹配字符串ab或者cd,量词+、?、{n,m}可以直接使用。ERE并没有明确规定支持反向引用,但是不少工具都支持\1、\2之类的反向引用。

总结一下POSIX规范中BRE和ERE的差异:
相对于BRE,ERE的改动有:

  1. 增加了三个元字符:+、? 和 | 。
  2. 使用{、}、(、)的时候,前面不用添加转义字符\了。
  3. 不再支持back reference的功能。
  4. 对于anchor characters,即^和$,永远表示开头和结尾。如果要作为普通字符搜索,前面要添加转义字符\。
  5. 例如某文件text.txt文件中有ab^cd5个字符。搜索时,BRE语法为grep 'ab^cd' text.txt,ERE语法则为egrep ‘ab\^cd' test.txt.

但是在GNU工具对这些规范的实现中,又作了一些修改:

  1. GNU BRE中也可以使用+、? 和 | 这三个字符了。
  2. GNU ERE中仍然支持back reference。

所以在我们平时使用的linux系统中,GNU BRE和GNU EREd 区别只有两点:

  1. 在BRE中,7个字符(, ), {, }, ?, + 和 | 必须要转义使用才能表示特殊字符,在ERE中则不需要转义。
  2. 在BRE中,^和$如果放在正则表达式的中间,表示特殊字符;在ERE中,^和$如果放在正则表达式的中间,必须使用转义字符才表示普通字符。

例题8:重复0个或1个前一个字符:

  1. 搜寻gd字串或god字串所在的行,并打印行号:
    grep -n 'go\?d' regular.txt
    或者egrep -n 'go?d' regular.txt

例题9:重复1个或1个以上前一个字符:

  1. 搜寻god、good、goood...等字串所在的行,并打印行号:
    grep -n 'go\+d' regular.txt
    或者egrep -n 'go+d' regular.txt

例题10:搜索多个字符串:

  1. 搜寻gd或者good或者dog字串所在的行,并打印行号:
    grep -n 'gd\|good\|dog' regular.txt
    或者egrep -n 'gd|good|dog' regular.txt

参考资料

  1. 第十二章、正规表示法与文件格式化处理:
    http://vbird.dic.ksu.edu.tw/linux_basic/0330regularex.php#grep
  2. Regular Expressions and Extended Pattern Matching:(written by Bruce Barnett)
    http://www.grymoire.com/Unix/Regular.html
  3. Linux/Unix工具与正则表达式的POSIX规范:
    http://www.infoq.com/cn/news/2011/07/regular-expressions-6-POSIX