作者:jicanmeng
时间:2015年09月14日
grep使用正则表达式搜索文本,并把匹配的行打印出来。
grep -n 'the' regular.txt
grep -vn 'the' regular.txt
grep -in 'the' regular.txt
grep -in --color=auto 'the' regular.txt
grep -n '^the' regular.txt
grep -n 'the$' regular.txt
grep -n '\.$' regular.txt
grep -n '^$' regular.txt
grep -v '^$' regular.txt | grep -v '^#'
The Anchor Characters: ^ and $:
The character "^" is the starting anchor, and the character "$" is the end anchor. The regular expression "^A" will match all lines that start with a capital A. The expression "A$" will match all lines that end with the capital A. If the anchor characters are not used at the proper end of the pattern, then they no longer act as anchors. That is, the "^" is only an anchor if it is the first character in a regular expression. The "$" is only an anchor if it is the last character.
Pattern | Matches |
^A | "A" at the beginning of a line |
A$ | "A" at the end of a line |
A^ | "A^" anywhere on a line |
$A | "$A" anywhere on a line |
^^ | "^" at the beginning of a line |
$$ | "$" at the end of a line |
grep -n 'g..d' regular.txt
grep -n 'ooo*' regular.txt
grep -n 'g.*g' regular.txt
Match any character with .:
The character "." is one of those special meta-characters. By itself it will match any character, except the end-of-line character. The pattern that will match a line with a single characters is
^.$
grep -n 't[ea]st' regular.txt
grep -n 'a[A-Za-z0-9_]b' regular.txt
grep -n '[^gb]oo' regular.txt
18: google is best tool for search keyword. 19: gooooogle yes!因为18行的tool和19行的oooo都满足oo前面不是g和b的条件,所以就被输出了。所以修改为:
grep -n 'oo' regular.txt | grep -v '[gb]oo'
grep -n 'oo' regular.txt
grep -n 'o\{2\}' regular.txt
grep -n 'go\{2,5\}g' regular.txt
grep -n 'go\{2,\}g' regular.txt
grep -n 'gooo*g' regular.txt
Matching a specific number of sets with \{ and \}:
There is a special pattern you can use to specify the minimum and maximum number of repeats. This is done by putting those two numbers between "\{" and "\}". The backslashes deserve a special discussion. Normally a backslash turns off the special meaning for a character. A period is matched by a "\." and an asterisk is matched by a "\*".
If a backslash is placed before a "<", ">", "{", "}", "(", ")", or before a digit, the backslash turns on a special meaning. This was done because these special functions were added late in the life of regular expressions. Changing the meaning of "{" would have broken old expressions. This is a horrible crime punishable by a year of hard labor writing COBOL programs. Instead, adding a backslash added functionality without breaking old programs. Rather than complain about the unsymmetry, view it as evolution.
You must remember that modifiers like "*" and "\{1,5\}" only act as modifiers if they follow a character set. If they were at the beginning of a pattern, they would not be a modifier. Here is a list of examples, and the exceptions:
Regular Expression | Matches |
_ | |
* | Any line with an asterisk |
\* | Any line with an asterisk |
\\ | Any line with a backslash |
^* | Any line starting with an asterisk |
^A* | Any line |
^A\* | Any line starting with an "A*" |
^AA* | Any line if it starts with one "A" |
^AA*B | Any line with one or more "A"'s followed by a "B" |
^A\{4,8\}B | Any line starting with 4, 5, 6, 7 or 8 "A"'s |
followed by a "B" | |
^A\{4,\}B | Any line starting with 4 or more "A"'s |
followed by a "B" | |
^A\{4\}B | Any line starting with "AAAAB" |
\{4,8\} | Any line with "{4,8}" |
A{4,8} | Any line with "A{4,8}" |
grep -n '\<the\>' regular.txt
Matching words with \< and \>:
Searching for a word isn't quite as simple as it at first appears. The string "the" will match the word "other". You can put spaces before and after the letters and use this regular expression: " the ". However, this does not match words at the beginning or end of the line. And it does not match the case where there is a punctuation mark after the word.
There is an easy solution. The characters "\<" and "\>" are similar to the "^" and "$" anchors, as they don't occupy a position of a character. They do "anchor" the expression between to only match if it is on a word boundary. The pattern to search for the word "the" would be "\<the\>". The character before the "t" must be either a new line character, or anything except a letter, number, or underscore. The character after the "e" must also be a character other than a number, letter, or underscore or it could be the end of line character.
Backreferences - Remembering patterns with \(, \) and \1:
Another pattern that requires a special mechanism is searching for repeated words. The expression "[a-z][a-z]" will match any two lower case letters. If you wanted to search for lines that had two adjoining identical letters, the above pattern wouldn't help. You need a way of remembering what you found, and seeing if the same pattern occurred again. You can mark part of a pattern using "\(" and "\)". You can recall the remembered pattern with "\" followed by a single digit. Therefore, to search for two identical letters, use
\([a-z]\)\1
You can have 9 different remembered patterns. Each occurrence of "\(" starts a new pattern. The regular expression that would match a 5 letter palindrome, (e.g. "radar"), would be
\([a-z]\)\([a-z]\)[a-z]\2\1
从例题1到例题7说的是基础的正则表达式(Basic Regulra Expression),另外还有一种扩展的正则表达式(Extended Regular Expression)。
BRE和ERE都遵循POSIX规范:
POSIX的全称是Portable Operating System Interface for uniX,它由一系列规范构成,定义了UNIX操作系统应当支持的功能,所以“POSIX规范的正则表达式”其实只是“关于正则表达式的POSIX规范”,它定义了BRE和ERE两种类型的规范。
BRE:
在Linux/Unix常用工具中,grep、vi、sed都属于BRE这一派,它的语法看起来比较奇怪,元字符(、)、{、}必须转义之后才具有特殊含义,所以正则表达式(a)b只能匹配字符串 (a)b而不是字符串ab;正则表达式a{1,2}只能匹配字符串a{1,2},正则表达式a\{1,2\}才能匹配字符串a或者aa。
之所以这么麻烦,是因为这些工具的诞生时间很早,正则表达式的许多功能却是逐步发展演化出来的,之前这些元字符可能并没有特殊的含义;为保证向后兼容,就只能使用转义。而且有些功能甚至根本就不支持,比如BRE就不支持+和?量词,也不支持多选结构|和反向引用\1、\2 …。
不过今天,纯粹的BRE已经很少见了,毕竟大家已经认为正则表达式“理所应当”支持多选结构和反向引用等功能,没有确实太不方便。所以虽然vi属于BRE流派,但提供了这些功能。GNU也对BRE做了扩展,支持+、?、|,只是使用时必须写成\+、\?、\|,而且也支持\1、\2之类反向引用。这样,GNU的grep等工具虽然名义上属于BRE流,但更确切的名称是GNU BRE。
ERE:
在Linux/Unix常用工具中,egrep、awk则属于ERE这一派,。虽然BRE名为“基本”而ERE名为“扩展”,但ERE并不要求兼容BRE的语法,而是自成一体。因此其中的元字符不用转义(在元字符之前添加反斜线会取消其特殊含义),所以(ab|cd)就可以匹配字符串ab或者cd,量词+、?、{n,m}可以直接使用。ERE并没有明确规定支持反向引用,但是不少工具都支持\1、\2之类的反向引用。
总结一下POSIX规范中BRE和ERE的差异:
相对于BRE,ERE的改动有:
grep 'ab^cd' text.txt
,ERE语法则为egrep ‘ab\^cd' test.txt
.
但是在GNU工具对这些规范的实现中,又作了一些修改:
所以在我们平时使用的linux系统中,GNU BRE和GNU EREd 区别只有两点:
grep -n 'go\?d' regular.txt
egrep -n 'go?d' regular.txt
grep -n 'go\+d' regular.txt
egrep -n 'go+d' regular.txt
grep -n 'gd\|good\|dog' regular.txt
egrep -n 'gd|good|dog' regular.txt