注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

随机过程

http://superware.blog.163.com

 
 
 

日志

 
 

Regular Expressions  

2015-01-16 17:18:43|  分类: 默认分类 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

I. For single character

1. Regular expression


        A regular expression is a notation for specifying and matching strings.  A regular expression is a basic expression or one creatd by applying operators to component expressions. To understand the strings matched by a regular expression, we need to understand the strings matched by its components.

======================================================================================
                                                              Regular Expressions

i). The regular expression meta-characters are:

\ ^ $ . [ ] | ( ) * + ?


ii). A basic regular expression is one of the following:

(1). a non-meta-character, such as A, that matches itself.

(2). an escape sequence that matches a special symbol: \t matches a tab

(3). an quoted meta-character, such as \*, that matches the meta-character literally.

(4). ^, which matches the beginning of string.

(5). $, which matches the end of a string.

(6). ., which matches any single character.

(7). [abc], which is a character class, matches any of the character a, b, or c.

(8). [A-Za-z], which includes abbreviations, matches any single letter.

(9). [^0-9], which is a complemented character, matches any character except a digit.


iii). These operators combine regular expressions into larger ones:

A|B, alternation, matches A or B.

AB, concatenation, matches A immediately followed by B.

A*, closure, matches zero or more A's.

A+, positive closure, matches one or more A's.

(r), parentheses, matches the same string as r does.

======================================================================================

2. Character class(characters enclosed in bracket)

        Character classA regular expression consisting of a group of character enclosed in brackets. The character class matches any of the enclosed characters.
字符族:有方括号括住的一群字符所组成的一个正则表达式。字符族匹配被方括号括住的任何一个字符。

        Ranges of characters can be abbreviated in a character class by using a hyphen. The character immediately to the left of the hyphen defines the beginning of the range, the character immediately to the right defines the end. Thus, [0-9] matches any digit, and [a-zA-Z][0-9] matches a letter followed by a digit. Without both a left and right operand, a hyphen in a character class denotes itself, so the character class [+-] and [-+] match either a + or a -. The character class [A-Za-z-]+ matches words that include hyphens.

[A-Za-z-]+ : 至少包含一个字符或"-"号,比如 "-","A","a" 等。方括号后的 "+" 表示不包含 null



i). Complemented  character(^ after [


        A complemented character class is one in which the first character after the [ is a ^, such a class matches any character not in the group following the caret.

        For example:
^[^0-9]+  : 至少包含一个字符(+)作为该行的开头(^),并且该字符不是数字字符([^0-9]);

^[ABC] : 开头为 A、B 或 C 三字符中的一个;
[^ABC] : 匹配任何字符,但该字符不是 A、B 或 C 三字符中的一个;
^[^ABC] : 开头匹配任何字符,但该字符不是A、B 或 C 三字符中的一个;
^[^a-z]$ : 整行是一个字符,但不是小写字符 a-z 中的一个。

[] 表示选取其中的任何一个。

ii). Repetitions

        The symbols  *+  and ? are unary operators used to specify repetitions in regular expressions. if r  is a regular expression, then:

(r)*  matches any string consisting of zero or more consecutive substrings matched by r.
(r)+ matches any string consisting of one or more consecutive substrings matched by r.
(r)?
matches the null string or any string matched by r.

If r is a basic regular expression, then the parentheses can be omitted.

        In regular expressions, the alternation operator | has lowest precedence, then concatenation, and finally then repetition operator *,  +,  ?.  As in arithmetic expressions, operators of higher precedence are done before lower ones. These conventions often allow parentheses to be omitted:

ab|cd is the same as (ab)|(cd), and ^ab|cd*e$ is the same as (^ab)|(c(d*)e$)

.

II. For String

1. Parentheses

        Parentheses are used in regular expressions to specify how components are grouped.  There are two binary expression operators: alternation(或) and concatenation(连接). The alternation operator | is used to specify alternatives: if r_1 and r_2 are regular expressions, then r_1 | r_2 matches any string matched by r_1 or by r_2.

        There is no explicit concatenation operator. If r_1 and r_2 are regular expressions, then (r_1)(r_2), with no black between (r_1) and (r_2)) matches any string of the form xy where r_1 matches x and r_2 matches y. The parentheses around r_1 or r_2 can be omitted, if the contained regular expression does not contain the alternation operator(It means that the priority of the alternation operator is lower than the concatenation operator). For example, the regular expression

(Asian|European|American)(male|female)(black|blue)bird

matches twelve strings ranging from
Asianmaleblackbird
to

Americanfemalebluebird


III). Regular expression enclosed  in slashes
 
       Any regular expression enclosed in slashes can be used as the right-hand operand of a matching operator: the program
$2 !~ /^[0-9]+$/
denotes that all lines in which the second field is not a string of digits.

For example:

$ cat test.txt

China Asia

USA America

Germany Europe


# print the lines in which the 4th field is Asia or Europe.

$ awk '$2 ~ /^(Asia|Europe)$/ { print }' test.txt

China Asia

Germany Europe


# print the lines in which the 2nd field is not digit

$ awk '$2 !~ /^[0-9]+$/ { print }' test.txt

China Asia

USA America

Germany Europe


        Since + and . are meta-characters, they have to be preceded by backslashes.  These backslashes are not needed within character classes, For example

/^(\+|-)?[0-9]+\.?[0-9]*$/ # [0-9]*$ denotes that the end of line is zero or one digit.

or

/^[+-]?[0-9]+[.]?[0-9]*$/ # and \.? is equal to [.]?

So the above two examples show an alternate way to describe the same numbers.
  评论这张
 
阅读(57)| 评论(0)
推荐

历史上的今天

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017