1. Homepage of Dr. Zoltán Porkoláb
    1. Home
    2. Archive
  2. Teaching
    1. Timetable
    2. Multiparadigm programming (MSc)
    3. C programming (BSc for physicists)
    4. Project tools (BSc)
    5. Bolyai College
    6. C++ (for foreign studenst)
    7. Software technology lab
    8. BSc and MSc thesis
  3. Research
    1. Templight
    2. CodeChecker
    3. CodeCompass
    4. Projects
    5. Publications (up to 2011)
    6. PhD students
  4. Affiliations
    1. Dept. of Programming Languages and Compilers
    2. Ericsson Hungary Ltd

Regular expressions

Regular expressions (regexps) are fundamental for many areas of informatics, like

  • Search engines
  • Programmer editors
  • Word processors
  • Text processing utilities

In programming languages they are implemented

  • Built-in (JavaScript, Perl, Ruby, Tcl)
  • Standard library (C#, Java, C++11)
  • External libraries

In 1956 S. C. Kleene described regular languages. One of the first use in computer science was when Ken Thompson built it into QED editor. Later he added regexps to ed, the standard UNIX editor. Regular expressions are used in many UNIX filters, like grep which originates its name to g/re/p.

$ grep hello
hallo
hallo

The family of grep utilities:

  • grep grep -G Pattern is a Basic Regular Expression (BRE)
  • egrep grep -E pattern is an Extended Regular Expressions (ERE)
  • fgrep grep -F Pattern is a fixed string
  • grep -P Pattern is a Perl Regular Expression (PCRE)
$ egrep 'a|e'
hallo
hallo
hello
hello
hullo

In BRE metacharacters must be ecaped by \ character.

$ grep 'a\|e'
hallo
hallo
hello
hello
hullo

In ERE we have to escape the tokens which exist as metacharacters.

$ grep 'a\|e'
hallo
hallo
ha|ello
ha|ello

Regexp syntax

Tokens

tokens are elementary symbols, which means themselves as it is.

Concatenation

Symbols can be concatenated:

re re

$ grep hello
hello
hello

Boolean or

re | re

$ grep 'a|e'
hallo
hello
ha|ello
ha|ello

Grouping

( re )

$ egrep 'h(a|e)llo'
hallo
hallo
hello
hello
hullo

Any character

.

$ egrep 'h.llo'
hallo
hallo
hullo
hullo
hllo

Character groups

[ tokens ]

$ egrep 'h[ae]llo'
hallo
hallo
hello
hello
$ egrep 'h[0123456789]llo'
h1llo
h1llo
h2llo       
h2llo
$ egrep 'h[0-9]llo'
h1llo
h1llo
h2llo       
h2llo
$ egrep 'h[aeiou]llo'
hallo
hallo
hullo       
hullo
$ egrep 'h[^ae]llo'
hxllo
hxllo
hallo

Inside the bracket expression, special symbols are not metacharacters:

egrep 'h[i.]llo'
hillo
hillo
hallo
h.llo
h.llo

Quantification

* A sequence of zero or more matches of the atom

+ A sequence of 1 or more matches of the atom

? Zero or 1 matches of the atom

$ egrep 'ha*llo'
hallo
hallo
haaaaaaaallo
haaaaaaaallo
hllo
hllo
$ egrep 'ha+llo'
hallo
hallo
haaaaaaaallo
haaaaaaaallo
hllo
$ egrep 'ha?llo'
hallo
hallo
hllo
hllo

{ m } Matches m occurences of atom.

{ m , n } Matches m <= k <= n occurences of atom.

{ m , } Matches m <= k occurences of atom.

{ , n } Matches k <= n occurences of atom.

$ egrep 'ha{2,3}llo'
hallo
haallo
haallo
haaallo
haaallo
haaaallo    

Anchors

^ Beginning of the line

$ End of the line

$ egrep '^b'
line
line begin
begin of line
begin of line
$ egrep 'e$'
line
line
end
end of line
end of line
$ egrep '^b.*e$'
end of line
begin
begin of line        
begin of line

Character classes

POSIX Perl Ascii Description
[:alnum:]   [A-Za-z0-9] Alphanumeric characters
  \w [A-Za-z0-9_] Word characters
  \W [^A-Za-z0-9_] Non-word characters
[:alpha:]   [A-Za-z] Alphabetic characters
[:blank:]   [ \t] Space and tab
  \b (?<=\W)(?=\w)|(?<=\w)(?=\W) Word boundaries
[:cntrl:]   [\x00-\x1F\x7F] Control characters
[:digit:] \d [0-9] Digits
  \D [^0-9] Non-digits
[:graph:]   [\x21-\x7E] Visible characters
[:lower:]   [a-z] Lowercase letters
[:print:]   [\x20-\x7E] Visible characters and space
[:punct:]   [][!”#$%&’()*+,./:;<=>?@\^_`{|}~-] Punctuation characters
[:space:] \s [ \t\r\n\v\f] Whitespace characters
  \S [^ \t\r\n\v\f] Non-whitespace characters
[:upper:]   [A-Z] Uppercase letters
[:xdigit:]   [A-Fa-f0-9] Hexadecimal digits

In the grep utilities, we should use [[: posix :]] instead of [: posix :]

Except, we use it in the bracket expression:

$ grep '[^[:alnum:]]'
%asdglksjd
%asdglksjd
%/!+
%/!+
lahdglhflj
askdhsdkl6776
$ grep '[^[:digit:]]'
laksjgléaksdg
laksjgléaksdg
762491264597