1\input texinfo @c -*-texinfo-*- 2@c 3@c -- Stuff that needs adding: ---------------------------------------------- 4@c (document the `;' command-separator) 5@c -------------------------------------------------------------------------- 6@c Check for consistency: regexps in @code, text that they match in @samp. 7@c 8@c Tips: 9@c @command for command 10@c @samp for command fragments: @samp{cat -s} 11@c @code for sed commands and flags 12@c Use ``quote'' not `quote' or "quote". 13@c 14@c %**start of header 15@setfilename sed.info 16@settitle sed, a stream editor 17@c %**end of header 18 19@c @smallbook 20 21@include version.texi 22 23@c Combine indices. 24@syncodeindex ky cp 25@syncodeindex pg cp 26@syncodeindex tp cp 27 28@defcodeindex op 29@syncodeindex op fn 30 31@include config.texi 32 33@copying 34This file documents version @value{VERSION} of 35@value{SSED}, a stream editor. 36 37Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free 38Software Foundation, Inc. 39 40This document is released under the terms of the @acronym{GNU} Free 41Documentation License as published by the Free Software Foundation; 42either version 1.1, or (at your option) any later version. 43 44You should have received a copy of the @acronym{GNU} Free Documentation 45License along with @value{SSED}; see the file @file{COPYING.DOC}. 46If not, write to the Free Software Foundation, 59 Temple Place - Suite 47330, Boston, MA 02110-1301, USA. 48 49There are no Cover Texts and no Invariant Sections; this text, along 50with its equivalent in the printed manual, constitutes the Title Page. 51@end copying 52 53@setchapternewpage off 54 55@titlepage 56@title @command{sed}, a stream editor 57@subtitle version @value{VERSION}, @value{UPDATED} 58@author by Ken Pizzini, Paolo Bonzini 59 60@page 61@vskip 0pt plus 1filll 62Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc. 63 64@insertcopying 65 66Published by the Free Software Foundation, @* 6751 Franklin Street, Fifth Floor @* 68Boston, MA 02110-1301, USA 69@end titlepage 70 71 72@node Top 73@top 74 75@ifnottex 76@insertcopying 77@end ifnottex 78 79@menu 80* Introduction:: Introduction 81* Invoking sed:: Invocation 82* sed Programs:: @command{sed} programs 83* Examples:: Some sample scripts 84* Limitations:: Limitations and (non-)limitations of @value{SSED} 85* Other Resources:: Other resources for learning about @command{sed} 86* Reporting Bugs:: Reporting bugs 87 88* Extended regexps:: @command{egrep}-style regular expressions 89@ifset PERL 90* Perl regexps:: Perl-style regular expressions 91@end ifset 92 93* Concept Index:: A menu with all the topics in this manual. 94* Command and Option Index:: A menu with all @command{sed} commands and 95 command-line options. 96 97@detailmenu 98--- The detailed node listing --- 99 100sed Programs: 101* Execution Cycle:: How @command{sed} works 102* Addresses:: Selecting lines with @command{sed} 103* Regular Expressions:: Overview of regular expression syntax 104* Common Commands:: Often used commands 105* The "s" Command:: @command{sed}'s Swiss Army Knife 106* Other Commands:: Less frequently used commands 107* Programming Commands:: Commands for @command{sed} gurus 108* Extended Commands:: Commands specific of @value{SSED} 109* Escapes:: Specifying special characters 110 111Examples: 112* Centering lines:: 113* Increment a number:: 114* Rename files to lower case:: 115* Print bash environment:: 116* Reverse chars of lines:: 117* tac:: Reverse lines of files 118* cat -n:: Numbering lines 119* cat -b:: Numbering non-blank lines 120* wc -c:: Counting chars 121* wc -w:: Counting words 122* wc -l:: Counting lines 123* head:: Printing the first lines 124* tail:: Printing the last lines 125* uniq:: Make duplicate lines unique 126* uniq -d:: Print duplicated lines of input 127* uniq -u:: Remove all duplicated lines 128* cat -s:: Squeezing blank lines 129 130@ifset PERL 131Perl regexps:: Perl-style regular expressions 132* Backslash:: Introduces special sequences 133* Circumflex/dollar sign/period:: Behave specially with regard to new lines 134* Square brackets:: Are a bit different in strange cases 135* Options setting:: Toggle modifiers in the middle of a regexp 136* Non-capturing subpatterns:: Are not counted when backreferencing 137* Repetition:: Allows for non-greedy matching 138* Backreferences:: Allows for more than 10 back references 139* Assertions:: Allows for complex look ahead matches 140* Non-backtracking subpatterns:: Often gives more performance 141* Conditional subpatterns:: Allows if/then/else branches 142* Recursive patterns:: For example to match parentheses 143* Comments:: Because things can get complex... 144@end ifset 145 146@end detailmenu 147@end menu 148 149 150@node Introduction 151@chapter Introduction 152 153@cindex Stream editor 154@command{sed} is a stream editor. 155A stream editor is used to perform basic text 156transformations on an input stream 157(a file or input from a pipeline). 158While in some ways similar to an editor which 159permits scripted edits (such as @command{ed}), 160@command{sed} works by making only one pass over the 161input(s), and is consequently more efficient. 162But it is @command{sed}'s ability to filter text in a pipeline 163which particularly distinguishes it from other types of 164editors. 165 166 167@node Invoking sed 168@chapter Invocation 169 170Normally @command{sed} is invoked like this: 171 172@example 173sed SCRIPT INPUTFILE... 174@end example 175 176The full format for invoking @command{sed} is: 177 178@example 179sed OPTIONS... [SCRIPT] [INPUTFILE...] 180@end example 181 182If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-}, 183@command{sed} filters the contents of the standard input. The @var{script} 184is actually the first non-option parameter, which @command{sed} specially 185considers a script and not an input file if (and only if) none of the 186other @var{options} specifies a script to be executed, that is if neither 187of the @option{-e} and @option{-f} options is specified. 188 189@command{sed} may be invoked with the following command-line options: 190 191@table @code 192@item --version 193@opindex --version 194@cindex Version, printing 195Print out the version of @command{sed} that is being run and a copyright notice, 196then exit. 197 198@item --help 199@opindex --help 200@cindex Usage summary, printing 201Print a usage message briefly summarizing these command-line options 202and the bug-reporting address, 203then exit. 204 205@item -n 206@itemx --quiet 207@itemx --silent 208@opindex -n 209@opindex --quiet 210@opindex --silent 211@cindex Disabling autoprint, from command line 212By default, @command{sed} prints out the pattern space 213at the end of each cycle through the script (@pxref{Execution Cycle, , 214How @code{sed} works}). 215These options disable this automatic printing, 216and @command{sed} only produces output when explicitly told to 217via the @code{p} command. 218 219@item -e @var{script} 220@itemx --expression=@var{script} 221@opindex -e 222@opindex --expression 223@cindex Script, from command line 224Add the commands in @var{script} to the set of commands to be 225run while processing the input. 226 227@item -f @var{script-file} 228@itemx --file=@var{script-file} 229@opindex -f 230@opindex --file 231@cindex Script, from a file 232Add the commands contained in the file @var{script-file} 233to the set of commands to be run while processing the input. 234 235@item -i[@var{SUFFIX}] 236@itemx --in-place[=@var{SUFFIX}] 237@opindex -i 238@opindex --in-place 239@cindex In-place editing, activating 240@cindex @value{SSEDEXT}, in-place editing 241This option specifies that files are to be edited in-place. 242@value{SSED} does this by creating a temporary file and 243sending output to this file rather than to the standard 244output.@footnote{This applies to commands such as @code{=}, 245@code{a}, @code{c}, @code{i}, @code{l}, @code{p}. You can 246still write to the standard output by using the @code{w} 247@cindex @value{SSEDEXT}, @file{/dev/stdout} file 248or @code{W} commands together with the @file{/dev/stdout} 249special file}. 250 251This option implies @option{-s}. 252 253When the end of the file is reached, the temporary file is 254renamed to the output file's original name. The extension, 255if supplied, is used to modify the name of the old file 256before renaming the temporary file, thereby making a backup 257copy@footnote{Note that @value{SSED} creates the backup 258file whether or not any output is actually changed.}). 259 260@cindex In-place editing, Perl-style backup file names 261This rule is followed: if the extension doesn't contain a @code{*}, 262then it is appended to the end of the current filename as a 263suffix; if the extension does contain one or more @code{*} 264characters, then @emph{each} asterisk is replaced with the 265current filename. This allows you to add a prefix to the 266backup file, instead of (or in addition to) a suffix, or 267even to place backup copies of the original files into another 268directory (provided the directory already exists). 269 270If no extension is supplied, the original file is 271overwritten without making a backup. 272 273@item -l @var{N} 274@itemx --line-length=@var{N} 275@opindex -l 276@opindex --line-length 277@cindex Line length, setting 278Specify the default line-wrap length for the @code{l} command. 279A length of 0 (zero) means to never wrap long lines. If 280not specified, it is taken to be 70. 281 282@item --posix 283@cindex @value{SSEDEXT}, disabling 284@value{SSED} includes several extensions to @acronym{POSIX} 285sed. In order to simplify writing portable scripts, this 286option disables all the extensions that this manual documents, 287including additional commands. 288@cindex @code{POSIXLY_CORRECT} behavior, enabling 289Most of the extensions accept @command{sed} programs that 290are outside the syntax mandated by @acronym{POSIX}, but some 291of them (such as the behavior of the @command{N} command 292described in @pxref{Reporting Bugs}) actually violate the 293standard. If you want to disable only the latter kind of 294extension, you can set the @code{POSIXLY_CORRECT} variable 295to a non-empty value. 296 297@item -b 298@itemx --binary 299@opindex -b 300@opindex --binary 301This option is available on every platform, but is only effective where the 302operating system makes a distinction between text files and binary files. 303When such a distinction is made---as is the case for MS-DOS, Windows, 304Cygwin---text files are composed of lines separated by a carriage return 305@emph{and} a line feed character, and @command{sed} does not see the 306ending CR. When this option is specified, @command{sed} will open 307input files in binary mode, thus not requesting this special processing 308and considering lines to end at a line feed. 309 310@item --follow-symlinks 311@opindex --follow-symlinks 312This option is available only on platforms that support 313symbolic links and has an effect only if option @option{-i} 314is specified. In this case, if the file that is specified 315on the command line is a symbolic link, @command{sed} will 316follow the link and edit the ultimate destination of the 317link. The default behavior is to break the symbolic link, 318so that the link destination will not be modified. 319 320@item -r 321@itemx --regexp-extended 322@opindex -r 323@opindex --regexp-extended 324@cindex Extended regular expressions, choosing 325@cindex @acronym{GNU} extensions, extended regular expressions 326Use extended regular expressions rather than basic 327regular expressions. Extended regexps are those that 328@command{egrep} accepts; they can be clearer because they 329usually have less backslashes, but are a @acronym{GNU} extension 330and hence scripts that use them are not portable. 331@xref{Extended regexps, , Extended regular expressions}. 332 333@ifset PERL 334@item -R 335@itemx --regexp-perl 336@opindex -R 337@opindex --regexp-perl 338@cindex Perl-style regular expressions, choosing 339@cindex @value{SSEDEXT}, Perl-style regular expressions 340Use Perl-style regular expressions rather than basic 341regular expressions. Perl-style regexps are extremely 342powerful but are a @value{SSED} extension and hence scripts that 343use it are not portable. @xref{Perl regexps, , 344Perl-style regular expressions}. 345@end ifset 346 347@item -s 348@itemx --separate 349@cindex Working on separate files 350By default, @command{sed} will consider the files specified on the 351command line as a single continuous long stream. This @value{SSED} 352extension allows the user to consider them as separate files: 353range addresses (such as @samp{/abc/,/def/}) are not allowed 354to span several files, line numbers are relative to the start 355of each file, @code{$} refers to the last line of each file, 356and files invoked from the @code{R} commands are rewound at the 357start of each file. 358 359@item -u 360@itemx --unbuffered 361@opindex -u 362@opindex --unbuffered 363@cindex Unbuffered I/O, choosing 364Buffer both input and output as minimally as practical. 365(This is particularly useful if the input is coming from 366the likes of @samp{tail -f}, and you wish to see the transformed 367output as soon as possible.) 368 369@end table 370 371If no @option{-e}, @option{-f}, @option{--expression}, or @option{--file} 372options are given on the command-line, 373then the first non-option argument on the command line is 374taken to be the @var{script} to be executed. 375 376@cindex Files to be processed as input 377If any command-line parameters remain after processing the above, 378these parameters are interpreted as the names of input files to 379be processed. 380@cindex Standard input, processing as input 381A file name of @samp{-} refers to the standard input stream. 382The standard input will be processed if no file names are specified. 383 384 385@node sed Programs 386@chapter @command{sed} Programs 387 388@cindex @command{sed} program structure 389@cindex Script structure 390A @command{sed} program consists of one or more @command{sed} commands, 391passed in by one or more of the 392@option{-e}, @option{-f}, @option{--expression}, and @option{--file} 393options, or the first non-option argument if zero of these 394options are used. 395This document will refer to ``the'' @command{sed} script; 396this is understood to mean the in-order catenation 397of all of the @var{script}s and @var{script-file}s passed in. 398 399Each @code{sed} command consists of an optional address or 400address range, followed by a one-character command name 401and any additional command-specific code. 402 403@menu 404* Execution Cycle:: How @command{sed} works 405* Addresses:: Selecting lines with @command{sed} 406* Regular Expressions:: Overview of regular expression syntax 407* Common Commands:: Often used commands 408* The "s" Command:: @command{sed}'s Swiss Army Knife 409* Other Commands:: Less frequently used commands 410* Programming Commands:: Commands for @command{sed} gurus 411* Extended Commands:: Commands specific of @value{SSED} 412* Escapes:: Specifying special characters 413@end menu 414 415 416@node Execution Cycle 417@section How @command{sed} Works 418 419@cindex Buffer spaces, pattern and hold 420@cindex Spaces, pattern and hold 421@cindex Pattern space, definition 422@cindex Hold space, definition 423@command{sed} maintains two data buffers: the active @emph{pattern} space, 424and the auxiliary @emph{hold} space. Both are initially empty. 425 426@command{sed} operates by performing the following cycle on each 427lines of input: first, @command{sed} reads one line from the input 428stream, removes any trailing newline, and places it in the pattern space. 429Then commands are executed; each command can have an address associated 430to it: addresses are a kind of condition code, and a command is only 431executed if the condition is verified before the command is to be 432executed. 433 434When the end of the script is reached, unless the @option{-n} option 435is in use, the contents of pattern space are printed out to the output 436stream, adding back the trailing newline if it was removed.@footnote{Actually, 437if @command{sed} prints a line without the terminating newline, it will 438nevertheless print the missing newline as soon as more text is sent to 439the same output stream, which gives the ``least expected surprise'' 440even though it does not make commands like @samp{sed -n p} exactly 441identical to @command{cat}.} Then the next cycle starts for the next 442input line. 443 444Unless special commands (like @samp{D}) are used, the pattern space is 445deleted between two cycles. The hold space, on the other hand, keeps 446its data between cycles (see commands @samp{h}, @samp{H}, @samp{x}, 447@samp{g}, @samp{G} to move data between both buffers). 448 449 450@node Addresses 451@section Selecting lines with @command{sed} 452@cindex Addresses, in @command{sed} scripts 453@cindex Line selection 454@cindex Selecting lines to process 455 456Addresses in a @command{sed} script can be in any of the following forms: 457@table @code 458@item @var{number} 459@cindex Address, numeric 460@cindex Line, selecting by number 461Specifying a line number will match only that line in the input. 462(Note that @command{sed} counts lines continuously across all input files 463unless @option{-i} or @option{-s} options are specified.) 464 465@item @var{first}~@var{step} 466@cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses 467This @acronym{GNU} extension matches every @var{step}th line 468starting with line @var{first}. 469In particular, lines will be selected when there exists 470a non-negative @var{n} such that the current line-number equals 471@var{first} + (@var{n} * @var{step}). 472Thus, to select the odd-numbered lines, 473one would use @code{1~2}; 474to pick every third line starting with the second, @samp{2~3} would be used; 475to pick every fifth line starting with the tenth, use @samp{10~5}; 476and @samp{50~0} is just an obscure way of saying @code{50}. 477 478@item $ 479@cindex Address, last line 480@cindex Last line, selecting 481@cindex Line, selecting last 482This address matches the last line of the last file of input, or 483the last line of each file when the @option{-i} or @option{-s} options 484are specified. 485 486@item /@var{regexp}/ 487@cindex Address, as a regular expression 488@cindex Line, selecting by regular expression match 489This will select any line which matches the regular expression @var{regexp}. 490If @var{regexp} itself includes any @code{/} characters, 491each must be escaped by a backslash (@code{\}). 492 493@cindex empty regular expression 494@cindex @value{SSEDEXT}, modifiers and the empty regular expression 495The empty regular expression @samp{//} repeats the last regular 496expression match (the same holds if the empty regular expression is 497passed to the @code{s} command). Note that modifiers to regular expressions 498are evaluated when the regular expression is compiled, thus it is invalid to 499specify them together with the empty regular expression. 500 501@item \%@var{regexp}% 502(The @code{%} may be replaced by any other single character.) 503 504@cindex Slash character, in regular expressions 505This also matches the regular expression @var{regexp}, 506but allows one to use a different delimiter than @code{/}. 507This is particularly useful if the @var{regexp} itself contains 508a lot of slashes, since it avoids the tedious escaping of every @code{/}. 509If @var{regexp} itself includes any delimiter characters, 510each must be escaped by a backslash (@code{\}). 511 512@item /@var{regexp}/I 513@itemx \%@var{regexp}%I 514@cindex @acronym{GNU} extensions, @code{I} modifier 515@ifset PERL 516@cindex Perl-style regular expressions, case-insensitive 517@end ifset 518The @code{I} modifier to regular-expression matching is a @acronym{GNU} 519extension which causes the @var{regexp} to be matched in 520a case-insensitive manner. 521 522@item /@var{regexp}/M 523@itemx \%@var{regexp}%M 524@ifset PERL 525@cindex @value{SSEDEXT}, @code{M} modifier 526@end ifset 527@cindex Perl-style regular expressions, multiline 528The @code{M} modifier to regular-expression matching is a @value{SSED} 529extension which causes @code{^} and @code{$} to match respectively 530(in addition to the normal behavior) the empty string after a newline, 531and the empty string before a newline. There are special character 532sequences 533@ifset PERL 534(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'} 535in basic or extended regular expression modes) 536@end ifset 537@ifclear PERL 538(@code{\`} and @code{\'}) 539@end ifclear 540which always match the beginning or the end of the buffer. 541@code{M} stands for @cite{multi-line}. 542 543@ifset PERL 544@item /@var{regexp}/S 545@itemx \%@var{regexp}%S 546@cindex @value{SSEDEXT}, @code{S} modifier 547@cindex Perl-style regular expressions, single line 548The @code{S} modifier to regular-expression matching is only valid 549in Perl mode and specifies that the dot character (@code{.}) will 550match the newline character too. @code{S} stands for @cite{single-line}. 551@end ifset 552 553@ifset PERL 554@item /@var{regexp}/X 555@itemx \%@var{regexp}%X 556@cindex @value{SSEDEXT}, @code{X} modifier 557@cindex Perl-style regular expressions, extended 558The @code{X} modifier to regular-expression matching is also 559valid in Perl mode only. If it is used, whitespace in the 560pattern (other than in a character class) and 561characters between a @kbd{#} outside a character class and the 562next newline character are ignored. An escaping backslash 563can be used to include a whitespace or @kbd{#} character as part 564of the pattern. 565@end ifset 566@end table 567 568If no addresses are given, then all lines are matched; 569if one address is given, then only lines matching that 570address are matched. 571 572@cindex Range of lines 573@cindex Several lines, selecting 574An address range can be specified by specifying two addresses 575separated by a comma (@code{,}). An address range matches lines 576starting from where the first address matches, and continues 577until the second address matches (inclusively). 578 579If the second address is a @var{regexp}, then checking for the 580ending match will start with the line @emph{following} the 581line which matched the first address: a range will always 582span at least two lines (except of course if the input stream 583ends). 584 585If the second address is a @var{number} less than (or equal to) 586the line matching the first address, then only the one line is 587matched. 588 589@cindex Special addressing forms 590@cindex Range with start address of zero 591@cindex Zero, as range start address 592@cindex @var{addr1},+N 593@cindex @var{addr1},~N 594@cindex @acronym{GNU} extensions, special two-address forms 595@cindex @acronym{GNU} extensions, @code{0} address 596@cindex @acronym{GNU} extensions, 0,@var{addr2} addressing 597@cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing 598@cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing 599@value{SSED} also supports some special two-address forms; all these 600are @acronym{GNU} extensions: 601@table @code 602@item 0,/@var{regexp}/ 603A line number of @code{0} can be used in an address specification like 604@code{0,/@var{regexp}/} so that @command{sed} will try to match 605@var{regexp} in the first input line too. In other words, 606@code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/}, 607except that if @var{addr2} matches the very first line of input the 608@code{0,/@var{regexp}/} form will consider it to end the range, whereas 609the @code{1,/@var{regexp}/} form will match the beginning of its range and 610hence make the range span up to the @emph{second} occurrence of the 611regular expression. 612 613Note that this is the only place where the @code{0} address makes 614sense; there is no 0-th line and commands which are given the @code{0} 615address in any other way will give an error. 616 617@item @var{addr1},+@var{N} 618Matches @var{addr1} and the @var{N} lines following @var{addr1}. 619 620@item @var{addr1},~@var{N} 621Matches @var{addr1} and the lines following @var{addr1} 622until the next line whose input line number is a multiple of @var{N}. 623@end table 624 625@cindex Excluding lines 626@cindex Selecting non-matching lines 627Appending the @code{!} character to the end of an address 628specification negates the sense of the match. 629That is, if the @code{!} character follows an address range, 630then only lines which do @emph{not} match the address range 631will be selected. 632This also works for singleton addresses, 633and, perhaps perversely, for the null address. 634 635 636@node Regular Expressions 637@section Overview of Regular Expression Syntax 638 639To know how to use @command{sed}, people should understand regular 640expressions (@dfn{regexp} for short). A regular expression 641is a pattern that is matched against a 642subject string from left to right. Most characters are 643@dfn{ordinary}: they stand for 644themselves in a pattern, and match the corresponding characters 645in the subject. As a trivial example, the pattern 646 647@example 648The quick brown fox 649@end example 650 651@noindent 652matches a portion of a subject string that is identical to 653itself. The power of regular expressions comes from the 654ability to include alternatives and repetitions in the pattern. 655These are encoded in the pattern by the use of @dfn{special characters}, 656which do not stand for themselves but instead 657are interpreted in some special way. Here is a brief description 658of regular expression syntax as used in @command{sed}. 659 660@table @code 661@item @var{char} 662A single ordinary character matches itself. 663 664@item * 665@cindex @acronym{GNU} extensions, to basic regular expressions 666Matches a sequence of zero or more instances of matches for the 667preceding regular expression, which must be an ordinary character, a 668special character preceded by @code{\}, a @code{.}, a grouped regexp 669(see below), or a bracket expression. As a @acronym{GNU} extension, a 670postfixed regular expression can also be followed by @code{*}; for 671example, @code{a**} is equivalent to @code{a*}. @acronym{POSIX} 6721003.1-2001 says that @code{*} stands for itself when it appears at 673the start of a regular expression or subexpression, but many 674non@acronym{GNU} implementations do not support this and portable 675scripts should instead use @code{\*} in these contexts. 676 677@item \+ 678@cindex @acronym{GNU} extensions, to basic regular expressions 679As @code{*}, but matches one or more. It is a @acronym{GNU} extension. 680 681@item \? 682@cindex @acronym{GNU} extensions, to basic regular expressions 683As @code{*}, but only matches zero or one. It is a @acronym{GNU} extension. 684 685@item \@{@var{i}\@} 686As @code{*}, but matches exactly @var{i} sequences (@var{i} is a 687decimal integer; for portability, keep it between 0 and 255 688inclusive). 689 690@item \@{@var{i},@var{j}\@} 691Matches between @var{i} and @var{j}, inclusive, sequences. 692 693@item \@{@var{i},\@} 694Matches more than or equal to @var{i} sequences. 695 696@item \(@var{regexp}\) 697Groups the inner @var{regexp} as a whole, this is used to: 698 699@itemize @bullet 700@item 701@cindex @acronym{GNU} extensions, to basic regular expressions 702Apply postfix operators, like @code{\(abcd\)*}: 703this will search for zero or more whole sequences 704of @samp{abcd}, while @code{abcd*} would search 705for @samp{abc} followed by zero or more occurrences 706of @samp{d}. Note that support for @code{\(abcd\)*} is 707required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU} 708implementations do not support it and hence it is not universally 709portable. 710 711@item 712Use back references (see below). 713@end itemize 714 715@item . 716Matches any character, including newline. 717 718@item ^ 719Matches the null string at beginning of the pattern space, i.e. what 720appears after the circumflex must appear at the beginning of the 721pattern space. 722 723In most scripts, pattern space is initialized to the content of each 724line (@pxref{Execution Cycle, , How @code{sed} works}). So, it is a 725useful simplification to think of @code{^#include} as matching only 726lines where @samp{#include} is the first thing on line---if there are 727spaces before, for example, the match fails. This simplification is 728valid as long as the original content of pattern space is not modified, 729for example with an @code{s} command. 730 731@code{^} acts as a special character only at the beginning of the 732regular expression or subexpression (that is, after @code{\(} or 733@code{\|}). Portable scripts should avoid @code{^} at the beginning of 734a subexpression, though, as @acronym{POSIX} allows implementations that 735treat @code{^} as an ordinary character in that context. 736 737@item $ 738It is the same as @code{^}, but refers to end of pattern space. 739@code{$} also acts as a special character only at the end 740of the regular expression or subexpression (that is, before @code{\)} 741or @code{\|}), and its use at the end of a subexpression is not 742portable. 743 744 745@item [@var{list}] 746@itemx [^@var{list}] 747Matches any single character in @var{list}: for example, 748@code{[aeiou]} matches all vowels. A list may include 749sequences like @code{@var{char1}-@var{char2}}, which 750matches any character between (inclusive) @var{char1} 751and @var{char2}. 752 753A leading @code{^} reverses the meaning of @var{list}, so that 754it matches any single character @emph{not} in @var{list}. To include 755@code{]} in the list, make it the first character (after 756the @code{^} if needed), to include @code{-} in the list, 757make it the first or last; to include @code{^} put 758it after the first character. 759 760@cindex @code{POSIXLY_CORRECT} behavior, bracket expressions 761The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\} 762are normally not special within @var{list}. For example, @code{[\*]} 763matches either @samp{\} or @samp{*}, because the @code{\} is not 764special here. However, strings like @code{[.ch.]}, @code{[=a=]}, and 765@code{[:space:]} are special within @var{list} and represent collating 766symbols, equivalence classes, and character classes, respectively, and 767@code{[} is therefore special within @var{list} when it is followed by 768@code{.}, @code{=}, or @code{:}. Also, when not in 769@env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and 770@code{\t} are recognized within @var{list}. @xref{Escapes}. 771 772@item @var{regexp1}\|@var{regexp2} 773@cindex @acronym{GNU} extensions, to basic regular expressions 774Matches either @var{regexp1} or @var{regexp2}. Use 775parentheses to use complex alternative regular expressions. 776The matching process tries each alternative in turn, from 777left to right, and the first one that succeeds is used. 778It is a @acronym{GNU} extension. 779 780@item @var{regexp1}@var{regexp2} 781Matches the concatenation of @var{regexp1} and @var{regexp2}. 782Concatenation binds more tightly than @code{\|}, @code{^}, and 783@code{$}, but less tightly than the other regular expression 784operators. 785 786@item \@var{digit} 787Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized 788subexpression in the regular expression. This is called a @dfn{back 789reference}. Subexpressions are implicity numbered by counting 790occurrences of @code{\(} left-to-right. 791 792@item \n 793Matches the newline character. 794 795@item \@var{char} 796Matches @var{char}, where @var{char} is one of @code{$}, 797@code{*}, @code{.}, @code{[}, @code{\}, or @code{^}. 798Note that the only C-like 799backslash sequences that you can portably assume to be 800interpreted are @code{\n} and @code{\\}; in particular 801@code{\t} is not portable, and matches a @samp{t} under most 802implementations of @command{sed}, rather than a tab character. 803 804@end table 805 806@cindex Greedy regular expression matching 807Note that the regular expression matcher is greedy, i.e., matches 808are attempted from left to right and, if two or more matches are 809possible starting at the same character, it selects the longest. 810 811@noindent 812Examples: 813@table @samp 814@item abcdef 815Matches @samp{abcdef}. 816 817@item a*b 818Matches zero or more @samp{a}s followed by a single 819@samp{b}. For example, @samp{b} or @samp{aaaaab}. 820 821@item a\?b 822Matches @samp{b} or @samp{ab}. 823 824@item a\+b\+ 825Matches one or more @samp{a}s followed by one or more 826@samp{b}s: @samp{ab} is the shortest possible match, but 827other examples are @samp{aaaab} or @samp{abbbbb} or 828@samp{aaaaaabbbbbbb}. 829 830@item .* 831@itemx .\+ 832These two both match all the characters in a string; 833however, the first matches every string (including the empty 834string), while the second matches only strings containing 835at least one character. 836 837@item ^main.*(.*) 838his matches a string starting with @samp{main}, 839followed by an opening and closing 840parenthesis. The @samp{n}, @samp{(} and @samp{)} need not 841be adjacent. 842 843@item ^# 844This matches a string beginning with @samp{#}. 845 846@item \\$ 847This matches a string ending with a single backslash. The 848regexp contains two backslashes for escaping. 849 850@item \$ 851Instead, this matches a string consisting of a single dollar sign, 852because it is escaped. 853 854@item [a-zA-Z0-9] 855In the C locale, this matches any @acronym{ASCII} letters or digits. 856 857@item [^ @kbd{tab}]\+ 858(Here @kbd{tab} stands for a single tab character.) 859This matches a string of one or more 860characters, none of which is a space or a tab. 861Usually this means a word. 862 863@item ^\(.*\)\n\1$ 864This matches a string consisting of two equal substrings separated by 865a newline. 866 867@item .\@{9\@}A$ 868This matches nine characters followed by an @samp{A}. 869 870@item ^.\@{15\@}A 871This matches the start of a string that contains 16 characters, 872the last of which is an @samp{A}. 873 874@end table 875 876 877 878@node Common Commands 879@section Often-Used Commands 880 881If you use @command{sed} at all, you will quite likely want to know 882these commands. 883 884@table @code 885@item # 886[No addresses allowed.] 887 888@findex # (comments) 889@cindex Comments, in scripts 890The @code{#} character begins a comment; 891the comment continues until the next newline. 892 893@cindex Portability, comments 894If you are concerned about portability, be aware that 895some implementations of @command{sed} (which are not @sc{posix} 896conformant) may only support a single one-line comment, 897and then only when the very first character of the script is a @code{#}. 898 899@findex -n, forcing from within a script 900@cindex Caveat --- #n on first line 901Warning: if the first two characters of the @command{sed} script 902are @code{#n}, then the @option{-n} (no-autoprint) option is forced. 903If you want to put a comment in the first line of your script 904and that comment begins with the letter @samp{n} 905and you do not want this behavior, 906then be sure to either use a capital @samp{N}, 907or place at least one space before the @samp{n}. 908 909@item q [@var{exit-code}] 910This command only accepts a single address. 911 912@findex q (quit) command 913@cindex @value{SSEDEXT}, returning an exit code 914@cindex Quitting 915Exit @command{sed} without processing any more commands or input. 916Note that the current pattern space is printed if auto-print is 917not disabled with the @option{-n} options. The ability to return 918an exit code from the @command{sed} script is a @value{SSED} extension. 919 920@item d 921@findex d (delete) command 922@cindex Text, deleting 923Delete the pattern space; 924immediately start next cycle. 925 926@item p 927@findex p (print) command 928@cindex Text, printing 929Print out the pattern space (to the standard output). 930This command is usually only used in conjunction with the @option{-n} 931command-line option. 932 933@item n 934@findex n (next-line) command 935@cindex Next input line, replace pattern space with 936@cindex Read next input line 937If auto-print is not disabled, print the pattern space, 938then, regardless, replace the pattern space with the next line of input. 939If there is no more input then @command{sed} exits without processing 940any more commands. 941 942@item @{ @var{commands} @} 943@findex @{@} command grouping 944@cindex Grouping commands 945@cindex Command groups 946A group of commands may be enclosed between 947@code{@{} and @code{@}} characters. 948This is particularly useful when you want a group of commands 949to be triggered by a single address (or address-range) match. 950 951@end table 952 953@node The "s" Command 954@section The @code{s} Command 955 956The syntax of the @code{s} (as in substitute) command is 957@samp{s/@var{regexp}/@var{replacement}/@var{flags}}. The @code{/} 958characters may be uniformly replaced by any other single 959character within any given @code{s} command. The @code{/} 960character (or whatever other character is used in its stead) 961can appear in the @var{regexp} or @var{replacement} 962only if it is preceded by a @code{\} character. 963 964The @code{s} command is probably the most important in @command{sed} 965and has a lot of different options. Its basic concept is simple: 966the @code{s} command attempts to match the pattern 967space against the supplied @var{regexp}; if the match is 968successful, then that portion of the pattern 969space which was matched is replaced with @var{replacement}. 970 971@cindex Backreferences, in regular expressions 972@cindex Parenthesized substrings 973The @var{replacement} can contain @code{\@var{n}} (@var{n} being 974a number from 1 to 9, inclusive) references, which refer to 975the portion of the match which is contained between the @var{n}th 976@code{\(} and its matching @code{\)}. 977Also, the @var{replacement} can contain unescaped @code{&} 978characters which reference the whole matched portion 979of the pattern space. 980@cindex @value{SSEDEXT}, case modifiers in @code{s} commands 981Finally, as a @value{SSED} extension, you can include a 982special sequence made of a backslash and one of the letters 983@code{L}, @code{l}, @code{U}, @code{u}, or @code{E}. 984The meaning is as follows: 985 986@table @code 987@item \L 988Turn the replacement 989to lowercase until a @code{\U} or @code{\E} is found, 990 991@item \l 992Turn the 993next character to lowercase, 994 995@item \U 996Turn the replacement to uppercase 997until a @code{\L} or @code{\E} is found, 998 999@item \u 1000Turn the next character 1001to uppercase, 1002 1003@item \E 1004Stop case conversion started by @code{\L} or @code{\U}. 1005@end table 1006 1007To include a literal @code{\}, @code{&}, or newline in the final 1008replacement, be sure to precede the desired @code{\}, @code{&}, 1009or newline in the @var{replacement} with a @code{\}. 1010 1011@findex s command, option flags 1012@cindex Substitution of text, options 1013The @code{s} command can be followed by zero or more of the 1014following @var{flags}: 1015 1016@table @code 1017@item g 1018@cindex Global substitution 1019@cindex Replacing all text matching regexp in a line 1020Apply the replacement to @emph{all} matches to the @var{regexp}, 1021not just the first. 1022 1023@item @var{number} 1024@cindex Replacing only @var{n}th match of regexp in a line 1025Only replace the @var{number}th match of the @var{regexp}. 1026 1027@cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command 1028@cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command 1029Note: the @sc{posix} standard does not specify what should happen 1030when you mix the @code{g} and @var{number} modifiers, 1031and currently there is no widely agreed upon meaning 1032across @command{sed} implementations. 1033For @value{SSED}, the interaction is defined to be: 1034ignore matches before the @var{number}th, 1035and then match and replace all matches from 1036the @var{number}th on. 1037 1038@item p 1039@cindex Text, printing after substitution 1040If the substitution was made, then print the new pattern space. 1041 1042Note: when both the @code{p} and @code{e} options are specified, 1043the relative ordering of the two produces very different results. 1044In general, @code{ep} (evaluate then print) is what you want, 1045but operating the other way round can be useful for debugging. 1046For this reason, the current version of @value{SSED} interprets 1047specially the presence of @code{p} options both before and after 1048@code{e}, printing the pattern space before and after evaluation, 1049while in general flags for the @code{s} command show their 1050effect just once. This behavior, although documented, might 1051change in future versions. 1052 1053@item w @var{file-name} 1054@cindex Text, writing to a file after substitution 1055@cindex @value{SSEDEXT}, @file{/dev/stdout} file 1056@cindex @value{SSEDEXT}, @file{/dev/stderr} file 1057If the substitution was made, then write out the result to the named file. 1058As a @value{SSED} extension, two special values of @var{file-name} are 1059supported: @file{/dev/stderr}, which writes the result to the standard 1060error, and @file{/dev/stdout}, which writes to the standard 1061output.@footnote{This is equivalent to @code{p} unless the @option{-i} 1062option is being used.} 1063 1064@item e 1065@cindex Evaluate Bourne-shell commands, after substitution 1066@cindex Subprocesses 1067@cindex @value{SSEDEXT}, evaluating Bourne-shell commands 1068@cindex @value{SSEDEXT}, subprocesses 1069This command allows one to pipe input from a shell command 1070into pattern space. If a substitution was made, the command 1071that is found in pattern space is executed and pattern space 1072is replaced with its output. A trailing newline is suppressed; 1073results are undefined if the command to be executed contains 1074a @sc{nul} character. This is a @value{SSED} extension. 1075 1076@item I 1077@itemx i 1078@cindex @acronym{GNU} extensions, @code{I} modifier 1079@cindex Case-insensitive matching 1080@ifset PERL 1081@cindex Perl-style regular expressions, case-insensitive 1082@end ifset 1083The @code{I} modifier to regular-expression matching is a @acronym{GNU} 1084extension which makes @command{sed} match @var{regexp} in a 1085case-insensitive manner. 1086 1087@item M 1088@itemx m 1089@cindex @value{SSEDEXT}, @code{M} modifier 1090@ifset PERL 1091@cindex Perl-style regular expressions, multiline 1092@end ifset 1093The @code{M} modifier to regular-expression matching is a @value{SSED} 1094extension which causes @code{^} and @code{$} to match respectively 1095(in addition to the normal behavior) the empty string after a newline, 1096and the empty string before a newline. There are special character 1097sequences 1098@ifset PERL 1099(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'} 1100in basic or extended regular expression modes) 1101@end ifset 1102@ifclear PERL 1103(@code{\`} and @code{\'}) 1104@end ifclear 1105which always match the beginning or the end of the buffer. 1106@code{M} stands for @cite{multi-line}. 1107 1108@ifset PERL 1109@item S 1110@itemx s 1111@cindex @value{SSEDEXT}, @code{S} modifier 1112@cindex Perl-style regular expressions, single line 1113The @code{S} modifier to regular-expression matching is only valid 1114in Perl mode and specifies that the dot character (@code{.}) will 1115match the newline character too. @code{S} stands for @cite{single-line}. 1116@end ifset 1117 1118@ifset PERL 1119@item X 1120@itemx x 1121@cindex @value{SSEDEXT}, @code{X} modifier 1122@cindex Perl-style regular expressions, extended 1123The @code{X} modifier to regular-expression matching is also 1124valid in Perl mode only. If it is used, whitespace in the 1125pattern (other than in a character class) and 1126characters between a @kbd{#} outside a character class and the 1127next newline character are ignored. An escaping backslash 1128can be used to include a whitespace or @kbd{#} character as part 1129of the pattern. 1130@end ifset 1131@end table 1132 1133 1134@node Other Commands 1135@section Less Frequently-Used Commands 1136 1137Though perhaps less frequently used than those in the previous 1138section, some very small yet useful @command{sed} scripts can be built with 1139these commands. 1140 1141@table @code 1142@item y/@var{source-chars}/@var{dest-chars}/ 1143(The @code{/} characters may be uniformly replaced by 1144any other single character within any given @code{y} command.) 1145 1146@findex y (transliterate) command 1147@cindex Transliteration 1148Transliterate any characters in the pattern space which match 1149any of the @var{source-chars} with the corresponding character 1150in @var{dest-chars}. 1151 1152Instances of the @code{/} (or whatever other character is used in its stead), 1153@code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars} 1154lists, provide that each instance is escaped by a @code{\}. 1155The @var{source-chars} and @var{dest-chars} lists @emph{must} 1156contain the same number of characters (after de-escaping). 1157 1158@item a\ 1159@itemx @var{text} 1160@cindex @value{SSEDEXT}, two addresses supported by most commands 1161As a @acronym{GNU} extension, this command accepts two addresses. 1162 1163@findex a (append text lines) command 1164@cindex Appending text after a line 1165@cindex Text, appending 1166Queue the lines of text which follow this command 1167(each but the last ending with a @code{\}, 1168which are removed from the output) 1169to be output at the end of the current cycle, 1170or when the next input line is read. 1171 1172Escape sequences in @var{text} are processed, so you should 1173use @code{\\} in @var{text} to print a single backslash. 1174 1175As a @acronym{GNU} extension, if between the @code{a} and the newline there is 1176other than a whitespace-@code{\} sequence, then the text of this line, 1177starting at the first non-whitespace character after the @code{a}, 1178is taken as the first line of the @var{text} block. 1179(This enables a simplification in scripting a one-line add.) 1180This extension also works with the @code{i} and @code{c} commands. 1181 1182@item i\ 1183@itemx @var{text} 1184@cindex @value{SSEDEXT}, two addresses supported by most commands 1185As a @acronym{GNU} extension, this command accepts two addresses. 1186 1187@findex i (insert text lines) command 1188@cindex Inserting text before a line 1189@cindex Text, insertion 1190Immediately output the lines of text which follow this command 1191(each but the last ending with a @code{\}, 1192which are removed from the output). 1193 1194@item c\ 1195@itemx @var{text} 1196@findex c (change to text lines) command 1197@cindex Replacing selected lines with other text 1198Delete the lines matching the address or address-range, 1199and output the lines of text which follow this command 1200(each but the last ending with a @code{\}, 1201which are removed from the output) 1202in place of the last line 1203(or in place of each line, if no addresses were specified). 1204A new cycle is started after this command is done, 1205since the pattern space will have been deleted. 1206 1207@item = 1208@cindex @value{SSEDEXT}, two addresses supported by most commands 1209As a @acronym{GNU} extension, this command accepts two addresses. 1210 1211@findex = (print line number) command 1212@cindex Printing line number 1213@cindex Line number, printing 1214Print out the current input line number (with a trailing newline). 1215 1216@item l @var{n} 1217@findex l (list unambiguously) command 1218@cindex List pattern space 1219@cindex Printing text unambiguously 1220@cindex Line length, setting 1221@cindex @value{SSEDEXT}, setting line length 1222Print the pattern space in an unambiguous form: 1223non-printable characters (and the @code{\} character) 1224are printed in C-style escaped form; long lines are split, 1225with a trailing @code{\} character to indicate the split; 1226the end of each line is marked with a @code{$}. 1227 1228@var{n} specifies the desired line-wrap length; 1229a length of 0 (zero) means to never wrap long lines. If omitted, 1230the default as specified on the command line is used. The @var{n} 1231parameter is a @value{SSED} extension. 1232 1233@item r @var{filename} 1234@cindex @value{SSEDEXT}, two addresses supported by most commands 1235As a @acronym{GNU} extension, this command accepts two addresses. 1236 1237@findex r (read file) command 1238@cindex Read text from a file 1239@cindex @value{SSEDEXT}, @file{/dev/stdin} file 1240Queue the contents of @var{filename} to be read and 1241inserted into the output stream at the end of the current cycle, 1242or when the next input line is read. 1243Note that if @var{filename} cannot be read, it is treated as 1244if it were an empty file, without any error indication. 1245 1246As a @value{SSED} extension, the special value @file{/dev/stdin} 1247is supported for the file name, which reads the contents of the 1248standard input. 1249 1250@item w @var{filename} 1251@findex w (write file) command 1252@cindex Write to a file 1253@cindex @value{SSEDEXT}, @file{/dev/stdout} file 1254@cindex @value{SSEDEXT}, @file{/dev/stderr} file 1255Write the pattern space to @var{filename}. 1256As a @value{SSED} extension, two special values of @var{file-name} are 1257supported: @file{/dev/stderr}, which writes the result to the standard 1258error, and @file{/dev/stdout}, which writes to the standard 1259output.@footnote{This is equivalent to @code{p} unless the @option{-i} 1260option is being used.} 1261 1262The file will be created (or truncated) before the 1263first input line is read; all @code{w} commands 1264(including instances of @code{w} flag on successful @code{s} commands) 1265which refer to the same @var{filename} are output without 1266closing and reopening the file. 1267 1268@item D 1269@findex D (delete first line) command 1270@cindex Delete first line from pattern space 1271Delete text in the pattern space up to the first newline. 1272If any text is left, restart cycle with the resultant 1273pattern space (without reading a new line of input), 1274otherwise start a normal new cycle. 1275 1276@item N 1277@findex N (append Next line) command 1278@cindex Next input line, append to pattern space 1279@cindex Append next input line to pattern space 1280Add a newline to the pattern space, 1281then append the next line of input to the pattern space. 1282If there is no more input then @command{sed} exits without processing 1283any more commands. 1284 1285@item P 1286@findex P (print first line) command 1287@cindex Print first line from pattern space 1288Print out the portion of the pattern space up to the first newline. 1289 1290@item h 1291@findex h (hold) command 1292@cindex Copy pattern space into hold space 1293@cindex Replace hold space with copy of pattern space 1294@cindex Hold space, copying pattern space into 1295Replace the contents of the hold space with the contents of the pattern space. 1296 1297@item H 1298@findex H (append Hold) command 1299@cindex Append pattern space to hold space 1300@cindex Hold space, appending from pattern space 1301Append a newline to the contents of the hold space, 1302and then append the contents of the pattern space to that of the hold space. 1303 1304@item g 1305@findex g (get) command 1306@cindex Copy hold space into pattern space 1307@cindex Replace pattern space with copy of hold space 1308@cindex Hold space, copy into pattern space 1309Replace the contents of the pattern space with the contents of the hold space. 1310 1311@item G 1312@findex G (appending Get) command 1313@cindex Append hold space to pattern space 1314@cindex Hold space, appending to pattern space 1315Append a newline to the contents of the pattern space, 1316and then append the contents of the hold space to that of the pattern space. 1317 1318@item x 1319@findex x (eXchange) command 1320@cindex Exchange hold space with pattern space 1321@cindex Hold space, exchange with pattern space 1322Exchange the contents of the hold and pattern spaces. 1323 1324@end table 1325 1326 1327@node Programming Commands 1328@section Commands for @command{sed} gurus 1329 1330In most cases, use of these commands indicates that you are 1331probably better off programming in something like @command{awk} 1332or Perl. But occasionally one is committed to sticking 1333with @command{sed}, and these commands can enable one to write 1334quite convoluted scripts. 1335 1336@cindex Flow of control in scripts 1337@table @code 1338@item : @var{label} 1339[No addresses allowed.] 1340 1341@findex : (label) command 1342@cindex Labels, in scripts 1343Specify the location of @var{label} for branch commands. 1344In all other respects, a no-op. 1345 1346@item b @var{label} 1347@findex b (branch) command 1348@cindex Branch to a label, unconditionally 1349@cindex Goto, in scripts 1350Unconditionally branch to @var{label}. 1351The @var{label} may be omitted, in which case the next cycle is started. 1352 1353@item t @var{label} 1354@findex t (test and branch if successful) command 1355@cindex Branch to a label, if @code{s///} succeeded 1356@cindex Conditional branch 1357Branch to @var{label} only if there has been a successful @code{s}ubstitution 1358since the last input line was read or conditional branch was taken. 1359The @var{label} may be omitted, in which case the next cycle is started. 1360 1361@end table 1362 1363@node Extended Commands 1364@section Commands Specific to @value{SSED} 1365 1366These commands are specific to @value{SSED}, so you 1367must use them with care and only when you are sure that 1368hindering portability is not evil. They allow you to check 1369for @value{SSED} extensions or to do tasks that are required 1370quite often, yet are unsupported by standard @command{sed}s. 1371 1372@table @code 1373@item e [@var{command}] 1374@findex e (evaluate) command 1375@cindex Evaluate Bourne-shell commands 1376@cindex Subprocesses 1377@cindex @value{SSEDEXT}, evaluating Bourne-shell commands 1378@cindex @value{SSEDEXT}, subprocesses 1379This command allows one to pipe input from a shell command 1380into pattern space. Without parameters, the @code{e} command 1381executes the command that is found in pattern space and 1382replaces the pattern space with the output; a trailing newline 1383is suppressed. 1384 1385If a parameter is specified, instead, the @code{e} command 1386interprets it as a command and sends its output to the output stream 1387(like @code{r} does). The command can run across multiple 1388lines, all but the last ending with a back-slash. 1389 1390In both cases, the results are undefined if the command to be 1391executed contains a @sc{nul} character. 1392 1393@item L @var{n} 1394@findex L (fLow paragraphs) command 1395@cindex Reformat pattern space 1396@cindex Reformatting paragraphs 1397@cindex @value{SSEDEXT}, reformatting paragraphs 1398@cindex @value{SSEDEXT}, @code{L} command 1399This @value{SSED} extension fills and joins lines in pattern space 1400to produce output lines of (at most) @var{n} characters, like 1401@code{fmt} does; if @var{n} is omitted, the default as specified 1402on the command line is used. This command is considered a failed 1403experiment and unless there is enough request (which seems unlikely) 1404will be removed in future versions. 1405 1406@ignore 1407Blank lines, spaces between words, and indentation are 1408preserved in the output; successive input lines with different 1409indentation are not joined; tabs are expanded to 8 columns. 1410 1411If the pattern space contains multiple lines, they are joined, but 1412since the pattern space usually contains a single line, the behavior 1413of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e., 1414it does not join short lines to form longer ones). 1415 1416@var{n} specifies the desired line-wrap length; if omitted, 1417the default as specified on the command line is used. 1418@end ignore 1419 1420@item Q [@var{exit-code}] 1421This command only accepts a single address. 1422 1423@findex Q (silent Quit) command 1424@cindex @value{SSEDEXT}, quitting silently 1425@cindex @value{SSEDEXT}, returning an exit code 1426@cindex Quitting 1427This command is the same as @code{q}, but will not print the 1428contents of pattern space. Like @code{q}, it provides the 1429ability to return an exit code to the caller. 1430 1431This command can be useful because the only alternative ways 1432to accomplish this apparently trivial function are to use 1433the @option{-n} option (which can unnecessarily complicate 1434your script) or resorting to the following snippet, which 1435wastes time by reading the whole file without any visible effect: 1436 1437@example 1438:eat 1439$d @i{@r{Quit silently on the last line}} 1440N @i{@r{Read another line, silently}} 1441g @i{@r{Overwrite pattern space each time to save memory}} 1442b eat 1443@end example 1444 1445@item R @var{filename} 1446@findex R (read line) command 1447@cindex Read text from a file 1448@cindex @value{SSEDEXT}, reading a file a line at a time 1449@cindex @value{SSEDEXT}, @code{R} command 1450@cindex @value{SSEDEXT}, @file{/dev/stdin} file 1451Queue a line of @var{filename} to be read and 1452inserted into the output stream at the end of the current cycle, 1453or when the next input line is read. 1454Note that if @var{filename} cannot be read, or if its end is 1455reached, no line is appended, without any error indication. 1456 1457As with the @code{r} command, the special value @file{/dev/stdin} 1458is supported for the file name, which reads a line from the 1459standard input. 1460 1461@item T @var{label} 1462@findex T (test and branch if failed) command 1463@cindex @value{SSEDEXT}, branch if @code{s///} failed 1464@cindex Branch to a label, if @code{s///} failed 1465@cindex Conditional branch 1466Branch to @var{label} only if there have been no successful 1467@code{s}ubstitutions since the last input line was read or 1468conditional branch was taken. The @var{label} may be omitted, 1469in which case the next cycle is started. 1470 1471@item v @var{version} 1472@findex v (version) command 1473@cindex @value{SSEDEXT}, checking for their presence 1474@cindex Requiring @value{SSED} 1475This command does nothing, but makes @command{sed} fail if 1476@value{SSED} extensions are not supported, simply because other 1477versions of @command{sed} do not implement it. In addition, you 1478can specify the version of @command{sed} that your script 1479requires, such as @code{4.0.5}. The default is @code{4.0} 1480because that is the first version that implemented this command. 1481 1482This command enables all @value{SSEDEXT} even if 1483@env{POSIXLY_CORRECT} is set in the environment. 1484 1485@item W @var{filename} 1486@findex W (write first line) command 1487@cindex Write first line to a file 1488@cindex @value{SSEDEXT}, writing first line to a file 1489Write to the given filename the portion of the pattern space up to 1490the first newline. Everything said under the @code{w} command about 1491file handling holds here too. 1492 1493@item z 1494@findex z (Zap) command 1495@cindex @value{SSEDEXT}, emptying pattern space 1496@cindex Emptying pattern space 1497This command empties the content of pattern space. It is 1498usually the same as @samp{s/.*//}, but is more efficient 1499and works in the presence of invalid multibyte sequences 1500in the input stream. @sc{posix} mandates that such sequences 1501are @emph{not} matched by @samp{.}, so that there is no portable 1502way to clear @command{sed}'s buffers in the middle of the 1503script in most multibyte locales (including UTF-8 locales). 1504@end table 1505 1506@node Escapes 1507@section @acronym{GNU} Extensions for Escapes in Regular Expressions 1508 1509@cindex @acronym{GNU} extensions, special escapes 1510Until this chapter, we have only encountered escapes of the form 1511@samp{\^}, which tell @command{sed} not to interpret the circumflex 1512as a special character, but rather to take it literally. For 1513example, @samp{\*} matches a single asterisk rather than zero 1514or more backslashes. 1515 1516@cindex @code{POSIXLY_CORRECT} behavior, escapes 1517This chapter introduces another kind of escape@footnote{All 1518the escapes introduced here are @acronym{GNU} 1519extensions, with the exception of @code{\n}. In basic regular 1520expression mode, setting @code{POSIXLY_CORRECT} disables them inside 1521bracket expressions.}---that 1522is, escapes that are applied to a character or sequence of characters 1523that ordinarily are taken literally, and that @command{sed} replaces 1524with a special character. This provides a way 1525of encoding non-printable characters in patterns in a visible manner. 1526There is no restriction on the appearance of non-printing characters 1527in a @command{sed} script but when a script is being prepared in the 1528shell or by text editing, it is usually easier to use one of 1529the following escape sequences than the binary character it 1530represents: 1531 1532The list of these escapes is: 1533 1534@table @code 1535@item \a 1536Produces or matches a @sc{bel} character, that is an ``alert'' (@sc{ascii} 7). 1537 1538@item \f 1539Produces or matches a form feed (@sc{ascii} 12). 1540 1541@item \n 1542Produces or matches a newline (@sc{ascii} 10). 1543 1544@item \r 1545Produces or matches a carriage return (@sc{ascii} 13). 1546 1547@item \t 1548Produces or matches a horizontal tab (@sc{ascii} 9). 1549 1550@item \v 1551Produces or matches a so called ``vertical tab'' (@sc{ascii} 11). 1552 1553@item \c@var{x} 1554Produces or matches @kbd{@sc{Control}-@var{x}}, where @var{x} is 1555any character. The precise effect of @samp{\c@var{x}} is as follows: 1556if @var{x} is a lower case letter, it is converted to upper case. 1557Then bit 6 of the character (hex 40) is inverted. Thus @samp{\cz} becomes 1558hex 1A, but @samp{\c@{} becomes hex 3B, while @samp{\c;} becomes hex 7B. 1559 1560@item \d@var{xxx} 1561Produces or matches a character whose decimal @sc{ascii} value is @var{xxx}. 1562 1563@item \o@var{xxx} 1564@ifset PERL 1565@item \@var{xxx} 1566@end ifset 1567Produces or matches a character whose octal @sc{ascii} value is @var{xxx}. 1568@ifset PERL 1569The syntax without the @code{o} is active in Perl mode, while the one 1570with the @code{o} is active in the normal or extended @sc{posix} regular 1571expression modes. 1572@end ifset 1573 1574@item \x@var{xx} 1575Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}. 1576@end table 1577 1578@samp{\b} (backspace) was omitted because of the conflict with 1579the existing ``word boundary'' meaning. 1580 1581Other escapes match a particular character class and are valid only in 1582regular expressions: 1583 1584@table @code 1585@item \w 1586Matches any ``word'' character. A ``word'' character is any 1587letter or digit or the underscore character. 1588 1589@item \W 1590Matches any ``non-word'' character. 1591 1592@item \b 1593Matches a word boundary; that is it matches if the character 1594to the left is a ``word'' character and the character to the 1595right is a ``non-word'' character, or vice-versa. 1596 1597@item \B 1598Matches everywhere but on a word boundary; that is it matches 1599if the character to the left and the character to the right 1600are either both ``word'' characters or both ``non-word'' 1601characters. 1602 1603@item \` 1604Matches only at the start of pattern space. This is different 1605from @code{^} in multi-line mode. 1606 1607@item \' 1608Matches only at the end of pattern space. This is different 1609from @code{$} in multi-line mode. 1610 1611@ifset PERL 1612@item \G 1613Match only at the start of pattern space or, when doing a global 1614substitution using the @code{s///g} command and option, at 1615the end-of-match position of the prior match. For example, 1616@samp{s/\Ga/Z/g} will change an initial run of @code{a}s to 1617a run of @code{Z}s 1618@end ifset 1619@end table 1620 1621@node Examples 1622@chapter Some Sample Scripts 1623 1624Here are some @command{sed} scripts to guide you in the art of mastering 1625@command{sed}. 1626 1627@menu 1628Some exotic examples: 1629* Centering lines:: 1630* Increment a number:: 1631* Rename files to lower case:: 1632* Print bash environment:: 1633* Reverse chars of lines:: 1634 1635Emulating standard utilities: 1636* tac:: Reverse lines of files 1637* cat -n:: Numbering lines 1638* cat -b:: Numbering non-blank lines 1639* wc -c:: Counting chars 1640* wc -w:: Counting words 1641* wc -l:: Counting lines 1642* head:: Printing the first lines 1643* tail:: Printing the last lines 1644* uniq:: Make duplicate lines unique 1645* uniq -d:: Print duplicated lines of input 1646* uniq -u:: Remove all duplicated lines 1647* cat -s:: Squeezing blank lines 1648@end menu 1649 1650@node Centering lines 1651@section Centering Lines 1652 1653This script centers all lines of a file on a 80 columns width. 1654To change that width, the number in @code{\@{@dots{}\@}} must be 1655replaced, and the number of added spaces also must be changed. 1656 1657Note how the buffer commands are used to separate parts in 1658the regular expressions to be matched---this is a common 1659technique. 1660 1661@c start------------------------------------------- 1662@example 1663#!/usr/bin/sed -f 1664 1665# Put 80 spaces in the buffer 16661 @{ 1667 x 1668 s/^$/ / 1669 s/^.*$/&&&&&&&&/ 1670 x 1671@} 1672 1673# del leading and trailing spaces 1674y/@kbd{tab}/ / 1675s/^ *// 1676s/ *$// 1677 1678# add a newline and 80 spaces to end of line 1679G 1680 1681# keep first 81 chars (80 + a newline) 1682s/^\(.\@{81\@}\).*$/\1/ 1683 1684# \2 matches half of the spaces, which are moved to the beginning 1685s/^\(.*\)\n\(.*\)\2/\2\1/ 1686@end example 1687@c end--------------------------------------------- 1688 1689@node Increment a number 1690@section Increment a Number 1691 1692This script is one of a few that demonstrate how to do arithmetic 1693in @command{sed}. This is indeed possible,@footnote{@command{sed} guru Greg 1694Ubben wrote an implementation of the @command{dc} @sc{rpn} calculator! 1695It is distributed together with sed.} but must be done manually. 1696 1697To increment one number you just add 1 to last digit, replacing 1698it by the following digit. There is one exception: when the digit 1699is a nine the previous digits must be also incremented until you 1700don't have a nine. 1701 1702This solution by Bruno Haible is very clever and smart because 1703it uses a single buffer; if you don't have this limitation, the 1704algorithm used in @ref{cat -n, Numbering lines}, is faster. 1705It works by replacing trailing nines with an underscore, then 1706using multiple @code{s} commands to increment the last digit, 1707and then again substituting underscores with zeros. 1708 1709@c start------------------------------------------- 1710@example 1711#!/usr/bin/sed -f 1712 1713/[^0-9]/ d 1714 1715# replace all leading 9s by _ (any other character except digits, could 1716# be used) 1717:d 1718s/9\(_*\)$/_\1/ 1719td 1720 1721# incr last digit only. The first line adds a most-significant 1722# digit of 1 if we have to add a digit. 1723# 1724# The @code{tn} commands are not necessary, but make the thing 1725# faster 1726 1727s/^\(_*\)$/1\1/; tn 1728s/8\(_*\)$/9\1/; tn 1729s/7\(_*\)$/8\1/; tn 1730s/6\(_*\)$/7\1/; tn 1731s/5\(_*\)$/6\1/; tn 1732s/4\(_*\)$/5\1/; tn 1733s/3\(_*\)$/4\1/; tn 1734s/2\(_*\)$/3\1/; tn 1735s/1\(_*\)$/2\1/; tn 1736s/0\(_*\)$/1\1/; tn 1737 1738:n 1739y/_/0/ 1740@end example 1741@c end--------------------------------------------- 1742 1743@node Rename files to lower case 1744@section Rename Files to Lower Case 1745 1746This is a pretty strange use of @command{sed}. We transform text, and 1747transform it to be shell commands, then just feed them to shell. 1748Don't worry, even worse hacks are done when using @command{sed}; I have 1749seen a script converting the output of @command{date} into a @command{bc} 1750program! 1751 1752The main body of this is the @command{sed} script, which remaps the name 1753from lower to upper (or vice-versa) and even checks out 1754if the remapped name is the same as the original name. 1755Note how the script is parameterized using shell 1756variables and proper quoting. 1757 1758@c start------------------------------------------- 1759@example 1760#! /bin/sh 1761# rename files to lower/upper case... 1762# 1763# usage: 1764# move-to-lower * 1765# move-to-upper * 1766# or 1767# move-to-lower -R . 1768# move-to-upper -R . 1769# 1770 1771help() 1772@{ 1773 cat << eof 1774Usage: $0 [-n] [-r] [-h] files... 1775 1776-n do nothing, only see what would be done 1777-R recursive (use find) 1778-h this message 1779files files to remap to lower case 1780 1781Examples: 1782 $0 -n * (see if everything is ok, then...) 1783 $0 * 1784 1785 $0 -R . 1786 1787eof 1788@} 1789 1790apply_cmd='sh' 1791finder='echo "$@@" | tr " " "\n"' 1792files_only= 1793 1794while : 1795do 1796 case "$1" in 1797 -n) apply_cmd='cat' ;; 1798 -R) finder='find "$@@" -type f';; 1799 -h) help ; exit 1 ;; 1800 *) break ;; 1801 esac 1802 shift 1803done 1804 1805if [ -z "$1" ]; then 1806 echo Usage: $0 [-h] [-n] [-r] files... 1807 exit 1 1808fi 1809 1810LOWER='abcdefghijklmnopqrstuvwxyz' 1811UPPER='ABCDEFGHIJKLMNOPQRSTUVWXYZ' 1812 1813case `basename $0` in 1814 *upper*) TO=$UPPER; FROM=$LOWER ;; 1815 *) FROM=$UPPER; TO=$LOWER ;; 1816esac 1817 1818eval $finder | sed -n ' 1819 1820# remove all trailing slashes 1821s/\/*$// 1822 1823# add ./ if there is no path, only a filename 1824/\//! s/^/.\// 1825 1826# save path+filename 1827h 1828 1829# remove path 1830s/.*\/// 1831 1832# do conversion only on filename 1833y/'$FROM'/'$TO'/ 1834 1835# now line contains original path+file, while 1836# hold space contains the new filename 1837x 1838 1839# add converted file name to line, which now contains 1840# path/file-name\nconverted-file-name 1841G 1842 1843# check if converted file name is equal to original file name, 1844# if it is, do not print nothing 1845/^.*\/\(.*\)\n\1/b 1846 1847# now, transform path/fromfile\n, into 1848# mv path/fromfile path/tofile and print it 1849s/^\(.*\/\)\(.*\)\n\(.*\)$/mv "\1\2" "\1\3"/p 1850 1851' | $apply_cmd 1852@end example 1853@c end--------------------------------------------- 1854 1855@node Print bash environment 1856@section Print @command{bash} Environment 1857 1858This script strips the definition of the shell functions 1859from the output of the @command{set} Bourne-shell command. 1860 1861@c start------------------------------------------- 1862@example 1863#!/bin/sh 1864 1865set | sed -n ' 1866:x 1867 1868@ifinfo 1869# if no occurrence of "=()" print and load next line 1870@end ifinfo 1871@ifnotinfo 1872# if no occurrence of @samp{=()} print and load next line 1873@end ifnotinfo 1874/=()/! @{ p; b; @} 1875/ () $/! @{ p; b; @} 1876 1877# possible start of functions section 1878# save the line in case this is a var like FOO="() " 1879h 1880 1881# if the next line has a brace, we quit because 1882# nothing comes after functions 1883n 1884/^@{/ q 1885 1886# print the old line 1887x; p 1888 1889# work on the new line now 1890x; bx 1891' 1892@end example 1893@c end--------------------------------------------- 1894 1895@node Reverse chars of lines 1896@section Reverse Characters of Lines 1897 1898This script can be used to reverse the position of characters 1899in lines. The technique moves two characters at a time, hence 1900it is faster than more intuitive implementations. 1901 1902Note the @code{tx} command before the definition of the label. 1903This is often needed to reset the flag that is tested by 1904the @code{t} command. 1905 1906Imaginative readers will find uses for this script. An example 1907is reversing the output of @command{banner}.@footnote{This requires 1908another script to pad the output of banner; for example 1909 1910@example 1911#! /bin/sh 1912 1913banner -w $1 $2 $3 $4 | 1914 sed -e :a -e '/^.\@{0,'$1'\@}$/ @{ s/$/ /; ba; @}' | 1915 ~/sedscripts/reverseline.sed 1916@end example 1917} 1918 1919@c start------------------------------------------- 1920@example 1921#!/usr/bin/sed -f 1922 1923/../! b 1924 1925# Reverse a line. Begin embedding the line between two newlines 1926s/^.*$/\ 1927&\ 1928/ 1929 1930# Move first character at the end. The regexp matches until 1931# there are zero or one characters between the markers 1932tx 1933:x 1934s/\(\n.\)\(.*\)\(.\n\)/\3\2\1/ 1935tx 1936 1937# Remove the newline markers 1938s/\n//g 1939@end example 1940@c end--------------------------------------------- 1941 1942@node tac 1943@section Reverse Lines of Files 1944 1945This one begins a series of totally useless (yet interesting) 1946scripts emulating various Unix commands. This, in particular, 1947is a @command{tac} workalike. 1948 1949Note that on implementations other than @acronym{GNU} @command{sed} 1950@ifset PERL 1951and @value{SSED} 1952@end ifset 1953this script might easily overflow internal buffers. 1954 1955@c start------------------------------------------- 1956@example 1957#!/usr/bin/sed -nf 1958 1959# reverse all lines of input, i.e. first line became last, ... 1960 1961# from the second line, the buffer (which contains all previous lines) 1962# is *appended* to current line, so, the order will be reversed 19631! G 1964 1965# on the last line we're done -- print everything 1966$ p 1967 1968# store everything on the buffer again 1969h 1970@end example 1971@c end--------------------------------------------- 1972 1973@node cat -n 1974@section Numbering Lines 1975 1976This script replaces @samp{cat -n}; in fact it formats its output 1977exactly like @acronym{GNU} @command{cat} does. 1978 1979Of course this is completely useless and for two reasons: first, 1980because somebody else did it in C, second, because the following 1981Bourne-shell script could be used for the same purpose and would 1982be much faster: 1983 1984@c start------------------------------------------- 1985@example 1986#! /bin/sh 1987sed -e "=" $@@ | sed -e ' 1988 s/^/ / 1989 N 1990 s/^ *\(......\)\n/\1 / 1991' 1992@end example 1993@c end--------------------------------------------- 1994 1995It uses @command{sed} to print the line number, then groups lines two 1996by two using @code{N}. Of course, this script does not teach as much as 1997the one presented below. 1998 1999The algorithm used for incrementing uses both buffers, so the line 2000is printed as soon as possible and then discarded. The number 2001is split so that changing digits go in a buffer and unchanged ones go 2002in the other; the changed digits are modified in a single step 2003(using a @code{y} command). The line number for the next line 2004is then composed and stored in the hold space, to be used in the 2005next iteration. 2006 2007@c start------------------------------------------- 2008@example 2009#!/usr/bin/sed -nf 2010 2011# Prime the pump on the first line 2012x 2013/^$/ s/^.*$/1/ 2014 2015# Add the correct line number before the pattern 2016G 2017h 2018 2019# Format it and print it 2020s/^/ / 2021s/^ *\(......\)\n/\1 /p 2022 2023# Get the line number from hold space; add a zero 2024# if we're going to add a digit on the next line 2025g 2026s/\n.*$// 2027/^9*$/ s/^/0/ 2028 2029# separate changing/unchanged digits with an x 2030s/.9*$/x&/ 2031 2032# keep changing digits in hold space 2033h 2034s/^.*x// 2035y/0123456789/1234567890/ 2036x 2037 2038# keep unchanged digits in pattern space 2039s/x.*$// 2040 2041# compose the new number, remove the newline implicitly added by G 2042G 2043s/\n// 2044h 2045@end example 2046@c end--------------------------------------------- 2047 2048@node cat -b 2049@section Numbering Non-blank Lines 2050 2051Emulating @samp{cat -b} is almost the same as @samp{cat -n}---we only 2052have to select which lines are to be numbered and which are not. 2053 2054The part that is common to this script and the previous one is 2055not commented to show how important it is to comment @command{sed} 2056scripts properly... 2057 2058@c start------------------------------------------- 2059@example 2060#!/usr/bin/sed -nf 2061 2062/^$/ @{ 2063 p 2064 b 2065@} 2066 2067# Same as cat -n from now 2068x 2069/^$/ s/^.*$/1/ 2070G 2071h 2072s/^/ / 2073s/^ *\(......\)\n/\1 /p 2074x 2075s/\n.*$// 2076/^9*$/ s/^/0/ 2077s/.9*$/x&/ 2078h 2079s/^.*x// 2080y/0123456789/1234567890/ 2081x 2082s/x.*$// 2083G 2084s/\n// 2085h 2086@end example 2087@c end--------------------------------------------- 2088 2089@node wc -c 2090@section Counting Characters 2091 2092This script shows another way to do arithmetic with @command{sed}. 2093In this case we have to add possibly large numbers, so implementing 2094this by successive increments would not be feasible (and possibly 2095even more complicated to contrive than this script). 2096 2097The approach is to map numbers to letters, kind of an abacus 2098implemented with @command{sed}. @samp{a}s are units, @samp{b}s are 2099tens and so on: we simply add the number of characters 2100on the current line as units, and then propagate the carry 2101to tens, hundreds, and so on. 2102 2103As usual, running totals are kept in hold space. 2104 2105On the last line, we convert the abacus form back to decimal. 2106For the sake of variety, this is done with a loop rather than 2107with some 80 @code{s} commands@footnote{Some implementations 2108have a limit of 199 commands per script}: first we 2109convert units, removing @samp{a}s from the number; then we 2110rotate letters so that tens become @samp{a}s, and so on 2111until no more letters remain. 2112 2113@c start------------------------------------------- 2114@example 2115#!/usr/bin/sed -nf 2116 2117# Add n+1 a's to hold space (+1 is for the newline) 2118s/./a/g 2119H 2120x 2121s/\n/a/ 2122 2123# Do the carry. The t's and b's are not necessary, 2124# but they do speed up the thing 2125t a 2126: a; s/aaaaaaaaaa/b/g; t b; b done 2127: b; s/bbbbbbbbbb/c/g; t c; b done 2128: c; s/cccccccccc/d/g; t d; b done 2129: d; s/dddddddddd/e/g; t e; b done 2130: e; s/eeeeeeeeee/f/g; t f; b done 2131: f; s/ffffffffff/g/g; t g; b done 2132: g; s/gggggggggg/h/g; t h; b done 2133: h; s/hhhhhhhhhh//g 2134 2135: done 2136$! @{ 2137 h 2138 b 2139@} 2140 2141# On the last line, convert back to decimal 2142 2143: loop 2144/a/! s/[b-h]*/&0/ 2145s/aaaaaaaaa/9/ 2146s/aaaaaaaa/8/ 2147s/aaaaaaa/7/ 2148s/aaaaaa/6/ 2149s/aaaaa/5/ 2150s/aaaa/4/ 2151s/aaa/3/ 2152s/aa/2/ 2153s/a/1/ 2154 2155: next 2156y/bcdefgh/abcdefg/ 2157/[a-h]/ b loop 2158p 2159@end example 2160@c end--------------------------------------------- 2161 2162@node wc -w 2163@section Counting Words 2164 2165This script is almost the same as the previous one, once each 2166of the words on the line is converted to a single @samp{a} 2167(in the previous script each letter was changed to an @samp{a}). 2168 2169It is interesting that real @command{wc} programs have optimized 2170loops for @samp{wc -c}, so they are much slower at counting 2171words rather than characters. This script's bottleneck, 2172instead, is arithmetic, and hence the word-counting one 2173is faster (it has to manage smaller numbers). 2174 2175Again, the common parts are not commented to show the importance 2176of commenting @command{sed} scripts. 2177 2178@c start------------------------------------------- 2179@example 2180#!/usr/bin/sed -nf 2181 2182# Convert words to a's 2183s/[ @kbd{tab}][ @kbd{tab}]*/ /g 2184s/^/ / 2185s/ [^ ][^ ]*/a /g 2186s/ //g 2187 2188# Append them to hold space 2189H 2190x 2191s/\n// 2192 2193# From here on it is the same as in wc -c. 2194/aaaaaaaaaa/! bx; s/aaaaaaaaaa/b/g 2195/bbbbbbbbbb/! bx; s/bbbbbbbbbb/c/g 2196/cccccccccc/! bx; s/cccccccccc/d/g 2197/dddddddddd/! bx; s/dddddddddd/e/g 2198/eeeeeeeeee/! bx; s/eeeeeeeeee/f/g 2199/ffffffffff/! bx; s/ffffffffff/g/g 2200/gggggggggg/! bx; s/gggggggggg/h/g 2201s/hhhhhhhhhh//g 2202:x 2203$! @{ h; b; @} 2204:y 2205/a/! s/[b-h]*/&0/ 2206s/aaaaaaaaa/9/ 2207s/aaaaaaaa/8/ 2208s/aaaaaaa/7/ 2209s/aaaaaa/6/ 2210s/aaaaa/5/ 2211s/aaaa/4/ 2212s/aaa/3/ 2213s/aa/2/ 2214s/a/1/ 2215y/bcdefgh/abcdefg/ 2216/[a-h]/ by 2217p 2218@end example 2219@c end--------------------------------------------- 2220 2221@node wc -l 2222@section Counting Lines 2223 2224No strange things are done now, because @command{sed} gives us 2225@samp{wc -l} functionality for free!!! Look: 2226 2227@c start------------------------------------------- 2228@example 2229#!/usr/bin/sed -nf 2230$= 2231@end example 2232@c end--------------------------------------------- 2233 2234@node head 2235@section Printing the First Lines 2236 2237This script is probably the simplest useful @command{sed} script. 2238It displays the first 10 lines of input; the number of displayed 2239lines is right before the @code{q} command. 2240 2241@c start------------------------------------------- 2242@example 2243#!/usr/bin/sed -f 224410q 2245@end example 2246@c end--------------------------------------------- 2247 2248@node tail 2249@section Printing the Last Lines 2250 2251Printing the last @var{n} lines rather than the first is more complex 2252but indeed possible. @var{n} is encoded in the second line, before 2253the bang character. 2254 2255This script is similar to the @command{tac} script in that it keeps the 2256final output in the hold space and prints it at the end: 2257 2258@c start------------------------------------------- 2259@example 2260#!/usr/bin/sed -nf 2261 22621! @{; H; g; @} 22631,10 !s/[^\n]*\n// 2264$p 2265h 2266@end example 2267@c end--------------------------------------------- 2268 2269Mainly, the scripts keeps a window of 10 lines and slides it 2270by adding a line and deleting the oldest (the substitution command 2271on the second line works like a @code{D} command but does not 2272restart the loop). 2273 2274The ``sliding window'' technique is a very powerful way to write 2275efficient and complex @command{sed} scripts, because commands like 2276@code{P} would require a lot of work if implemented manually. 2277 2278To introduce the technique, which is fully demonstrated in the 2279rest of this chapter and is based on the @code{N}, @code{P} 2280and @code{D} commands, here is an implementation of @command{tail} 2281using a simple ``sliding window.'' 2282 2283This looks complicated but in fact the working is the same as 2284the last script: after we have kicked in the appropriate number 2285of lines, however, we stop using the hold space to keep inter-line 2286state, and instead use @code{N} and @code{D} to slide pattern 2287space by one line: 2288 2289@c start------------------------------------------- 2290@example 2291#!/usr/bin/sed -f 2292 22931h 22942,10 @{; H; g; @} 2295$q 22961,9d 2297N 2298D 2299@end example 2300@c end--------------------------------------------- 2301 2302Note how the first, second and fourth line are inactive after 2303the first ten lines of input. After that, all the script does 2304is: exiting on the last line of input, appending the next input 2305line to pattern space, and removing the first line. 2306 2307@node uniq 2308@section Make Duplicate Lines Unique 2309 2310This is an example of the art of using the @code{N}, @code{P} 2311and @code{D} commands, probably the most difficult to master. 2312 2313@c start------------------------------------------- 2314@example 2315#!/usr/bin/sed -f 2316h 2317 2318:b 2319# On the last line, print and exit 2320$b 2321N 2322/^\(.*\)\n\1$/ @{ 2323 # The two lines are identical. Undo the effect of 2324 # the n command. 2325 g 2326 bb 2327@} 2328 2329# If the @code{N} command had added the last line, print and exit 2330$b 2331 2332# The lines are different; print the first and go 2333# back working on the second. 2334P 2335D 2336@end example 2337@c end--------------------------------------------- 2338 2339As you can see, we mantain a 2-line window using @code{P} and @code{D}. 2340This technique is often used in advanced @command{sed} scripts. 2341 2342@node uniq -d 2343@section Print Duplicated Lines of Input 2344 2345This script prints only duplicated lines, like @samp{uniq -d}. 2346 2347@c start------------------------------------------- 2348@example 2349#!/usr/bin/sed -nf 2350 2351$b 2352N 2353/^\(.*\)\n\1$/ @{ 2354 # Print the first of the duplicated lines 2355 s/.*\n// 2356 p 2357 2358 # Loop until we get a different line 2359 :b 2360 $b 2361 N 2362 /^\(.*\)\n\1$/ @{ 2363 s/.*\n// 2364 bb 2365 @} 2366@} 2367 2368# The last line cannot be followed by duplicates 2369$b 2370 2371# Found a different one. Leave it alone in the pattern space 2372# and go back to the top, hunting its duplicates 2373D 2374@end example 2375@c end--------------------------------------------- 2376 2377@node uniq -u 2378@section Remove All Duplicated Lines 2379 2380This script prints only unique lines, like @samp{uniq -u}. 2381 2382@c start------------------------------------------- 2383@example 2384#!/usr/bin/sed -f 2385 2386# Search for a duplicate line --- until that, print what you find. 2387$b 2388N 2389/^\(.*\)\n\1$/ ! @{ 2390 P 2391 D 2392@} 2393 2394:c 2395# Got two equal lines in pattern space. At the 2396# end of the file we simply exit 2397$d 2398 2399# Else, we keep reading lines with @code{N} until we 2400# find a different one 2401s/.*\n// 2402N 2403/^\(.*\)\n\1$/ @{ 2404 bc 2405@} 2406 2407# Remove the last instance of the duplicate line 2408# and go back to the top 2409D 2410@end example 2411@c end--------------------------------------------- 2412 2413@node cat -s 2414@section Squeezing Blank Lines 2415 2416As a final example, here are three scripts, of increasing complexity 2417and speed, that implement the same function as @samp{cat -s}, that is 2418squeezing blank lines. 2419 2420The first leaves a blank line at the beginning and end if there are 2421some already. 2422 2423@c start------------------------------------------- 2424@example 2425#!/usr/bin/sed -f 2426 2427# on empty lines, join with next 2428# Note there is a star in the regexp 2429:x 2430/^\n*$/ @{ 2431N 2432bx 2433@} 2434 2435# now, squeeze all '\n', this can be also done by: 2436# s/^\(\n\)*/\1/ 2437s/\n*/\ 2438/ 2439@end example 2440@c end--------------------------------------------- 2441 2442This one is a bit more complex and removes all empty lines 2443at the beginning. It does leave a single blank line at end 2444if one was there. 2445 2446@c start------------------------------------------- 2447@example 2448#!/usr/bin/sed -f 2449 2450# delete all leading empty lines 24511,/^./@{ 2452/./!d 2453@} 2454 2455# on an empty line we remove it and all the following 2456# empty lines, but one 2457:x 2458/./!@{ 2459N 2460s/^\n$// 2461tx 2462@} 2463@end example 2464@c end--------------------------------------------- 2465 2466This removes leading and trailing blank lines. It is also the 2467fastest. Note that loops are completely done with @code{n} and 2468@code{b}, without relying on @command{sed} to restart the 2469the script automatically at the end of a line. 2470 2471@c start------------------------------------------- 2472@example 2473#!/usr/bin/sed -nf 2474 2475# delete all (leading) blanks 2476/./!d 2477 2478# get here: so there is a non empty 2479:x 2480# print it 2481p 2482# get next 2483n 2484# got chars? print it again, etc... 2485/./bx 2486 2487# no, don't have chars: got an empty line 2488:z 2489# get next, if last line we finish here so no trailing 2490# empty lines are written 2491n 2492# also empty? then ignore it, and get next... this will 2493# remove ALL empty lines 2494/./!bz 2495 2496# all empty lines were deleted/ignored, but we have a non empty. As 2497# what we want to do is to squeeze, insert a blank line artificially 2498i\ 2499 2500bx 2501@end example 2502@c end--------------------------------------------- 2503 2504@node Limitations 2505@chapter @value{SSED}'s Limitations and Non-limitations 2506 2507@cindex @acronym{GNU} extensions, unlimited line length 2508@cindex Portability, line length limitations 2509For those who want to write portable @command{sed} scripts, 2510be aware that some implementations have been known to 2511limit line lengths (for the pattern and hold spaces) 2512to be no more than 4000 bytes. 2513The @sc{posix} standard specifies that conforming @command{sed} 2514implementations shall support at least 8192 byte line lengths. 2515@value{SSED} has no built-in limit on line length; 2516as long as it can @code{malloc()} more (virtual) memory, 2517you can feed or construct lines as long as you like. 2518 2519However, recursion is used to handle subpatterns and indefinite 2520repetition. This means that the available stack space may limit 2521the size of the buffer that can be processed by certain patterns. 2522 2523@ifset PERL 2524There are some size limitations in the regular expression 2525matcher but it is hoped that they will never in practice 2526be relevant. The maximum length of a compiled pattern 2527is 65539 (sic) bytes. All values in repeating quantifiers 2528must be less than 65536. The maximum nesting depth of 2529all parenthesized subpatterns, including capturing and 2530non-capturing subpatterns@footnote{The 2531distinction is meaningful when referring to Perl-style 2532regular expressions.}, assertions, and other types of 2533subpattern, is 200. 2534 2535Also, @value{SSED} recognizes the @sc{posix} syntax 2536@code{[.@var{ch}.]} and @code{[=@var{ch}=]} 2537where @var{ch} is a ``collating element'', but these 2538are not supported, and an error is given if they are 2539encountered. 2540 2541Here are a few distinctions between the real Perl-style 2542regular expressions and those that @option{-R} recognizes. 2543 2544@enumerate 2545@item 2546Lookahead assertions do not allow repeat quantifiers after them 2547Perl permits them, but they do not mean what you 2548might think. For example, @samp{(?!a)@{3@}} does not assert that the 2549next three characters are not @samp{a}. It just asserts three times that the 2550next character is not @samp{a} --- a waste of time and nothing else. 2551 2552@item 2553Capturing subpatterns that occur inside negative lookahead 2554head assertions are counted, but their entries are counted 2555as empty in the second half of an @code{s} command. 2556Perl sets its numerical variables from any such patterns 2557that are matched before the assertion fails to match 2558something (thereby succeeding), but only if the negative 2559lookahead assertion contains just one branch. 2560 2561@item 2562The following Perl escape sequences are not supported: 2563@samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E}, 2564@samp{\Q}. In fact these are implemented by Perl's general 2565string-handling and are not part of its pattern matching engine. 2566 2567@item 2568The Perl @samp{\G} assertion is not supported as it is not 2569relevant to single pattern matches. 2570 2571@item 2572Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})} 2573and @samp{(?p@{code@})} constructions. However, there is some experimental 2574support for recursive patterns using the non-Perl item @samp{(?R)}. 2575 2576@item 2577There are at the time of writing some oddities in Perl 25785.005_02 concerned with the settings of captured strings 2579when part of a pattern is repeated. For example, matching 2580@samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets 2581@samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.} 2582to the value @samp{b}, but matching @samp{aabbaa} 2583against @samp{/^(aa(bb)?)+$/} leaves @samp{$2} 2584unset. However, if the pattern is changed to 2585@samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set. 2586In Perl 5.004 @samp{$2} is set in both cases, and that is also 2587true of @value{SSED}. 2588 2589@item 2590Another as yet unresolved discrepancy is that in Perl 25915.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches 2592the string @samp{a}, whereas in @value{SSED} it does not. 2593However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched 2594against @samp{a} leaves $1 unset. 2595@end enumerate 2596@end ifset 2597 2598@node Other Resources 2599@chapter Other Resources for Learning About @command{sed} 2600 2601@cindex Additional reading about @command{sed} 2602In addition to several books that have been written about @command{sed} 2603(either specifically or as chapters in books which discuss 2604shell programming), one can find out more about @command{sed} 2605(including suggestions of a few books) from the FAQ 2606for the @code{sed-users} mailing list, available from: 2607@display 2608@uref{http://sed.sourceforge.net/sedfaq.html} 2609@end display 2610 2611Also of interest are 2612@uref{http://www.student.northpark.edu/pemente/sed/index.htm} 2613and @uref{http://sed.sf.net/grabbag}, 2614which include @command{sed} tutorials and other @command{sed}-related goodies. 2615 2616The @code{sed-users} mailing list itself maintained by Sven Guckes. 2617To subscribe, visit @uref{http://groups.yahoo.com} and search 2618for the @code{sed-users} mailing list. 2619 2620@node Reporting Bugs 2621@chapter Reporting Bugs 2622 2623@cindex Bugs, reporting 2624Email bug reports to @email{bonzini@@gnu.org}. 2625Be sure to include the word ``sed'' somewhere in the @code{Subject:} field. 2626Also, please include the output of @samp{sed --version} in the body 2627of your report if at all possible. 2628 2629Please do not send a bug report like this: 2630 2631@example 2632@i{@i{@r{while building frobme-1.3.4}}} 2633$ configure 2634@error{} sed: file sedscr line 1: Unknown option to 's' 2635@end example 2636 2637If @value{SSED} doesn't configure your favorite package, take a 2638few extra minutes to identify the specific problem and make a stand-alone 2639test case. Unlike other programs such as C compilers, making such test 2640cases for @command{sed} is quite simple. 2641 2642A stand-alone test case includes all the data necessary to perform the 2643test, and the specific invocation of @command{sed} that causes the problem. 2644The smaller a stand-alone test case is, the better. A test case should 2645not involve something as far removed from @command{sed} as ``try to configure 2646frobme-1.3.4''. Yes, that is in principle enough information to look 2647for the bug, but that is not a very practical prospect. 2648 2649Here are a few commonly reported bugs that are not bugs. 2650 2651@table @asis 2652@item @code{N} command on the last line 2653@cindex Portability, @code{N} command on the last line 2654@cindex Non-bugs, @code{N} command on the last line 2655 2656Most versions of @command{sed} exit without printing anything when 2657the @command{N} command is issued on the last line of a file. 2658@value{SSED} prints pattern space before exiting unless of course 2659the @command{-n} command switch has been specified. This choice is 2660by design. 2661 2662For example, the behavior of 2663@example 2664sed N foo bar 2665@end example 2666@noindent 2667would depend on whether foo has an even or an odd number of 2668lines@footnote{which is the actual ``bug'' that prompted the 2669change in behavior}. Or, when writing a script to read the 2670next few lines following a pattern match, traditional 2671implementations of @code{sed} would force you to write 2672something like 2673@example 2674/foo/@{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N @} 2675@end example 2676@noindent 2677instead of just 2678@example 2679/foo/@{ N;N;N;N;N;N;N;N;N; @} 2680@end example 2681 2682@cindex @code{POSIXLY_CORRECT} behavior, @code{N} command 2683In any case, the simplest workaround is to use @code{$d;N} in 2684scripts that rely on the traditional behavior, or to set 2685the @code{POSIXLY_CORRECT} variable to a non-empty value. 2686 2687@item Regex syntax clashes (problems with backslashes) 2688@cindex @acronym{GNU} extensions, to basic regular expressions 2689@cindex Non-bugs, regex syntax clashes 2690@command{sed} uses the @sc{posix} basic regular expression syntax. According to 2691the standard, the meaning of some escape sequences is undefined in 2692this syntax; notable in the case of @command{sed} are @code{\|}, 2693@code{\+}, @code{\?}, @code{\`}, @code{\'}, @code{\<}, 2694@code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}. 2695 2696As in all @acronym{GNU} programs that use @sc{posix} basic regular 2697expressions, @command{sed} interprets these escape sequences as special 2698characters. So, @code{x\+} matches one or more occurrences of @samp{x}. 2699@code{abc\|def} matches either @samp{abc} or @samp{def}. 2700 2701This syntax may cause problems when running scripts written for other 2702@command{sed}s. Some @command{sed} programs have been written with the 2703assumption that @code{\|} and @code{\+} match the literal characters 2704@code{|} and @code{+}. Such scripts must be modified by removing the 2705spurious backslashes if they are to be used with modern implementations 2706of @command{sed}, like 2707@ifset PERL 2708@value{SSED} or 2709@end ifset 2710@acronym{GNU} @command{sed}. 2711 2712On the other hand, some scripts use s|abc\|def||g to remove occurrences 2713of @emph{either} @code{abc} or @code{def}. While this worked until 2714@command{sed} 4.0.x, newer versions interpret this as removing the 2715string @code{abc|def}. This is again undefined behavior according to 2716@acronym{POSIX}, and this interpretation is arguably more robust: older 2717@command{sed}s, for example, required that the regex matcher parsed 2718@code{\/} as @code{/} in the common case of escaping a slash, which is 2719again undefined behavior; the new behavior avoids this, and this is good 2720because the regex matcher is only partially under our control. 2721 2722@cindex @acronym{GNU} extensions, special escapes 2723In addition, this version of @command{sed} supports several escape characters 2724(some of which are multi-character) to insert non-printable characters 2725in scripts (@code{\a}, @code{\c}, @code{\d}, @code{\o}, @code{\r}, 2726@code{\t}, @code{\v}, @code{\x}). These can cause similar problems 2727with scripts written for other @command{sed}s. 2728 2729@item @option{-i} clobbers read-only files 2730@cindex In-place editing 2731@cindex @value{SSEDEXT}, in-place editing 2732@cindex Non-bugs, in-place editing 2733 2734In short, @samp{sed -i} will let you delete the contents of 2735a read-only file, and in general the @option{-i} option 2736(@pxref{Invoking sed, , Invocation}) lets you clobber 2737protected files. This is not a bug, but rather a consequence 2738of how the Unix filesystem works. 2739 2740The permissions on a file say what can happen to the data 2741in that file, while the permissions on a directory say what can 2742happen to the list of files in that directory. @samp{sed -i} 2743will not ever open for writing a file that is already on disk. 2744Rather, it will work on a temporary file that is finally renamed 2745to the original name: if you rename or delete files, you're actually 2746modifying the contents of the directory, so the operation depends on 2747the permissions of the directory, not of the file. For this same 2748reason, @command{sed} does not let you use @option{-i} on a writeable file 2749in a read-only directory, and will break hard or symbolic links when 2750@option{-i} is used on such a file. 2751 2752@item @code{0a} does not work (gives an error) 2753@cindex @code{0} address 2754@cindex @acronym{GNU} extensions, @code{0} address 2755@cindex Non-bugs, @code{0} address 2756 2757There is no line 0. 0 is a special address that is only used to treat 2758addresses like @code{0,/@var{RE}/} as active when the script starts: if 2759you write @code{1,/abc/d} and the first line includes the word @samp{abc}, 2760then that match would be ignored because address ranges must span at least 2761two lines (barring the end of the file); but what you probably wanted is 2762to delete every line up to the first one including @samp{abc}, and this 2763is obtained with @code{0,/abc/d}. 2764 2765@ifclear PERL 2766@item @code{[a-z]} is case insensitive 2767@cindex Non-bugs, localization-related 2768 2769You are encountering problems with locales. POSIX mandates that @code{[a-z]} 2770uses the current locale's collation order -- in C parlance, that means using 2771@code{strcoll(3)} instead of @code{strcmp(3)}. Some locales have a 2772case-insensitive collation order, others don't. 2773 2774Another problem is that @code{[a-z]} tries to use collation symbols. 2775This only happens if you are on the @acronym{GNU} system, using 2776@acronym{GNU} libc's regular expression matcher instead of compiling the 2777one supplied with @acronym{GNU} sed. In a Danish locale, for example, 2778the regular expression @code{^[a-z]$} matches the string @samp{aa}, 2779because this is a single collating symbol that comes after @samp{a} 2780and before @samp{b}; @samp{ll} behaves similarly in Spanish 2781locales, or @samp{ij} in Dutch locales. 2782 2783To work around these problems, which may cause bugs in shell scripts, set 2784the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}. 2785 2786@item @code{s/.*//} does not clear pattern space 2787@cindex Non-bugs, localization-related 2788@cindex @value{SSEDEXT}, emptying pattern space 2789@cindex Emptying pattern space 2790 2791This happens if your input stream includes invalid multibyte 2792sequences. @sc{posix} mandates that such sequences 2793are @emph{not} matched by @samp{.}, so that @samp{s/.*//} will not clear 2794pattern space as you would expect. In fact, there is no way to clear 2795sed's buffers in the middle of the script in most multibyte locales 2796(including UTF-8 locales). For this reason, @value{SSED} provides a `z' 2797command (for `zap') as an extension. 2798 2799To work around these problems, which may cause bugs in shell scripts, set 2800the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}. 2801@end ifclear 2802@end table 2803 2804 2805@node Extended regexps 2806@appendix Extended regular expressions 2807@cindex Extended regular expressions, syntax 2808 2809The only difference between basic and extended regular expressions is in 2810the behavior of a few characters: @samp{?}, @samp{+}, parentheses, 2811and braces (@samp{@{@}}). While basic regular expressions require 2812these to be escaped if you want them to behave as special characters, 2813when using extended regular expressions you must escape them if 2814you want them @emph{to match a literal character}. 2815 2816@noindent 2817Examples: 2818@table @code 2819@item abc? 2820becomes @samp{abc\?} when using extended regular expressions. It matches 2821the literal string @samp{abc?}. 2822 2823@item c\+ 2824becomes @samp{c+} when using extended regular expressions. It matches 2825one or more @samp{c}s. 2826 2827@item a\@{3,\@} 2828becomes @samp{a@{3,@}} when using extended regular expressions. It matches 2829three or more @samp{a}s. 2830 2831@item \(abc\)\@{2,3\@} 2832becomes @samp{(abc)@{2,3@}} when using extended regular expressions. It 2833matches either @samp{abcabc} or @samp{abcabcabc}. 2834 2835@item \(abc*\)\1 2836becomes @samp{(abc*)\1} when using extended regular expressions. 2837Backreferences must still be escaped when using extended regular 2838expressions. 2839@end table 2840 2841@ifset PERL 2842@node Perl regexps 2843@appendix Perl-style regular expressions 2844@cindex Perl-style regular expressions, syntax 2845 2846@emph{This part is taken from the @file{pcre.txt} file distributed together 2847with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.} 2848 2849Perl introduced several extensions to regular expressions, some 2850of them incompatible with the syntax of regular expressions 2851accepted by Emacs and other @acronym{GNU} tools (whose matcher was 2852based on the Emacs matcher). @value{SSED} implements 2853both kinds of extensions. 2854 2855@iftex 2856Summarizing, we have: 2857 2858@itemize @bullet 2859@item 2860A backslash can introduce several special sequences 2861 2862@item 2863The circumflex, dollar sign, and period characters behave specially 2864with regard to new lines 2865 2866@item 2867Strange uses of square brackets are parsed differently 2868 2869@item 2870You can toggle modifiers in the middle of a regular expression 2871 2872@item 2873You can specify that a subpattern does not count when numbering backreferences 2874 2875@item 2876@cindex Greedy regular expression matching 2877You can specify greedy or non-greedy matching 2878 2879@item 2880You can have more than ten back references 2881 2882@item 2883You can do complex look aheads and look behinds (in the spirit of 2884@code{\b}, but with subpatterns). 2885 2886@item 2887You can often improve performance by avoiding that @command{sed} wastes 2888time with backtracking 2889 2890@item 2891You can have if/then/else branches 2892 2893@item 2894You can do recursive matches, for example to look for unbalanced parentheses 2895 2896@item 2897You can have comments and non-significant whitespace, because things can 2898get complex... 2899@end itemize 2900 2901Most of these extensions are introduced by the special @code{(?} 2902sequence, which gives special meanings to parenthesized groups. 2903@end iftex 2904@menu 2905Other extensions can be roughly subdivided in two categories 2906On one hand Perl introduces several more escaped sequences 2907(that is, sequences introduced by a backslash). On the other 2908hand, it specifies that if a question mark follows an open 2909parentheses it should give a special meaning to the parenthesized 2910group. 2911 2912* Backslash:: Introduces special sequences 2913* Circumflex/dollar sign/period:: Behave specially with regard to new lines 2914* Square brackets:: Are a bit different in strange cases 2915* Options setting:: Toggle modifiers in the middle of a regexp 2916* Non-capturing subpatterns:: Are not counted when backreferencing 2917* Repetition:: Allows for non-greedy matching 2918* Backreferences:: Allows for more than 10 back references 2919* Assertions:: Allows for complex look ahead matches 2920* Non-backtracking subpatterns:: Often gives more performance 2921* Conditional subpatterns:: Allows if/then/else branches 2922* Recursive patterns:: For example to match parentheses 2923* Comments:: Because things can get complex... 2924@end menu 2925 2926@node Backslash 2927@appendixsec Backslash 2928@cindex Perl-style regular expressions, escaped sequences 2929 2930There are a few difference in the handling of backslashed 2931sequences in Perl mode. 2932 2933First of all, there are no @code{\o} and @code{\d} sequences. 2934@sc{ascii} values for characters can be specified in octal 2935with a @code{\@var{xxx}} sequence, where @var{xxx} is a 2936sequence of up to three octal digits. If the first digit 2937is a zero, the treatment of the sequence is straightforward; 2938just note that if the character that follows the escaped digit 2939is itself an octal digit, you have to supply three octal digits 2940for @var{xxx}. For example @code{\07} is a @sc{bel} character 2941rather than a @sc{nul} and a literal @code{7} (this sequence is 2942instead represented by @code{\0007}). 2943 2944@cindex Perl-style regular expressions, backreferences 2945The handling of a backslash followed by a digit other than 0 2946is complicated. Outside a character class, @command{sed} reads it 2947and any following digits as a decimal number. If the number 2948is less than 10, or if there have been at least that many 2949previous capturing left parentheses in the expression, the 2950entire sequence is taken as a back reference. A description 2951of how this works is given later, following the discussion 2952of parenthesized subpatterns. 2953 2954Inside a character class, or if the decimal number is 2955greater than 9 and there have not been that many capturing 2956subpatterns, @command{sed} re-reads up to three octal digits following 2957the backslash, and generates a single byte from the 2958least significant 8 bits of the value. Any subsequent digits 2959stand for themselves. For example: 2960 2961@example 2962\040 @i{@r{is another way of writing a space}} 2963\40 @i{@r{is the same, provided there are fewer than 40}} 2964 @i{@r{previous capturing subpatterns}} 2965\7 @i{@r{is always a back reference}} 2966\011 @i{@r{is always a tab}} 2967\11 @i{@r{might be a back reference, or another way of writing a tab}} 2968\0113 @i{@r{is a tab followed by the character @samp{3}}} 2969\113 @i{@r{is the character with octal code 113 (since there}} 2970 @i{@r{can be no more than 99 back references)}} 2971\377 @i{@r{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}} 2972\81 @i{@r{is either a back reference, or a binary zero}} 2973 @i{@r{followed by the two characters @samp{81}}} 2974@end example 2975 2976Note that octal values of 100 or greater must not be introduced 2977by a leading zero, because no more than three octal 2978digits are ever read. Note that this applies only to the LHS 2979pattern; it is not possible yet to specify more than 9 backreferences 2980on the RHS of the `s' command. 2981 2982All the sequences that define a single byte value can be 2983used both inside and outside character classes. In addition, 2984inside a character class, the sequence @code{\b} is interpreted 2985as the backspace character (hex 08). Outside a character 2986class it has a different meaning (see below). 2987 2988In addition, there are four additional escapes specifying 2989generic character classes (like @code{\w} and @code{\W} do): 2990 2991@cindex Perl-style regular expressions, character classes 2992@table @samp 2993@item \d 2994Matches any decimal digit 2995 2996@item \D 2997Matches any character that is not a decimal digit 2998@end table 2999 3000In Perl mode, these character type sequences can appear both inside and 3001outside character classes. Instead, in @sc{posix} mode these sequences 3002(as well as @code{\w} and @code{\W}) are treated as two literal characters 3003(a backslash and a letter) inside square brackets. 3004 3005Escaped sequences specifying assertions are also different in 3006Perl mode. An assertion specifies a condition that has to be met 3007at a particular point in a match, without consuming any 3008characters from the subject string. The use of subpatterns 3009for more complicated assertions is described below. The 3010backslashed assertions are 3011 3012@cindex Perl-style regular expressions, assertions 3013@table @samp 3014@item \b 3015Asserts that the point is at a word boundary. 3016A word boundary is a position in the subject string where 3017the current character and the previous character do not both 3018match @code{\w} or @code{\W} (i.e. one matches @code{\w} and 3019the other matches @code{\W}), or the start or end of the string 3020if the first or last character matches @code{\w}, respectively. 3021 3022@item \B 3023Asserts that the point is not at a word boundary. 3024 3025@item \A 3026Asserts the matcher is at the start of pattern space (independent 3027of multiline mode). 3028 3029@item \Z 3030Asserts the matcher is at the end of pattern space, 3031or at a newline before the end of pattern space (independent of 3032multiline mode) 3033 3034@item \z 3035Asserts the matcher is at the end of pattern space (independent 3036of multiline mode) 3037@end table 3038 3039These assertions may not appear in character classes (but 3040note that @code{\b} has a different meaning, namely the 3041backspace character, inside a character class). 3042Note that Perl mode does not support directly assertions 3043for the beginning and the end of word; the @acronym{GNU} extensions 3044@code{\<} and @code{\>} achieve this purpose in @sc{posix} mode 3045instead. 3046 3047The @code{\A}, @code{\Z}, and @code{\z} assertions differ 3048from the traditional circumflex and dollar sign (described below) 3049in that they only ever match at the very start and end of the 3050subject string, whatever options are set; in particular @code{\A} 3051and @code{\z} are the same as the @acronym{GNU} extensions 3052@code{\`} and @code{\'} that are active in @sc{posix} mode. 3053 3054@node Circumflex/dollar sign/period 3055@appendixsec Circumflex, dollar sign, period 3056@cindex Perl-style regular expressions, newlines 3057 3058Outside a character class, in the default matching mode, the 3059circumflex character is an assertion which is true only if 3060the current matching point is at the start of the subject 3061string. Inside a character class, the circumflex has an entirely 3062different meaning (see below). 3063 3064The circumflex need not be the first character of the pattern if 3065a number of alternatives are involved, but it should be the 3066first thing in each alternative in which it appears if the 3067pattern is ever to match that branch. If all possible alternatives, 3068start with a circumflex, that is, if the pattern is 3069constrained to match only at the start of the subject, it is 3070said to be an @dfn{anchored} pattern. (There are also other constructs 3071structs that can cause a pattern to be anchored.) 3072 3073A dollar sign is an assertion which is true only if the 3074current matching point is at the end of the subject string, 3075or immediately before a newline character that is the last 3076character in the string (by default). A dollar sign need not be the 3077last character of the pattern if a number of alternatives 3078are involved, but it should be the last item in any branch 3079in which it appears. A dollar sign has no special meaning in a 3080character class. 3081 3082@cindex Perl-style regular expressions, multiline 3083The meanings of the circumflex and dollar sign characters are 3084changed if the @code{M} modifier option is used. When this is 3085the case, they match immediately after and immediately 3086before an internal @code{\n} character, respectively, in addition 3087to matching at the start and end of the subject string. For 3088example, the pattern @code{/^abc$/} matches the subject string 3089@samp{def\nabc} in multiline mode, but not otherwise. Consequently, 3090patterns that are anchored in single line mode 3091because all branches start with @code{^} are not anchored in 3092multiline mode. 3093 3094@cindex Perl-style regular expressions, multiline 3095Note that the sequences @code{\A}, @code{\Z}, and @code{\z} 3096can be used to match the start and end of the subject in both 3097modes, and if all branches of a pattern start with @code{\A} 3098is it always anchored, whether the @code{M} modifier is set or not. 3099 3100@cindex Perl-style regular expressions, single line 3101Outside a character class, a dot in the pattern matches any 3102one character in the subject, including a non-printing character, 3103but not (by default) newline. If the @code{S} modifier is used, 3104dots match newlines as well. Actually, the handling of 3105dot is entirely independent of the handling of circumflex 3106and dollar sign, the only relationship being that they both 3107involve newline characters. Dot has no special meaning in a 3108character class. 3109 3110@node Square brackets 3111@appendixsec Square brackets 3112@cindex Perl-style regular expressions, character classes 3113 3114An opening square bracket introduces a character class, terminated 3115by a closing square bracket. A closing square bracket on its own 3116is not special. If a closing square bracket is required as a 3117member of the class, it should be the first data character in 3118the class (after an initial circumflex, if present) or escaped with a backslash. 3119 3120A character class matches a single character in the subject; 3121the character must be in the set of characters defined by 3122the class, unless the first character in the class is a circumflex, 3123in which case the subject character must not be in 3124the set defined by the class. If a circumflex is actually 3125required as a member of the class, ensure it is not the 3126first character, or escape it with a backslash. 3127 3128For example, the character class [aeiou] matches any lower 3129case vowel, while [^aeiou] matches any character that is not 3130a lower case vowel. Note that a circumflex is just a convenient 3131venient notation for specifying the characters which are in 3132the class by enumerating those that are not. It is not an 3133assertion: it still consumes a character from the subject 3134string, and fails if the current pointer is at the end of 3135the string. 3136 3137@cindex Perl-style regular expressions, case-insensitive 3138When caseless matching is set, any letters in a class 3139represent both their upper case and lower case versions, so 3140for example, a caseless @code{[aeiou]} matches uppercase 3141and lowercase @samp{A}s, and a caseless @code{[^aeiou]} 3142does not match @samp{A}, whereas a case-sensitive version would. 3143 3144@cindex Perl-style regular expressions, single line 3145@cindex Perl-style regular expressions, multiline 3146The newline character is never treated in any special way in 3147character classes, whatever the setting of the @code{S} and 3148@code{M} options (modifiers) is. A class such as @code{[^a]} will 3149always match a newline. 3150 3151The minus (hyphen) character can be used to specify a range 3152of characters in a character class. For example, @code{[d-m]} 3153matches any letter between d and m, inclusive. If a minus 3154character is required in a class, it must be escaped with a 3155backslash or appear in a position where it cannot be interpreted 3156as indicating a range, typically as the first or last 3157character in the class. 3158 3159It is not possible to have the literal character @code{]} as the 3160end character of a range. A pattern such as @code{[W-]46]} is 3161interpreted as a class of two characters (@code{W} and @code{-}) 3162followed by a literal string @code{46]}, so it would match 3163@samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped 3164with a backslash it is interpreted as the end of range, so 3165@code{[W-\]46]} is interpreted as a single class containing a 3166range followed by two separate characters. The octal or 3167hexadecimal representation of @code{]} can also be used to end a range. 3168 3169Ranges operate in @sc{ascii} collating sequence. They can also be 3170used for characters specified numerically, for example 3171@code{[\000-\037]}. If a range that includes letters is used when 3172caseless matching is set, it matches the letters in either 3173case. For example, a caseless @code{[W-c]} is equivalent to 3174@code{[][\^_`wxyzabc]}, matched caselessly, and if character 3175tables for the French locale are in use, @code{[\xc8-\xcb]} 3176matches accented E characters in both cases. 3177 3178Unlike in @sc{posix} mode, the character types @code{\d}, 3179@code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W} 3180may also appear in a character class, and add the characters 3181that they match to the class. For example, @code{[\dABCDEF]} matches any 3182hexadecimal digit. A circumflex can conveniently be used 3183with the upper case character types to specify a more restricted 3184set of characters than the matching lower case type. 3185For example, the class @code{[^\W_]} matches any letter or digit, 3186but not underscore. 3187 3188All non-alphameric characters other than @code{\}, @code{-}, 3189@code{^} (at the start) and the terminating @code{]} 3190are non-special in character classes, but it does no harm 3191if they are escaped. 3192 3193Perl 5.6 supports the @sc{posix} notation for character classes, which 3194uses names enclosed by @code{[:} and @code{:]} within the enclosing 3195square brackets, and @value{SSED} supports this notation as well. 3196For example, 3197 3198@example 3199[01[:alpha:]%] 3200@end example 3201 3202@noindent 3203matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}. 3204The supported class names are 3205 3206@table @code 3207@item alnum 3208Matches letters and digits 3209 3210@item alpha 3211Matches letters 3212 3213@item ascii 3214Matches character codes 0 - 127 3215 3216@item cntrl 3217Matches control characters 3218 3219@item digit 3220Matches decimal digits (same as \d) 3221 3222@item graph 3223Matches printing characters, excluding space 3224 3225@item lower 3226Matches lower case letters 3227 3228@item print 3229Matches printing characters, including space 3230 3231@item punct 3232Matches printing characters, excluding letters and digits 3233 3234@item space 3235Matches white space (same as \s) 3236 3237@item upper 3238Matches upper case letters 3239 3240@item word 3241Matches ``word'' characters (same as \w) 3242 3243@item xdigit 3244Matches hexadecimal digits 3245@end table 3246 3247The names @code{ascii} and @code{word} are extensions valid only in 3248Perl mode. Another Perl extension is negation, which is 3249indicated by a circumflex character after the colon. For example, 3250 3251@example 3252[12[:^digit:]] 3253@end example 3254 3255@noindent 3256matches @samp{1}, @samp{2}, or any non-digit. 3257 3258@node Options setting 3259@appendixsec Options setting 3260@cindex Perl-style regular expressions, toggling options 3261@cindex Perl-style regular expressions, case-insensitive 3262@cindex Perl-style regular expressions, multiline 3263@cindex Perl-style regular expressions, single line 3264@cindex Perl-style regular expressions, extended 3265 3266The settings of the @code{I}, @code{M}, @code{S}, @code{X} 3267modifiers can be changed from within the pattern by 3268a sequence of Perl option letters enclosed between @code{(?} 3269and @code{)}. The option letters must be lowercase. 3270 3271For example, @code{(?im)} sets caseless, multiline matching. It is 3272also possible to unset these options by preceding the letter 3273with a hyphen; you can also have combined settings and unsettings: 3274@code{(?im-sx)} sets caseless and multiline matching, 3275while unsets single line matching (for dots) and extended 3276whitespace interpretation. If a letter appears both before 3277and after the hyphen, the option is unset. 3278 3279The scope of these option changes depends on where in the 3280pattern the setting occurs. For settings that are outside 3281any subpattern (defined below), the effect is the same as if 3282the options were set or unset at the start of matching. The 3283following patterns all behave in exactly the same way: 3284 3285@example 3286(?i)abc 3287a(?i)bc 3288ab(?i)c 3289abc(?i) 3290@end example 3291 3292which in turn is the same as specifying the pattern abc with 3293the @code{I} modifier. In other words, ``top level'' settings 3294apply to the whole pattern (unless there are other 3295changes inside subpatterns). If there is more than one setting 3296of the same option at top level, the rightmost setting 3297is used. 3298 3299If an option change occurs inside a subpattern, the effect 3300is different. This is a change of behaviour in Perl 5.005. 3301An option change inside a subpattern affects only that part 3302of the subpattern @emph{that follows} it, so 3303 3304@example 3305(a(?i)b)c 3306@end example 3307 3308@noindent 3309matches abc and aBc and no other strings (assuming 3310case-sensitive matching is used). By this means, options can 3311be made to have different settings in different parts of the 3312pattern. Any changes made in one alternative do carry on 3313into subsequent branches within the same subpattern. For 3314example, 3315 3316@example 3317(a(?i)b|c) 3318@end example 3319 3320@noindent 3321matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C}, 3322even though when matching @samp{C} the first branch is 3323abandoned before the option setting. 3324This is because the effects of option settings happen at 3325compile time. There would be some very weird behaviour otherwise. 3326 3327@ignore 3328There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA 3329that can be changed in the same way as the Perl-compatible options by 3330using the characters U and X respectively. The (?X) flag 3331setting is special in that it must always occur earlier in 3332the pattern than any of the additional features it turns on, 3333even when it is at top level. It is best put at the start. 3334@end ignore 3335 3336 3337@node Non-capturing subpatterns 3338@appendixsec Non-capturing subpatterns 3339@cindex Perl-style regular expressions, non-capturing subpatterns 3340 3341Marking part of a pattern as a subpattern does two things. 3342On one hand, it localizes a set of alternatives; on the other 3343hand, it sets up the subpattern as a capturing subpattern (as 3344defined above). The subpattern can be backreferenced and 3345referenced in the right side of @code{s} commands. 3346 3347For example, if the string @samp{the red king} is matched against 3348the pattern 3349 3350@example 3351the ((red|white) (king|queen)) 3352@end example 3353 3354@noindent 3355the captured substrings are @samp{red king}, @samp{red}, 3356and @samp{king}, and are numbered 1, 2, and 3. 3357 3358The fact that plain parentheses fulfil two functions is not 3359always helpful. There are often times when a grouping 3360subpattern is required without a capturing requirement. If an 3361opening parenthesis is followed by @code{?:}, the subpattern does 3362not do any capturing, and is not counted when computing the 3363number of any subsequent capturing subpatterns. For example, 3364if the string @samp{the white queen} is matched against the pattern 3365 3366@example 3367the ((?:red|white) (king|queen)) 3368@end example 3369 3370@noindent 3371the captured substrings are @samp{white queen} and @samp{queen}, 3372and are numbered 1 and 2. The maximum number of captured 3373substrings is 99, while the maximum number of all subpatterns, 3374both capturing and non-capturing, is 200. 3375 3376As a convenient shorthand, if any option settings are 3377equired at the start of a non-capturing subpattern, the 3378option letters may appear between the @code{?} and the 3379@code{:}. Thus the two patterns 3380 3381@example 3382(?i:saturday|sunday) 3383(?:(?i)saturday|sunday) 3384@end example 3385 3386@noindent 3387match exactly the same set of strings. Because alternative 3388branches are tried from left to right, and options are not 3389reset until the end of the subpattern is reached, an option 3390setting in one branch does affect subsequent branches, so 3391the above patterns match @samp{SUNDAY} as well as @samp{Saturday}. 3392 3393 3394@node Repetition 3395@appendixsec Repetition 3396@cindex Perl-style regular expressions, repetitions 3397 3398Repetition is specified by quantifiers, which can follow any 3399of the following items: 3400 3401@itemize @bullet 3402@item 3403a single character, possibly escaped 3404 3405@item 3406the @code{.} special character 3407 3408@item 3409a character class 3410 3411@item 3412a back reference (see next section) 3413 3414@item 3415a parenthesized subpattern (unless it is an assertion; @pxref{Assertions}) 3416@end itemize 3417 3418The general repetition quantifier specifies a minimum and 3419maximum number of permitted matches, by giving the two 3420numbers in curly brackets (braces), separated by a comma. 3421The numbers must be less than 65536, and the first must be 3422less than or equal to the second. For example: 3423 3424@example 3425z@{2,4@} 3426@end example 3427 3428@noindent 3429matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own 3430is not a special character. If the second number is omitted, 3431but the comma is present, there is no upper limit; if the 3432second number and the comma are both omitted, the quantifier 3433specifies an exact number of required matches. Thus 3434 3435@example 3436[aeiou]@{3,@} 3437@end example 3438 3439@noindent 3440matches at least 3 successive vowels, but may match many 3441more, while 3442 3443@example 3444\d@{8@} 3445@end example 3446 3447@noindent 3448matches exactly 8 digits. An opening curly bracket that 3449appears in a position where a quantifier is not allowed, or 3450one that does not match the syntax of a quantifier, is taken 3451as a literal character. For example, @{,6@} is not a quantifier, 3452but a literal string of four characters.@footnote{It 3453raises an error if @option{-R} is not used.} 3454 3455The quantifier @samp{@{0@}} is permitted, causing the expression to 3456behave as if the previous item and the quantifier were not 3457present. 3458 3459For convenience (and historical compatibility) the three 3460most common quantifiers have single-character abbreviations: 3461 3462@table @code 3463@item * 3464is equivalent to @{0,@} 3465 3466@item + 3467is equivalent to @{1,@} 3468 3469@item ? 3470is equivalent to @{0,1@} 3471@end table 3472 3473It is possible to construct infinite loops by following a 3474subpattern that can match no characters with a quantifier 3475that has no upper limit, for example: 3476 3477@example 3478(a?)* 3479@end example 3480 3481Earlier versions of Perl used to give an error at 3482compile time for such patterns. However, because there are 3483cases where this can be useful, such patterns are now 3484accepted, but if any repetition of the subpattern does in 3485fact match no characters, the loop is forcibly broken. 3486 3487@cindex Greedy regular expression matching 3488@cindex Perl-style regular expressions, stingy repetitions 3489By default, the quantifiers are @dfn{greedy} like in @sc{posix} 3490mode, that is, they match as much as possible (up to the maximum 3491number of permitted times), without causing the rest of the 3492pattern to fail. The classic example of where this gives problems 3493is in trying to match comments in C programs. These appear between 3494the sequences @code{/*} and @code{*/} and within the sequence, individual 3495@code{*} and @code{/} characters may appear. An attempt to match C 3496comments by applying the pattern 3497 3498@example 3499/\*.*\*/ 3500@end example 3501 3502@noindent 3503to the string 3504 3505@example 3506/* first command */ not comment /* second comment */ 3507@end example 3508 3509@noindent 3510 3511fails, because it matches the entire string owing to the 3512greediness of the @code{.*} item. 3513 3514However, if a quantifier is followed by a question mark, it 3515ceases to be greedy, and instead matches the minimum number 3516of times possible, so the pattern @code{/\*.*?\*/} 3517does the right thing with the C comments. The meaning of the 3518various quantifiers is not otherwise changed, just the preferred 3519number of matches. Do not confuse this use of question 3520mark with its use as a quantifier in its own right. 3521Because it has two uses, it can sometimes appear doubled, as in 3522 3523@example 3524\d??\d 3525@end example 3526 3527which matches one digit by preference, but can match two if 3528that is the only way the rest of the pattern matches. 3529 3530Note that greediness does not matter when specifying addresses, 3531but can be nevertheless used to improve performance. 3532 3533@ignore 3534If the PCRE_UNGREEDY option is set (an option which is not 3535available in Perl), the quantifiers are not greedy by 3536default, but individual ones can be made greedy by following 3537them with a question mark. In other words, it inverts the 3538default behaviour. 3539@end ignore 3540 3541When a parenthesized subpattern is quantified with a minimum 3542repeat count that is greater than 1 or with a limited maximum, 3543more store is required for the compiled pattern, in 3544proportion to the size of the minimum or maximum. 3545 3546@cindex Perl-style regular expressions, single line 3547If a pattern starts with @code{.*} or @code{.@{0,@}} and the 3548@code{S} modifier is used, the pattern is implicitly anchored, 3549because whatever follows will be tried against every character 3550position in the subject string, so there is no point in 3551retrying the overall match at any position after the first. 3552PCRE treats such a pattern as though it were preceded by \A. 3553 3554When a capturing subpattern is repeated, the value captured 3555is the substring that matched the final iteration. For example, 3556after 3557 3558@example 3559(tweedle[dume]@{3@}\s*)+ 3560@end example 3561 3562@noindent 3563has matched @samp{tweedledum tweedledee} the value of the 3564captured substring is @samp{tweedledee}. However, if there are 3565nested capturing subpatterns, the corresponding captured 3566values may have been set in previous iterations. For example, 3567after 3568 3569@example 3570/(a|(b))+/ 3571@end example 3572 3573matches @samp{aba}, the value of the second captured substring is 3574@samp{b}. 3575 3576@node Backreferences 3577@appendixsec Backreferences 3578@cindex Perl-style regular expressions, backreferences 3579 3580Outside a character class, a backslash followed by a digit 3581greater than 0 (and possibly further digits) is a back 3582reference to a capturing subpattern earlier (i.e. to its 3583left) in the pattern, provided there have been that many 3584previous capturing left parentheses. 3585 3586However, if the decimal number following the backslash is 3587less than 10, it is always taken as a back reference, and 3588causes an error only if there are not that many capturing 3589left parentheses in the entire pattern. In other words, the 3590parentheses that are referenced need not be to the left of 3591the reference for numbers less than 10. @ref{Backslash} 3592for further details of the handling of digits following a backslash. 3593 3594A back reference matches whatever actually matched the capturing 3595subpattern in the current subject string, rather than 3596anything matching the subpattern itself. So the pattern 3597 3598@example 3599(sens|respons)e and \1ibility 3600@end example 3601 3602@noindent 3603matches @samp{sense and sensibility} and @samp{response and responsibility}, 3604but not @samp{sense and responsibility}. If caseful 3605matching is in force at the time of the back reference, the 3606case of letters is relevant. For example, 3607 3608@example 3609((?i)blah)\s+\1 3610@end example 3611 3612@noindent 3613matches @samp{blah blah} and @samp{Blah Blah}, but not 3614@samp{BLAH blah}, even though the original capturing 3615subpattern is matched caselessly. 3616 3617There may be more than one back reference to the same subpattern. 3618Also, if a subpattern has not actually been used in a 3619particular match, any back references to it always fail. For 3620example, the pattern 3621 3622@example 3623(a|(bc))\2 3624@end example 3625 3626@noindent 3627always fails if it starts to match @samp{a} rather than 3628@samp{bc}. Because there may be up to 99 back references, all 3629digits following the backslash are taken as part of a potential 3630back reference number; this is different from what happens 3631in @sc{posix} mode. If the pattern continues with a digit 3632character, some delimiter must be used to terminate the back 3633reference. If the @code{X} modifier option is set, this can be 3634whitespace. Otherwise an empty comment can be used, or the 3635following character can be expressed in hexadecimal or octal. 3636Note that this applies only to the LHS pattern; it is 3637not possible yet to specify more than 9 backreferences on the 3638RHS of the `s' command. 3639 3640A back reference that occurs inside the parentheses to which 3641it refers fails when the subpattern is first used, so, for 3642example, @code{(a\1)} never matches. However, such references 3643can be useful inside repeated subpatterns. For example, the 3644pattern 3645 3646@example 3647(a|b\1)+ 3648@end example 3649 3650@noindent 3651matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa}, 3652etc. At each iteration of the subpattern, the back reference matches 3653the character string corresponding to the previous iteration. In 3654order for this to work, the pattern must be such that the first 3655iteration does not need to match the back reference. This can be 3656done using alternation, as in the example above, or by a 3657quantifier with a minimum of zero. 3658 3659@node Assertions 3660@appendixsec Assertions 3661@cindex Perl-style regular expressions, assertions 3662@cindex Perl-style regular expressions, asserting subpatterns 3663 3664An assertion is a test on the characters following or 3665preceding the current matching point that does not actually 3666consume any characters. The simple assertions coded as @code{\b}, 3667@code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$} 3668are described above. More complicated assertions are coded as 3669subpatterns. There are two kinds: those that look ahead of the 3670current position in the subject string, and those that look behind it. 3671 3672@cindex Perl-style regular expressions, lookahead subpatterns 3673An assertion subpattern is matched in the normal way, except 3674that it does not cause the current matching position to be 3675changed. Lookahead assertions start with @code{(?=} for positive 3676assertions and @code{(?!} for negative assertions. For example, 3677 3678@example 3679\w+(?=;) 3680@end example 3681 3682@noindent 3683matches a word followed by a semicolon, but does not include 3684the semicolon in the match, and 3685 3686@example 3687foo(?!bar) 3688@end example 3689 3690@noindent 3691matches any occurrence of @samp{foo} that is not followed by 3692@samp{bar}. 3693 3694Note that the apparently similar pattern 3695 3696@example 3697(?!foo)bar 3698@end example 3699 3700@noindent 3701@cindex Perl-style regular expressions, lookbehind subpatterns 3702finds any occurrence of @samp{bar} even if it is preceded by 3703@samp{foo}, because the assertion @code{(?!foo)} is always true 3704when the next three characters are @samp{bar}. A lookbehind 3705assertion is needed to achieve this effect. 3706Lookbehind assertions start with @code{(?<=} for positive 3707assertions and @code{(?<!} for negative assertions. So, 3708 3709@example 3710(?<!foo)bar 3711@end example 3712 3713achieves the required effect of finding an occurrence of 3714@samp{bar} that is not preceded by @samp{foo}. The contents of a 3715lookbehind assertion are restricted 3716such that all the strings it matches must have a fixed 3717length. However, if there are several alternatives, they do 3718not all have to have the same fixed length. This is an extension 3719compared with Perl 5.005, which requires all branches to match 3720the same length of string. Thus 3721 3722@example 3723(?<=dogs|cats|) 3724@end example 3725 3726@noindent 3727is permitted, but the apparently equivalent regular expression 3728 3729@example 3730(?<!dogs?|cats?) 3731@end example 3732 3733@noindent 3734causes an error at compile time. Branches that match different 3735length strings are permitted only at the top level of 3736a lookbehind assertion: an assertion such as 3737 3738@example 3739(?<=ab(c|de)) 3740@end example 3741 3742@noindent 3743is not permitted, because its single top-level branch can 3744match two different lengths, but it is acceptable if rewritten 3745to use two top-level branches: 3746 3747@example 3748(?<=abc|abde) 3749@end example 3750 3751All this is required because lookbehind assertions simply 3752move the current position back by the alternative's fixed 3753width and then try to match. If there are 3754insufficient characters before the current position, the 3755match is deemed to fail. Lookbehinds, in conjunction with 3756non-backtracking subpatterns can be particularly useful for 3757matching at the ends of strings; an example is given at the end 3758of the section on non-backtracking subpatterns. 3759 3760Several assertions (of any sort) may occur in succession. 3761For example, 3762 3763@example 3764(?<=\d@{3@})(?<!999)foo 3765@end example 3766 3767@noindent 3768matches @samp{foo} preceded by three digits that are not @samp{999}. 3769Notice that each of the assertions is applied independently 3770at the same point in the subject string. First there is a 3771check that the previous three characters are all digits, and 3772then there is a check that the same three characters are not 3773@samp{999}. This pattern does not match @samp{foo} preceded by six 3774characters, the first of which are digits and the last three 3775of which are not @samp{999}. For example, it doesn't match 3776@samp{123abcfoo}. A pattern to do that is 3777 3778@example 3779(?<=\d@{3@}...)(?<!999)foo 3780@end example 3781 3782@noindent 3783This time the first assertion looks at the preceding six 3784characters, checking that the first three are digits, and 3785then the second assertion checks that the preceding three 3786characters are not @samp{999}. Actually, assertions can be 3787nested in any combination, so one can write this as 3788 3789@example 3790(?<=\d@{3@}(?!999)...)foo 3791@end example 3792 3793or 3794 3795@example 3796(?<=\d@{3@}...(?<!999))foo 3797@end example 3798 3799@noindent 3800both of which might be considered more readable. 3801 3802Assertion subpatterns are not capturing subpatterns, and may 3803not be repeated, because it makes no sense to assert the 3804same thing several times. If any kind of assertion contains 3805capturing subpatterns within it, these are counted for the 3806purposes of numbering the capturing subpatterns in the whole 3807pattern. However, substring capturing is carried out only 3808for positive assertions, because it does not make sense for 3809negative assertions. 3810 3811Assertions count towards the maximum of 200 parenthesized 3812subpatterns. 3813 3814@node Non-backtracking subpatterns 3815@appendixsec Non-backtracking subpatterns 3816@cindex Perl-style regular expressions, non-backtracking subpatterns 3817 3818With both maximizing and minimizing repetition, failure of 3819what follows normally causes the repeated item to be evaluated 3820again to see if a different number of repeats allows the 3821rest of the pattern to match. Sometimes it is useful to 3822prevent this, either to change the nature of the match, or 3823to cause it fail earlier than it otherwise might, when the 3824author of the pattern knows there is no point in carrying 3825on. 3826 3827Consider, for example, the pattern @code{\d+foo} when applied to 3828the subject line 3829 3830@example 3831123456bar 3832@end example 3833 3834After matching all 6 digits and then failing to match @samp{foo}, 3835the normal action of the matcher is to try again with only 5 3836digits matching the @code{\d+} item, and then with 4, and so on, 3837before ultimately failing. Non-backtracking subpatterns 3838provide the means for specifying that once a portion of the 3839pattern has matched, it is not to be re-evaluated in this way, 3840so the matcher would give up immediately on failing to match 3841@samp{foo} the first time. The notation is another kind of special 3842parenthesis, starting with @code{(?>} as in this example: 3843 3844@example 3845(?>\d+)bar 3846@end example 3847 3848This kind of parenthesis ``locks up'' the part of the pattern 3849it contains once it has matched, and a failure further into 3850the pattern is prevented from backtracking into it. 3851Backtracking past it to previous items, however, works as 3852normal. 3853 3854Non-backtracking subpatterns are not capturing subpatterns. Simple 3855cases such as the above example can be thought of as a maximizing 3856repeat that must swallow everything it can. So, 3857while both @code{\d+} and @code{\d+?} are prepared to adjust the number of 3858digits they match in order to make the rest of the pattern 3859match, @code{(?>\d+)} can only match an entire sequence of digits. 3860 3861This construction can of course contain arbitrarily complicated 3862subpatterns, and it can be nested. 3863 3864@cindex Perl-style regular expressions, lookbehind subpatterns 3865Non-backtracking subpatterns can be used in conjunction with look-behind 3866assertions to specify efficient matching at the end 3867of the subject string. Consider a simple pattern such as 3868 3869@example 3870abcd$ 3871@end example 3872 3873@noindent 3874when applied to a long string which does not match. Because 3875matching proceeds from left to right, @command{sed} will look for 3876each @samp{a} in the subject and then see if what follows matches 3877the rest of the pattern. If the pattern is specified as 3878 3879@example 3880^.*abcd$ 3881@end example 3882 3883@noindent 3884the initial @code{.*} matches the entire string at first, but when 3885this fails (because there is no following @samp{a}), it backtracks 3886to match all but the last character, then all but the 3887last two characters, and so on. Once again the search for 3888@samp{a} covers the entire string, from right to left, so we are 3889no better off. However, if the pattern is written as 3890 3891@example 3892^(?>.*)(?<=abcd) 3893@end example 3894 3895there can be no backtracking for the .* item; it can match 3896only the entire string. The subsequent lookbehind assertion 3897does a single test on the last four characters. If it fails, 3898the match fails immediately. For long strings, this approach 3899makes a significant difference to the processing time. 3900 3901When a pattern contains an unlimited repeat inside a subpattern 3902that can itself be repeated an unlimited number of 3903times, the use of a once-only subpattern is the only way to 3904avoid some failing matches taking a very long time 3905indeed.@footnote{Actually, the matcher embedded in @value{SSED} 3906tries to do something for this in the simplest cases, 3907like @code{([^b]*b)*}. These cases are actually quite 3908common: they happen for example in a regular expression 3909like @code{\/\*([^*]*\*)*\/} which matches C comments.} 3910 3911The pattern 3912 3913@example 3914(\D+|<\d+>)*[!?] 3915@end example 3916 3917([^0-9<]+<(\d+>)?)*[!?] 3918 3919@noindent 3920matches an unlimited number of substrings that either consist 3921of non-digits, or digits enclosed in angular brackets, followed by 3922an exclamation or question mark. When it matches, it runs quickly. 3923However, if it is applied to 3924 3925@example 3926aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 3927@end example 3928 3929@noindent 3930it takes a long time before reporting failure. This is 3931because the string can be divided between the two repeats in 3932a large number of ways, and all have to be tried.@footnote{The 3933example used @code{[!?]} rather than a single character at the end, 3934because both @value{SSED} and Perl have an optimization that allows 3935for fast failure when a single character is used. They 3936remember the last single character that is required for a 3937match, and fail early if it is not present in the string.} 3938 3939If the pattern is changed to 3940 3941@example 3942((?>\D+)|<\d+>)*[!?] 3943@end example 3944 3945sequences of non-digits cannot be broken, and failure happens 3946quickly. 3947 3948@node Conditional subpatterns 3949@appendixsec Conditional subpatterns 3950@cindex Perl-style regular expressions, conditional subpatterns 3951 3952It is possible to cause the matching process to obey a subpattern 3953conditionally or to choose between two alternative 3954subpatterns, depending on the result of an assertion, or 3955whether a previous capturing subpattern matched or not. The 3956two possible forms of conditional subpattern are 3957 3958@example 3959(?(@var{condition})@var{yes-pattern}) 3960(?(@var{condition})@var{yes-pattern}|@var{no-pattern}) 3961@end example 3962 3963If the condition is satisfied, the yes-pattern is used; otherwise 3964the no-pattern (if present) is used. If there are more than two 3965alternatives in the subpattern, a compile-time error occurs. 3966 3967There are two kinds of condition. If the text between the 3968parentheses consists of a sequence of digits, the condition 3969is satisfied if the capturing subpattern of that number has 3970previously matched. The number must be greater than zero. 3971Consider the following pattern, which contains non-significant 3972white space to make it more readable (assume the @code{X} modifier) 3973and to divide it into three parts for ease of discussion: 3974 3975@example 3976( \( )? [^()]+ (?(1) \) ) 3977@end example 3978 3979The first part matches an optional opening parenthesis, and 3980if that character is present, sets it as the first captured 3981substring. The second part matches one or more characters 3982that are not parentheses. The third part is a conditional 3983subpattern that tests whether the first set of parentheses 3984matched or not. If they did, that is, if subject started 3985with an opening parenthesis, the condition is true, and so 3986the yes-pattern is executed and a closing parenthesis is 3987required. Otherwise, since no-pattern is not present, the 3988subpattern matches nothing. In other words, this pattern 3989matches a sequence of non-parentheses, optionally enclosed 3990in parentheses. 3991 3992@cindex Perl-style regular expressions, lookahead subpatterns 3993If the condition is not a sequence of digits, it must be an 3994assertion. This may be a positive or negative lookahead or 3995lookbehind assertion. Consider this pattern, again containing 3996non-significant white space, and with the two alternatives 3997on the second line: 3998 3999@example 4000(?(?=...[a-z]) 4001 \d\d-[a-z]@{3@}-\d\d | 4002 \d\d-\d\d-\d\d ) 4003@end example 4004 4005The condition is a positive lookahead assertion that matches 4006a letter that is three characters away from the current point. 4007If a letter is found, the subject is matched against the first 4008alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are 4009letters and @var{dd} are digits); otherwise it is matched against 4010the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}. 4011 4012 4013@node Recursive patterns 4014@appendixsec Recursive patterns 4015@cindex Perl-style regular expressions, recursive patterns 4016@cindex Perl-style regular expressions, recursion 4017 4018Consider the problem of matching a string in parentheses, 4019allowing for unlimited nested parentheses. Without the use 4020of recursion, the best that can be done is to use a pattern 4021that matches up to some fixed depth of nesting. It is not 4022possible to handle an arbitrary nesting depth. Perl 5.6 has 4023provided an experimental facility that allows regular 4024expressions to recurse (amongst other things). It does this 4025by interpolating Perl code in the expression at run time, 4026and the code can refer to the expression itself. A Perl pattern 4027tern to solve the parentheses problem can be created like 4028this: 4029 4030@example 4031$re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x; 4032@end example 4033 4034The @code{(?p@{...@})} item interpolates Perl code at run time, 4035and in this case refers recursively to the pattern in which it 4036appears. Obviously, @command{sed} cannot support the interpolation of 4037Perl code. Instead, the special item @code{(?R)} is provided for 4038the specific case of recursion. This pattern solves the 4039parentheses problem (assume the @code{X} modifier option is used 4040so that white space is ignored): 4041 4042@example 4043\( ( (?>[^()]+) | (?R) )* \) 4044@end example 4045 4046First it matches an opening parenthesis. Then it matches any 4047number of substrings which can either be a sequence of 4048non-parentheses, or a recursive match of the pattern itself 4049(i.e. a correctly parenthesized substring). Finally there is 4050a closing parenthesis. 4051 4052This particular example pattern contains nested unlimited 4053repeats, and so the use of a non-backtracking subpattern for 4054matching strings of non-parentheses is important when applying 4055the pattern to strings that do not match. For example, when 4056it is applied to 4057 4058@example 4059(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() 4060@end example 4061 4062it yields a ``no match'' response quickly. However, if a 4063standard backtracking subpattern is not used, the match runs 4064for a very long time indeed because there are so many different 4065ways the @code{+} and @code{*} repeats can carve up the subject, 4066and all have to be tested before failure can be reported. 4067 4068The values set for any capturing subpatterns are those from 4069the outermost level of the recursion at which the subpattern 4070value is set. If the pattern above is matched against 4071 4072@example 4073(ab(cd)ef) 4074@end example 4075 4076@noindent 4077the value for the capturing parentheses is @samp{ef}, which is 4078the last value taken on at the top level. 4079 4080@node Comments 4081@appendixsec Comments 4082@cindex Perl-style regular expressions, comments 4083 4084The sequence (?# marks the start of a comment which continues 4085ues up to the next closing parenthesis. Nested parentheses 4086are not permitted. The characters that make up a comment 4087play no part in the pattern matching at all. 4088 4089@cindex Perl-style regular expressions, extended 4090If the @code{X} modifier option is used, an unescaped @code{#} character 4091outside a character class introduces a comment that continues 4092up to the next newline character in the pattern. 4093@end ifset 4094 4095 4096@page 4097@node Concept Index 4098@unnumbered Concept Index 4099 4100This is a general index of all issues discussed in this manual, with the 4101exception of the @command{sed} commands and command-line options. 4102 4103@printindex cp 4104 4105@page 4106@node Command and Option Index 4107@unnumbered Command and Option Index 4108 4109This is an alphabetical list of all @command{sed} commands and command-line 4110options. 4111 4112@printindex fn 4113 4114@contents 4115@bye 4116 4117@c XXX FIXME: the term "cycle" is never defined... 4118