1\input texinfo @c -*-texinfo-*- 2@c Do not edit this file!! It is automatically generated from sed-in.texi. 3@c 4@c -- Stuff that needs adding: ---------------------------------------------- 5@c (document the `;' command-separator) 6@c -------------------------------------------------------------------------- 7@c Check for consistency: regexps in @code, text that they match in @samp. 8@c 9@c Tips: 10@c @command for command 11@c @samp for command fragments: @samp{cat -s} 12@c @code for sed commands and flags 13@c Use ``quote'' not `quote' or "quote". 14@c 15@c %**start of header 16@setfilename sed.info 17@settitle sed, a stream editor 18@c %**end of header 19 20@c @smallbook 21 22@include version.texi 23 24@c Combine indices. 25@syncodeindex ky cp 26@syncodeindex pg cp 27@syncodeindex tp cp 28 29@defcodeindex op 30@syncodeindex op fn 31 32@include config.texi 33 34@copying 35This file documents version @value{VERSION} of 36@value{SSED}, a stream editor. 37 38Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free 39Software Foundation, Inc. 40 41This document is released under the terms of the @acronym{GNU} Free 42Documentation License as published by the Free Software Foundation; 43either version 1.1, or (at your option) any later version. 44 45You should have received a copy of the @acronym{GNU} Free Documentation 46License along with @value{SSED}; see the file @file{COPYING.DOC}. 47If not, write to the Free Software Foundation, 59 Temple Place - Suite 48330, Boston, MA 02110-1301, USA. 49 50There are no Cover Texts and no Invariant Sections; this text, along 51with its equivalent in the printed manual, constitutes the Title Page. 52@end copying 53 54@setchapternewpage off 55 56@titlepage 57@title @command{sed}, a stream editor 58@subtitle version @value{VERSION}, @value{UPDATED} 59@author by Ken Pizzini, Paolo Bonzini 60 61@page 62@vskip 0pt plus 1filll 63Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc. 64 65@insertcopying 66 67Published by the Free Software Foundation, @* 6851 Franklin Street, Fifth Floor @* 69Boston, MA 02110-1301, USA 70@end titlepage 71 72 73@node Top 74@top 75 76@ifnottex 77@insertcopying 78@end ifnottex 79 80@menu 81* Introduction:: Introduction 82* Invoking sed:: Invocation 83* sed Programs:: @command{sed} programs 84* Examples:: Some sample scripts 85* Limitations:: Limitations and (non-)limitations of @value{SSED} 86* Other Resources:: Other resources for learning about @command{sed} 87* Reporting Bugs:: Reporting bugs 88 89* Extended regexps:: @command{egrep}-style regular expressions 90@ifset PERL 91* Perl regexps:: Perl-style regular expressions 92@end ifset 93 94* Concept Index:: A menu with all the topics in this manual. 95* Command and Option Index:: A menu with all @command{sed} commands and 96 command-line options. 97 98@detailmenu 99--- The detailed node listing --- 100 101sed Programs: 102* Execution Cycle:: How @command{sed} works 103* Addresses:: Selecting lines with @command{sed} 104* Regular Expressions:: Overview of regular expression syntax 105* Common Commands:: Often used commands 106* The "s" Command:: @command{sed}'s Swiss Army Knife 107* Other Commands:: Less frequently used commands 108* Programming Commands:: Commands for @command{sed} gurus 109* Extended Commands:: Commands specific of @value{SSED} 110* Escapes:: Specifying special characters 111 112Examples: 113* Centering lines:: 114* Increment a number:: 115* Rename files to lower case:: 116* Print bash environment:: 117* Reverse chars of lines:: 118* tac:: Reverse lines of files 119* cat -n:: Numbering lines 120* cat -b:: Numbering non-blank lines 121* wc -c:: Counting chars 122* wc -w:: Counting words 123* wc -l:: Counting lines 124* head:: Printing the first lines 125* tail:: Printing the last lines 126* uniq:: Make duplicate lines unique 127* uniq -d:: Print duplicated lines of input 128* uniq -u:: Remove all duplicated lines 129* cat -s:: Squeezing blank lines 130 131@ifset PERL 132Perl regexps:: Perl-style regular expressions 133* Backslash:: Introduces special sequences 134* Circumflex/dollar sign/period:: Behave specially with regard to new lines 135* Square brackets:: Are a bit different in strange cases 136* Options setting:: Toggle modifiers in the middle of a regexp 137* Non-capturing subpatterns:: Are not counted when backreferencing 138* Repetition:: Allows for non-greedy matching 139* Backreferences:: Allows for more than 10 back references 140* Assertions:: Allows for complex look ahead matches 141* Non-backtracking subpatterns:: Often gives more performance 142* Conditional subpatterns:: Allows if/then/else branches 143* Recursive patterns:: For example to match parentheses 144* Comments:: Because things can get complex... 145@end ifset 146 147@end detailmenu 148@end menu 149 150 151@node Introduction 152@chapter Introduction 153 154@cindex Stream editor 155@command{sed} is a stream editor. 156A stream editor is used to perform basic text 157transformations on an input stream 158(a file or input from a pipeline). 159While in some ways similar to an editor which 160permits scripted edits (such as @command{ed}), 161@command{sed} works by making only one pass over the 162input(s), and is consequently more efficient. 163But it is @command{sed}'s ability to filter text in a pipeline 164which particularly distinguishes it from other types of 165editors. 166 167 168@node Invoking sed 169@chapter Invocation 170 171Normally @command{sed} is invoked like this: 172 173@example 174sed SCRIPT INPUTFILE... 175@end example 176 177The full format for invoking @command{sed} is: 178 179@example 180sed OPTIONS... [SCRIPT] [INPUTFILE...] 181@end example 182 183If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-}, 184@command{sed} filters the contents of the standard input. The @var{script} 185is actually the first non-option parameter, which @command{sed} specially 186considers a script and not an input file if (and only if) none of the 187other @var{options} specifies a script to be executed, that is if neither 188of the @option{-e} and @option{-f} options is specified. 189 190@command{sed} may be invoked with the following command-line options: 191 192@table @code 193@item --version 194@opindex --version 195@cindex Version, printing 196Print out the version of @command{sed} that is being run and a copyright notice, 197then exit. 198 199@item --help 200@opindex --help 201@cindex Usage summary, printing 202Print a usage message briefly summarizing these command-line options 203and the bug-reporting address, 204then exit. 205 206@item -n 207@itemx --quiet 208@itemx --silent 209@opindex -n 210@opindex --quiet 211@opindex --silent 212@cindex Disabling autoprint, from command line 213By default, @command{sed} prints out the pattern space 214at the end of each cycle through the script (@pxref{Execution Cycle, , 215How @code{sed} works}). 216These options disable this automatic printing, 217and @command{sed} only produces output when explicitly told to 218via the @code{p} command. 219 220@item -e @var{script} 221@itemx --expression=@var{script} 222@opindex -e 223@opindex --expression 224@cindex Script, from command line 225Add the commands in @var{script} to the set of commands to be 226run while processing the input. 227 228@item -f @var{script-file} 229@itemx --file=@var{script-file} 230@opindex -f 231@opindex --file 232@cindex Script, from a file 233Add the commands contained in the file @var{script-file} 234to the set of commands to be run while processing the input. 235 236@item -i[@var{SUFFIX}] 237@itemx --in-place[=@var{SUFFIX}] 238@opindex -i 239@opindex --in-place 240@cindex In-place editing, activating 241@cindex @value{SSEDEXT}, in-place editing 242This option specifies that files are to be edited in-place. 243@value{SSED} does this by creating a temporary file and 244sending output to this file rather than to the standard 245output.@footnote{This applies to commands such as @code{=}, 246@code{a}, @code{c}, @code{i}, @code{l}, @code{p}. You can 247still write to the standard output by using the @code{w} 248@cindex @value{SSEDEXT}, @file{/dev/stdout} file 249or @code{W} commands together with the @file{/dev/stdout} 250special file}. 251 252This option implies @option{-s}. 253 254When the end of the file is reached, the temporary file is 255renamed to the output file's original name. The extension, 256if supplied, is used to modify the name of the old file 257before renaming the temporary file, thereby making a backup 258copy@footnote{Note that @value{SSED} creates the backup 259file whether or not any output is actually changed.}). 260 261@cindex In-place editing, Perl-style backup file names 262This rule is followed: if the extension doesn't contain a @code{*}, 263then it is appended to the end of the current filename as a 264suffix; if the extension does contain one or more @code{*} 265characters, then @emph{each} asterisk is replaced with the 266current filename. This allows you to add a prefix to the 267backup file, instead of (or in addition to) a suffix, or 268even to place backup copies of the original files into another 269directory (provided the directory already exists). 270 271If no extension is supplied, the original file is 272overwritten without making a backup. 273 274@item -l @var{N} 275@itemx --line-length=@var{N} 276@opindex -l 277@opindex --line-length 278@cindex Line length, setting 279Specify the default line-wrap length for the @code{l} command. 280A length of 0 (zero) means to never wrap long lines. If 281not specified, it is taken to be 70. 282 283@item --posix 284@cindex @value{SSEDEXT}, disabling 285@value{SSED} includes several extensions to @acronym{POSIX} 286sed. In order to simplify writing portable scripts, this 287option disables all the extensions that this manual documents, 288including additional commands. 289@cindex @code{POSIXLY_CORRECT} behavior, enabling 290Most of the extensions accept @command{sed} programs that 291are outside the syntax mandated by @acronym{POSIX}, but some 292of them (such as the behavior of the @command{N} command 293described in @pxref{Reporting Bugs}) actually violate the 294standard. If you want to disable only the latter kind of 295extension, you can set the @code{POSIXLY_CORRECT} variable 296to a non-empty value. 297 298@item -b 299@itemx --binary 300@opindex -b 301@opindex --binary 302This option is available on every platform, but is only effective where the 303operating system makes a distinction between text files and binary files. 304When such a distinction is made---as is the case for MS-DOS, Windows, 305Cygwin---text files are composed of lines separated by a carriage return 306@emph{and} a line feed character, and @command{sed} does not see the 307ending CR. When this option is specified, @command{sed} will open 308input files in binary mode, thus not requesting this special processing 309and considering lines to end at a line feed. 310 311@item --follow-symlinks 312@opindex --follow-symlinks 313This option is available only on platforms that support 314symbolic links and has an effect only if option @option{-i} 315is specified. In this case, if the file that is specified 316on the command line is a symbolic link, @command{sed} will 317follow the link and edit the ultimate destination of the 318link. The default behavior is to break the symbolic link, 319so that the link destination will not be modified. 320 321@item -r 322@itemx --regexp-extended 323@opindex -r 324@opindex --regexp-extended 325@cindex Extended regular expressions, choosing 326@cindex @acronym{GNU} extensions, extended regular expressions 327Use extended regular expressions rather than basic 328regular expressions. Extended regexps are those that 329@command{egrep} accepts; they can be clearer because they 330usually have less backslashes, but are a @acronym{GNU} extension 331and hence scripts that use them are not portable. 332@xref{Extended regexps, , Extended regular expressions}. 333 334@ifset PERL 335@item -R 336@itemx --regexp-perl 337@opindex -R 338@opindex --regexp-perl 339@cindex Perl-style regular expressions, choosing 340@cindex @value{SSEDEXT}, Perl-style regular expressions 341Use Perl-style regular expressions rather than basic 342regular expressions. Perl-style regexps are extremely 343powerful but are a @value{SSED} extension and hence scripts that 344use it are not portable. @xref{Perl regexps, , 345Perl-style regular expressions}. 346@end ifset 347 348@item -s 349@itemx --separate 350@cindex Working on separate files 351By default, @command{sed} will consider the files specified on the 352command line as a single continuous long stream. This @value{SSED} 353extension allows the user to consider them as separate files: 354range addresses (such as @samp{/abc/,/def/}) are not allowed 355to span several files, line numbers are relative to the start 356of each file, @code{$} refers to the last line of each file, 357and files invoked from the @code{R} commands are rewound at the 358start of each file. 359 360@item -u 361@itemx --unbuffered 362@opindex -u 363@opindex --unbuffered 364@cindex Unbuffered I/O, choosing 365Buffer both input and output as minimally as practical. 366(This is particularly useful if the input is coming from 367the likes of @samp{tail -f}, and you wish to see the transformed 368output as soon as possible.) 369 370@end table 371 372If no @option{-e}, @option{-f}, @option{--expression}, or @option{--file} 373options are given on the command-line, 374then the first non-option argument on the command line is 375taken to be the @var{script} to be executed. 376 377@cindex Files to be processed as input 378If any command-line parameters remain after processing the above, 379these parameters are interpreted as the names of input files to 380be processed. 381@cindex Standard input, processing as input 382A file name of @samp{-} refers to the standard input stream. 383The standard input will be processed if no file names are specified. 384 385 386@node sed Programs 387@chapter @command{sed} Programs 388 389@cindex @command{sed} program structure 390@cindex Script structure 391A @command{sed} program consists of one or more @command{sed} commands, 392passed in by one or more of the 393@option{-e}, @option{-f}, @option{--expression}, and @option{--file} 394options, or the first non-option argument if zero of these 395options are used. 396This document will refer to ``the'' @command{sed} script; 397this is understood to mean the in-order catenation 398of all of the @var{script}s and @var{script-file}s passed in. 399 400Each @code{sed} command consists of an optional address or 401address range, followed by a one-character command name 402and any additional command-specific code. 403 404@menu 405* Execution Cycle:: How @command{sed} works 406* Addresses:: Selecting lines with @command{sed} 407* Regular Expressions:: Overview of regular expression syntax 408* Common Commands:: Often used commands 409* The "s" Command:: @command{sed}'s Swiss Army Knife 410* Other Commands:: Less frequently used commands 411* Programming Commands:: Commands for @command{sed} gurus 412* Extended Commands:: Commands specific of @value{SSED} 413* Escapes:: Specifying special characters 414@end menu 415 416 417@node Execution Cycle 418@section How @command{sed} Works 419 420@cindex Buffer spaces, pattern and hold 421@cindex Spaces, pattern and hold 422@cindex Pattern space, definition 423@cindex Hold space, definition 424@command{sed} maintains two data buffers: the active @emph{pattern} space, 425and the auxiliary @emph{hold} space. Both are initially empty. 426 427@command{sed} operates by performing the following cycle on each 428lines of input: first, @command{sed} reads one line from the input 429stream, removes any trailing newline, and places it in the pattern space. 430Then commands are executed; each command can have an address associated 431to it: addresses are a kind of condition code, and a command is only 432executed if the condition is verified before the command is to be 433executed. 434 435When the end of the script is reached, unless the @option{-n} option 436is in use, the contents of pattern space are printed out to the output 437stream, adding back the trailing newline if it was removed.@footnote{Actually, 438if @command{sed} prints a line without the terminating newline, it will 439nevertheless print the missing newline as soon as more text is sent to 440the same output stream, which gives the ``least expected surprise'' 441even though it does not make commands like @samp{sed -n p} exactly 442identical to @command{cat}.} Then the next cycle starts for the next 443input line. 444 445Unless special commands (like @samp{D}) are used, the pattern space is 446deleted between two cycles. The hold space, on the other hand, keeps 447its data between cycles (see commands @samp{h}, @samp{H}, @samp{x}, 448@samp{g}, @samp{G} to move data between both buffers). 449 450 451@node Addresses 452@section Selecting lines with @command{sed} 453@cindex Addresses, in @command{sed} scripts 454@cindex Line selection 455@cindex Selecting lines to process 456 457Addresses in a @command{sed} script can be in any of the following forms: 458@table @code 459@item @var{number} 460@cindex Address, numeric 461@cindex Line, selecting by number 462Specifying a line number will match only that line in the input. 463(Note that @command{sed} counts lines continuously across all input files 464unless @option{-i} or @option{-s} options are specified.) 465 466@item @var{first}~@var{step} 467@cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses 468This @acronym{GNU} extension matches every @var{step}th line 469starting with line @var{first}. 470In particular, lines will be selected when there exists 471a non-negative @var{n} such that the current line-number equals 472@var{first} + (@var{n} * @var{step}). 473Thus, to select the odd-numbered lines, 474one would use @code{1~2}; 475to pick every third line starting with the second, @samp{2~3} would be used; 476to pick every fifth line starting with the tenth, use @samp{10~5}; 477and @samp{50~0} is just an obscure way of saying @code{50}. 478 479@item $ 480@cindex Address, last line 481@cindex Last line, selecting 482@cindex Line, selecting last 483This address matches the last line of the last file of input, or 484the last line of each file when the @option{-i} or @option{-s} options 485are specified. 486 487@item /@var{regexp}/ 488@cindex Address, as a regular expression 489@cindex Line, selecting by regular expression match 490This will select any line which matches the regular expression @var{regexp}. 491If @var{regexp} itself includes any @code{/} characters, 492each must be escaped by a backslash (@code{\}). 493 494@cindex empty regular expression 495@cindex @value{SSEDEXT}, modifiers and the empty regular expression 496The empty regular expression @samp{//} repeats the last regular 497expression match (the same holds if the empty regular expression is 498passed to the @code{s} command). Note that modifiers to regular expressions 499are evaluated when the regular expression is compiled, thus it is invalid to 500specify them together with the empty regular expression. 501 502@item \%@var{regexp}% 503(The @code{%} may be replaced by any other single character.) 504 505@cindex Slash character, in regular expressions 506This also matches the regular expression @var{regexp}, 507but allows one to use a different delimiter than @code{/}. 508This is particularly useful if the @var{regexp} itself contains 509a lot of slashes, since it avoids the tedious escaping of every @code{/}. 510If @var{regexp} itself includes any delimiter characters, 511each must be escaped by a backslash (@code{\}). 512 513@item /@var{regexp}/I 514@itemx \%@var{regexp}%I 515@cindex @acronym{GNU} extensions, @code{I} modifier 516@ifset PERL 517@cindex Perl-style regular expressions, case-insensitive 518@end ifset 519The @code{I} modifier to regular-expression matching is a @acronym{GNU} 520extension which causes the @var{regexp} to be matched in 521a case-insensitive manner. 522 523@item /@var{regexp}/M 524@itemx \%@var{regexp}%M 525@ifset PERL 526@cindex @value{SSEDEXT}, @code{M} modifier 527@end ifset 528@cindex Perl-style regular expressions, multiline 529The @code{M} modifier to regular-expression matching is a @value{SSED} 530extension which causes @code{^} and @code{$} to match respectively 531(in addition to the normal behavior) the empty string after a newline, 532and the empty string before a newline. There are special character 533sequences 534@ifset PERL 535(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'} 536in basic or extended regular expression modes) 537@end ifset 538@ifclear PERL 539(@code{\`} and @code{\'}) 540@end ifclear 541which always match the beginning or the end of the buffer. 542@code{M} stands for @cite{multi-line}. 543 544@ifset PERL 545@item /@var{regexp}/S 546@itemx \%@var{regexp}%S 547@cindex @value{SSEDEXT}, @code{S} modifier 548@cindex Perl-style regular expressions, single line 549The @code{S} modifier to regular-expression matching is only valid 550in Perl mode and specifies that the dot character (@code{.}) will 551match the newline character too. @code{S} stands for @cite{single-line}. 552@end ifset 553 554@ifset PERL 555@item /@var{regexp}/X 556@itemx \%@var{regexp}%X 557@cindex @value{SSEDEXT}, @code{X} modifier 558@cindex Perl-style regular expressions, extended 559The @code{X} modifier to regular-expression matching is also 560valid in Perl mode only. If it is used, whitespace in the 561pattern (other than in a character class) and 562characters between a @kbd{#} outside a character class and the 563next newline character are ignored. An escaping backslash 564can be used to include a whitespace or @kbd{#} character as part 565of the pattern. 566@end ifset 567@end table 568 569If no addresses are given, then all lines are matched; 570if one address is given, then only lines matching that 571address are matched. 572 573@cindex Range of lines 574@cindex Several lines, selecting 575An address range can be specified by specifying two addresses 576separated by a comma (@code{,}). An address range matches lines 577starting from where the first address matches, and continues 578until the second address matches (inclusively). 579 580If the second address is a @var{regexp}, then checking for the 581ending match will start with the line @emph{following} the 582line which matched the first address: a range will always 583span at least two lines (except of course if the input stream 584ends). 585 586If the second address is a @var{number} less than (or equal to) 587the line matching the first address, then only the one line is 588matched. 589 590@cindex Special addressing forms 591@cindex Range with start address of zero 592@cindex Zero, as range start address 593@cindex @var{addr1},+N 594@cindex @var{addr1},~N 595@cindex @acronym{GNU} extensions, special two-address forms 596@cindex @acronym{GNU} extensions, @code{0} address 597@cindex @acronym{GNU} extensions, 0,@var{addr2} addressing 598@cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing 599@cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing 600@value{SSED} also supports some special two-address forms; all these 601are @acronym{GNU} extensions: 602@table @code 603@item 0,/@var{regexp}/ 604A line number of @code{0} can be used in an address specification like 605@code{0,/@var{regexp}/} so that @command{sed} will try to match 606@var{regexp} in the first input line too. In other words, 607@code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/}, 608except that if @var{addr2} matches the very first line of input the 609@code{0,/@var{regexp}/} form will consider it to end the range, whereas 610the @code{1,/@var{regexp}/} form will match the beginning of its range and 611hence make the range span up to the @emph{second} occurrence of the 612regular expression. 613 614Note that this is the only place where the @code{0} address makes 615sense; there is no 0-th line and commands which are given the @code{0} 616address in any other way will give an error. 617 618@item @var{addr1},+@var{N} 619Matches @var{addr1} and the @var{N} lines following @var{addr1}. 620 621@item @var{addr1},~@var{N} 622Matches @var{addr1} and the lines following @var{addr1} 623until the next line whose input line number is a multiple of @var{N}. 624@end table 625 626@cindex Excluding lines 627@cindex Selecting non-matching lines 628Appending the @code{!} character to the end of an address 629specification negates the sense of the match. 630That is, if the @code{!} character follows an address range, 631then only lines which do @emph{not} match the address range 632will be selected. 633This also works for singleton addresses, 634and, perhaps perversely, for the null address. 635 636 637@node Regular Expressions 638@section Overview of Regular Expression Syntax 639 640To know how to use @command{sed}, people should understand regular 641expressions (@dfn{regexp} for short). A regular expression 642is a pattern that is matched against a 643subject string from left to right. Most characters are 644@dfn{ordinary}: they stand for 645themselves in a pattern, and match the corresponding characters 646in the subject. As a trivial example, the pattern 647 648@example 649The quick brown fox 650@end example 651 652@noindent 653matches a portion of a subject string that is identical to 654itself. The power of regular expressions comes from the 655ability to include alternatives and repetitions in the pattern. 656These are encoded in the pattern by the use of @dfn{special characters}, 657which do not stand for themselves but instead 658are interpreted in some special way. Here is a brief description 659of regular expression syntax as used in @command{sed}. 660 661@table @code 662@item @var{char} 663A single ordinary character matches itself. 664 665@item * 666@cindex @acronym{GNU} extensions, to basic regular expressions 667Matches a sequence of zero or more instances of matches for the 668preceding regular expression, which must be an ordinary character, a 669special character preceded by @code{\}, a @code{.}, a grouped regexp 670(see below), or a bracket expression. As a @acronym{GNU} extension, a 671postfixed regular expression can also be followed by @code{*}; for 672example, @code{a**} is equivalent to @code{a*}. @acronym{POSIX} 6731003.1-2001 says that @code{*} stands for itself when it appears at 674the start of a regular expression or subexpression, but many 675non@acronym{GNU} implementations do not support this and portable 676scripts should instead use @code{\*} in these contexts. 677 678@item \+ 679@cindex @acronym{GNU} extensions, to basic regular expressions 680As @code{*}, but matches one or more. It is a @acronym{GNU} extension. 681 682@item \? 683@cindex @acronym{GNU} extensions, to basic regular expressions 684As @code{*}, but only matches zero or one. It is a @acronym{GNU} extension. 685 686@item \@{@var{i}\@} 687As @code{*}, but matches exactly @var{i} sequences (@var{i} is a 688decimal integer; for portability, keep it between 0 and 255 689inclusive). 690 691@item \@{@var{i},@var{j}\@} 692Matches between @var{i} and @var{j}, inclusive, sequences. 693 694@item \@{@var{i},\@} 695Matches more than or equal to @var{i} sequences. 696 697@item \(@var{regexp}\) 698Groups the inner @var{regexp} as a whole, this is used to: 699 700@itemize @bullet 701@item 702@cindex @acronym{GNU} extensions, to basic regular expressions 703Apply postfix operators, like @code{\(abcd\)*}: 704this will search for zero or more whole sequences 705of @samp{abcd}, while @code{abcd*} would search 706for @samp{abc} followed by zero or more occurrences 707of @samp{d}. Note that support for @code{\(abcd\)*} is 708required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU} 709implementations do not support it and hence it is not universally 710portable. 711 712@item 713Use back references (see below). 714@end itemize 715 716@item . 717Matches any character, including newline. 718 719@item ^ 720Matches the null string at beginning of the pattern space, i.e. what 721appears after the circumflex must appear at the beginning of the 722pattern space. 723 724In most scripts, pattern space is initialized to the content of each 725line (@pxref{Execution Cycle, , How @code{sed} works}). So, it is a 726useful simplification to think of @code{^#include} as matching only 727lines where @samp{#include} is the first thing on line---if there are 728spaces before, for example, the match fails. This simplification is 729valid as long as the original content of pattern space is not modified, 730for example with an @code{s} command. 731 732@code{^} acts as a special character only at the beginning of the 733regular expression or subexpression (that is, after @code{\(} or 734@code{\|}). Portable scripts should avoid @code{^} at the beginning of 735a subexpression, though, as @acronym{POSIX} allows implementations that 736treat @code{^} as an ordinary character in that context. 737 738@item $ 739It is the same as @code{^}, but refers to end of pattern space. 740@code{$} also acts as a special character only at the end 741of the regular expression or subexpression (that is, before @code{\)} 742or @code{\|}), and its use at the end of a subexpression is not 743portable. 744 745 746@item [@var{list}] 747@itemx [^@var{list}] 748Matches any single character in @var{list}: for example, 749@code{[aeiou]} matches all vowels. A list may include 750sequences like @code{@var{char1}-@var{char2}}, which 751matches any character between (inclusive) @var{char1} 752and @var{char2}. 753 754A leading @code{^} reverses the meaning of @var{list}, so that 755it matches any single character @emph{not} in @var{list}. To include 756@code{]} in the list, make it the first character (after 757the @code{^} if needed), to include @code{-} in the list, 758make it the first or last; to include @code{^} put 759it after the first character. 760 761@cindex @code{POSIXLY_CORRECT} behavior, bracket expressions 762The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\} 763are normally not special within @var{list}. For example, @code{[\*]} 764matches either @samp{\} or @samp{*}, because the @code{\} is not 765special here. However, strings like @code{[.ch.]}, @code{[=a=]}, and 766@code{[:space:]} are special within @var{list} and represent collating 767symbols, equivalence classes, and character classes, respectively, and 768@code{[} is therefore special within @var{list} when it is followed by 769@code{.}, @code{=}, or @code{:}. Also, when not in 770@env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and 771@code{\t} are recognized within @var{list}. @xref{Escapes}. 772 773@item @var{regexp1}\|@var{regexp2} 774@cindex @acronym{GNU} extensions, to basic regular expressions 775Matches either @var{regexp1} or @var{regexp2}. Use 776parentheses to use complex alternative regular expressions. 777The matching process tries each alternative in turn, from 778left to right, and the first one that succeeds is used. 779It is a @acronym{GNU} extension. 780 781@item @var{regexp1}@var{regexp2} 782Matches the concatenation of @var{regexp1} and @var{regexp2}. 783Concatenation binds more tightly than @code{\|}, @code{^}, and 784@code{$}, but less tightly than the other regular expression 785operators. 786 787@item \@var{digit} 788Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized 789subexpression in the regular expression. This is called a @dfn{back 790reference}. Subexpressions are implicity numbered by counting 791occurrences of @code{\(} left-to-right. 792 793@item \n 794Matches the newline character. 795 796@item \@var{char} 797Matches @var{char}, where @var{char} is one of @code{$}, 798@code{*}, @code{.}, @code{[}, @code{\}, or @code{^}. 799Note that the only C-like 800backslash sequences that you can portably assume to be 801interpreted are @code{\n} and @code{\\}; in particular 802@code{\t} is not portable, and matches a @samp{t} under most 803implementations of @command{sed}, rather than a tab character. 804 805@end table 806 807@cindex Greedy regular expression matching 808Note that the regular expression matcher is greedy, i.e., matches 809are attempted from left to right and, if two or more matches are 810possible starting at the same character, it selects the longest. 811 812@noindent 813Examples: 814@table @samp 815@item abcdef 816Matches @samp{abcdef}. 817 818@item a*b 819Matches zero or more @samp{a}s followed by a single 820@samp{b}. For example, @samp{b} or @samp{aaaaab}. 821 822@item a\?b 823Matches @samp{b} or @samp{ab}. 824 825@item a\+b\+ 826Matches one or more @samp{a}s followed by one or more 827@samp{b}s: @samp{ab} is the shortest possible match, but 828other examples are @samp{aaaab} or @samp{abbbbb} or 829@samp{aaaaaabbbbbbb}. 830 831@item .* 832@itemx .\+ 833These two both match all the characters in a string; 834however, the first matches every string (including the empty 835string), while the second matches only strings containing 836at least one character. 837 838@item ^main.*(.*) 839his matches a string starting with @samp{main}, 840followed by an opening and closing 841parenthesis. The @samp{n}, @samp{(} and @samp{)} need not 842be adjacent. 843 844@item ^# 845This matches a string beginning with @samp{#}. 846 847@item \\$ 848This matches a string ending with a single backslash. The 849regexp contains two backslashes for escaping. 850 851@item \$ 852Instead, this matches a string consisting of a single dollar sign, 853because it is escaped. 854 855@item [a-zA-Z0-9] 856In the C locale, this matches any @acronym{ASCII} letters or digits. 857 858@item [^ @kbd{tab}]\+ 859(Here @kbd{tab} stands for a single tab character.) 860This matches a string of one or more 861characters, none of which is a space or a tab. 862Usually this means a word. 863 864@item ^\(.*\)\n\1$ 865This matches a string consisting of two equal substrings separated by 866a newline. 867 868@item .\@{9\@}A$ 869This matches nine characters followed by an @samp{A}. 870 871@item ^.\@{15\@}A 872This matches the start of a string that contains 16 characters, 873the last of which is an @samp{A}. 874 875@end table 876 877 878 879@node Common Commands 880@section Often-Used Commands 881 882If you use @command{sed} at all, you will quite likely want to know 883these commands. 884 885@table @code 886@item # 887[No addresses allowed.] 888 889@findex # (comments) 890@cindex Comments, in scripts 891The @code{#} character begins a comment; 892the comment continues until the next newline. 893 894@cindex Portability, comments 895If you are concerned about portability, be aware that 896some implementations of @command{sed} (which are not @sc{posix} 897conformant) may only support a single one-line comment, 898and then only when the very first character of the script is a @code{#}. 899 900@findex -n, forcing from within a script 901@cindex Caveat --- #n on first line 902Warning: if the first two characters of the @command{sed} script 903are @code{#n}, then the @option{-n} (no-autoprint) option is forced. 904If you want to put a comment in the first line of your script 905and that comment begins with the letter @samp{n} 906and you do not want this behavior, 907then be sure to either use a capital @samp{N}, 908or place at least one space before the @samp{n}. 909 910@item q [@var{exit-code}] 911This command only accepts a single address. 912 913@findex q (quit) command 914@cindex @value{SSEDEXT}, returning an exit code 915@cindex Quitting 916Exit @command{sed} without processing any more commands or input. 917Note that the current pattern space is printed if auto-print is 918not disabled with the @option{-n} options. The ability to return 919an exit code from the @command{sed} script is a @value{SSED} extension. 920 921@item d 922@findex d (delete) command 923@cindex Text, deleting 924Delete the pattern space; 925immediately start next cycle. 926 927@item p 928@findex p (print) command 929@cindex Text, printing 930Print out the pattern space (to the standard output). 931This command is usually only used in conjunction with the @option{-n} 932command-line option. 933 934@item n 935@findex n (next-line) command 936@cindex Next input line, replace pattern space with 937@cindex Read next input line 938If auto-print is not disabled, print the pattern space, 939then, regardless, replace the pattern space with the next line of input. 940If there is no more input then @command{sed} exits without processing 941any more commands. 942 943@item @{ @var{commands} @} 944@findex @{@} command grouping 945@cindex Grouping commands 946@cindex Command groups 947A group of commands may be enclosed between 948@code{@{} and @code{@}} characters. 949This is particularly useful when you want a group of commands 950to be triggered by a single address (or address-range) match. 951 952@end table 953 954@node The "s" Command 955@section The @code{s} Command 956 957The syntax of the @code{s} (as in substitute) command is 958@samp{s/@var{regexp}/@var{replacement}/@var{flags}}. The @code{/} 959characters may be uniformly replaced by any other single 960character within any given @code{s} command. The @code{/} 961character (or whatever other character is used in its stead) 962can appear in the @var{regexp} or @var{replacement} 963only if it is preceded by a @code{\} character. 964 965The @code{s} command is probably the most important in @command{sed} 966and has a lot of different options. Its basic concept is simple: 967the @code{s} command attempts to match the pattern 968space against the supplied @var{regexp}; if the match is 969successful, then that portion of the pattern 970space which was matched is replaced with @var{replacement}. 971 972@cindex Backreferences, in regular expressions 973@cindex Parenthesized substrings 974The @var{replacement} can contain @code{\@var{n}} (@var{n} being 975a number from 1 to 9, inclusive) references, which refer to 976the portion of the match which is contained between the @var{n}th 977@code{\(} and its matching @code{\)}. 978Also, the @var{replacement} can contain unescaped @code{&} 979characters which reference the whole matched portion 980of the pattern space. 981@cindex @value{SSEDEXT}, case modifiers in @code{s} commands 982Finally, as a @value{SSED} extension, you can include a 983special sequence made of a backslash and one of the letters 984@code{L}, @code{l}, @code{U}, @code{u}, or @code{E}. 985The meaning is as follows: 986 987@table @code 988@item \L 989Turn the replacement 990to lowercase until a @code{\U} or @code{\E} is found, 991 992@item \l 993Turn the 994next character to lowercase, 995 996@item \U 997Turn the replacement to uppercase 998until a @code{\L} or @code{\E} is found, 999 1000@item \u 1001Turn the next character 1002to uppercase, 1003 1004@item \E 1005Stop case conversion started by @code{\L} or @code{\U}. 1006@end table 1007 1008To include a literal @code{\}, @code{&}, or newline in the final 1009replacement, be sure to precede the desired @code{\}, @code{&}, 1010or newline in the @var{replacement} with a @code{\}. 1011 1012@findex s command, option flags 1013@cindex Substitution of text, options 1014The @code{s} command can be followed by zero or more of the 1015following @var{flags}: 1016 1017@table @code 1018@item g 1019@cindex Global substitution 1020@cindex Replacing all text matching regexp in a line 1021Apply the replacement to @emph{all} matches to the @var{regexp}, 1022not just the first. 1023 1024@item @var{number} 1025@cindex Replacing only @var{n}th match of regexp in a line 1026Only replace the @var{number}th match of the @var{regexp}. 1027 1028@cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command 1029@cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command 1030Note: the @sc{posix} standard does not specify what should happen 1031when you mix the @code{g} and @var{number} modifiers, 1032and currently there is no widely agreed upon meaning 1033across @command{sed} implementations. 1034For @value{SSED}, the interaction is defined to be: 1035ignore matches before the @var{number}th, 1036and then match and replace all matches from 1037the @var{number}th on. 1038 1039@item p 1040@cindex Text, printing after substitution 1041If the substitution was made, then print the new pattern space. 1042 1043Note: when both the @code{p} and @code{e} options are specified, 1044the relative ordering of the two produces very different results. 1045In general, @code{ep} (evaluate then print) is what you want, 1046but operating the other way round can be useful for debugging. 1047For this reason, the current version of @value{SSED} interprets 1048specially the presence of @code{p} options both before and after 1049@code{e}, printing the pattern space before and after evaluation, 1050while in general flags for the @code{s} command show their 1051effect just once. This behavior, although documented, might 1052change in future versions. 1053 1054@item w @var{file-name} 1055@cindex Text, writing to a file after substitution 1056@cindex @value{SSEDEXT}, @file{/dev/stdout} file 1057@cindex @value{SSEDEXT}, @file{/dev/stderr} file 1058If the substitution was made, then write out the result to the named file. 1059As a @value{SSED} extension, two special values of @var{file-name} are 1060supported: @file{/dev/stderr}, which writes the result to the standard 1061error, and @file{/dev/stdout}, which writes to the standard 1062output.@footnote{This is equivalent to @code{p} unless the @option{-i} 1063option is being used.} 1064 1065@item e 1066@cindex Evaluate Bourne-shell commands, after substitution 1067@cindex Subprocesses 1068@cindex @value{SSEDEXT}, evaluating Bourne-shell commands 1069@cindex @value{SSEDEXT}, subprocesses 1070This command allows one to pipe input from a shell command 1071into pattern space. If a substitution was made, the command 1072that is found in pattern space is executed and pattern space 1073is replaced with its output. A trailing newline is suppressed; 1074results are undefined if the command to be executed contains 1075a @sc{nul} character. This is a @value{SSED} extension. 1076 1077@item I 1078@itemx i 1079@cindex @acronym{GNU} extensions, @code{I} modifier 1080@cindex Case-insensitive matching 1081@ifset PERL 1082@cindex Perl-style regular expressions, case-insensitive 1083@end ifset 1084The @code{I} modifier to regular-expression matching is a @acronym{GNU} 1085extension which makes @command{sed} match @var{regexp} in a 1086case-insensitive manner. 1087 1088@item M 1089@itemx m 1090@cindex @value{SSEDEXT}, @code{M} modifier 1091@ifset PERL 1092@cindex Perl-style regular expressions, multiline 1093@end ifset 1094The @code{M} modifier to regular-expression matching is a @value{SSED} 1095extension which causes @code{^} and @code{$} to match respectively 1096(in addition to the normal behavior) the empty string after a newline, 1097and the empty string before a newline. There are special character 1098sequences 1099@ifset PERL 1100(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'} 1101in basic or extended regular expression modes) 1102@end ifset 1103@ifclear PERL 1104(@code{\`} and @code{\'}) 1105@end ifclear 1106which always match the beginning or the end of the buffer. 1107@code{M} stands for @cite{multi-line}. 1108 1109@ifset PERL 1110@item S 1111@itemx s 1112@cindex @value{SSEDEXT}, @code{S} modifier 1113@cindex Perl-style regular expressions, single line 1114The @code{S} modifier to regular-expression matching is only valid 1115in Perl mode and specifies that the dot character (@code{.}) will 1116match the newline character too. @code{S} stands for @cite{single-line}. 1117@end ifset 1118 1119@ifset PERL 1120@item X 1121@itemx x 1122@cindex @value{SSEDEXT}, @code{X} modifier 1123@cindex Perl-style regular expressions, extended 1124The @code{X} modifier to regular-expression matching is also 1125valid in Perl mode only. If it is used, whitespace in the 1126pattern (other than in a character class) and 1127characters between a @kbd{#} outside a character class and the 1128next newline character are ignored. An escaping backslash 1129can be used to include a whitespace or @kbd{#} character as part 1130of the pattern. 1131@end ifset 1132@end table 1133 1134 1135@node Other Commands 1136@section Less Frequently-Used Commands 1137 1138Though perhaps less frequently used than those in the previous 1139section, some very small yet useful @command{sed} scripts can be built with 1140these commands. 1141 1142@table @code 1143@item y/@var{source-chars}/@var{dest-chars}/ 1144(The @code{/} characters may be uniformly replaced by 1145any other single character within any given @code{y} command.) 1146 1147@findex y (transliterate) command 1148@cindex Transliteration 1149Transliterate any characters in the pattern space which match 1150any of the @var{source-chars} with the corresponding character 1151in @var{dest-chars}. 1152 1153Instances of the @code{/} (or whatever other character is used in its stead), 1154@code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars} 1155lists, provide that each instance is escaped by a @code{\}. 1156The @var{source-chars} and @var{dest-chars} lists @emph{must} 1157contain the same number of characters (after de-escaping). 1158 1159@item a\ 1160@itemx @var{text} 1161@cindex @value{SSEDEXT}, two addresses supported by most commands 1162As a @acronym{GNU} extension, this command accepts two addresses. 1163 1164@findex a (append text lines) command 1165@cindex Appending text after a line 1166@cindex Text, appending 1167Queue the lines of text which follow this command 1168(each but the last ending with a @code{\}, 1169which are removed from the output) 1170to be output at the end of the current cycle, 1171or when the next input line is read. 1172 1173Escape sequences in @var{text} are processed, so you should 1174use @code{\\} in @var{text} to print a single backslash. 1175 1176As a @acronym{GNU} extension, if between the @code{a} and the newline there is 1177other than a whitespace-@code{\} sequence, then the text of this line, 1178starting at the first non-whitespace character after the @code{a}, 1179is taken as the first line of the @var{text} block. 1180(This enables a simplification in scripting a one-line add.) 1181This extension also works with the @code{i} and @code{c} commands. 1182 1183@item i\ 1184@itemx @var{text} 1185@cindex @value{SSEDEXT}, two addresses supported by most commands 1186As a @acronym{GNU} extension, this command accepts two addresses. 1187 1188@findex i (insert text lines) command 1189@cindex Inserting text before a line 1190@cindex Text, insertion 1191Immediately output the lines of text which follow this command 1192(each but the last ending with a @code{\}, 1193which are removed from the output). 1194 1195@item c\ 1196@itemx @var{text} 1197@findex c (change to text lines) command 1198@cindex Replacing selected lines with other text 1199Delete the lines matching the address or address-range, 1200and output the lines of text which follow this command 1201(each but the last ending with a @code{\}, 1202which are removed from the output) 1203in place of the last line 1204(or in place of each line, if no addresses were specified). 1205A new cycle is started after this command is done, 1206since the pattern space will have been deleted. 1207 1208@item = 1209@cindex @value{SSEDEXT}, two addresses supported by most commands 1210As a @acronym{GNU} extension, this command accepts two addresses. 1211 1212@findex = (print line number) command 1213@cindex Printing line number 1214@cindex Line number, printing 1215Print out the current input line number (with a trailing newline). 1216 1217@item l @var{n} 1218@findex l (list unambiguously) command 1219@cindex List pattern space 1220@cindex Printing text unambiguously 1221@cindex Line length, setting 1222@cindex @value{SSEDEXT}, setting line length 1223Print the pattern space in an unambiguous form: 1224non-printable characters (and the @code{\} character) 1225are printed in C-style escaped form; long lines are split, 1226with a trailing @code{\} character to indicate the split; 1227the end of each line is marked with a @code{$}. 1228 1229@var{n} specifies the desired line-wrap length; 1230a length of 0 (zero) means to never wrap long lines. If omitted, 1231the default as specified on the command line is used. The @var{n} 1232parameter is a @value{SSED} extension. 1233 1234@item r @var{filename} 1235@cindex @value{SSEDEXT}, two addresses supported by most commands 1236As a @acronym{GNU} extension, this command accepts two addresses. 1237 1238@findex r (read file) command 1239@cindex Read text from a file 1240@cindex @value{SSEDEXT}, @file{/dev/stdin} file 1241Queue the contents of @var{filename} to be read and 1242inserted into the output stream at the end of the current cycle, 1243or when the next input line is read. 1244Note that if @var{filename} cannot be read, it is treated as 1245if it were an empty file, without any error indication. 1246 1247As a @value{SSED} extension, the special value @file{/dev/stdin} 1248is supported for the file name, which reads the contents of the 1249standard input. 1250 1251@item w @var{filename} 1252@findex w (write file) command 1253@cindex Write to a file 1254@cindex @value{SSEDEXT}, @file{/dev/stdout} file 1255@cindex @value{SSEDEXT}, @file{/dev/stderr} file 1256Write the pattern space to @var{filename}. 1257As a @value{SSED} extension, two special values of @var{file-name} are 1258supported: @file{/dev/stderr}, which writes the result to the standard 1259error, and @file{/dev/stdout}, which writes to the standard 1260output.@footnote{This is equivalent to @code{p} unless the @option{-i} 1261option is being used.} 1262 1263The file will be created (or truncated) before the 1264first input line is read; all @code{w} commands 1265(including instances of @code{w} flag on successful @code{s} commands) 1266which refer to the same @var{filename} are output without 1267closing and reopening the file. 1268 1269@item D 1270@findex D (delete first line) command 1271@cindex Delete first line from pattern space 1272Delete text in the pattern space up to the first newline. 1273If any text is left, restart cycle with the resultant 1274pattern space (without reading a new line of input), 1275otherwise start a normal new cycle. 1276 1277@item N 1278@findex N (append Next line) command 1279@cindex Next input line, append to pattern space 1280@cindex Append next input line to pattern space 1281Add a newline to the pattern space, 1282then append the next line of input to the pattern space. 1283If there is no more input then @command{sed} exits without processing 1284any more commands. 1285 1286@item P 1287@findex P (print first line) command 1288@cindex Print first line from pattern space 1289Print out the portion of the pattern space up to the first newline. 1290 1291@item h 1292@findex h (hold) command 1293@cindex Copy pattern space into hold space 1294@cindex Replace hold space with copy of pattern space 1295@cindex Hold space, copying pattern space into 1296Replace the contents of the hold space with the contents of the pattern space. 1297 1298@item H 1299@findex H (append Hold) command 1300@cindex Append pattern space to hold space 1301@cindex Hold space, appending from pattern space 1302Append a newline to the contents of the hold space, 1303and then append the contents of the pattern space to that of the hold space. 1304 1305@item g 1306@findex g (get) command 1307@cindex Copy hold space into pattern space 1308@cindex Replace pattern space with copy of hold space 1309@cindex Hold space, copy into pattern space 1310Replace the contents of the pattern space with the contents of the hold space. 1311 1312@item G 1313@findex G (appending Get) command 1314@cindex Append hold space to pattern space 1315@cindex Hold space, appending to pattern space 1316Append a newline to the contents of the pattern space, 1317and then append the contents of the hold space to that of the pattern space. 1318 1319@item x 1320@findex x (eXchange) command 1321@cindex Exchange hold space with pattern space 1322@cindex Hold space, exchange with pattern space 1323Exchange the contents of the hold and pattern spaces. 1324 1325@end table 1326 1327 1328@node Programming Commands 1329@section Commands for @command{sed} gurus 1330 1331In most cases, use of these commands indicates that you are 1332probably better off programming in something like @command{awk} 1333or Perl. But occasionally one is committed to sticking 1334with @command{sed}, and these commands can enable one to write 1335quite convoluted scripts. 1336 1337@cindex Flow of control in scripts 1338@table @code 1339@item : @var{label} 1340[No addresses allowed.] 1341 1342@findex : (label) command 1343@cindex Labels, in scripts 1344Specify the location of @var{label} for branch commands. 1345In all other respects, a no-op. 1346 1347@item b @var{label} 1348@findex b (branch) command 1349@cindex Branch to a label, unconditionally 1350@cindex Goto, in scripts 1351Unconditionally branch to @var{label}. 1352The @var{label} may be omitted, in which case the next cycle is started. 1353 1354@item t @var{label} 1355@findex t (test and branch if successful) command 1356@cindex Branch to a label, if @code{s///} succeeded 1357@cindex Conditional branch 1358Branch to @var{label} only if there has been a successful @code{s}ubstitution 1359since the last input line was read or conditional branch was taken. 1360The @var{label} may be omitted, in which case the next cycle is started. 1361 1362@end table 1363 1364@node Extended Commands 1365@section Commands Specific to @value{SSED} 1366 1367These commands are specific to @value{SSED}, so you 1368must use them with care and only when you are sure that 1369hindering portability is not evil. They allow you to check 1370for @value{SSED} extensions or to do tasks that are required 1371quite often, yet are unsupported by standard @command{sed}s. 1372 1373@table @code 1374@item e [@var{command}] 1375@findex e (evaluate) command 1376@cindex Evaluate Bourne-shell commands 1377@cindex Subprocesses 1378@cindex @value{SSEDEXT}, evaluating Bourne-shell commands 1379@cindex @value{SSEDEXT}, subprocesses 1380This command allows one to pipe input from a shell command 1381into pattern space. Without parameters, the @code{e} command 1382executes the command that is found in pattern space and 1383replaces the pattern space with the output; a trailing newline 1384is suppressed. 1385 1386If a parameter is specified, instead, the @code{e} command 1387interprets it as a command and sends its output to the output stream 1388(like @code{r} does). The command can run across multiple 1389lines, all but the last ending with a back-slash. 1390 1391In both cases, the results are undefined if the command to be 1392executed contains a @sc{nul} character. 1393 1394@item L @var{n} 1395@findex L (fLow paragraphs) command 1396@cindex Reformat pattern space 1397@cindex Reformatting paragraphs 1398@cindex @value{SSEDEXT}, reformatting paragraphs 1399@cindex @value{SSEDEXT}, @code{L} command 1400This @value{SSED} extension fills and joins lines in pattern space 1401to produce output lines of (at most) @var{n} characters, like 1402@code{fmt} does; if @var{n} is omitted, the default as specified 1403on the command line is used. This command is considered a failed 1404experiment and unless there is enough request (which seems unlikely) 1405will be removed in future versions. 1406 1407@ignore 1408Blank lines, spaces between words, and indentation are 1409preserved in the output; successive input lines with different 1410indentation are not joined; tabs are expanded to 8 columns. 1411 1412If the pattern space contains multiple lines, they are joined, but 1413since the pattern space usually contains a single line, the behavior 1414of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e., 1415it does not join short lines to form longer ones). 1416 1417@var{n} specifies the desired line-wrap length; if omitted, 1418the default as specified on the command line is used. 1419@end ignore 1420 1421@item Q [@var{exit-code}] 1422This command only accepts a single address. 1423 1424@findex Q (silent Quit) command 1425@cindex @value{SSEDEXT}, quitting silently 1426@cindex @value{SSEDEXT}, returning an exit code 1427@cindex Quitting 1428This command is the same as @code{q}, but will not print the 1429contents of pattern space. Like @code{q}, it provides the 1430ability to return an exit code to the caller. 1431 1432This command can be useful because the only alternative ways 1433to accomplish this apparently trivial function are to use 1434the @option{-n} option (which can unnecessarily complicate 1435your script) or resorting to the following snippet, which 1436wastes time by reading the whole file without any visible effect: 1437 1438@example 1439:eat 1440$d @i{@r{Quit silently on the last line}} 1441N @i{@r{Read another line, silently}} 1442g @i{@r{Overwrite pattern space each time to save memory}} 1443b eat 1444@end example 1445 1446@item R @var{filename} 1447@findex R (read line) command 1448@cindex Read text from a file 1449@cindex @value{SSEDEXT}, reading a file a line at a time 1450@cindex @value{SSEDEXT}, @code{R} command 1451@cindex @value{SSEDEXT}, @file{/dev/stdin} file 1452Queue a line of @var{filename} to be read and 1453inserted into the output stream at the end of the current cycle, 1454or when the next input line is read. 1455Note that if @var{filename} cannot be read, or if its end is 1456reached, no line is appended, without any error indication. 1457 1458As with the @code{r} command, the special value @file{/dev/stdin} 1459is supported for the file name, which reads a line from the 1460standard input. 1461 1462@item T @var{label} 1463@findex T (test and branch if failed) command 1464@cindex @value{SSEDEXT}, branch if @code{s///} failed 1465@cindex Branch to a label, if @code{s///} failed 1466@cindex Conditional branch 1467Branch to @var{label} only if there have been no successful 1468@code{s}ubstitutions since the last input line was read or 1469conditional branch was taken. The @var{label} may be omitted, 1470in which case the next cycle is started. 1471 1472@item v @var{version} 1473@findex v (version) command 1474@cindex @value{SSEDEXT}, checking for their presence 1475@cindex Requiring @value{SSED} 1476This command does nothing, but makes @command{sed} fail if 1477@value{SSED} extensions are not supported, simply because other 1478versions of @command{sed} do not implement it. In addition, you 1479can specify the version of @command{sed} that your script 1480requires, such as @code{4.0.5}. The default is @code{4.0} 1481because that is the first version that implemented this command. 1482 1483This command enables all @value{SSEDEXT} even if 1484@env{POSIXLY_CORRECT} is set in the environment. 1485 1486@item W @var{filename} 1487@findex W (write first line) command 1488@cindex Write first line to a file 1489@cindex @value{SSEDEXT}, writing first line to a file 1490Write to the given filename the portion of the pattern space up to 1491the first newline. Everything said under the @code{w} command about 1492file handling holds here too. 1493 1494@item z 1495@findex z (Zap) command 1496@cindex @value{SSEDEXT}, emptying pattern space 1497@cindex Emptying pattern space 1498This command empties the content of pattern space. It is 1499usually the same as @samp{s/.*//}, but is more efficient 1500and works in the presence of invalid multibyte sequences 1501in the input stream. @sc{posix} mandates that such sequences 1502are @emph{not} matched by @samp{.}, so that there is no portable 1503way to clear @command{sed}'s buffers in the middle of the 1504script in most multibyte locales (including UTF-8 locales). 1505@end table 1506 1507@node Escapes 1508@section @acronym{GNU} Extensions for Escapes in Regular Expressions 1509 1510@cindex @acronym{GNU} extensions, special escapes 1511Until this chapter, we have only encountered escapes of the form 1512@samp{\^}, which tell @command{sed} not to interpret the circumflex 1513as a special character, but rather to take it literally. For 1514example, @samp{\*} matches a single asterisk rather than zero 1515or more backslashes. 1516 1517@cindex @code{POSIXLY_CORRECT} behavior, escapes 1518This chapter introduces another kind of escape@footnote{All 1519the escapes introduced here are @acronym{GNU} 1520extensions, with the exception of @code{\n}. In basic regular 1521expression mode, setting @code{POSIXLY_CORRECT} disables them inside 1522bracket expressions.}---that 1523is, escapes that are applied to a character or sequence of characters 1524that ordinarily are taken literally, and that @command{sed} replaces 1525with a special character. This provides a way 1526of encoding non-printable characters in patterns in a visible manner. 1527There is no restriction on the appearance of non-printing characters 1528in a @command{sed} script but when a script is being prepared in the 1529shell or by text editing, it is usually easier to use one of 1530the following escape sequences than the binary character it 1531represents: 1532 1533The list of these escapes is: 1534 1535@table @code 1536@item \a 1537Produces or matches a @sc{bel} character, that is an ``alert'' (@sc{ascii} 7). 1538 1539@item \f 1540Produces or matches a form feed (@sc{ascii} 12). 1541 1542@item \n 1543Produces or matches a newline (@sc{ascii} 10). 1544 1545@item \r 1546Produces or matches a carriage return (@sc{ascii} 13). 1547 1548@item \t 1549Produces or matches a horizontal tab (@sc{ascii} 9). 1550 1551@item \v 1552Produces or matches a so called ``vertical tab'' (@sc{ascii} 11). 1553 1554@item \c@var{x} 1555Produces or matches @kbd{@sc{Control}-@var{x}}, where @var{x} is 1556any character. The precise effect of @samp{\c@var{x}} is as follows: 1557if @var{x} is a lower case letter, it is converted to upper case. 1558Then bit 6 of the character (hex 40) is inverted. Thus @samp{\cz} becomes 1559hex 1A, but @samp{\c@{} becomes hex 3B, while @samp{\c;} becomes hex 7B. 1560 1561@item \d@var{xxx} 1562Produces or matches a character whose decimal @sc{ascii} value is @var{xxx}. 1563 1564@item \o@var{xxx} 1565@ifset PERL 1566@item \@var{xxx} 1567@end ifset 1568Produces or matches a character whose octal @sc{ascii} value is @var{xxx}. 1569@ifset PERL 1570The syntax without the @code{o} is active in Perl mode, while the one 1571with the @code{o} is active in the normal or extended @sc{posix} regular 1572expression modes. 1573@end ifset 1574 1575@item \x@var{xx} 1576Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}. 1577@end table 1578 1579@samp{\b} (backspace) was omitted because of the conflict with 1580the existing ``word boundary'' meaning. 1581 1582Other escapes match a particular character class and are valid only in 1583regular expressions: 1584 1585@table @code 1586@item \w 1587Matches any ``word'' character. A ``word'' character is any 1588letter or digit or the underscore character. 1589 1590@item \W 1591Matches any ``non-word'' character. 1592 1593@item \b 1594Matches a word boundary; that is it matches if the character 1595to the left is a ``word'' character and the character to the 1596right is a ``non-word'' character, or vice-versa. 1597 1598@item \B 1599Matches everywhere but on a word boundary; that is it matches 1600if the character to the left and the character to the right 1601are either both ``word'' characters or both ``non-word'' 1602characters. 1603 1604@item \` 1605Matches only at the start of pattern space. This is different 1606from @code{^} in multi-line mode. 1607 1608@item \' 1609Matches only at the end of pattern space. This is different 1610from @code{$} in multi-line mode. 1611 1612@ifset PERL 1613@item \G 1614Match only at the start of pattern space or, when doing a global 1615substitution using the @code{s///g} command and option, at 1616the end-of-match position of the prior match. For example, 1617@samp{s/\Ga/Z/g} will change an initial run of @code{a}s to 1618a run of @code{Z}s 1619@end ifset 1620@end table 1621 1622@node Examples 1623@chapter Some Sample Scripts 1624 1625Here are some @command{sed} scripts to guide you in the art of mastering 1626@command{sed}. 1627 1628@menu 1629Some exotic examples: 1630* Centering lines:: 1631* Increment a number:: 1632* Rename files to lower case:: 1633* Print bash environment:: 1634* Reverse chars of lines:: 1635 1636Emulating standard utilities: 1637* tac:: Reverse lines of files 1638* cat -n:: Numbering lines 1639* cat -b:: Numbering non-blank lines 1640* wc -c:: Counting chars 1641* wc -w:: Counting words 1642* wc -l:: Counting lines 1643* head:: Printing the first lines 1644* tail:: Printing the last lines 1645* uniq:: Make duplicate lines unique 1646* uniq -d:: Print duplicated lines of input 1647* uniq -u:: Remove all duplicated lines 1648* cat -s:: Squeezing blank lines 1649@end menu 1650 1651@node Centering lines 1652@section Centering Lines 1653 1654This script centers all lines of a file on a 80 columns width. 1655To change that width, the number in @code{\@{@dots{}\@}} must be 1656replaced, and the number of added spaces also must be changed. 1657 1658Note how the buffer commands are used to separate parts in 1659the regular expressions to be matched---this is a common 1660technique. 1661 1662@c start------------------------------------------- 1663@example 1664#!/usr/bin/sed -f 1665 1666@group 1667# Put 80 spaces in the buffer 16681 @{ 1669 x 1670 s/^$/ / 1671 s/^.*$/&&&&&&&&/ 1672 x 1673@} 1674@end group 1675 1676@group 1677# del leading and trailing spaces 1678y/@kbd{tab}/ / 1679s/^ *// 1680s/ *$// 1681@end group 1682 1683@group 1684# add a newline and 80 spaces to end of line 1685G 1686@end group 1687 1688@group 1689# keep first 81 chars (80 + a newline) 1690s/^\(.\@{81\@}\).*$/\1/ 1691@end group 1692 1693@group 1694# \2 matches half of the spaces, which are moved to the beginning 1695s/^\(.*\)\n\(.*\)\2/\2\1/ 1696@end group 1697@end example 1698@c end--------------------------------------------- 1699 1700@node Increment a number 1701@section Increment a Number 1702 1703This script is one of a few that demonstrate how to do arithmetic 1704in @command{sed}. This is indeed possible,@footnote{@command{sed} guru Greg 1705Ubben wrote an implementation of the @command{dc} @sc{rpn} calculator! 1706It is distributed together with sed.} but must be done manually. 1707 1708To increment one number you just add 1 to last digit, replacing 1709it by the following digit. There is one exception: when the digit 1710is a nine the previous digits must be also incremented until you 1711don't have a nine. 1712 1713This solution by Bruno Haible is very clever and smart because 1714it uses a single buffer; if you don't have this limitation, the 1715algorithm used in @ref{cat -n, Numbering lines}, is faster. 1716It works by replacing trailing nines with an underscore, then 1717using multiple @code{s} commands to increment the last digit, 1718and then again substituting underscores with zeros. 1719 1720@c start------------------------------------------- 1721@example 1722#!/usr/bin/sed -f 1723 1724/[^0-9]/ d 1725 1726@group 1727# replace all leading 9s by _ (any other character except digits, could 1728# be used) 1729:d 1730s/9\(_*\)$/_\1/ 1731td 1732@end group 1733 1734@group 1735# incr last digit only. The first line adds a most-significant 1736# digit of 1 if we have to add a digit. 1737# 1738# The @code{tn} commands are not necessary, but make the thing 1739# faster 1740@end group 1741 1742@group 1743s/^\(_*\)$/1\1/; tn 1744s/8\(_*\)$/9\1/; tn 1745s/7\(_*\)$/8\1/; tn 1746s/6\(_*\)$/7\1/; tn 1747s/5\(_*\)$/6\1/; tn 1748s/4\(_*\)$/5\1/; tn 1749s/3\(_*\)$/4\1/; tn 1750s/2\(_*\)$/3\1/; tn 1751s/1\(_*\)$/2\1/; tn 1752s/0\(_*\)$/1\1/; tn 1753@end group 1754 1755@group 1756:n 1757y/_/0/ 1758@end group 1759@end example 1760@c end--------------------------------------------- 1761 1762@node Rename files to lower case 1763@section Rename Files to Lower Case 1764 1765This is a pretty strange use of @command{sed}. We transform text, and 1766transform it to be shell commands, then just feed them to shell. 1767Don't worry, even worse hacks are done when using @command{sed}; I have 1768seen a script converting the output of @command{date} into a @command{bc} 1769program! 1770 1771The main body of this is the @command{sed} script, which remaps the name 1772from lower to upper (or vice-versa) and even checks out 1773if the remapped name is the same as the original name. 1774Note how the script is parameterized using shell 1775variables and proper quoting. 1776 1777@c start------------------------------------------- 1778@example 1779@group 1780#! /bin/sh 1781# rename files to lower/upper case... 1782# 1783# usage: 1784# move-to-lower * 1785# move-to-upper * 1786# or 1787# move-to-lower -R . 1788# move-to-upper -R . 1789# 1790@end group 1791 1792@group 1793help() 1794@{ 1795 cat << eof 1796Usage: $0 [-n] [-r] [-h] files... 1797@end group 1798 1799@group 1800-n do nothing, only see what would be done 1801-R recursive (use find) 1802-h this message 1803files files to remap to lower case 1804@end group 1805 1806@group 1807Examples: 1808 $0 -n * (see if everything is ok, then...) 1809 $0 * 1810@end group 1811 1812 $0 -R . 1813 1814@group 1815eof 1816@} 1817@end group 1818 1819@group 1820apply_cmd='sh' 1821finder='echo "$@@" | tr " " "\n"' 1822files_only= 1823@end group 1824 1825@group 1826while : 1827do 1828 case "$1" in 1829 -n) apply_cmd='cat' ;; 1830 -R) finder='find "$@@" -type f';; 1831 -h) help ; exit 1 ;; 1832 *) break ;; 1833 esac 1834 shift 1835done 1836@end group 1837 1838@group 1839if [ -z "$1" ]; then 1840 echo Usage: $0 [-h] [-n] [-r] files... 1841 exit 1 1842fi 1843@end group 1844 1845@group 1846LOWER='abcdefghijklmnopqrstuvwxyz' 1847UPPER='ABCDEFGHIJKLMNOPQRSTUVWXYZ' 1848@end group 1849 1850@group 1851case `basename $0` in 1852 *upper*) TO=$UPPER; FROM=$LOWER ;; 1853 *) FROM=$UPPER; TO=$LOWER ;; 1854esac 1855@end group 1856 1857eval $finder | sed -n ' 1858 1859@group 1860# remove all trailing slashes 1861s/\/*$// 1862@end group 1863 1864@group 1865# add ./ if there is no path, only a filename 1866/\//! s/^/.\// 1867@end group 1868 1869@group 1870# save path+filename 1871h 1872@end group 1873 1874@group 1875# remove path 1876s/.*\/// 1877@end group 1878 1879@group 1880# do conversion only on filename 1881y/'$FROM'/'$TO'/ 1882@end group 1883 1884@group 1885# now line contains original path+file, while 1886# hold space contains the new filename 1887x 1888@end group 1889 1890@group 1891# add converted file name to line, which now contains 1892# path/file-name\nconverted-file-name 1893G 1894@end group 1895 1896@group 1897# check if converted file name is equal to original file name, 1898# if it is, do not print nothing 1899/^.*\/\(.*\)\n\1/b 1900@end group 1901 1902@group 1903# now, transform path/fromfile\n, into 1904# mv path/fromfile path/tofile and print it 1905s/^\(.*\/\)\(.*\)\n\(.*\)$/mv "\1\2" "\1\3"/p 1906@end group 1907 1908' | $apply_cmd 1909@end example 1910@c end--------------------------------------------- 1911 1912@node Print bash environment 1913@section Print @command{bash} Environment 1914 1915This script strips the definition of the shell functions 1916from the output of the @command{set} Bourne-shell command. 1917 1918@c start------------------------------------------- 1919@example 1920#!/bin/sh 1921 1922@group 1923set | sed -n ' 1924:x 1925@end group 1926 1927@group 1928@ifinfo 1929# if no occurrence of "=()" print and load next line 1930@end ifinfo 1931@ifnotinfo 1932# if no occurrence of @samp{=()} print and load next line 1933@end ifnotinfo 1934/=()/! @{ p; b; @} 1935/ () $/! @{ p; b; @} 1936@end group 1937 1938@group 1939# possible start of functions section 1940# save the line in case this is a var like FOO="() " 1941h 1942@end group 1943 1944@group 1945# if the next line has a brace, we quit because 1946# nothing comes after functions 1947n 1948/^@{/ q 1949@end group 1950 1951@group 1952# print the old line 1953x; p 1954@end group 1955 1956@group 1957# work on the new line now 1958x; bx 1959' 1960@end group 1961@end example 1962@c end--------------------------------------------- 1963 1964@node Reverse chars of lines 1965@section Reverse Characters of Lines 1966 1967This script can be used to reverse the position of characters 1968in lines. The technique moves two characters at a time, hence 1969it is faster than more intuitive implementations. 1970 1971Note the @code{tx} command before the definition of the label. 1972This is often needed to reset the flag that is tested by 1973the @code{t} command. 1974 1975Imaginative readers will find uses for this script. An example 1976is reversing the output of @command{banner}.@footnote{This requires 1977another script to pad the output of banner; for example 1978 1979@example 1980#! /bin/sh 1981 1982banner -w $1 $2 $3 $4 | 1983 sed -e :a -e '/^.\@{0,'$1'\@}$/ @{ s/$/ /; ba; @}' | 1984 ~/sedscripts/reverseline.sed 1985@end example 1986} 1987 1988@c start------------------------------------------- 1989@example 1990#!/usr/bin/sed -f 1991 1992/../! b 1993 1994@group 1995# Reverse a line. Begin embedding the line between two newlines 1996s/^.*$/\ 1997&\ 1998/ 1999@end group 2000 2001@group 2002# Move first character at the end. The regexp matches until 2003# there are zero or one characters between the markers 2004tx 2005:x 2006s/\(\n.\)\(.*\)\(.\n\)/\3\2\1/ 2007tx 2008@end group 2009 2010@group 2011# Remove the newline markers 2012s/\n//g 2013@end group 2014@end example 2015@c end--------------------------------------------- 2016 2017@node tac 2018@section Reverse Lines of Files 2019 2020This one begins a series of totally useless (yet interesting) 2021scripts emulating various Unix commands. This, in particular, 2022is a @command{tac} workalike. 2023 2024Note that on implementations other than @acronym{GNU} @command{sed} 2025@ifset PERL 2026and @value{SSED} 2027@end ifset 2028this script might easily overflow internal buffers. 2029 2030@c start------------------------------------------- 2031@example 2032#!/usr/bin/sed -nf 2033 2034# reverse all lines of input, i.e. first line became last, ... 2035 2036@group 2037# from the second line, the buffer (which contains all previous lines) 2038# is *appended* to current line, so, the order will be reversed 20391! G 2040@end group 2041 2042@group 2043# on the last line we're done -- print everything 2044$ p 2045@end group 2046 2047@group 2048# store everything on the buffer again 2049h 2050@end group 2051@end example 2052@c end--------------------------------------------- 2053 2054@node cat -n 2055@section Numbering Lines 2056 2057This script replaces @samp{cat -n}; in fact it formats its output 2058exactly like @acronym{GNU} @command{cat} does. 2059 2060Of course this is completely useless and for two reasons: first, 2061because somebody else did it in C, second, because the following 2062Bourne-shell script could be used for the same purpose and would 2063be much faster: 2064 2065@c start------------------------------------------- 2066@example 2067@group 2068#! /bin/sh 2069sed -e "=" $@@ | sed -e ' 2070 s/^/ / 2071 N 2072 s/^ *\(......\)\n/\1 / 2073' 2074@end group 2075@end example 2076@c end--------------------------------------------- 2077 2078It uses @command{sed} to print the line number, then groups lines two 2079by two using @code{N}. Of course, this script does not teach as much as 2080the one presented below. 2081 2082The algorithm used for incrementing uses both buffers, so the line 2083is printed as soon as possible and then discarded. The number 2084is split so that changing digits go in a buffer and unchanged ones go 2085in the other; the changed digits are modified in a single step 2086(using a @code{y} command). The line number for the next line 2087is then composed and stored in the hold space, to be used in the 2088next iteration. 2089 2090@c start------------------------------------------- 2091@example 2092#!/usr/bin/sed -nf 2093 2094@group 2095# Prime the pump on the first line 2096x 2097/^$/ s/^.*$/1/ 2098@end group 2099 2100@group 2101# Add the correct line number before the pattern 2102G 2103h 2104@end group 2105 2106@group 2107# Format it and print it 2108s/^/ / 2109s/^ *\(......\)\n/\1 /p 2110@end group 2111 2112@group 2113# Get the line number from hold space; add a zero 2114# if we're going to add a digit on the next line 2115g 2116s/\n.*$// 2117/^9*$/ s/^/0/ 2118@end group 2119 2120@group 2121# separate changing/unchanged digits with an x 2122s/.9*$/x&/ 2123@end group 2124 2125@group 2126# keep changing digits in hold space 2127h 2128s/^.*x// 2129y/0123456789/1234567890/ 2130x 2131@end group 2132 2133@group 2134# keep unchanged digits in pattern space 2135s/x.*$// 2136@end group 2137 2138@group 2139# compose the new number, remove the newline implicitly added by G 2140G 2141s/\n// 2142h 2143@end group 2144@end example 2145@c end--------------------------------------------- 2146 2147@node cat -b 2148@section Numbering Non-blank Lines 2149 2150Emulating @samp{cat -b} is almost the same as @samp{cat -n}---we only 2151have to select which lines are to be numbered and which are not. 2152 2153The part that is common to this script and the previous one is 2154not commented to show how important it is to comment @command{sed} 2155scripts properly... 2156 2157@c start------------------------------------------- 2158@example 2159#!/usr/bin/sed -nf 2160 2161@group 2162/^$/ @{ 2163 p 2164 b 2165@} 2166@end group 2167 2168@group 2169# Same as cat -n from now 2170x 2171/^$/ s/^.*$/1/ 2172G 2173h 2174s/^/ / 2175s/^ *\(......\)\n/\1 /p 2176x 2177s/\n.*$// 2178/^9*$/ s/^/0/ 2179s/.9*$/x&/ 2180h 2181s/^.*x// 2182y/0123456789/1234567890/ 2183x 2184s/x.*$// 2185G 2186s/\n// 2187h 2188@end group 2189@end example 2190@c end--------------------------------------------- 2191 2192@node wc -c 2193@section Counting Characters 2194 2195This script shows another way to do arithmetic with @command{sed}. 2196In this case we have to add possibly large numbers, so implementing 2197this by successive increments would not be feasible (and possibly 2198even more complicated to contrive than this script). 2199 2200The approach is to map numbers to letters, kind of an abacus 2201implemented with @command{sed}. @samp{a}s are units, @samp{b}s are 2202tens and so on: we simply add the number of characters 2203on the current line as units, and then propagate the carry 2204to tens, hundreds, and so on. 2205 2206As usual, running totals are kept in hold space. 2207 2208On the last line, we convert the abacus form back to decimal. 2209For the sake of variety, this is done with a loop rather than 2210with some 80 @code{s} commands@footnote{Some implementations 2211have a limit of 199 commands per script}: first we 2212convert units, removing @samp{a}s from the number; then we 2213rotate letters so that tens become @samp{a}s, and so on 2214until no more letters remain. 2215 2216@c start------------------------------------------- 2217@example 2218#!/usr/bin/sed -nf 2219 2220@group 2221# Add n+1 a's to hold space (+1 is for the newline) 2222s/./a/g 2223H 2224x 2225s/\n/a/ 2226@end group 2227 2228@group 2229# Do the carry. The t's and b's are not necessary, 2230# but they do speed up the thing 2231t a 2232: a; s/aaaaaaaaaa/b/g; t b; b done 2233: b; s/bbbbbbbbbb/c/g; t c; b done 2234: c; s/cccccccccc/d/g; t d; b done 2235: d; s/dddddddddd/e/g; t e; b done 2236: e; s/eeeeeeeeee/f/g; t f; b done 2237: f; s/ffffffffff/g/g; t g; b done 2238: g; s/gggggggggg/h/g; t h; b done 2239: h; s/hhhhhhhhhh//g 2240@end group 2241 2242@group 2243: done 2244$! @{ 2245 h 2246 b 2247@} 2248@end group 2249 2250# On the last line, convert back to decimal 2251 2252@group 2253: loop 2254/a/! s/[b-h]*/&0/ 2255s/aaaaaaaaa/9/ 2256s/aaaaaaaa/8/ 2257s/aaaaaaa/7/ 2258s/aaaaaa/6/ 2259s/aaaaa/5/ 2260s/aaaa/4/ 2261s/aaa/3/ 2262s/aa/2/ 2263s/a/1/ 2264@end group 2265 2266@group 2267: next 2268y/bcdefgh/abcdefg/ 2269/[a-h]/ b loop 2270p 2271@end group 2272@end example 2273@c end--------------------------------------------- 2274 2275@node wc -w 2276@section Counting Words 2277 2278This script is almost the same as the previous one, once each 2279of the words on the line is converted to a single @samp{a} 2280(in the previous script each letter was changed to an @samp{a}). 2281 2282It is interesting that real @command{wc} programs have optimized 2283loops for @samp{wc -c}, so they are much slower at counting 2284words rather than characters. This script's bottleneck, 2285instead, is arithmetic, and hence the word-counting one 2286is faster (it has to manage smaller numbers). 2287 2288Again, the common parts are not commented to show the importance 2289of commenting @command{sed} scripts. 2290 2291@c start------------------------------------------- 2292@example 2293#!/usr/bin/sed -nf 2294 2295@group 2296# Convert words to a's 2297s/[ @kbd{tab}][ @kbd{tab}]*/ /g 2298s/^/ / 2299s/ [^ ][^ ]*/a /g 2300s/ //g 2301@end group 2302 2303@group 2304# Append them to hold space 2305H 2306x 2307s/\n// 2308@end group 2309 2310@group 2311# From here on it is the same as in wc -c. 2312/aaaaaaaaaa/! bx; s/aaaaaaaaaa/b/g 2313/bbbbbbbbbb/! bx; s/bbbbbbbbbb/c/g 2314/cccccccccc/! bx; s/cccccccccc/d/g 2315/dddddddddd/! bx; s/dddddddddd/e/g 2316/eeeeeeeeee/! bx; s/eeeeeeeeee/f/g 2317/ffffffffff/! bx; s/ffffffffff/g/g 2318/gggggggggg/! bx; s/gggggggggg/h/g 2319s/hhhhhhhhhh//g 2320:x 2321$! @{ h; b; @} 2322:y 2323/a/! s/[b-h]*/&0/ 2324s/aaaaaaaaa/9/ 2325s/aaaaaaaa/8/ 2326s/aaaaaaa/7/ 2327s/aaaaaa/6/ 2328s/aaaaa/5/ 2329s/aaaa/4/ 2330s/aaa/3/ 2331s/aa/2/ 2332s/a/1/ 2333y/bcdefgh/abcdefg/ 2334/[a-h]/ by 2335p 2336@end group 2337@end example 2338@c end--------------------------------------------- 2339 2340@node wc -l 2341@section Counting Lines 2342 2343No strange things are done now, because @command{sed} gives us 2344@samp{wc -l} functionality for free!!! Look: 2345 2346@c start------------------------------------------- 2347@example 2348@group 2349#!/usr/bin/sed -nf 2350$= 2351@end group 2352@end example 2353@c end--------------------------------------------- 2354 2355@node head 2356@section Printing the First Lines 2357 2358This script is probably the simplest useful @command{sed} script. 2359It displays the first 10 lines of input; the number of displayed 2360lines is right before the @code{q} command. 2361 2362@c start------------------------------------------- 2363@example 2364@group 2365#!/usr/bin/sed -f 236610q 2367@end group 2368@end example 2369@c end--------------------------------------------- 2370 2371@node tail 2372@section Printing the Last Lines 2373 2374Printing the last @var{n} lines rather than the first is more complex 2375but indeed possible. @var{n} is encoded in the second line, before 2376the bang character. 2377 2378This script is similar to the @command{tac} script in that it keeps the 2379final output in the hold space and prints it at the end: 2380 2381@c start------------------------------------------- 2382@example 2383#!/usr/bin/sed -nf 2384 2385@group 23861! @{; H; g; @} 23871,10 !s/[^\n]*\n// 2388$p 2389h 2390@end group 2391@end example 2392@c end--------------------------------------------- 2393 2394Mainly, the scripts keeps a window of 10 lines and slides it 2395by adding a line and deleting the oldest (the substitution command 2396on the second line works like a @code{D} command but does not 2397restart the loop). 2398 2399The ``sliding window'' technique is a very powerful way to write 2400efficient and complex @command{sed} scripts, because commands like 2401@code{P} would require a lot of work if implemented manually. 2402 2403To introduce the technique, which is fully demonstrated in the 2404rest of this chapter and is based on the @code{N}, @code{P} 2405and @code{D} commands, here is an implementation of @command{tail} 2406using a simple ``sliding window.'' 2407 2408This looks complicated but in fact the working is the same as 2409the last script: after we have kicked in the appropriate number 2410of lines, however, we stop using the hold space to keep inter-line 2411state, and instead use @code{N} and @code{D} to slide pattern 2412space by one line: 2413 2414@c start------------------------------------------- 2415@example 2416#!/usr/bin/sed -f 2417 2418@group 24191h 24202,10 @{; H; g; @} 2421$q 24221,9d 2423N 2424D 2425@end group 2426@end example 2427@c end--------------------------------------------- 2428 2429Note how the first, second and fourth line are inactive after 2430the first ten lines of input. After that, all the script does 2431is: exiting on the last line of input, appending the next input 2432line to pattern space, and removing the first line. 2433 2434@node uniq 2435@section Make Duplicate Lines Unique 2436 2437This is an example of the art of using the @code{N}, @code{P} 2438and @code{D} commands, probably the most difficult to master. 2439 2440@c start------------------------------------------- 2441@example 2442@group 2443#!/usr/bin/sed -f 2444h 2445@end group 2446 2447@group 2448:b 2449# On the last line, print and exit 2450$b 2451N 2452/^\(.*\)\n\1$/ @{ 2453 # The two lines are identical. Undo the effect of 2454 # the n command. 2455 g 2456 bb 2457@} 2458@end group 2459 2460@group 2461# If the @code{N} command had added the last line, print and exit 2462$b 2463@end group 2464 2465@group 2466# The lines are different; print the first and go 2467# back working on the second. 2468P 2469D 2470@end group 2471@end example 2472@c end--------------------------------------------- 2473 2474As you can see, we mantain a 2-line window using @code{P} and @code{D}. 2475This technique is often used in advanced @command{sed} scripts. 2476 2477@node uniq -d 2478@section Print Duplicated Lines of Input 2479 2480This script prints only duplicated lines, like @samp{uniq -d}. 2481 2482@c start------------------------------------------- 2483@example 2484#!/usr/bin/sed -nf 2485 2486@group 2487$b 2488N 2489/^\(.*\)\n\1$/ @{ 2490 # Print the first of the duplicated lines 2491 s/.*\n// 2492 p 2493@end group 2494 2495@group 2496 # Loop until we get a different line 2497 :b 2498 $b 2499 N 2500 /^\(.*\)\n\1$/ @{ 2501 s/.*\n// 2502 bb 2503 @} 2504@} 2505@end group 2506 2507@group 2508# The last line cannot be followed by duplicates 2509$b 2510@end group 2511 2512@group 2513# Found a different one. Leave it alone in the pattern space 2514# and go back to the top, hunting its duplicates 2515D 2516@end group 2517@end example 2518@c end--------------------------------------------- 2519 2520@node uniq -u 2521@section Remove All Duplicated Lines 2522 2523This script prints only unique lines, like @samp{uniq -u}. 2524 2525@c start------------------------------------------- 2526@example 2527#!/usr/bin/sed -f 2528 2529@group 2530# Search for a duplicate line --- until that, print what you find. 2531$b 2532N 2533/^\(.*\)\n\1$/ ! @{ 2534 P 2535 D 2536@} 2537@end group 2538 2539@group 2540:c 2541# Got two equal lines in pattern space. At the 2542# end of the file we simply exit 2543$d 2544@end group 2545 2546@group 2547# Else, we keep reading lines with @code{N} until we 2548# find a different one 2549s/.*\n// 2550N 2551/^\(.*\)\n\1$/ @{ 2552 bc 2553@} 2554@end group 2555 2556@group 2557# Remove the last instance of the duplicate line 2558# and go back to the top 2559D 2560@end group 2561@end example 2562@c end--------------------------------------------- 2563 2564@node cat -s 2565@section Squeezing Blank Lines 2566 2567As a final example, here are three scripts, of increasing complexity 2568and speed, that implement the same function as @samp{cat -s}, that is 2569squeezing blank lines. 2570 2571The first leaves a blank line at the beginning and end if there are 2572some already. 2573 2574@c start------------------------------------------- 2575@example 2576#!/usr/bin/sed -f 2577 2578@group 2579# on empty lines, join with next 2580# Note there is a star in the regexp 2581:x 2582/^\n*$/ @{ 2583N 2584bx 2585@} 2586@end group 2587 2588@group 2589# now, squeeze all '\n', this can be also done by: 2590# s/^\(\n\)*/\1/ 2591s/\n*/\ 2592/ 2593@end group 2594@end example 2595@c end--------------------------------------------- 2596 2597This one is a bit more complex and removes all empty lines 2598at the beginning. It does leave a single blank line at end 2599if one was there. 2600 2601@c start------------------------------------------- 2602@example 2603#!/usr/bin/sed -f 2604 2605@group 2606# delete all leading empty lines 26071,/^./@{ 2608/./!d 2609@} 2610@end group 2611 2612@group 2613# on an empty line we remove it and all the following 2614# empty lines, but one 2615:x 2616/./!@{ 2617N 2618s/^\n$// 2619tx 2620@} 2621@end group 2622@end example 2623@c end--------------------------------------------- 2624 2625This removes leading and trailing blank lines. It is also the 2626fastest. Note that loops are completely done with @code{n} and 2627@code{b}, without relying on @command{sed} to restart the 2628the script automatically at the end of a line. 2629 2630@c start------------------------------------------- 2631@example 2632#!/usr/bin/sed -nf 2633 2634@group 2635# delete all (leading) blanks 2636/./!d 2637@end group 2638 2639@group 2640# get here: so there is a non empty 2641:x 2642# print it 2643p 2644# get next 2645n 2646# got chars? print it again, etc... 2647/./bx 2648@end group 2649 2650@group 2651# no, don't have chars: got an empty line 2652:z 2653# get next, if last line we finish here so no trailing 2654# empty lines are written 2655n 2656# also empty? then ignore it, and get next... this will 2657# remove ALL empty lines 2658/./!bz 2659@end group 2660 2661@group 2662# all empty lines were deleted/ignored, but we have a non empty. As 2663# what we want to do is to squeeze, insert a blank line artificially 2664i\ 2665@end group 2666 2667bx 2668@end example 2669@c end--------------------------------------------- 2670 2671@node Limitations 2672@chapter @value{SSED}'s Limitations and Non-limitations 2673 2674@cindex @acronym{GNU} extensions, unlimited line length 2675@cindex Portability, line length limitations 2676For those who want to write portable @command{sed} scripts, 2677be aware that some implementations have been known to 2678limit line lengths (for the pattern and hold spaces) 2679to be no more than 4000 bytes. 2680The @sc{posix} standard specifies that conforming @command{sed} 2681implementations shall support at least 8192 byte line lengths. 2682@value{SSED} has no built-in limit on line length; 2683as long as it can @code{malloc()} more (virtual) memory, 2684you can feed or construct lines as long as you like. 2685 2686However, recursion is used to handle subpatterns and indefinite 2687repetition. This means that the available stack space may limit 2688the size of the buffer that can be processed by certain patterns. 2689 2690@ifset PERL 2691There are some size limitations in the regular expression 2692matcher but it is hoped that they will never in practice 2693be relevant. The maximum length of a compiled pattern 2694is 65539 (sic) bytes. All values in repeating quantifiers 2695must be less than 65536. The maximum nesting depth of 2696all parenthesized subpatterns, including capturing and 2697non-capturing subpatterns@footnote{The 2698distinction is meaningful when referring to Perl-style 2699regular expressions.}, assertions, and other types of 2700subpattern, is 200. 2701 2702Also, @value{SSED} recognizes the @sc{posix} syntax 2703@code{[.@var{ch}.]} and @code{[=@var{ch}=]} 2704where @var{ch} is a ``collating element'', but these 2705are not supported, and an error is given if they are 2706encountered. 2707 2708Here are a few distinctions between the real Perl-style 2709regular expressions and those that @option{-R} recognizes. 2710 2711@enumerate 2712@item 2713Lookahead assertions do not allow repeat quantifiers after them 2714Perl permits them, but they do not mean what you 2715might think. For example, @samp{(?!a)@{3@}} does not assert that the 2716next three characters are not @samp{a}. It just asserts three times that the 2717next character is not @samp{a} --- a waste of time and nothing else. 2718 2719@item 2720Capturing subpatterns that occur inside negative lookahead 2721head assertions are counted, but their entries are counted 2722as empty in the second half of an @code{s} command. 2723Perl sets its numerical variables from any such patterns 2724that are matched before the assertion fails to match 2725something (thereby succeeding), but only if the negative 2726lookahead assertion contains just one branch. 2727 2728@item 2729The following Perl escape sequences are not supported: 2730@samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E}, 2731@samp{\Q}. In fact these are implemented by Perl's general 2732string-handling and are not part of its pattern matching engine. 2733 2734@item 2735The Perl @samp{\G} assertion is not supported as it is not 2736relevant to single pattern matches. 2737 2738@item 2739Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})} 2740and @samp{(?p@{code@})} constructions. However, there is some experimental 2741support for recursive patterns using the non-Perl item @samp{(?R)}. 2742 2743@item 2744There are at the time of writing some oddities in Perl 27455.005_02 concerned with the settings of captured strings 2746when part of a pattern is repeated. For example, matching 2747@samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets 2748@samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.} 2749to the value @samp{b}, but matching @samp{aabbaa} 2750against @samp{/^(aa(bb)?)+$/} leaves @samp{$2} 2751unset. However, if the pattern is changed to 2752@samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set. 2753In Perl 5.004 @samp{$2} is set in both cases, and that is also 2754true of @value{SSED}. 2755 2756@item 2757Another as yet unresolved discrepancy is that in Perl 27585.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches 2759the string @samp{a}, whereas in @value{SSED} it does not. 2760However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched 2761against @samp{a} leaves $1 unset. 2762@end enumerate 2763@end ifset 2764 2765@node Other Resources 2766@chapter Other Resources for Learning About @command{sed} 2767 2768@cindex Additional reading about @command{sed} 2769In addition to several books that have been written about @command{sed} 2770(either specifically or as chapters in books which discuss 2771shell programming), one can find out more about @command{sed} 2772(including suggestions of a few books) from the FAQ 2773for the @code{sed-users} mailing list, available from: 2774@display 2775@uref{http://sed.sourceforge.net/sedfaq.html} 2776@end display 2777 2778Also of interest are 2779@uref{http://www.student.northpark.edu/pemente/sed/index.htm} 2780and @uref{http://sed.sf.net/grabbag}, 2781which include @command{sed} tutorials and other @command{sed}-related goodies. 2782 2783The @code{sed-users} mailing list itself maintained by Sven Guckes. 2784To subscribe, visit @uref{http://groups.yahoo.com} and search 2785for the @code{sed-users} mailing list. 2786 2787@node Reporting Bugs 2788@chapter Reporting Bugs 2789 2790@cindex Bugs, reporting 2791Email bug reports to @email{bonzini@@gnu.org}. 2792Be sure to include the word ``sed'' somewhere in the @code{Subject:} field. 2793Also, please include the output of @samp{sed --version} in the body 2794of your report if at all possible. 2795 2796Please do not send a bug report like this: 2797 2798@example 2799@i{@i{@r{while building frobme-1.3.4}}} 2800$ configure 2801@error{} sed: file sedscr line 1: Unknown option to 's' 2802@end example 2803 2804If @value{SSED} doesn't configure your favorite package, take a 2805few extra minutes to identify the specific problem and make a stand-alone 2806test case. Unlike other programs such as C compilers, making such test 2807cases for @command{sed} is quite simple. 2808 2809A stand-alone test case includes all the data necessary to perform the 2810test, and the specific invocation of @command{sed} that causes the problem. 2811The smaller a stand-alone test case is, the better. A test case should 2812not involve something as far removed from @command{sed} as ``try to configure 2813frobme-1.3.4''. Yes, that is in principle enough information to look 2814for the bug, but that is not a very practical prospect. 2815 2816Here are a few commonly reported bugs that are not bugs. 2817 2818@table @asis 2819@item @code{N} command on the last line 2820@cindex Portability, @code{N} command on the last line 2821@cindex Non-bugs, @code{N} command on the last line 2822 2823Most versions of @command{sed} exit without printing anything when 2824the @command{N} command is issued on the last line of a file. 2825@value{SSED} prints pattern space before exiting unless of course 2826the @command{-n} command switch has been specified. This choice is 2827by design. 2828 2829For example, the behavior of 2830@example 2831sed N foo bar 2832@end example 2833@noindent 2834would depend on whether foo has an even or an odd number of 2835lines@footnote{which is the actual ``bug'' that prompted the 2836change in behavior}. Or, when writing a script to read the 2837next few lines following a pattern match, traditional 2838implementations of @code{sed} would force you to write 2839something like 2840@example 2841/foo/@{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N @} 2842@end example 2843@noindent 2844instead of just 2845@example 2846/foo/@{ N;N;N;N;N;N;N;N;N; @} 2847@end example 2848 2849@cindex @code{POSIXLY_CORRECT} behavior, @code{N} command 2850In any case, the simplest workaround is to use @code{$d;N} in 2851scripts that rely on the traditional behavior, or to set 2852the @code{POSIXLY_CORRECT} variable to a non-empty value. 2853 2854@item Regex syntax clashes (problems with backslashes) 2855@cindex @acronym{GNU} extensions, to basic regular expressions 2856@cindex Non-bugs, regex syntax clashes 2857@command{sed} uses the @sc{posix} basic regular expression syntax. According to 2858the standard, the meaning of some escape sequences is undefined in 2859this syntax; notable in the case of @command{sed} are @code{\|}, 2860@code{\+}, @code{\?}, @code{\`}, @code{\'}, @code{\<}, 2861@code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}. 2862 2863As in all @acronym{GNU} programs that use @sc{posix} basic regular 2864expressions, @command{sed} interprets these escape sequences as special 2865characters. So, @code{x\+} matches one or more occurrences of @samp{x}. 2866@code{abc\|def} matches either @samp{abc} or @samp{def}. 2867 2868This syntax may cause problems when running scripts written for other 2869@command{sed}s. Some @command{sed} programs have been written with the 2870assumption that @code{\|} and @code{\+} match the literal characters 2871@code{|} and @code{+}. Such scripts must be modified by removing the 2872spurious backslashes if they are to be used with modern implementations 2873of @command{sed}, like 2874@ifset PERL 2875@value{SSED} or 2876@end ifset 2877@acronym{GNU} @command{sed}. 2878 2879On the other hand, some scripts use s|abc\|def||g to remove occurrences 2880of @emph{either} @code{abc} or @code{def}. While this worked until 2881@command{sed} 4.0.x, newer versions interpret this as removing the 2882string @code{abc|def}. This is again undefined behavior according to 2883@acronym{POSIX}, and this interpretation is arguably more robust: older 2884@command{sed}s, for example, required that the regex matcher parsed 2885@code{\/} as @code{/} in the common case of escaping a slash, which is 2886again undefined behavior; the new behavior avoids this, and this is good 2887because the regex matcher is only partially under our control. 2888 2889@cindex @acronym{GNU} extensions, special escapes 2890In addition, this version of @command{sed} supports several escape characters 2891(some of which are multi-character) to insert non-printable characters 2892in scripts (@code{\a}, @code{\c}, @code{\d}, @code{\o}, @code{\r}, 2893@code{\t}, @code{\v}, @code{\x}). These can cause similar problems 2894with scripts written for other @command{sed}s. 2895 2896@item @option{-i} clobbers read-only files 2897@cindex In-place editing 2898@cindex @value{SSEDEXT}, in-place editing 2899@cindex Non-bugs, in-place editing 2900 2901In short, @samp{sed -i} will let you delete the contents of 2902a read-only file, and in general the @option{-i} option 2903(@pxref{Invoking sed, , Invocation}) lets you clobber 2904protected files. This is not a bug, but rather a consequence 2905of how the Unix filesystem works. 2906 2907The permissions on a file say what can happen to the data 2908in that file, while the permissions on a directory say what can 2909happen to the list of files in that directory. @samp{sed -i} 2910will not ever open for writing a file that is already on disk. 2911Rather, it will work on a temporary file that is finally renamed 2912to the original name: if you rename or delete files, you're actually 2913modifying the contents of the directory, so the operation depends on 2914the permissions of the directory, not of the file. For this same 2915reason, @command{sed} does not let you use @option{-i} on a writeable file 2916in a read-only directory, and will break hard or symbolic links when 2917@option{-i} is used on such a file. 2918 2919@item @code{0a} does not work (gives an error) 2920@cindex @code{0} address 2921@cindex @acronym{GNU} extensions, @code{0} address 2922@cindex Non-bugs, @code{0} address 2923 2924There is no line 0. 0 is a special address that is only used to treat 2925addresses like @code{0,/@var{RE}/} as active when the script starts: if 2926you write @code{1,/abc/d} and the first line includes the word @samp{abc}, 2927then that match would be ignored because address ranges must span at least 2928two lines (barring the end of the file); but what you probably wanted is 2929to delete every line up to the first one including @samp{abc}, and this 2930is obtained with @code{0,/abc/d}. 2931 2932@ifclear PERL 2933@item @code{[a-z]} is case insensitive 2934@cindex Non-bugs, localization-related 2935 2936You are encountering problems with locales. POSIX mandates that @code{[a-z]} 2937uses the current locale's collation order -- in C parlance, that means using 2938@code{strcoll(3)} instead of @code{strcmp(3)}. Some locales have a 2939case-insensitive collation order, others don't. 2940 2941Another problem is that @code{[a-z]} tries to use collation symbols. 2942This only happens if you are on the @acronym{GNU} system, using 2943@acronym{GNU} libc's regular expression matcher instead of compiling the 2944one supplied with @acronym{GNU} sed. In a Danish locale, for example, 2945the regular expression @code{^[a-z]$} matches the string @samp{aa}, 2946because this is a single collating symbol that comes after @samp{a} 2947and before @samp{b}; @samp{ll} behaves similarly in Spanish 2948locales, or @samp{ij} in Dutch locales. 2949 2950To work around these problems, which may cause bugs in shell scripts, set 2951the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}. 2952 2953@item @code{s/.*//} does not clear pattern space 2954@cindex Non-bugs, localization-related 2955@cindex @value{SSEDEXT}, emptying pattern space 2956@cindex Emptying pattern space 2957 2958This happens if your input stream includes invalid multibyte 2959sequences. @sc{posix} mandates that such sequences 2960are @emph{not} matched by @samp{.}, so that @samp{s/.*//} will not clear 2961pattern space as you would expect. In fact, there is no way to clear 2962sed's buffers in the middle of the script in most multibyte locales 2963(including UTF-8 locales). For this reason, @value{SSED} provides a `z' 2964command (for `zap') as an extension. 2965 2966To work around these problems, which may cause bugs in shell scripts, set 2967the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}. 2968@end ifclear 2969@end table 2970 2971 2972@node Extended regexps 2973@appendix Extended regular expressions 2974@cindex Extended regular expressions, syntax 2975 2976The only difference between basic and extended regular expressions is in 2977the behavior of a few characters: @samp{?}, @samp{+}, parentheses, 2978and braces (@samp{@{@}}). While basic regular expressions require 2979these to be escaped if you want them to behave as special characters, 2980when using extended regular expressions you must escape them if 2981you want them @emph{to match a literal character}. 2982 2983@noindent 2984Examples: 2985@table @code 2986@item abc? 2987becomes @samp{abc\?} when using extended regular expressions. It matches 2988the literal string @samp{abc?}. 2989 2990@item c\+ 2991becomes @samp{c+} when using extended regular expressions. It matches 2992one or more @samp{c}s. 2993 2994@item a\@{3,\@} 2995becomes @samp{a@{3,@}} when using extended regular expressions. It matches 2996three or more @samp{a}s. 2997 2998@item \(abc\)\@{2,3\@} 2999becomes @samp{(abc)@{2,3@}} when using extended regular expressions. It 3000matches either @samp{abcabc} or @samp{abcabcabc}. 3001 3002@item \(abc*\)\1 3003becomes @samp{(abc*)\1} when using extended regular expressions. 3004Backreferences must still be escaped when using extended regular 3005expressions. 3006@end table 3007 3008@ifset PERL 3009@node Perl regexps 3010@appendix Perl-style regular expressions 3011@cindex Perl-style regular expressions, syntax 3012 3013@emph{This part is taken from the @file{pcre.txt} file distributed together 3014with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.} 3015 3016Perl introduced several extensions to regular expressions, some 3017of them incompatible with the syntax of regular expressions 3018accepted by Emacs and other @acronym{GNU} tools (whose matcher was 3019based on the Emacs matcher). @value{SSED} implements 3020both kinds of extensions. 3021 3022@iftex 3023Summarizing, we have: 3024 3025@itemize @bullet 3026@item 3027A backslash can introduce several special sequences 3028 3029@item 3030The circumflex, dollar sign, and period characters behave specially 3031with regard to new lines 3032 3033@item 3034Strange uses of square brackets are parsed differently 3035 3036@item 3037You can toggle modifiers in the middle of a regular expression 3038 3039@item 3040You can specify that a subpattern does not count when numbering backreferences 3041 3042@item 3043@cindex Greedy regular expression matching 3044You can specify greedy or non-greedy matching 3045 3046@item 3047You can have more than ten back references 3048 3049@item 3050You can do complex look aheads and look behinds (in the spirit of 3051@code{\b}, but with subpatterns). 3052 3053@item 3054You can often improve performance by avoiding that @command{sed} wastes 3055time with backtracking 3056 3057@item 3058You can have if/then/else branches 3059 3060@item 3061You can do recursive matches, for example to look for unbalanced parentheses 3062 3063@item 3064You can have comments and non-significant whitespace, because things can 3065get complex... 3066@end itemize 3067 3068Most of these extensions are introduced by the special @code{(?} 3069sequence, which gives special meanings to parenthesized groups. 3070@end iftex 3071@menu 3072Other extensions can be roughly subdivided in two categories 3073On one hand Perl introduces several more escaped sequences 3074(that is, sequences introduced by a backslash). On the other 3075hand, it specifies that if a question mark follows an open 3076parentheses it should give a special meaning to the parenthesized 3077group. 3078 3079* Backslash:: Introduces special sequences 3080* Circumflex/dollar sign/period:: Behave specially with regard to new lines 3081* Square brackets:: Are a bit different in strange cases 3082* Options setting:: Toggle modifiers in the middle of a regexp 3083* Non-capturing subpatterns:: Are not counted when backreferencing 3084* Repetition:: Allows for non-greedy matching 3085* Backreferences:: Allows for more than 10 back references 3086* Assertions:: Allows for complex look ahead matches 3087* Non-backtracking subpatterns:: Often gives more performance 3088* Conditional subpatterns:: Allows if/then/else branches 3089* Recursive patterns:: For example to match parentheses 3090* Comments:: Because things can get complex... 3091@end menu 3092 3093@node Backslash 3094@appendixsec Backslash 3095@cindex Perl-style regular expressions, escaped sequences 3096 3097There are a few difference in the handling of backslashed 3098sequences in Perl mode. 3099 3100First of all, there are no @code{\o} and @code{\d} sequences. 3101@sc{ascii} values for characters can be specified in octal 3102with a @code{\@var{xxx}} sequence, where @var{xxx} is a 3103sequence of up to three octal digits. If the first digit 3104is a zero, the treatment of the sequence is straightforward; 3105just note that if the character that follows the escaped digit 3106is itself an octal digit, you have to supply three octal digits 3107for @var{xxx}. For example @code{\07} is a @sc{bel} character 3108rather than a @sc{nul} and a literal @code{7} (this sequence is 3109instead represented by @code{\0007}). 3110 3111@cindex Perl-style regular expressions, backreferences 3112The handling of a backslash followed by a digit other than 0 3113is complicated. Outside a character class, @command{sed} reads it 3114and any following digits as a decimal number. If the number 3115is less than 10, or if there have been at least that many 3116previous capturing left parentheses in the expression, the 3117entire sequence is taken as a back reference. A description 3118of how this works is given later, following the discussion 3119of parenthesized subpatterns. 3120 3121Inside a character class, or if the decimal number is 3122greater than 9 and there have not been that many capturing 3123subpatterns, @command{sed} re-reads up to three octal digits following 3124the backslash, and generates a single byte from the 3125least significant 8 bits of the value. Any subsequent digits 3126stand for themselves. For example: 3127 3128@example 3129\040 @i{@r{is another way of writing a space}} 3130\40 @i{@r{is the same, provided there are fewer than 40}} 3131 @i{@r{previous capturing subpatterns}} 3132\7 @i{@r{is always a back reference}} 3133\011 @i{@r{is always a tab}} 3134\11 @i{@r{might be a back reference, or another way of writing a tab}} 3135\0113 @i{@r{is a tab followed by the character @samp{3}}} 3136\113 @i{@r{is the character with octal code 113 (since there}} 3137 @i{@r{can be no more than 99 back references)}} 3138\377 @i{@r{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}} 3139\81 @i{@r{is either a back reference, or a binary zero}} 3140 @i{@r{followed by the two characters @samp{81}}} 3141@end example 3142 3143Note that octal values of 100 or greater must not be introduced 3144by a leading zero, because no more than three octal 3145digits are ever read. Note that this applies only to the LHS 3146pattern; it is not possible yet to specify more than 9 backreferences 3147on the RHS of the `s' command. 3148 3149All the sequences that define a single byte value can be 3150used both inside and outside character classes. In addition, 3151inside a character class, the sequence @code{\b} is interpreted 3152as the backspace character (hex 08). Outside a character 3153class it has a different meaning (see below). 3154 3155In addition, there are four additional escapes specifying 3156generic character classes (like @code{\w} and @code{\W} do): 3157 3158@cindex Perl-style regular expressions, character classes 3159@table @samp 3160@item \d 3161Matches any decimal digit 3162 3163@item \D 3164Matches any character that is not a decimal digit 3165@end table 3166 3167In Perl mode, these character type sequences can appear both inside and 3168outside character classes. Instead, in @sc{posix} mode these sequences 3169(as well as @code{\w} and @code{\W}) are treated as two literal characters 3170(a backslash and a letter) inside square brackets. 3171 3172Escaped sequences specifying assertions are also different in 3173Perl mode. An assertion specifies a condition that has to be met 3174at a particular point in a match, without consuming any 3175characters from the subject string. The use of subpatterns 3176for more complicated assertions is described below. The 3177backslashed assertions are 3178 3179@cindex Perl-style regular expressions, assertions 3180@table @samp 3181@item \b 3182Asserts that the point is at a word boundary. 3183A word boundary is a position in the subject string where 3184the current character and the previous character do not both 3185match @code{\w} or @code{\W} (i.e. one matches @code{\w} and 3186the other matches @code{\W}), or the start or end of the string 3187if the first or last character matches @code{\w}, respectively. 3188 3189@item \B 3190Asserts that the point is not at a word boundary. 3191 3192@item \A 3193Asserts the matcher is at the start of pattern space (independent 3194of multiline mode). 3195 3196@item \Z 3197Asserts the matcher is at the end of pattern space, 3198or at a newline before the end of pattern space (independent of 3199multiline mode) 3200 3201@item \z 3202Asserts the matcher is at the end of pattern space (independent 3203of multiline mode) 3204@end table 3205 3206These assertions may not appear in character classes (but 3207note that @code{\b} has a different meaning, namely the 3208backspace character, inside a character class). 3209Note that Perl mode does not support directly assertions 3210for the beginning and the end of word; the @acronym{GNU} extensions 3211@code{\<} and @code{\>} achieve this purpose in @sc{posix} mode 3212instead. 3213 3214The @code{\A}, @code{\Z}, and @code{\z} assertions differ 3215from the traditional circumflex and dollar sign (described below) 3216in that they only ever match at the very start and end of the 3217subject string, whatever options are set; in particular @code{\A} 3218and @code{\z} are the same as the @acronym{GNU} extensions 3219@code{\`} and @code{\'} that are active in @sc{posix} mode. 3220 3221@node Circumflex/dollar sign/period 3222@appendixsec Circumflex, dollar sign, period 3223@cindex Perl-style regular expressions, newlines 3224 3225Outside a character class, in the default matching mode, the 3226circumflex character is an assertion which is true only if 3227the current matching point is at the start of the subject 3228string. Inside a character class, the circumflex has an entirely 3229different meaning (see below). 3230 3231The circumflex need not be the first character of the pattern if 3232a number of alternatives are involved, but it should be the 3233first thing in each alternative in which it appears if the 3234pattern is ever to match that branch. If all possible alternatives, 3235start with a circumflex, that is, if the pattern is 3236constrained to match only at the start of the subject, it is 3237said to be an @dfn{anchored} pattern. (There are also other constructs 3238structs that can cause a pattern to be anchored.) 3239 3240A dollar sign is an assertion which is true only if the 3241current matching point is at the end of the subject string, 3242or immediately before a newline character that is the last 3243character in the string (by default). A dollar sign need not be the 3244last character of the pattern if a number of alternatives 3245are involved, but it should be the last item in any branch 3246in which it appears. A dollar sign has no special meaning in a 3247character class. 3248 3249@cindex Perl-style regular expressions, multiline 3250The meanings of the circumflex and dollar sign characters are 3251changed if the @code{M} modifier option is used. When this is 3252the case, they match immediately after and immediately 3253before an internal @code{\n} character, respectively, in addition 3254to matching at the start and end of the subject string. For 3255example, the pattern @code{/^abc$/} matches the subject string 3256@samp{def\nabc} in multiline mode, but not otherwise. Consequently, 3257patterns that are anchored in single line mode 3258because all branches start with @code{^} are not anchored in 3259multiline mode. 3260 3261@cindex Perl-style regular expressions, multiline 3262Note that the sequences @code{\A}, @code{\Z}, and @code{\z} 3263can be used to match the start and end of the subject in both 3264modes, and if all branches of a pattern start with @code{\A} 3265is it always anchored, whether the @code{M} modifier is set or not. 3266 3267@cindex Perl-style regular expressions, single line 3268Outside a character class, a dot in the pattern matches any 3269one character in the subject, including a non-printing character, 3270but not (by default) newline. If the @code{S} modifier is used, 3271dots match newlines as well. Actually, the handling of 3272dot is entirely independent of the handling of circumflex 3273and dollar sign, the only relationship being that they both 3274involve newline characters. Dot has no special meaning in a 3275character class. 3276 3277@node Square brackets 3278@appendixsec Square brackets 3279@cindex Perl-style regular expressions, character classes 3280 3281An opening square bracket introduces a character class, terminated 3282by a closing square bracket. A closing square bracket on its own 3283is not special. If a closing square bracket is required as a 3284member of the class, it should be the first data character in 3285the class (after an initial circumflex, if present) or escaped with a backslash. 3286 3287A character class matches a single character in the subject; 3288the character must be in the set of characters defined by 3289the class, unless the first character in the class is a circumflex, 3290in which case the subject character must not be in 3291the set defined by the class. If a circumflex is actually 3292required as a member of the class, ensure it is not the 3293first character, or escape it with a backslash. 3294 3295For example, the character class [aeiou] matches any lower 3296case vowel, while [^aeiou] matches any character that is not 3297a lower case vowel. Note that a circumflex is just a convenient 3298venient notation for specifying the characters which are in 3299the class by enumerating those that are not. It is not an 3300assertion: it still consumes a character from the subject 3301string, and fails if the current pointer is at the end of 3302the string. 3303 3304@cindex Perl-style regular expressions, case-insensitive 3305When caseless matching is set, any letters in a class 3306represent both their upper case and lower case versions, so 3307for example, a caseless @code{[aeiou]} matches uppercase 3308and lowercase @samp{A}s, and a caseless @code{[^aeiou]} 3309does not match @samp{A}, whereas a case-sensitive version would. 3310 3311@cindex Perl-style regular expressions, single line 3312@cindex Perl-style regular expressions, multiline 3313The newline character is never treated in any special way in 3314character classes, whatever the setting of the @code{S} and 3315@code{M} options (modifiers) is. A class such as @code{[^a]} will 3316always match a newline. 3317 3318The minus (hyphen) character can be used to specify a range 3319of characters in a character class. For example, @code{[d-m]} 3320matches any letter between d and m, inclusive. If a minus 3321character is required in a class, it must be escaped with a 3322backslash or appear in a position where it cannot be interpreted 3323as indicating a range, typically as the first or last 3324character in the class. 3325 3326It is not possible to have the literal character @code{]} as the 3327end character of a range. A pattern such as @code{[W-]46]} is 3328interpreted as a class of two characters (@code{W} and @code{-}) 3329followed by a literal string @code{46]}, so it would match 3330@samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped 3331with a backslash it is interpreted as the end of range, so 3332@code{[W-\]46]} is interpreted as a single class containing a 3333range followed by two separate characters. The octal or 3334hexadecimal representation of @code{]} can also be used to end a range. 3335 3336Ranges operate in @sc{ascii} collating sequence. They can also be 3337used for characters specified numerically, for example 3338@code{[\000-\037]}. If a range that includes letters is used when 3339caseless matching is set, it matches the letters in either 3340case. For example, a caseless @code{[W-c]} is equivalent to 3341@code{[][\^_`wxyzabc]}, matched caselessly, and if character 3342tables for the French locale are in use, @code{[\xc8-\xcb]} 3343matches accented E characters in both cases. 3344 3345Unlike in @sc{posix} mode, the character types @code{\d}, 3346@code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W} 3347may also appear in a character class, and add the characters 3348that they match to the class. For example, @code{[\dABCDEF]} matches any 3349hexadecimal digit. A circumflex can conveniently be used 3350with the upper case character types to specify a more restricted 3351set of characters than the matching lower case type. 3352For example, the class @code{[^\W_]} matches any letter or digit, 3353but not underscore. 3354 3355All non-alphameric characters other than @code{\}, @code{-}, 3356@code{^} (at the start) and the terminating @code{]} 3357are non-special in character classes, but it does no harm 3358if they are escaped. 3359 3360Perl 5.6 supports the @sc{posix} notation for character classes, which 3361uses names enclosed by @code{[:} and @code{:]} within the enclosing 3362square brackets, and @value{SSED} supports this notation as well. 3363For example, 3364 3365@example 3366[01[:alpha:]%] 3367@end example 3368 3369@noindent 3370matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}. 3371The supported class names are 3372 3373@table @code 3374@item alnum 3375Matches letters and digits 3376 3377@item alpha 3378Matches letters 3379 3380@item ascii 3381Matches character codes 0 - 127 3382 3383@item cntrl 3384Matches control characters 3385 3386@item digit 3387Matches decimal digits (same as \d) 3388 3389@item graph 3390Matches printing characters, excluding space 3391 3392@item lower 3393Matches lower case letters 3394 3395@item print 3396Matches printing characters, including space 3397 3398@item punct 3399Matches printing characters, excluding letters and digits 3400 3401@item space 3402Matches white space (same as \s) 3403 3404@item upper 3405Matches upper case letters 3406 3407@item word 3408Matches ``word'' characters (same as \w) 3409 3410@item xdigit 3411Matches hexadecimal digits 3412@end table 3413 3414The names @code{ascii} and @code{word} are extensions valid only in 3415Perl mode. Another Perl extension is negation, which is 3416indicated by a circumflex character after the colon. For example, 3417 3418@example 3419[12[:^digit:]] 3420@end example 3421 3422@noindent 3423matches @samp{1}, @samp{2}, or any non-digit. 3424 3425@node Options setting 3426@appendixsec Options setting 3427@cindex Perl-style regular expressions, toggling options 3428@cindex Perl-style regular expressions, case-insensitive 3429@cindex Perl-style regular expressions, multiline 3430@cindex Perl-style regular expressions, single line 3431@cindex Perl-style regular expressions, extended 3432 3433The settings of the @code{I}, @code{M}, @code{S}, @code{X} 3434modifiers can be changed from within the pattern by 3435a sequence of Perl option letters enclosed between @code{(?} 3436and @code{)}. The option letters must be lowercase. 3437 3438For example, @code{(?im)} sets caseless, multiline matching. It is 3439also possible to unset these options by preceding the letter 3440with a hyphen; you can also have combined settings and unsettings: 3441@code{(?im-sx)} sets caseless and multiline matching, 3442while unsets single line matching (for dots) and extended 3443whitespace interpretation. If a letter appears both before 3444and after the hyphen, the option is unset. 3445 3446The scope of these option changes depends on where in the 3447pattern the setting occurs. For settings that are outside 3448any subpattern (defined below), the effect is the same as if 3449the options were set or unset at the start of matching. The 3450following patterns all behave in exactly the same way: 3451 3452@example 3453(?i)abc 3454a(?i)bc 3455ab(?i)c 3456abc(?i) 3457@end example 3458 3459which in turn is the same as specifying the pattern abc with 3460the @code{I} modifier. In other words, ``top level'' settings 3461apply to the whole pattern (unless there are other 3462changes inside subpatterns). If there is more than one setting 3463of the same option at top level, the rightmost setting 3464is used. 3465 3466If an option change occurs inside a subpattern, the effect 3467is different. This is a change of behaviour in Perl 5.005. 3468An option change inside a subpattern affects only that part 3469of the subpattern @emph{that follows} it, so 3470 3471@example 3472(a(?i)b)c 3473@end example 3474 3475@noindent 3476matches abc and aBc and no other strings (assuming 3477case-sensitive matching is used). By this means, options can 3478be made to have different settings in different parts of the 3479pattern. Any changes made in one alternative do carry on 3480into subsequent branches within the same subpattern. For 3481example, 3482 3483@example 3484(a(?i)b|c) 3485@end example 3486 3487@noindent 3488matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C}, 3489even though when matching @samp{C} the first branch is 3490abandoned before the option setting. 3491This is because the effects of option settings happen at 3492compile time. There would be some very weird behaviour otherwise. 3493 3494@ignore 3495There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA 3496that can be changed in the same way as the Perl-compatible options by 3497using the characters U and X respectively. The (?X) flag 3498setting is special in that it must always occur earlier in 3499the pattern than any of the additional features it turns on, 3500even when it is at top level. It is best put at the start. 3501@end ignore 3502 3503 3504@node Non-capturing subpatterns 3505@appendixsec Non-capturing subpatterns 3506@cindex Perl-style regular expressions, non-capturing subpatterns 3507 3508Marking part of a pattern as a subpattern does two things. 3509On one hand, it localizes a set of alternatives; on the other 3510hand, it sets up the subpattern as a capturing subpattern (as 3511defined above). The subpattern can be backreferenced and 3512referenced in the right side of @code{s} commands. 3513 3514For example, if the string @samp{the red king} is matched against 3515the pattern 3516 3517@example 3518the ((red|white) (king|queen)) 3519@end example 3520 3521@noindent 3522the captured substrings are @samp{red king}, @samp{red}, 3523and @samp{king}, and are numbered 1, 2, and 3. 3524 3525The fact that plain parentheses fulfil two functions is not 3526always helpful. There are often times when a grouping 3527subpattern is required without a capturing requirement. If an 3528opening parenthesis is followed by @code{?:}, the subpattern does 3529not do any capturing, and is not counted when computing the 3530number of any subsequent capturing subpatterns. For example, 3531if the string @samp{the white queen} is matched against the pattern 3532 3533@example 3534the ((?:red|white) (king|queen)) 3535@end example 3536 3537@noindent 3538the captured substrings are @samp{white queen} and @samp{queen}, 3539and are numbered 1 and 2. The maximum number of captured 3540substrings is 99, while the maximum number of all subpatterns, 3541both capturing and non-capturing, is 200. 3542 3543As a convenient shorthand, if any option settings are 3544equired at the start of a non-capturing subpattern, the 3545option letters may appear between the @code{?} and the 3546@code{:}. Thus the two patterns 3547 3548@example 3549(?i:saturday|sunday) 3550(?:(?i)saturday|sunday) 3551@end example 3552 3553@noindent 3554match exactly the same set of strings. Because alternative 3555branches are tried from left to right, and options are not 3556reset until the end of the subpattern is reached, an option 3557setting in one branch does affect subsequent branches, so 3558the above patterns match @samp{SUNDAY} as well as @samp{Saturday}. 3559 3560 3561@node Repetition 3562@appendixsec Repetition 3563@cindex Perl-style regular expressions, repetitions 3564 3565Repetition is specified by quantifiers, which can follow any 3566of the following items: 3567 3568@itemize @bullet 3569@item 3570a single character, possibly escaped 3571 3572@item 3573the @code{.} special character 3574 3575@item 3576a character class 3577 3578@item 3579a back reference (see next section) 3580 3581@item 3582a parenthesized subpattern (unless it is an assertion; @pxref{Assertions}) 3583@end itemize 3584 3585The general repetition quantifier specifies a minimum and 3586maximum number of permitted matches, by giving the two 3587numbers in curly brackets (braces), separated by a comma. 3588The numbers must be less than 65536, and the first must be 3589less than or equal to the second. For example: 3590 3591@example 3592z@{2,4@} 3593@end example 3594 3595@noindent 3596matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own 3597is not a special character. If the second number is omitted, 3598but the comma is present, there is no upper limit; if the 3599second number and the comma are both omitted, the quantifier 3600specifies an exact number of required matches. Thus 3601 3602@example 3603[aeiou]@{3,@} 3604@end example 3605 3606@noindent 3607matches at least 3 successive vowels, but may match many 3608more, while 3609 3610@example 3611\d@{8@} 3612@end example 3613 3614@noindent 3615matches exactly 8 digits. An opening curly bracket that 3616appears in a position where a quantifier is not allowed, or 3617one that does not match the syntax of a quantifier, is taken 3618as a literal character. For example, @{,6@} is not a quantifier, 3619but a literal string of four characters.@footnote{It 3620raises an error if @option{-R} is not used.} 3621 3622The quantifier @samp{@{0@}} is permitted, causing the expression to 3623behave as if the previous item and the quantifier were not 3624present. 3625 3626For convenience (and historical compatibility) the three 3627most common quantifiers have single-character abbreviations: 3628 3629@table @code 3630@item * 3631is equivalent to @{0,@} 3632 3633@item + 3634is equivalent to @{1,@} 3635 3636@item ? 3637is equivalent to @{0,1@} 3638@end table 3639 3640It is possible to construct infinite loops by following a 3641subpattern that can match no characters with a quantifier 3642that has no upper limit, for example: 3643 3644@example 3645(a?)* 3646@end example 3647 3648Earlier versions of Perl used to give an error at 3649compile time for such patterns. However, because there are 3650cases where this can be useful, such patterns are now 3651accepted, but if any repetition of the subpattern does in 3652fact match no characters, the loop is forcibly broken. 3653 3654@cindex Greedy regular expression matching 3655@cindex Perl-style regular expressions, stingy repetitions 3656By default, the quantifiers are @dfn{greedy} like in @sc{posix} 3657mode, that is, they match as much as possible (up to the maximum 3658number of permitted times), without causing the rest of the 3659pattern to fail. The classic example of where this gives problems 3660is in trying to match comments in C programs. These appear between 3661the sequences @code{/*} and @code{*/} and within the sequence, individual 3662@code{*} and @code{/} characters may appear. An attempt to match C 3663comments by applying the pattern 3664 3665@example 3666/\*.*\*/ 3667@end example 3668 3669@noindent 3670to the string 3671 3672@example 3673/* first command */ not comment /* second comment */ 3674@end example 3675 3676@noindent 3677 3678fails, because it matches the entire string owing to the 3679greediness of the @code{.*} item. 3680 3681However, if a quantifier is followed by a question mark, it 3682ceases to be greedy, and instead matches the minimum number 3683of times possible, so the pattern @code{/\*.*?\*/} 3684does the right thing with the C comments. The meaning of the 3685various quantifiers is not otherwise changed, just the preferred 3686number of matches. Do not confuse this use of question 3687mark with its use as a quantifier in its own right. 3688Because it has two uses, it can sometimes appear doubled, as in 3689 3690@example 3691\d??\d 3692@end example 3693 3694which matches one digit by preference, but can match two if 3695that is the only way the rest of the pattern matches. 3696 3697Note that greediness does not matter when specifying addresses, 3698but can be nevertheless used to improve performance. 3699 3700@ignore 3701If the PCRE_UNGREEDY option is set (an option which is not 3702available in Perl), the quantifiers are not greedy by 3703default, but individual ones can be made greedy by following 3704them with a question mark. In other words, it inverts the 3705default behaviour. 3706@end ignore 3707 3708When a parenthesized subpattern is quantified with a minimum 3709repeat count that is greater than 1 or with a limited maximum, 3710more store is required for the compiled pattern, in 3711proportion to the size of the minimum or maximum. 3712 3713@cindex Perl-style regular expressions, single line 3714If a pattern starts with @code{.*} or @code{.@{0,@}} and the 3715@code{S} modifier is used, the pattern is implicitly anchored, 3716because whatever follows will be tried against every character 3717position in the subject string, so there is no point in 3718retrying the overall match at any position after the first. 3719PCRE treats such a pattern as though it were preceded by \A. 3720 3721When a capturing subpattern is repeated, the value captured 3722is the substring that matched the final iteration. For example, 3723after 3724 3725@example 3726(tweedle[dume]@{3@}\s*)+ 3727@end example 3728 3729@noindent 3730has matched @samp{tweedledum tweedledee} the value of the 3731captured substring is @samp{tweedledee}. However, if there are 3732nested capturing subpatterns, the corresponding captured 3733values may have been set in previous iterations. For example, 3734after 3735 3736@example 3737/(a|(b))+/ 3738@end example 3739 3740matches @samp{aba}, the value of the second captured substring is 3741@samp{b}. 3742 3743@node Backreferences 3744@appendixsec Backreferences 3745@cindex Perl-style regular expressions, backreferences 3746 3747Outside a character class, a backslash followed by a digit 3748greater than 0 (and possibly further digits) is a back 3749reference to a capturing subpattern earlier (i.e. to its 3750left) in the pattern, provided there have been that many 3751previous capturing left parentheses. 3752 3753However, if the decimal number following the backslash is 3754less than 10, it is always taken as a back reference, and 3755causes an error only if there are not that many capturing 3756left parentheses in the entire pattern. In other words, the 3757parentheses that are referenced need not be to the left of 3758the reference for numbers less than 10. @ref{Backslash} 3759for further details of the handling of digits following a backslash. 3760 3761A back reference matches whatever actually matched the capturing 3762subpattern in the current subject string, rather than 3763anything matching the subpattern itself. So the pattern 3764 3765@example 3766(sens|respons)e and \1ibility 3767@end example 3768 3769@noindent 3770matches @samp{sense and sensibility} and @samp{response and responsibility}, 3771but not @samp{sense and responsibility}. If caseful 3772matching is in force at the time of the back reference, the 3773case of letters is relevant. For example, 3774 3775@example 3776((?i)blah)\s+\1 3777@end example 3778 3779@noindent 3780matches @samp{blah blah} and @samp{Blah Blah}, but not 3781@samp{BLAH blah}, even though the original capturing 3782subpattern is matched caselessly. 3783 3784There may be more than one back reference to the same subpattern. 3785Also, if a subpattern has not actually been used in a 3786particular match, any back references to it always fail. For 3787example, the pattern 3788 3789@example 3790(a|(bc))\2 3791@end example 3792 3793@noindent 3794always fails if it starts to match @samp{a} rather than 3795@samp{bc}. Because there may be up to 99 back references, all 3796digits following the backslash are taken as part of a potential 3797back reference number; this is different from what happens 3798in @sc{posix} mode. If the pattern continues with a digit 3799character, some delimiter must be used to terminate the back 3800reference. If the @code{X} modifier option is set, this can be 3801whitespace. Otherwise an empty comment can be used, or the 3802following character can be expressed in hexadecimal or octal. 3803Note that this applies only to the LHS pattern; it is 3804not possible yet to specify more than 9 backreferences on the 3805RHS of the `s' command. 3806 3807A back reference that occurs inside the parentheses to which 3808it refers fails when the subpattern is first used, so, for 3809example, @code{(a\1)} never matches. However, such references 3810can be useful inside repeated subpatterns. For example, the 3811pattern 3812 3813@example 3814(a|b\1)+ 3815@end example 3816 3817@noindent 3818matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa}, 3819etc. At each iteration of the subpattern, the back reference matches 3820the character string corresponding to the previous iteration. In 3821order for this to work, the pattern must be such that the first 3822iteration does not need to match the back reference. This can be 3823done using alternation, as in the example above, or by a 3824quantifier with a minimum of zero. 3825 3826@node Assertions 3827@appendixsec Assertions 3828@cindex Perl-style regular expressions, assertions 3829@cindex Perl-style regular expressions, asserting subpatterns 3830 3831An assertion is a test on the characters following or 3832preceding the current matching point that does not actually 3833consume any characters. The simple assertions coded as @code{\b}, 3834@code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$} 3835are described above. More complicated assertions are coded as 3836subpatterns. There are two kinds: those that look ahead of the 3837current position in the subject string, and those that look behind it. 3838 3839@cindex Perl-style regular expressions, lookahead subpatterns 3840An assertion subpattern is matched in the normal way, except 3841that it does not cause the current matching position to be 3842changed. Lookahead assertions start with @code{(?=} for positive 3843assertions and @code{(?!} for negative assertions. For example, 3844 3845@example 3846\w+(?=;) 3847@end example 3848 3849@noindent 3850matches a word followed by a semicolon, but does not include 3851the semicolon in the match, and 3852 3853@example 3854foo(?!bar) 3855@end example 3856 3857@noindent 3858matches any occurrence of @samp{foo} that is not followed by 3859@samp{bar}. 3860 3861Note that the apparently similar pattern 3862 3863@example 3864(?!foo)bar 3865@end example 3866 3867@noindent 3868@cindex Perl-style regular expressions, lookbehind subpatterns 3869finds any occurrence of @samp{bar} even if it is preceded by 3870@samp{foo}, because the assertion @code{(?!foo)} is always true 3871when the next three characters are @samp{bar}. A lookbehind 3872assertion is needed to achieve this effect. 3873Lookbehind assertions start with @code{(?<=} for positive 3874assertions and @code{(?<!} for negative assertions. So, 3875 3876@example 3877(?<!foo)bar 3878@end example 3879 3880achieves the required effect of finding an occurrence of 3881@samp{bar} that is not preceded by @samp{foo}. The contents of a 3882lookbehind assertion are restricted 3883such that all the strings it matches must have a fixed 3884length. However, if there are several alternatives, they do 3885not all have to have the same fixed length. This is an extension 3886compared with Perl 5.005, which requires all branches to match 3887the same length of string. Thus 3888 3889@example 3890(?<=dogs|cats|) 3891@end example 3892 3893@noindent 3894is permitted, but the apparently equivalent regular expression 3895 3896@example 3897(?<!dogs?|cats?) 3898@end example 3899 3900@noindent 3901causes an error at compile time. Branches that match different 3902length strings are permitted only at the top level of 3903a lookbehind assertion: an assertion such as 3904 3905@example 3906(?<=ab(c|de)) 3907@end example 3908 3909@noindent 3910is not permitted, because its single top-level branch can 3911match two different lengths, but it is acceptable if rewritten 3912to use two top-level branches: 3913 3914@example 3915(?<=abc|abde) 3916@end example 3917 3918All this is required because lookbehind assertions simply 3919move the current position back by the alternative's fixed 3920width and then try to match. If there are 3921insufficient characters before the current position, the 3922match is deemed to fail. Lookbehinds, in conjunction with 3923non-backtracking subpatterns can be particularly useful for 3924matching at the ends of strings; an example is given at the end 3925of the section on non-backtracking subpatterns. 3926 3927Several assertions (of any sort) may occur in succession. 3928For example, 3929 3930@example 3931(?<=\d@{3@})(?<!999)foo 3932@end example 3933 3934@noindent 3935matches @samp{foo} preceded by three digits that are not @samp{999}. 3936Notice that each of the assertions is applied independently 3937at the same point in the subject string. First there is a 3938check that the previous three characters are all digits, and 3939then there is a check that the same three characters are not 3940@samp{999}. This pattern does not match @samp{foo} preceded by six 3941characters, the first of which are digits and the last three 3942of which are not @samp{999}. For example, it doesn't match 3943@samp{123abcfoo}. A pattern to do that is 3944 3945@example 3946(?<=\d@{3@}...)(?<!999)foo 3947@end example 3948 3949@noindent 3950This time the first assertion looks at the preceding six 3951characters, checking that the first three are digits, and 3952then the second assertion checks that the preceding three 3953characters are not @samp{999}. Actually, assertions can be 3954nested in any combination, so one can write this as 3955 3956@example 3957(?<=\d@{3@}(?!999)...)foo 3958@end example 3959 3960or 3961 3962@example 3963(?<=\d@{3@}...(?<!999))foo 3964@end example 3965 3966@noindent 3967both of which might be considered more readable. 3968 3969Assertion subpatterns are not capturing subpatterns, and may 3970not be repeated, because it makes no sense to assert the 3971same thing several times. If any kind of assertion contains 3972capturing subpatterns within it, these are counted for the 3973purposes of numbering the capturing subpatterns in the whole 3974pattern. However, substring capturing is carried out only 3975for positive assertions, because it does not make sense for 3976negative assertions. 3977 3978Assertions count towards the maximum of 200 parenthesized 3979subpatterns. 3980 3981@node Non-backtracking subpatterns 3982@appendixsec Non-backtracking subpatterns 3983@cindex Perl-style regular expressions, non-backtracking subpatterns 3984 3985With both maximizing and minimizing repetition, failure of 3986what follows normally causes the repeated item to be evaluated 3987again to see if a different number of repeats allows the 3988rest of the pattern to match. Sometimes it is useful to 3989prevent this, either to change the nature of the match, or 3990to cause it fail earlier than it otherwise might, when the 3991author of the pattern knows there is no point in carrying 3992on. 3993 3994Consider, for example, the pattern @code{\d+foo} when applied to 3995the subject line 3996 3997@example 3998123456bar 3999@end example 4000 4001After matching all 6 digits and then failing to match @samp{foo}, 4002the normal action of the matcher is to try again with only 5 4003digits matching the @code{\d+} item, and then with 4, and so on, 4004before ultimately failing. Non-backtracking subpatterns 4005provide the means for specifying that once a portion of the 4006pattern has matched, it is not to be re-evaluated in this way, 4007so the matcher would give up immediately on failing to match 4008@samp{foo} the first time. The notation is another kind of special 4009parenthesis, starting with @code{(?>} as in this example: 4010 4011@example 4012(?>\d+)bar 4013@end example 4014 4015This kind of parenthesis ``locks up'' the part of the pattern 4016it contains once it has matched, and a failure further into 4017the pattern is prevented from backtracking into it. 4018Backtracking past it to previous items, however, works as 4019normal. 4020 4021Non-backtracking subpatterns are not capturing subpatterns. Simple 4022cases such as the above example can be thought of as a maximizing 4023repeat that must swallow everything it can. So, 4024while both @code{\d+} and @code{\d+?} are prepared to adjust the number of 4025digits they match in order to make the rest of the pattern 4026match, @code{(?>\d+)} can only match an entire sequence of digits. 4027 4028This construction can of course contain arbitrarily complicated 4029subpatterns, and it can be nested. 4030 4031@cindex Perl-style regular expressions, lookbehind subpatterns 4032Non-backtracking subpatterns can be used in conjunction with look-behind 4033assertions to specify efficient matching at the end 4034of the subject string. Consider a simple pattern such as 4035 4036@example 4037abcd$ 4038@end example 4039 4040@noindent 4041when applied to a long string which does not match. Because 4042matching proceeds from left to right, @command{sed} will look for 4043each @samp{a} in the subject and then see if what follows matches 4044the rest of the pattern. If the pattern is specified as 4045 4046@example 4047^.*abcd$ 4048@end example 4049 4050@noindent 4051the initial @code{.*} matches the entire string at first, but when 4052this fails (because there is no following @samp{a}), it backtracks 4053to match all but the last character, then all but the 4054last two characters, and so on. Once again the search for 4055@samp{a} covers the entire string, from right to left, so we are 4056no better off. However, if the pattern is written as 4057 4058@example 4059^(?>.*)(?<=abcd) 4060@end example 4061 4062there can be no backtracking for the .* item; it can match 4063only the entire string. The subsequent lookbehind assertion 4064does a single test on the last four characters. If it fails, 4065the match fails immediately. For long strings, this approach 4066makes a significant difference to the processing time. 4067 4068When a pattern contains an unlimited repeat inside a subpattern 4069that can itself be repeated an unlimited number of 4070times, the use of a once-only subpattern is the only way to 4071avoid some failing matches taking a very long time 4072indeed.@footnote{Actually, the matcher embedded in @value{SSED} 4073tries to do something for this in the simplest cases, 4074like @code{([^b]*b)*}. These cases are actually quite 4075common: they happen for example in a regular expression 4076like @code{\/\*([^*]*\*)*\/} which matches C comments.} 4077 4078The pattern 4079 4080@example 4081(\D+|<\d+>)*[!?] 4082@end example 4083 4084([^0-9<]+<(\d+>)?)*[!?] 4085 4086@noindent 4087matches an unlimited number of substrings that either consist 4088of non-digits, or digits enclosed in angular brackets, followed by 4089an exclamation or question mark. When it matches, it runs quickly. 4090However, if it is applied to 4091 4092@example 4093aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4094@end example 4095 4096@noindent 4097it takes a long time before reporting failure. This is 4098because the string can be divided between the two repeats in 4099a large number of ways, and all have to be tried.@footnote{The 4100example used @code{[!?]} rather than a single character at the end, 4101because both @value{SSED} and Perl have an optimization that allows 4102for fast failure when a single character is used. They 4103remember the last single character that is required for a 4104match, and fail early if it is not present in the string.} 4105 4106If the pattern is changed to 4107 4108@example 4109((?>\D+)|<\d+>)*[!?] 4110@end example 4111 4112sequences of non-digits cannot be broken, and failure happens 4113quickly. 4114 4115@node Conditional subpatterns 4116@appendixsec Conditional subpatterns 4117@cindex Perl-style regular expressions, conditional subpatterns 4118 4119It is possible to cause the matching process to obey a subpattern 4120conditionally or to choose between two alternative 4121subpatterns, depending on the result of an assertion, or 4122whether a previous capturing subpattern matched or not. The 4123two possible forms of conditional subpattern are 4124 4125@example 4126(?(@var{condition})@var{yes-pattern}) 4127(?(@var{condition})@var{yes-pattern}|@var{no-pattern}) 4128@end example 4129 4130If the condition is satisfied, the yes-pattern is used; otherwise 4131the no-pattern (if present) is used. If there are more than two 4132alternatives in the subpattern, a compile-time error occurs. 4133 4134There are two kinds of condition. If the text between the 4135parentheses consists of a sequence of digits, the condition 4136is satisfied if the capturing subpattern of that number has 4137previously matched. The number must be greater than zero. 4138Consider the following pattern, which contains non-significant 4139white space to make it more readable (assume the @code{X} modifier) 4140and to divide it into three parts for ease of discussion: 4141 4142@example 4143( \( )? [^()]+ (?(1) \) ) 4144@end example 4145 4146The first part matches an optional opening parenthesis, and 4147if that character is present, sets it as the first captured 4148substring. The second part matches one or more characters 4149that are not parentheses. The third part is a conditional 4150subpattern that tests whether the first set of parentheses 4151matched or not. If they did, that is, if subject started 4152with an opening parenthesis, the condition is true, and so 4153the yes-pattern is executed and a closing parenthesis is 4154required. Otherwise, since no-pattern is not present, the 4155subpattern matches nothing. In other words, this pattern 4156matches a sequence of non-parentheses, optionally enclosed 4157in parentheses. 4158 4159@cindex Perl-style regular expressions, lookahead subpatterns 4160If the condition is not a sequence of digits, it must be an 4161assertion. This may be a positive or negative lookahead or 4162lookbehind assertion. Consider this pattern, again containing 4163non-significant white space, and with the two alternatives 4164on the second line: 4165 4166@example 4167(?(?=...[a-z]) 4168 \d\d-[a-z]@{3@}-\d\d | 4169 \d\d-\d\d-\d\d ) 4170@end example 4171 4172The condition is a positive lookahead assertion that matches 4173a letter that is three characters away from the current point. 4174If a letter is found, the subject is matched against the first 4175alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are 4176letters and @var{dd} are digits); otherwise it is matched against 4177the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}. 4178 4179 4180@node Recursive patterns 4181@appendixsec Recursive patterns 4182@cindex Perl-style regular expressions, recursive patterns 4183@cindex Perl-style regular expressions, recursion 4184 4185Consider the problem of matching a string in parentheses, 4186allowing for unlimited nested parentheses. Without the use 4187of recursion, the best that can be done is to use a pattern 4188that matches up to some fixed depth of nesting. It is not 4189possible to handle an arbitrary nesting depth. Perl 5.6 has 4190provided an experimental facility that allows regular 4191expressions to recurse (amongst other things). It does this 4192by interpolating Perl code in the expression at run time, 4193and the code can refer to the expression itself. A Perl pattern 4194tern to solve the parentheses problem can be created like 4195this: 4196 4197@example 4198$re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x; 4199@end example 4200 4201The @code{(?p@{...@})} item interpolates Perl code at run time, 4202and in this case refers recursively to the pattern in which it 4203appears. Obviously, @command{sed} cannot support the interpolation of 4204Perl code. Instead, the special item @code{(?R)} is provided for 4205the specific case of recursion. This pattern solves the 4206parentheses problem (assume the @code{X} modifier option is used 4207so that white space is ignored): 4208 4209@example 4210\( ( (?>[^()]+) | (?R) )* \) 4211@end example 4212 4213First it matches an opening parenthesis. Then it matches any 4214number of substrings which can either be a sequence of 4215non-parentheses, or a recursive match of the pattern itself 4216(i.e. a correctly parenthesized substring). Finally there is 4217a closing parenthesis. 4218 4219This particular example pattern contains nested unlimited 4220repeats, and so the use of a non-backtracking subpattern for 4221matching strings of non-parentheses is important when applying 4222the pattern to strings that do not match. For example, when 4223it is applied to 4224 4225@example 4226(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() 4227@end example 4228 4229it yields a ``no match'' response quickly. However, if a 4230standard backtracking subpattern is not used, the match runs 4231for a very long time indeed because there are so many different 4232ways the @code{+} and @code{*} repeats can carve up the subject, 4233and all have to be tested before failure can be reported. 4234 4235The values set for any capturing subpatterns are those from 4236the outermost level of the recursion at which the subpattern 4237value is set. If the pattern above is matched against 4238 4239@example 4240(ab(cd)ef) 4241@end example 4242 4243@noindent 4244the value for the capturing parentheses is @samp{ef}, which is 4245the last value taken on at the top level. 4246 4247@node Comments 4248@appendixsec Comments 4249@cindex Perl-style regular expressions, comments 4250 4251The sequence (?# marks the start of a comment which continues 4252ues up to the next closing parenthesis. Nested parentheses 4253are not permitted. The characters that make up a comment 4254play no part in the pattern matching at all. 4255 4256@cindex Perl-style regular expressions, extended 4257If the @code{X} modifier option is used, an unescaped @code{#} character 4258outside a character class introduces a comment that continues 4259up to the next newline character in the pattern. 4260@end ifset 4261 4262 4263@page 4264@node Concept Index 4265@unnumbered Concept Index 4266 4267This is a general index of all issues discussed in this manual, with the 4268exception of the @command{sed} commands and command-line options. 4269 4270@printindex cp 4271 4272@page 4273@node Command and Option Index 4274@unnumbered Command and Option Index 4275 4276This is an alphabetical list of all @command{sed} commands and command-line 4277options. 4278 4279@printindex fn 4280 4281@contents 4282@bye 4283 4284@c XXX FIXME: the term "cycle" is never defined... 4285