• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1\input texinfo  @c -*-texinfo-*-
2@c
3@c -- Stuff that needs adding: ----------------------------------------------
4@c (document the `;' command-separator)
5@c --------------------------------------------------------------------------
6@c Check for consistency: regexps in @code, text that they match in @samp.
7@c
8@c Tips:
9@c    @command for command
10@c    @samp for command fragments: @samp{cat -s}
11@c    @code for sed commands and flags
12@c    Use ``quote'' not `quote' or "quote".
13@c
14@c %**start of header
15@setfilename sed.info
16@settitle sed, a stream editor
17@c %**end of header
18
19@c @smallbook
20
21@include version.texi
22
23@c Combine indices.
24@syncodeindex ky cp
25@syncodeindex pg cp
26@syncodeindex tp cp
27
28@defcodeindex op
29@syncodeindex op fn
30
31@include config.texi
32
33@copying
34This file documents version @value{VERSION} of
35@value{SSED}, a stream editor.
36
37Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free
38Software Foundation, Inc.
39
40This document is released under the terms of the @acronym{GNU} Free
41Documentation License as published by the Free Software Foundation;
42either version 1.1, or (at your option) any later version.
43
44You should have received a copy of the @acronym{GNU} Free Documentation
45License along with @value{SSED}; see the file @file{COPYING.DOC}.
46If not, write to the Free Software Foundation, 59 Temple Place - Suite
47330, Boston, MA 02110-1301, USA.
48
49There are no Cover Texts and no Invariant Sections; this text, along
50with its equivalent in the printed manual, constitutes the Title Page.
51@end copying
52
53@setchapternewpage off
54
55@titlepage
56@title @command{sed}, a stream editor
57@subtitle version @value{VERSION}, @value{UPDATED}
58@author by Ken Pizzini, Paolo Bonzini
59
60@page
61@vskip 0pt plus 1filll
62Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc.
63
64@insertcopying
65
66Published by the Free Software Foundation, @*
6751 Franklin Street, Fifth Floor @*
68Boston, MA 02110-1301, USA
69@end titlepage
70
71
72@node Top
73@top
74
75@ifnottex
76@insertcopying
77@end ifnottex
78
79@menu
80* Introduction::               Introduction
81* Invoking sed::               Invocation
82* sed Programs::               @command{sed} programs
83* Examples::                   Some sample scripts
84* Limitations::                Limitations and (non-)limitations of @value{SSED}
85* Other Resources::            Other resources for learning about @command{sed}
86* Reporting Bugs::             Reporting bugs
87
88* Extended regexps::           @command{egrep}-style regular expressions
89@ifset PERL
90* Perl regexps::               Perl-style regular expressions
91@end ifset
92
93* Concept Index::              A menu with all the topics in this manual.
94* Command and Option Index::   A menu with all @command{sed} commands and
95                               command-line options.
96
97@detailmenu
98--- The detailed node listing ---
99
100sed Programs:
101* Execution Cycle::                 How @command{sed} works
102* Addresses::                       Selecting lines with @command{sed}
103* Regular Expressions::             Overview of regular expression syntax
104* Common Commands::                 Often used commands
105* The "s" Command::                 @command{sed}'s Swiss Army Knife
106* Other Commands::                  Less frequently used commands
107* Programming Commands::            Commands for @command{sed} gurus
108* Extended Commands::               Commands specific of @value{SSED}
109* Escapes::                         Specifying special characters
110
111Examples:
112* Centering lines::
113* Increment a number::
114* Rename files to lower case::
115* Print bash environment::
116* Reverse chars of lines::
117* tac::                             Reverse lines of files
118* cat -n::                          Numbering lines
119* cat -b::                          Numbering non-blank lines
120* wc -c::                           Counting chars
121* wc -w::                           Counting words
122* wc -l::                           Counting lines
123* head::                            Printing the first lines
124* tail::                            Printing the last lines
125* uniq::                            Make duplicate lines unique
126* uniq -d::                         Print duplicated lines of input
127* uniq -u::                         Remove all duplicated lines
128* cat -s::                          Squeezing blank lines
129
130@ifset PERL
131Perl regexps::                      Perl-style regular expressions
132* Backslash::                       Introduces special sequences
133* Circumflex/dollar sign/period::   Behave specially with regard to new lines
134* Square brackets::                 Are a bit different in strange cases
135* Options setting::                 Toggle modifiers in the middle of a regexp
136* Non-capturing subpatterns::       Are not counted when backreferencing
137* Repetition::                      Allows for non-greedy matching
138* Backreferences::                  Allows for more than 10 back references
139* Assertions::                      Allows for complex look ahead matches
140* Non-backtracking subpatterns::    Often gives more performance
141* Conditional subpatterns::         Allows if/then/else branches
142* Recursive patterns::              For example to match parentheses
143* Comments::                        Because things can get complex...
144@end ifset
145
146@end detailmenu
147@end menu
148
149
150@node Introduction
151@chapter Introduction
152
153@cindex Stream editor
154@command{sed} is a stream editor.
155A stream editor is used to perform basic text
156transformations on an input stream
157(a file or input from a pipeline).
158While in some ways similar to an editor which
159permits scripted edits (such as @command{ed}),
160@command{sed} works by making only one pass over the
161input(s), and is consequently more efficient.
162But it is @command{sed}'s ability to filter text in a pipeline
163which particularly distinguishes it from other types of
164editors.
165
166
167@node Invoking sed
168@chapter Invocation
169
170Normally @command{sed} is invoked like this:
171
172@example
173sed SCRIPT INPUTFILE...
174@end example
175
176The full format for invoking @command{sed} is:
177
178@example
179sed OPTIONS... [SCRIPT] [INPUTFILE...]
180@end example
181
182If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-},
183@command{sed} filters the contents of the standard input.  The @var{script}
184is actually the first non-option parameter, which @command{sed} specially
185considers a script and not an input file if (and only if) none of the
186other @var{options} specifies a script to be executed, that is if neither
187of the @option{-e} and @option{-f} options is specified.
188
189@command{sed} may be invoked with the following command-line options:
190
191@table @code
192@item --version
193@opindex --version
194@cindex Version, printing
195Print out the version of @command{sed} that is being run and a copyright notice,
196then exit.
197
198@item --help
199@opindex --help
200@cindex Usage summary, printing
201Print a usage message briefly summarizing these command-line options
202and the bug-reporting address,
203then exit.
204
205@item -n
206@itemx --quiet
207@itemx --silent
208@opindex -n
209@opindex --quiet
210@opindex --silent
211@cindex Disabling autoprint, from command line
212By default, @command{sed} prints out the pattern space
213at the end of each cycle through the script (@pxref{Execution Cycle, ,
214How @code{sed} works}).
215These options disable this automatic printing,
216and @command{sed} only produces output when explicitly told to
217via the @code{p} command.
218
219@item -e @var{script}
220@itemx --expression=@var{script}
221@opindex -e
222@opindex --expression
223@cindex Script, from command line
224Add the commands in @var{script} to the set of commands to be
225run while processing the input.
226
227@item -f @var{script-file}
228@itemx --file=@var{script-file}
229@opindex -f
230@opindex --file
231@cindex Script, from a file
232Add the commands contained in the file @var{script-file}
233to the set of commands to be run while processing the input.
234
235@item -i[@var{SUFFIX}]
236@itemx --in-place[=@var{SUFFIX}]
237@opindex -i
238@opindex --in-place
239@cindex In-place editing, activating
240@cindex @value{SSEDEXT}, in-place editing
241This option specifies that files are to be edited in-place.
242@value{SSED} does this by creating a temporary file and
243sending output to this file rather than to the standard
244output.@footnote{This applies to commands such as @code{=},
245@code{a}, @code{c}, @code{i}, @code{l}, @code{p}.  You can
246still write to the standard output by using the @code{w}
247@cindex @value{SSEDEXT}, @file{/dev/stdout} file
248or @code{W} commands together with the @file{/dev/stdout}
249special file}.
250
251This option implies @option{-s}.
252
253When the end of the file is reached, the temporary file is
254renamed to the output file's original name.  The extension,
255if supplied, is used to modify the name of the old file
256before renaming the temporary file, thereby making a backup
257copy@footnote{Note that @value{SSED} creates the backup
258file whether or not any output is actually changed.}).
259
260@cindex In-place editing, Perl-style backup file names
261This rule is followed: if the extension doesn't contain a @code{*},
262then it is appended to the end of the current filename as a
263suffix; if the extension does contain one or more @code{*}
264characters, then @emph{each} asterisk is replaced with the
265current filename.  This allows you to add a prefix to the
266backup file, instead of (or in addition to) a suffix, or
267even to place backup copies of the original files into another
268directory (provided the directory already exists).
269
270If no extension is supplied, the original file is
271overwritten without making a backup.
272
273@item -l @var{N}
274@itemx --line-length=@var{N}
275@opindex -l
276@opindex --line-length
277@cindex Line length, setting
278Specify the default line-wrap length for the @code{l} command.
279A length of 0 (zero) means to never wrap long lines.  If
280not specified, it is taken to be 70.
281
282@item --posix
283@cindex @value{SSEDEXT}, disabling
284@value{SSED} includes several extensions to @acronym{POSIX}
285sed.  In order to simplify writing portable scripts, this
286option disables all the extensions that this manual documents,
287including additional commands.
288@cindex @code{POSIXLY_CORRECT} behavior, enabling
289Most of the extensions accept @command{sed} programs that
290are outside the syntax mandated by @acronym{POSIX}, but some
291of them (such as the behavior of the @command{N} command
292described in @pxref{Reporting Bugs}) actually violate the
293standard.  If you want to disable only the latter kind of
294extension, you can set the @code{POSIXLY_CORRECT} variable
295to a non-empty value.
296
297@item -b
298@itemx --binary
299@opindex -b
300@opindex --binary
301This option is available on every platform, but is only effective where the
302operating system makes a distinction between text files and binary files.
303When such a distinction is made---as is the case for MS-DOS, Windows,
304Cygwin---text files are composed of lines separated by a carriage return
305@emph{and} a line feed character, and @command{sed} does not see the
306ending CR.  When this option is specified, @command{sed} will open
307input files in binary mode, thus not requesting this special processing
308and considering lines to end at a line feed.
309
310@item --follow-symlinks
311@opindex --follow-symlinks
312This option is available only on platforms that support
313symbolic links and has an effect only if option @option{-i}
314is specified.  In this case, if the file that is specified
315on the command line is a symbolic link, @command{sed} will
316follow the link and edit the ultimate destination of the
317link.  The default behavior is to break the symbolic link,
318so that the link destination will not be modified.
319
320@item -r
321@itemx --regexp-extended
322@opindex -r
323@opindex --regexp-extended
324@cindex Extended regular expressions, choosing
325@cindex @acronym{GNU} extensions, extended regular expressions
326Use extended regular expressions rather than basic
327regular expressions.  Extended regexps are those that
328@command{egrep} accepts; they can be clearer because they
329usually have less backslashes, but are a @acronym{GNU} extension
330and hence scripts that use them are not portable.
331@xref{Extended regexps, , Extended regular expressions}.
332
333@ifset PERL
334@item -R
335@itemx --regexp-perl
336@opindex -R
337@opindex --regexp-perl
338@cindex Perl-style regular expressions, choosing
339@cindex @value{SSEDEXT}, Perl-style regular expressions
340Use Perl-style regular expressions rather than basic
341regular expressions.  Perl-style regexps are extremely
342powerful but are a @value{SSED} extension and hence scripts that
343use it are not portable.  @xref{Perl regexps, ,
344Perl-style regular expressions}.
345@end ifset
346
347@item -s
348@itemx --separate
349@cindex Working on separate files
350By default, @command{sed} will consider the files specified on the
351command line as a single continuous long stream.  This @value{SSED}
352extension allows the user to consider them as separate files:
353range addresses (such as @samp{/abc/,/def/}) are not allowed
354to span several files, line numbers are relative to the start
355of each file, @code{$} refers to the last line of each file,
356and files invoked from the @code{R} commands are rewound at the
357start of each file.
358
359@item -u
360@itemx --unbuffered
361@opindex -u
362@opindex --unbuffered
363@cindex Unbuffered I/O, choosing
364Buffer both input and output as minimally as practical.
365(This is particularly useful if the input is coming from
366the likes of @samp{tail -f}, and you wish to see the transformed
367output as soon as possible.)
368
369@end table
370
371If no @option{-e}, @option{-f}, @option{--expression}, or @option{--file}
372options are given on the command-line,
373then the first non-option argument on the command line is
374taken to be the @var{script} to be executed.
375
376@cindex Files to be processed as input
377If any command-line parameters remain after processing the above,
378these parameters are interpreted as the names of input files to
379be processed.
380@cindex Standard input, processing as input
381A file name of @samp{-} refers to the standard input stream.
382The standard input will be processed if no file names are specified.
383
384
385@node sed Programs
386@chapter @command{sed} Programs
387
388@cindex @command{sed} program structure
389@cindex Script structure
390A @command{sed} program consists of one or more @command{sed} commands,
391passed in by one or more of the
392@option{-e}, @option{-f}, @option{--expression}, and @option{--file}
393options, or the first non-option argument if zero of these
394options are used.
395This document will refer to ``the'' @command{sed} script;
396this is understood to mean the in-order catenation
397of all of the @var{script}s and @var{script-file}s passed in.
398
399Each @code{sed} command consists of an optional address or
400address range, followed by a one-character command name
401and any additional command-specific code.
402
403@menu
404* Execution Cycle::          How @command{sed} works
405* Addresses::                Selecting lines with @command{sed}
406* Regular Expressions::      Overview of regular expression syntax
407* Common Commands::          Often used commands
408* The "s" Command::          @command{sed}'s Swiss Army Knife
409* Other Commands::           Less frequently used commands
410* Programming Commands::     Commands for @command{sed} gurus
411* Extended Commands::        Commands specific of @value{SSED}
412* Escapes::                  Specifying special characters
413@end menu
414
415
416@node Execution Cycle
417@section How @command{sed} Works
418
419@cindex Buffer spaces, pattern and hold
420@cindex Spaces, pattern and hold
421@cindex Pattern space, definition
422@cindex Hold space, definition
423@command{sed} maintains two data buffers: the active @emph{pattern} space,
424and the auxiliary @emph{hold} space. Both are initially empty.
425
426@command{sed} operates by performing the following cycle on each
427lines of input: first, @command{sed} reads one line from the input
428stream, removes any trailing newline, and places it in the pattern space.
429Then commands are executed; each command can have an address associated
430to it: addresses are a kind of condition code, and a command is only
431executed if the condition is verified before the command is to be
432executed.
433
434When the end of the script is reached, unless the @option{-n} option
435is in use, the contents of pattern space are printed out to the output
436stream, adding back the trailing newline if it was removed.@footnote{Actually,
437if @command{sed} prints a line without the terminating newline, it will
438nevertheless print the missing newline as soon as more text is sent to
439the same output stream, which gives the ``least expected surprise''
440even though it does not make commands like @samp{sed -n p} exactly
441identical to @command{cat}.} Then the next cycle starts for the next
442input line.
443
444Unless special commands (like @samp{D}) are used, the pattern space is
445deleted between two cycles. The hold space, on the other hand, keeps
446its data between cycles (see commands @samp{h}, @samp{H}, @samp{x},
447@samp{g}, @samp{G} to move data between both buffers).
448
449
450@node Addresses
451@section Selecting lines with @command{sed}
452@cindex Addresses, in @command{sed} scripts
453@cindex Line selection
454@cindex Selecting lines to process
455
456Addresses in a @command{sed} script can be in any of the following forms:
457@table @code
458@item @var{number}
459@cindex Address, numeric
460@cindex Line, selecting by number
461Specifying a line number will match only that line in the input.
462(Note that @command{sed} counts lines continuously across all input files
463unless @option{-i} or @option{-s} options are specified.)
464
465@item @var{first}~@var{step}
466@cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses
467This @acronym{GNU} extension matches every @var{step}th line
468starting with line @var{first}.
469In particular, lines will be selected when there exists
470a non-negative @var{n} such that the current line-number equals
471@var{first} + (@var{n} * @var{step}).
472Thus, to select the odd-numbered lines,
473one would use @code{1~2};
474to pick every third line starting with the second, @samp{2~3} would be used;
475to pick every fifth line starting with the tenth, use @samp{10~5};
476and @samp{50~0} is just an obscure way of saying @code{50}.
477
478@item $
479@cindex Address, last line
480@cindex Last line, selecting
481@cindex Line, selecting last
482This address matches the last line of the last file of input, or
483the last line of each file when the @option{-i} or @option{-s} options
484are specified.
485
486@item /@var{regexp}/
487@cindex Address, as a regular expression
488@cindex Line, selecting by regular expression match
489This will select any line which matches the regular expression @var{regexp}.
490If @var{regexp} itself includes any @code{/} characters,
491each must be escaped by a backslash (@code{\}).
492
493@cindex empty regular expression
494@cindex @value{SSEDEXT}, modifiers and the empty regular expression
495The empty regular expression @samp{//} repeats the last regular
496expression match (the same holds if the empty regular expression is
497passed to the @code{s} command).  Note that modifiers to regular expressions
498are evaluated when the regular expression is compiled, thus it is invalid to
499specify them together with the empty regular expression.
500
501@item \%@var{regexp}%
502(The @code{%} may be replaced by any other single character.)
503
504@cindex Slash character, in regular expressions
505This also matches the regular expression @var{regexp},
506but allows one to use a different delimiter than @code{/}.
507This is particularly useful if the @var{regexp} itself contains
508a lot of slashes, since it avoids the tedious escaping of every @code{/}.
509If @var{regexp} itself includes any delimiter characters,
510each must be escaped by a backslash (@code{\}).
511
512@item /@var{regexp}/I
513@itemx \%@var{regexp}%I
514@cindex @acronym{GNU} extensions, @code{I} modifier
515@ifset PERL
516@cindex Perl-style regular expressions, case-insensitive
517@end ifset
518The @code{I} modifier to regular-expression matching is a @acronym{GNU}
519extension which causes the @var{regexp} to be matched in
520a case-insensitive manner.
521
522@item /@var{regexp}/M
523@itemx \%@var{regexp}%M
524@ifset PERL
525@cindex @value{SSEDEXT}, @code{M} modifier
526@end ifset
527@cindex Perl-style regular expressions, multiline
528The @code{M} modifier to regular-expression matching is a @value{SSED}
529extension which causes @code{^} and @code{$} to match respectively
530(in addition to the normal behavior) the empty string after a newline,
531and the empty string before a newline.  There are special character
532sequences
533@ifset PERL
534(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
535in basic or extended regular expression modes)
536@end ifset
537@ifclear PERL
538(@code{\`} and @code{\'})
539@end ifclear
540which always match the beginning or the end of the buffer.
541@code{M} stands for @cite{multi-line}.
542
543@ifset PERL
544@item /@var{regexp}/S
545@itemx \%@var{regexp}%S
546@cindex @value{SSEDEXT}, @code{S} modifier
547@cindex Perl-style regular expressions, single line
548The @code{S} modifier to regular-expression matching is only valid
549in Perl mode and specifies that the dot character (@code{.}) will
550match the newline character too.  @code{S} stands for @cite{single-line}.
551@end ifset
552
553@ifset PERL
554@item /@var{regexp}/X
555@itemx \%@var{regexp}%X
556@cindex @value{SSEDEXT}, @code{X} modifier
557@cindex Perl-style regular expressions, extended
558The @code{X} modifier to regular-expression matching is also
559valid in Perl mode only.  If it is used, whitespace in the
560pattern (other than in a character class) and
561characters between a @kbd{#} outside a character class and the
562next newline character are ignored. An escaping backslash
563can be used to include a whitespace or @kbd{#} character as part
564of the pattern.
565@end ifset
566@end table
567
568If no addresses are given, then all lines are matched;
569if one address is given, then only lines matching that
570address are matched.
571
572@cindex Range of lines
573@cindex Several lines, selecting
574An address range can be specified by specifying two addresses
575separated by a comma (@code{,}).  An address range matches lines
576starting from where the first address matches, and continues
577until the second address matches (inclusively).
578
579If the second address is a @var{regexp}, then checking for the
580ending match will start with the line @emph{following} the
581line which matched the first address: a range will always
582span at least two lines (except of course if the input stream
583ends).
584
585If the second address is a @var{number} less than (or equal to)
586the line matching the first address, then only the one line is
587matched.
588
589@cindex Special addressing forms
590@cindex Range with start address of zero
591@cindex Zero, as range start address
592@cindex @var{addr1},+N
593@cindex @var{addr1},~N
594@cindex @acronym{GNU} extensions, special two-address forms
595@cindex @acronym{GNU} extensions, @code{0} address
596@cindex @acronym{GNU} extensions, 0,@var{addr2} addressing
597@cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing
598@cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing
599@value{SSED} also supports some special two-address forms; all these
600are @acronym{GNU} extensions:
601@table @code
602@item 0,/@var{regexp}/
603A line number of @code{0} can be used in an address specification like
604@code{0,/@var{regexp}/} so that @command{sed} will try to match
605@var{regexp} in the first input line too.  In other words,
606@code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/},
607except that if @var{addr2} matches the very first line of input the
608@code{0,/@var{regexp}/} form will consider it to end the range, whereas
609the @code{1,/@var{regexp}/} form will match the beginning of its range and
610hence make the range span up to the @emph{second} occurrence of the
611regular expression.
612
613Note that this is the only place where the @code{0} address makes
614sense; there is no 0-th line and commands which are given the @code{0}
615address in any other way will give an error.
616
617@item @var{addr1},+@var{N}
618Matches @var{addr1} and the @var{N} lines following @var{addr1}.
619
620@item @var{addr1},~@var{N}
621Matches @var{addr1} and the lines following @var{addr1}
622until the next line whose input line number is a multiple of @var{N}.
623@end table
624
625@cindex Excluding lines
626@cindex Selecting non-matching lines
627Appending the @code{!} character to the end of an address
628specification negates the sense of the match.
629That is, if the @code{!} character follows an address range,
630then only lines which do @emph{not} match the address range
631will be selected.
632This also works for singleton addresses,
633and, perhaps perversely, for the null address.
634
635
636@node Regular Expressions
637@section Overview of Regular Expression Syntax
638
639To know how to use @command{sed}, people should understand regular
640expressions (@dfn{regexp} for short).  A regular expression
641is a pattern that is matched against a
642subject string from left to right.  Most characters are
643@dfn{ordinary}: they stand for
644themselves in a pattern, and match the corresponding characters
645in the subject.  As a trivial example, the pattern
646
647@example
648The quick brown fox
649@end example
650
651@noindent
652matches a portion of a subject string that is identical to
653itself.  The power of regular expressions comes from the
654ability to include alternatives and repetitions in the pattern.
655These are encoded in the pattern by the use of @dfn{special characters},
656which do not stand for themselves but instead
657are interpreted in some special way.  Here is a brief description
658of regular expression syntax as used in @command{sed}.
659
660@table @code
661@item @var{char}
662A single ordinary character matches itself.
663
664@item *
665@cindex @acronym{GNU} extensions, to basic regular expressions
666Matches a sequence of zero or more instances of matches for the
667preceding regular expression, which must be an ordinary character, a
668special character preceded by @code{\}, a @code{.}, a grouped regexp
669(see below), or a bracket expression.  As a @acronym{GNU} extension, a
670postfixed regular expression can also be followed by @code{*}; for
671example, @code{a**} is equivalent to @code{a*}.  @acronym{POSIX}
6721003.1-2001 says that @code{*} stands for itself when it appears at
673the start of a regular expression or subexpression, but many
674non@acronym{GNU} implementations do not support this and portable
675scripts should instead use @code{\*} in these contexts.
676
677@item \+
678@cindex @acronym{GNU} extensions, to basic regular expressions
679As @code{*}, but matches one or more.  It is a @acronym{GNU} extension.
680
681@item \?
682@cindex @acronym{GNU} extensions, to basic regular expressions
683As @code{*}, but only matches zero or one.  It is a @acronym{GNU} extension.
684
685@item \@{@var{i}\@}
686As @code{*}, but matches exactly @var{i} sequences (@var{i} is a
687decimal integer; for portability, keep it between 0 and 255
688inclusive).
689
690@item \@{@var{i},@var{j}\@}
691Matches between @var{i} and @var{j}, inclusive, sequences.
692
693@item \@{@var{i},\@}
694Matches more than or equal to @var{i} sequences.
695
696@item \(@var{regexp}\)
697Groups the inner @var{regexp} as a whole, this is used to:
698
699@itemize @bullet
700@item
701@cindex @acronym{GNU} extensions, to basic regular expressions
702Apply postfix operators, like @code{\(abcd\)*}:
703this will search for zero or more whole sequences
704of @samp{abcd}, while @code{abcd*} would search
705for @samp{abc} followed by zero or more occurrences
706of @samp{d}.  Note that support for @code{\(abcd\)*} is
707required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU}
708implementations do not support it and hence it is not universally
709portable.
710
711@item
712Use back references (see below).
713@end itemize
714
715@item .
716Matches any character, including newline.
717
718@item ^
719Matches the null string at beginning of the pattern space, i.e. what
720appears after the circumflex must appear at the beginning of the
721pattern space.
722
723In most scripts, pattern space is initialized to the content of each
724line (@pxref{Execution Cycle, , How @code{sed} works}).  So, it is a
725useful simplification to think of @code{^#include} as matching only
726lines where @samp{#include} is the first thing on line---if there are
727spaces before, for example, the match fails.  This simplification is
728valid as long as the original content of pattern space is not modified,
729for example with an @code{s} command.
730
731@code{^} acts as a special character only at the beginning of the
732regular expression or subexpression (that is, after @code{\(} or
733@code{\|}).  Portable scripts should avoid @code{^} at the beginning of
734a subexpression, though, as @acronym{POSIX} allows implementations that
735treat @code{^} as an ordinary character in that context.
736
737@item $
738It is the same as @code{^}, but refers to end of pattern space.
739@code{$} also acts as a special character only at the end
740of the regular expression or subexpression (that is, before @code{\)}
741or @code{\|}), and its use at the end of a subexpression is not
742portable.
743
744
745@item [@var{list}]
746@itemx [^@var{list}]
747Matches any single character in @var{list}: for example,
748@code{[aeiou]} matches all vowels.  A list may include
749sequences like @code{@var{char1}-@var{char2}}, which
750matches any character between (inclusive) @var{char1}
751and @var{char2}.
752
753A leading @code{^} reverses the meaning of @var{list}, so that
754it matches any single character @emph{not} in @var{list}.  To include
755@code{]} in the list, make it the first character (after
756the @code{^} if needed), to include @code{-} in the list,
757make it the first or last; to include @code{^} put
758it after the first character.
759
760@cindex @code{POSIXLY_CORRECT} behavior, bracket expressions
761The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\}
762are normally not special within @var{list}.  For example, @code{[\*]}
763matches either @samp{\} or @samp{*}, because the @code{\} is not
764special here.  However, strings like @code{[.ch.]}, @code{[=a=]}, and
765@code{[:space:]} are special within @var{list} and represent collating
766symbols, equivalence classes, and character classes, respectively, and
767@code{[} is therefore special within @var{list} when it is followed by
768@code{.}, @code{=}, or @code{:}.  Also, when not in
769@env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and
770@code{\t} are recognized within @var{list}.  @xref{Escapes}.
771
772@item @var{regexp1}\|@var{regexp2}
773@cindex @acronym{GNU} extensions, to basic regular expressions
774Matches either @var{regexp1} or @var{regexp2}.  Use
775parentheses to use complex alternative regular expressions.
776The matching process tries each alternative in turn, from
777left to right, and the first one that succeeds is used.
778It is a @acronym{GNU} extension.
779
780@item @var{regexp1}@var{regexp2}
781Matches the concatenation of @var{regexp1} and @var{regexp2}.
782Concatenation binds more tightly than @code{\|}, @code{^}, and
783@code{$}, but less tightly than the other regular expression
784operators.
785
786@item \@var{digit}
787Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized
788subexpression in the regular expression.  This is called a @dfn{back
789reference}.  Subexpressions are implicity numbered by counting
790occurrences of @code{\(} left-to-right.
791
792@item \n
793Matches the newline character.
794
795@item \@var{char}
796Matches @var{char}, where @var{char} is one of @code{$},
797@code{*}, @code{.}, @code{[}, @code{\}, or @code{^}.
798Note that the only C-like
799backslash sequences that you can portably assume to be
800interpreted are @code{\n} and @code{\\}; in particular
801@code{\t} is not portable, and matches a @samp{t} under most
802implementations of @command{sed}, rather than a tab character.
803
804@end table
805
806@cindex Greedy regular expression matching
807Note that the regular expression matcher is greedy, i.e., matches
808are attempted from left to right and, if two or more matches are
809possible starting at the same character, it selects the longest.
810
811@noindent
812Examples:
813@table @samp
814@item abcdef
815Matches @samp{abcdef}.
816
817@item a*b
818Matches zero or more @samp{a}s followed by a single
819@samp{b}.  For example, @samp{b} or @samp{aaaaab}.
820
821@item a\?b
822Matches @samp{b} or @samp{ab}.
823
824@item a\+b\+
825Matches one or more @samp{a}s followed by one or more
826@samp{b}s: @samp{ab} is the shortest possible match, but
827other examples are @samp{aaaab} or @samp{abbbbb} or
828@samp{aaaaaabbbbbbb}.
829
830@item .*
831@itemx .\+
832These two both match all the characters in a string;
833however, the first matches every string (including the empty
834string), while the second matches only strings containing
835at least one character.
836
837@item ^main.*(.*)
838his matches a string starting with @samp{main},
839followed by an opening and closing
840parenthesis.  The @samp{n}, @samp{(} and @samp{)} need not
841be adjacent.
842
843@item ^#
844This matches a string beginning with @samp{#}.
845
846@item \\$
847This matches a string ending with a single backslash.  The
848regexp contains two backslashes for escaping.
849
850@item \$
851Instead, this matches a string consisting of a single dollar sign,
852because it is escaped.
853
854@item [a-zA-Z0-9]
855In the C locale, this matches any @acronym{ASCII} letters or digits.
856
857@item [^ @kbd{tab}]\+
858(Here @kbd{tab} stands for a single tab character.)
859This matches a string of one or more
860characters, none of which is a space or a tab.
861Usually this means a word.
862
863@item ^\(.*\)\n\1$
864This matches a string consisting of two equal substrings separated by
865a newline.
866
867@item .\@{9\@}A$
868This matches nine characters followed by an @samp{A}.
869
870@item ^.\@{15\@}A
871This matches the start of a string that contains 16 characters,
872the last of which is an @samp{A}.
873
874@end table
875
876
877
878@node Common Commands
879@section Often-Used Commands
880
881If you use @command{sed} at all, you will quite likely want to know
882these commands.
883
884@table @code
885@item #
886[No addresses allowed.]
887
888@findex # (comments)
889@cindex Comments, in scripts
890The @code{#} character begins a comment;
891the comment continues until the next newline.
892
893@cindex Portability, comments
894If you are concerned about portability, be aware that
895some implementations of @command{sed} (which are not @sc{posix}
896conformant) may only support a single one-line comment,
897and then only when the very first character of the script is a @code{#}.
898
899@findex -n, forcing from within a script
900@cindex Caveat --- #n on first line
901Warning: if the first two characters of the @command{sed} script
902are @code{#n}, then the @option{-n} (no-autoprint) option is forced.
903If you want to put a comment in the first line of your script
904and that comment begins with the letter @samp{n}
905and you do not want this behavior,
906then be sure to either use a capital @samp{N},
907or place at least one space before the @samp{n}.
908
909@item q [@var{exit-code}]
910This command only accepts a single address.
911
912@findex q (quit) command
913@cindex @value{SSEDEXT}, returning an exit code
914@cindex Quitting
915Exit @command{sed} without processing any more commands or input.
916Note that the current pattern space is printed if auto-print is
917not disabled with the @option{-n} options.  The ability to return
918an exit code from the @command{sed} script is a @value{SSED} extension.
919
920@item d
921@findex d (delete) command
922@cindex Text, deleting
923Delete the pattern space;
924immediately start next cycle.
925
926@item p
927@findex p (print) command
928@cindex Text, printing
929Print out the pattern space (to the standard output).
930This command is usually only used in conjunction with the @option{-n}
931command-line option.
932
933@item n
934@findex n (next-line) command
935@cindex Next input line, replace pattern space with
936@cindex Read next input line
937If auto-print is not disabled, print the pattern space,
938then, regardless, replace the pattern space with the next line of input.
939If there is no more input then @command{sed} exits without processing
940any more commands.
941
942@item @{ @var{commands} @}
943@findex @{@} command grouping
944@cindex Grouping commands
945@cindex Command groups
946A group of commands may be enclosed between
947@code{@{} and @code{@}} characters.
948This is particularly useful when you want a group of commands
949to be triggered by a single address (or address-range) match.
950
951@end table
952
953@node The "s" Command
954@section The @code{s} Command
955
956The syntax of the @code{s} (as in substitute) command is
957@samp{s/@var{regexp}/@var{replacement}/@var{flags}}.  The @code{/}
958characters may be uniformly replaced by any other single
959character within any given @code{s} command.  The @code{/}
960character (or whatever other character is used in its stead)
961can appear in the @var{regexp} or @var{replacement}
962only if it is preceded by a @code{\} character.
963
964The @code{s} command is probably the most important in @command{sed}
965and has a lot of different options.  Its basic concept is simple:
966the @code{s} command attempts to match the pattern
967space against the supplied @var{regexp}; if the match is
968successful, then that portion of the pattern
969space which was matched is replaced with @var{replacement}.
970
971@cindex Backreferences, in regular expressions
972@cindex Parenthesized substrings
973The @var{replacement} can contain @code{\@var{n}} (@var{n} being
974a number from 1 to 9, inclusive) references, which refer to
975the portion of the match which is contained between the @var{n}th
976@code{\(} and its matching @code{\)}.
977Also, the @var{replacement} can contain unescaped @code{&}
978characters which reference the whole matched portion
979of the pattern space.
980@cindex @value{SSEDEXT}, case modifiers in @code{s} commands
981Finally, as a @value{SSED} extension, you can include a
982special sequence made of a backslash and one of the letters
983@code{L}, @code{l}, @code{U}, @code{u}, or @code{E}.
984The meaning is as follows:
985
986@table @code
987@item \L
988Turn the replacement
989to lowercase until a @code{\U} or @code{\E} is found,
990
991@item \l
992Turn the
993next character to lowercase,
994
995@item \U
996Turn the replacement to uppercase
997until a @code{\L} or @code{\E} is found,
998
999@item \u
1000Turn the next character
1001to uppercase,
1002
1003@item \E
1004Stop case conversion started by @code{\L} or @code{\U}.
1005@end table
1006
1007To include a literal @code{\}, @code{&}, or newline in the final
1008replacement, be sure to precede the desired @code{\}, @code{&},
1009or newline in the @var{replacement} with a @code{\}.
1010
1011@findex s command, option flags
1012@cindex Substitution of text, options
1013The @code{s} command can be followed by zero or more of the
1014following @var{flags}:
1015
1016@table @code
1017@item g
1018@cindex Global substitution
1019@cindex Replacing all text matching regexp in a line
1020Apply the replacement to @emph{all} matches to the @var{regexp},
1021not just the first.
1022
1023@item @var{number}
1024@cindex Replacing only @var{n}th match of regexp in a line
1025Only replace the @var{number}th match of the @var{regexp}.
1026
1027@cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command
1028@cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command
1029Note: the @sc{posix} standard does not specify what should happen
1030when you mix the @code{g} and @var{number} modifiers,
1031and currently there is no widely agreed upon meaning
1032across @command{sed} implementations.
1033For @value{SSED}, the interaction is defined to be:
1034ignore matches before the @var{number}th,
1035and then match and replace all matches from
1036the @var{number}th on.
1037
1038@item p
1039@cindex Text, printing after substitution
1040If the substitution was made, then print the new pattern space.
1041
1042Note: when both the @code{p} and @code{e} options are specified,
1043the relative ordering of the two produces very different results.
1044In general, @code{ep} (evaluate then print) is what you want,
1045but operating the other way round can be useful for debugging.
1046For this reason, the current version of @value{SSED} interprets
1047specially the presence of @code{p} options both before and after
1048@code{e}, printing the pattern space before and after evaluation,
1049while in general flags for the @code{s} command show their
1050effect just once.  This behavior, although documented, might
1051change in future versions.
1052
1053@item w @var{file-name}
1054@cindex Text, writing to a file after substitution
1055@cindex @value{SSEDEXT}, @file{/dev/stdout} file
1056@cindex @value{SSEDEXT}, @file{/dev/stderr} file
1057If the substitution was made, then write out the result to the named file.
1058As a @value{SSED} extension, two special values of @var{file-name} are
1059supported: @file{/dev/stderr}, which writes the result to the standard
1060error, and @file{/dev/stdout}, which writes to the standard
1061output.@footnote{This is equivalent to @code{p} unless the @option{-i}
1062option is being used.}
1063
1064@item e
1065@cindex Evaluate Bourne-shell commands, after substitution
1066@cindex Subprocesses
1067@cindex @value{SSEDEXT}, evaluating Bourne-shell commands
1068@cindex @value{SSEDEXT}, subprocesses
1069This command allows one to pipe input from a shell command
1070into pattern space.  If a substitution was made, the command
1071that is found in pattern space is executed and pattern space
1072is replaced with its output.  A trailing newline is suppressed;
1073results are undefined if the command to be executed contains
1074a @sc{nul} character.  This is a @value{SSED} extension.
1075
1076@item I
1077@itemx i
1078@cindex @acronym{GNU} extensions, @code{I} modifier
1079@cindex Case-insensitive matching
1080@ifset PERL
1081@cindex Perl-style regular expressions, case-insensitive
1082@end ifset
1083The @code{I} modifier to regular-expression matching is a @acronym{GNU}
1084extension which makes @command{sed} match @var{regexp} in a
1085case-insensitive manner.
1086
1087@item M
1088@itemx m
1089@cindex @value{SSEDEXT}, @code{M} modifier
1090@ifset PERL
1091@cindex Perl-style regular expressions, multiline
1092@end ifset
1093The @code{M} modifier to regular-expression matching is a @value{SSED}
1094extension which causes @code{^} and @code{$} to match respectively
1095(in addition to the normal behavior) the empty string after a newline,
1096and the empty string before a newline.  There are special character
1097sequences
1098@ifset PERL
1099(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
1100in basic or extended regular expression modes)
1101@end ifset
1102@ifclear PERL
1103(@code{\`} and @code{\'})
1104@end ifclear
1105which always match the beginning or the end of the buffer.
1106@code{M} stands for @cite{multi-line}.
1107
1108@ifset PERL
1109@item S
1110@itemx s
1111@cindex @value{SSEDEXT}, @code{S} modifier
1112@cindex Perl-style regular expressions, single line
1113The @code{S} modifier to regular-expression matching is only valid
1114in Perl mode and specifies that the dot character (@code{.}) will
1115match the newline character too.  @code{S} stands for @cite{single-line}.
1116@end ifset
1117
1118@ifset PERL
1119@item X
1120@itemx x
1121@cindex @value{SSEDEXT}, @code{X} modifier
1122@cindex Perl-style regular expressions, extended
1123The @code{X} modifier to regular-expression matching is also
1124valid in Perl mode only.  If it is used, whitespace in the
1125pattern (other than in a character class) and
1126characters between a @kbd{#} outside a character class and the
1127next newline character are ignored. An escaping backslash
1128can be used to include a whitespace or @kbd{#} character as part
1129of the pattern.
1130@end ifset
1131@end table
1132
1133
1134@node Other Commands
1135@section Less Frequently-Used Commands
1136
1137Though perhaps less frequently used than those in the previous
1138section, some very small yet useful @command{sed} scripts can be built with
1139these commands.
1140
1141@table @code
1142@item y/@var{source-chars}/@var{dest-chars}/
1143(The @code{/} characters may be uniformly replaced by
1144any other single character within any given @code{y} command.)
1145
1146@findex y (transliterate) command
1147@cindex Transliteration
1148Transliterate any characters in the pattern space which match
1149any of the @var{source-chars} with the corresponding character
1150in @var{dest-chars}.
1151
1152Instances of the @code{/} (or whatever other character is used in its stead),
1153@code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars}
1154lists, provide that each instance is escaped by a @code{\}.
1155The @var{source-chars} and @var{dest-chars} lists @emph{must}
1156contain the same number of characters (after de-escaping).
1157
1158@item a\
1159@itemx @var{text}
1160@cindex @value{SSEDEXT}, two addresses supported by most commands
1161As a @acronym{GNU} extension, this command accepts two addresses.
1162
1163@findex a (append text lines) command
1164@cindex Appending text after a line
1165@cindex Text, appending
1166Queue the lines of text which follow this command
1167(each but the last ending with a @code{\},
1168which are removed from the output)
1169to be output at the end of the current cycle,
1170or when the next input line is read.
1171
1172Escape sequences in @var{text} are processed, so you should
1173use @code{\\} in @var{text} to print a single backslash.
1174
1175As a @acronym{GNU} extension, if between the @code{a} and the newline there is
1176other than a whitespace-@code{\} sequence, then the text of this line,
1177starting at the first non-whitespace character after the @code{a},
1178is taken as the first line of the @var{text} block.
1179(This enables a simplification in scripting a one-line add.)
1180This extension also works with the @code{i} and @code{c} commands.
1181
1182@item i\
1183@itemx @var{text}
1184@cindex @value{SSEDEXT}, two addresses supported by most commands
1185As a @acronym{GNU} extension, this command accepts two addresses.
1186
1187@findex i (insert text lines) command
1188@cindex Inserting text before a line
1189@cindex Text, insertion
1190Immediately output the lines of text which follow this command
1191(each but the last ending with a @code{\},
1192which are removed from the output).
1193
1194@item c\
1195@itemx @var{text}
1196@findex c (change to text lines) command
1197@cindex Replacing selected lines with other text
1198Delete the lines matching the address or address-range,
1199and output the lines of text which follow this command
1200(each but the last ending with a @code{\},
1201which are removed from the output)
1202in place of the last line
1203(or in place of each line, if no addresses were specified).
1204A new cycle is started after this command is done,
1205since the pattern space will have been deleted.
1206
1207@item =
1208@cindex @value{SSEDEXT}, two addresses supported by most commands
1209As a @acronym{GNU} extension, this command accepts two addresses.
1210
1211@findex = (print line number) command
1212@cindex Printing line number
1213@cindex Line number, printing
1214Print out the current input line number (with a trailing newline).
1215
1216@item l @var{n}
1217@findex l (list unambiguously) command
1218@cindex List pattern space
1219@cindex Printing text unambiguously
1220@cindex Line length, setting
1221@cindex @value{SSEDEXT}, setting line length
1222Print the pattern space in an unambiguous form:
1223non-printable characters (and the @code{\} character)
1224are printed in C-style escaped form; long lines are split,
1225with a trailing @code{\} character to indicate the split;
1226the end of each line is marked with a @code{$}.
1227
1228@var{n} specifies the desired line-wrap length;
1229a length of 0 (zero) means to never wrap long lines.  If omitted,
1230the default as specified on the command line is used.  The @var{n}
1231parameter is a @value{SSED} extension.
1232
1233@item r @var{filename}
1234@cindex @value{SSEDEXT}, two addresses supported by most commands
1235As a @acronym{GNU} extension, this command accepts two addresses.
1236
1237@findex r (read file) command
1238@cindex Read text from a file
1239@cindex @value{SSEDEXT}, @file{/dev/stdin} file
1240Queue the contents of @var{filename} to be read and
1241inserted into the output stream at the end of the current cycle,
1242or when the next input line is read.
1243Note that if @var{filename} cannot be read, it is treated as
1244if it were an empty file, without any error indication.
1245
1246As a @value{SSED} extension, the special value @file{/dev/stdin}
1247is supported for the file name, which reads the contents of the
1248standard input.
1249
1250@item w @var{filename}
1251@findex w (write file) command
1252@cindex Write to a file
1253@cindex @value{SSEDEXT}, @file{/dev/stdout} file
1254@cindex @value{SSEDEXT}, @file{/dev/stderr} file
1255Write the pattern space to @var{filename}.
1256As a @value{SSED} extension, two special values of @var{file-name} are
1257supported: @file{/dev/stderr}, which writes the result to the standard
1258error, and @file{/dev/stdout}, which writes to the standard
1259output.@footnote{This is equivalent to @code{p} unless the @option{-i}
1260option is being used.}
1261
1262The file will be created (or truncated) before the
1263first input line is read; all @code{w} commands
1264(including instances of @code{w} flag on successful @code{s} commands)
1265which refer to the same @var{filename} are output without
1266closing and reopening the file.
1267
1268@item D
1269@findex D (delete first line) command
1270@cindex Delete first line from pattern space
1271Delete text in the pattern space up to the first newline.
1272If any text is left, restart cycle with the resultant
1273pattern space (without reading a new line of input),
1274otherwise start a normal new cycle.
1275
1276@item N
1277@findex N (append Next line) command
1278@cindex Next input line, append to pattern space
1279@cindex Append next input line to pattern space
1280Add a newline to the pattern space,
1281then append the next line of input to the pattern space.
1282If there is no more input then @command{sed} exits without processing
1283any more commands.
1284
1285@item P
1286@findex P (print first line) command
1287@cindex Print first line from pattern space
1288Print out the portion of the pattern space up to the first newline.
1289
1290@item h
1291@findex h (hold) command
1292@cindex Copy pattern space into hold space
1293@cindex Replace hold space with copy of pattern space
1294@cindex Hold space, copying pattern space into
1295Replace the contents of the hold space with the contents of the pattern space.
1296
1297@item H
1298@findex H (append Hold) command
1299@cindex Append pattern space to hold space
1300@cindex Hold space, appending from pattern space
1301Append a newline to the contents of the hold space,
1302and then append the contents of the pattern space to that of the hold space.
1303
1304@item g
1305@findex g (get) command
1306@cindex Copy hold space into pattern space
1307@cindex Replace pattern space with copy of hold space
1308@cindex Hold space, copy into pattern space
1309Replace the contents of the pattern space with the contents of the hold space.
1310
1311@item G
1312@findex G (appending Get) command
1313@cindex Append hold space to pattern space
1314@cindex Hold space, appending to pattern space
1315Append a newline to the contents of the pattern space,
1316and then append the contents of the hold space to that of the pattern space.
1317
1318@item x
1319@findex x (eXchange) command
1320@cindex Exchange hold space with pattern space
1321@cindex Hold space, exchange with pattern space
1322Exchange the contents of the hold and pattern spaces.
1323
1324@end table
1325
1326
1327@node Programming Commands
1328@section Commands for @command{sed} gurus
1329
1330In most cases, use of these commands indicates that you are
1331probably better off programming in something like @command{awk}
1332or Perl.  But occasionally one is committed to sticking
1333with @command{sed}, and these commands can enable one to write
1334quite convoluted scripts.
1335
1336@cindex Flow of control in scripts
1337@table @code
1338@item : @var{label}
1339[No addresses allowed.]
1340
1341@findex : (label) command
1342@cindex Labels, in scripts
1343Specify the location of @var{label} for branch commands.
1344In all other respects, a no-op.
1345
1346@item b @var{label}
1347@findex b (branch) command
1348@cindex Branch to a label, unconditionally
1349@cindex Goto, in scripts
1350Unconditionally branch to @var{label}.
1351The @var{label} may be omitted, in which case the next cycle is started.
1352
1353@item t @var{label}
1354@findex t (test and branch if successful) command
1355@cindex Branch to a label, if @code{s///} succeeded
1356@cindex Conditional branch
1357Branch to @var{label} only if there has been a successful @code{s}ubstitution
1358since the last input line was read or conditional branch was taken.
1359The @var{label} may be omitted, in which case the next cycle is started.
1360
1361@end table
1362
1363@node Extended Commands
1364@section Commands Specific to @value{SSED}
1365
1366These commands are specific to @value{SSED}, so you
1367must use them with care and only when you are sure that
1368hindering portability is not evil.  They allow you to check
1369for @value{SSED} extensions or to do tasks that are required
1370quite often, yet are unsupported by standard @command{sed}s.
1371
1372@table @code
1373@item e [@var{command}]
1374@findex e (evaluate) command
1375@cindex Evaluate Bourne-shell commands
1376@cindex Subprocesses
1377@cindex @value{SSEDEXT}, evaluating Bourne-shell commands
1378@cindex @value{SSEDEXT}, subprocesses
1379This command allows one to pipe input from a shell command
1380into pattern space.  Without parameters, the @code{e} command
1381executes the command that is found in pattern space and
1382replaces the pattern space with the output; a trailing newline
1383is suppressed.
1384
1385If a parameter is specified, instead, the @code{e} command
1386interprets it as a command and sends its output to the output stream
1387(like @code{r} does).  The command can run across multiple
1388lines, all but the last ending with a back-slash.
1389
1390In both cases, the results are undefined if the command to be
1391executed contains a @sc{nul} character.
1392
1393@item L @var{n}
1394@findex L (fLow paragraphs) command
1395@cindex Reformat pattern space
1396@cindex Reformatting paragraphs
1397@cindex @value{SSEDEXT}, reformatting paragraphs
1398@cindex @value{SSEDEXT}, @code{L} command
1399This @value{SSED} extension fills and joins lines in pattern space
1400to produce output lines of (at most) @var{n} characters, like
1401@code{fmt} does; if @var{n} is omitted, the default as specified
1402on the command line is used.  This command is considered a failed
1403experiment and unless there is enough request (which seems unlikely)
1404will be removed in future versions.
1405
1406@ignore
1407Blank lines, spaces between words, and indentation are
1408preserved in the output; successive input lines with different
1409indentation are not joined; tabs are expanded to 8 columns.
1410
1411If the pattern space contains multiple lines, they are joined, but
1412since the pattern space usually contains a single line, the behavior
1413of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e.,
1414it does not join short lines to form longer ones).
1415
1416@var{n} specifies the desired line-wrap length; if omitted,
1417the default as specified on the command line is used.
1418@end ignore
1419
1420@item Q [@var{exit-code}]
1421This command only accepts a single address.
1422
1423@findex Q (silent Quit) command
1424@cindex @value{SSEDEXT}, quitting silently
1425@cindex @value{SSEDEXT}, returning an exit code
1426@cindex Quitting
1427This command is the same as @code{q}, but will not print the
1428contents of pattern space.  Like @code{q}, it provides the
1429ability to return an exit code to the caller.
1430
1431This command can be useful because the only alternative ways
1432to accomplish this apparently trivial function are to use
1433the @option{-n} option (which can unnecessarily complicate
1434your script) or resorting to the following snippet, which
1435wastes time by reading the whole file without any visible effect:
1436
1437@example
1438:eat
1439$d       @i{@r{Quit silently on the last line}}
1440N        @i{@r{Read another line, silently}}
1441g        @i{@r{Overwrite pattern space each time to save memory}}
1442b eat
1443@end example
1444
1445@item R @var{filename}
1446@findex R (read line) command
1447@cindex Read text from a file
1448@cindex @value{SSEDEXT}, reading a file a line at a time
1449@cindex @value{SSEDEXT}, @code{R} command
1450@cindex @value{SSEDEXT}, @file{/dev/stdin} file
1451Queue a line of @var{filename} to be read and
1452inserted into the output stream at the end of the current cycle,
1453or when the next input line is read.
1454Note that if @var{filename} cannot be read, or if its end is
1455reached, no line is appended, without any error indication.
1456
1457As with the @code{r} command, the special value @file{/dev/stdin}
1458is supported for the file name, which reads a line from the
1459standard input.
1460
1461@item T @var{label}
1462@findex T (test and branch if failed) command
1463@cindex @value{SSEDEXT}, branch if @code{s///} failed
1464@cindex Branch to a label, if @code{s///} failed
1465@cindex Conditional branch
1466Branch to @var{label} only if there have been no successful
1467@code{s}ubstitutions since the last input line was read or
1468conditional branch was taken. The @var{label} may be omitted,
1469in which case the next cycle is started.
1470
1471@item v @var{version}
1472@findex v (version) command
1473@cindex @value{SSEDEXT}, checking for their presence
1474@cindex Requiring @value{SSED}
1475This command does nothing, but makes @command{sed} fail if
1476@value{SSED} extensions are not supported, simply because other
1477versions of @command{sed} do not implement it.  In addition, you
1478can specify the version of @command{sed} that your script
1479requires, such as @code{4.0.5}.  The default is @code{4.0}
1480because that is the first version that implemented this command.
1481
1482This command enables all @value{SSEDEXT} even if
1483@env{POSIXLY_CORRECT} is set in the environment.
1484
1485@item W @var{filename}
1486@findex W (write first line) command
1487@cindex Write first line to a file
1488@cindex @value{SSEDEXT}, writing first line to a file
1489Write to the given filename the portion of the pattern space up to
1490the first newline.  Everything said under the @code{w} command about
1491file handling holds here too.
1492
1493@item z
1494@findex z (Zap) command
1495@cindex @value{SSEDEXT}, emptying pattern space
1496@cindex Emptying pattern space
1497This command empties the content of pattern space.  It is
1498usually the same as @samp{s/.*//}, but is more efficient
1499and works in the presence of invalid multibyte sequences
1500in the input stream.  @sc{posix} mandates that such sequences
1501are @emph{not} matched by @samp{.}, so that there is no portable
1502way to clear @command{sed}'s buffers in the middle of the
1503script in most multibyte locales (including UTF-8 locales).
1504@end table
1505
1506@node Escapes
1507@section @acronym{GNU} Extensions for Escapes in Regular Expressions
1508
1509@cindex @acronym{GNU} extensions, special escapes
1510Until this chapter, we have only encountered escapes of the form
1511@samp{\^}, which tell @command{sed} not to interpret the circumflex
1512as a special character, but rather to take it literally.  For
1513example, @samp{\*} matches a single asterisk rather than zero
1514or more backslashes.
1515
1516@cindex @code{POSIXLY_CORRECT} behavior, escapes
1517This chapter introduces another kind of escape@footnote{All
1518the escapes introduced here are @acronym{GNU}
1519extensions, with the exception of @code{\n}.  In basic regular
1520expression mode, setting @code{POSIXLY_CORRECT} disables them inside
1521bracket expressions.}---that
1522is, escapes that are applied to a character or sequence of characters
1523that ordinarily are taken literally, and that @command{sed} replaces
1524with a special character.  This provides a way
1525of encoding non-printable characters in patterns in a visible manner.
1526There is no restriction on the appearance of non-printing characters
1527in a @command{sed} script but when a script is being prepared in the
1528shell or by text editing, it is usually easier to use one of
1529the following escape sequences than the binary character it
1530represents:
1531
1532The list of these escapes is:
1533
1534@table @code
1535@item \a
1536Produces or matches a @sc{bel} character, that is an ``alert'' (@sc{ascii} 7).
1537
1538@item \f
1539Produces or matches a form feed (@sc{ascii} 12).
1540
1541@item \n
1542Produces or matches a newline (@sc{ascii} 10).
1543
1544@item \r
1545Produces or matches a carriage return (@sc{ascii} 13).
1546
1547@item \t
1548Produces or matches a horizontal tab (@sc{ascii} 9).
1549
1550@item \v
1551Produces or matches a so called ``vertical tab'' (@sc{ascii} 11).
1552
1553@item \c@var{x}
1554Produces or matches @kbd{@sc{Control}-@var{x}}, where @var{x} is
1555any character.  The precise effect of @samp{\c@var{x}} is as follows:
1556if @var{x} is a lower case letter, it is converted to upper case.
1557Then bit 6 of the character (hex 40) is inverted.  Thus @samp{\cz} becomes
1558hex 1A, but @samp{\c@{} becomes hex 3B, while @samp{\c;} becomes hex 7B.
1559
1560@item \d@var{xxx}
1561Produces or matches a character whose decimal @sc{ascii} value is @var{xxx}.
1562
1563@item \o@var{xxx}
1564@ifset PERL
1565@item \@var{xxx}
1566@end ifset
1567Produces or matches a character whose octal @sc{ascii} value is @var{xxx}.
1568@ifset PERL
1569The syntax without the @code{o} is active in Perl mode, while the one
1570with the @code{o} is active in the normal or extended @sc{posix} regular
1571expression modes.
1572@end ifset
1573
1574@item \x@var{xx}
1575Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}.
1576@end table
1577
1578@samp{\b} (backspace) was omitted because of the conflict with
1579the existing ``word boundary'' meaning.
1580
1581Other escapes match a particular character class and are valid only in
1582regular expressions:
1583
1584@table @code
1585@item \w
1586Matches any ``word'' character.  A ``word'' character is any
1587letter or digit or the underscore character.
1588
1589@item \W
1590Matches any ``non-word'' character.
1591
1592@item \b
1593Matches a word boundary; that is it matches if the character
1594to the left is a ``word'' character and the character to the
1595right is a ``non-word'' character, or vice-versa.
1596
1597@item \B
1598Matches everywhere but on a word boundary; that is it matches
1599if the character to the left and the character to the right
1600are either both ``word'' characters or both ``non-word''
1601characters.
1602
1603@item \`
1604Matches only at the start of pattern space.  This is different
1605from @code{^} in multi-line mode.
1606
1607@item \'
1608Matches only at the end of pattern space.  This is different
1609from @code{$} in multi-line mode.
1610
1611@ifset PERL
1612@item \G
1613Match only at the start of pattern space or, when doing a global
1614substitution using the @code{s///g} command and option, at
1615the end-of-match position of the prior match.  For example,
1616@samp{s/\Ga/Z/g} will change an initial run of @code{a}s to
1617a run of @code{Z}s
1618@end ifset
1619@end table
1620
1621@node Examples
1622@chapter Some Sample Scripts
1623
1624Here are some @command{sed} scripts to guide you in the art of mastering
1625@command{sed}.
1626
1627@menu
1628Some exotic examples:
1629* Centering lines::
1630* Increment a number::
1631* Rename files to lower case::
1632* Print bash environment::
1633* Reverse chars of lines::
1634
1635Emulating standard utilities:
1636* tac::                             Reverse lines of files
1637* cat -n::                          Numbering lines
1638* cat -b::                          Numbering non-blank lines
1639* wc -c::                           Counting chars
1640* wc -w::                           Counting words
1641* wc -l::                           Counting lines
1642* head::                            Printing the first lines
1643* tail::                            Printing the last lines
1644* uniq::                            Make duplicate lines unique
1645* uniq -d::                         Print duplicated lines of input
1646* uniq -u::                         Remove all duplicated lines
1647* cat -s::                          Squeezing blank lines
1648@end menu
1649
1650@node Centering lines
1651@section Centering Lines
1652
1653This script centers all lines of a file on a 80 columns width.
1654To change that width, the number in @code{\@{@dots{}\@}} must be
1655replaced, and the number of added spaces also must be changed.
1656
1657Note how the buffer commands are used to separate parts in
1658the regular expressions to be matched---this is a common
1659technique.
1660
1661@c start-------------------------------------------
1662@example
1663#!/usr/bin/sed -f
1664
1665# Put 80 spaces in the buffer
16661 @{
1667  x
1668  s/^$/          /
1669  s/^.*$/&&&&&&&&/
1670  x
1671@}
1672
1673# del leading and trailing spaces
1674y/@kbd{tab}/ /
1675s/^ *//
1676s/ *$//
1677
1678# add a newline and 80 spaces to end of line
1679G
1680
1681# keep first 81 chars (80 + a newline)
1682s/^\(.\@{81\@}\).*$/\1/
1683
1684# \2 matches half of the spaces, which are moved to the beginning
1685s/^\(.*\)\n\(.*\)\2/\2\1/
1686@end example
1687@c end---------------------------------------------
1688
1689@node Increment a number
1690@section Increment a Number
1691
1692This script is one of a few that demonstrate how to do arithmetic
1693in @command{sed}.  This is indeed possible,@footnote{@command{sed} guru Greg
1694Ubben wrote an implementation of the @command{dc} @sc{rpn} calculator!
1695It is distributed together with sed.} but must be done manually.
1696
1697To increment one number you just add 1 to last digit, replacing
1698it by the following digit.  There is one exception: when the digit
1699is a nine the previous digits must be also incremented until you
1700don't have a nine.
1701
1702This solution by Bruno Haible is very clever and smart because
1703it uses a single buffer; if you don't have this limitation, the
1704algorithm used in @ref{cat -n, Numbering lines}, is faster.
1705It works by replacing trailing nines with an underscore, then
1706using multiple @code{s} commands to increment the last digit,
1707and then again substituting underscores with zeros.
1708
1709@c start-------------------------------------------
1710@example
1711#!/usr/bin/sed -f
1712
1713/[^0-9]/ d
1714
1715# replace all leading 9s by _ (any other character except digits, could
1716# be used)
1717:d
1718s/9\(_*\)$/_\1/
1719td
1720
1721# incr last digit only.  The first line adds a most-significant
1722# digit of 1 if we have to add a digit.
1723#
1724# The @code{tn} commands are not necessary, but make the thing
1725# faster
1726
1727s/^\(_*\)$/1\1/; tn
1728s/8\(_*\)$/9\1/; tn
1729s/7\(_*\)$/8\1/; tn
1730s/6\(_*\)$/7\1/; tn
1731s/5\(_*\)$/6\1/; tn
1732s/4\(_*\)$/5\1/; tn
1733s/3\(_*\)$/4\1/; tn
1734s/2\(_*\)$/3\1/; tn
1735s/1\(_*\)$/2\1/; tn
1736s/0\(_*\)$/1\1/; tn
1737
1738:n
1739y/_/0/
1740@end example
1741@c end---------------------------------------------
1742
1743@node Rename files to lower case
1744@section Rename Files to Lower Case
1745
1746This is a pretty strange use of @command{sed}.  We transform text, and
1747transform it to be shell commands, then just feed them to shell.
1748Don't worry, even worse hacks are done when using @command{sed}; I have
1749seen a script converting the output of @command{date} into a @command{bc}
1750program!
1751
1752The main body of this is the @command{sed} script, which remaps the name
1753from lower to upper (or vice-versa) and even checks out
1754if the remapped name is the same as the original name.
1755Note how the script is parameterized using shell
1756variables and proper quoting.
1757
1758@c start-------------------------------------------
1759@example
1760#! /bin/sh
1761# rename files to lower/upper case...
1762#
1763# usage:
1764#    move-to-lower *
1765#    move-to-upper *
1766# or
1767#    move-to-lower -R .
1768#    move-to-upper -R .
1769#
1770
1771help()
1772@{
1773        cat << eof
1774Usage: $0 [-n] [-r] [-h] files...
1775
1776-n      do nothing, only see what would be done
1777-R      recursive (use find)
1778-h      this message
1779files   files to remap to lower case
1780
1781Examples:
1782       $0 -n *        (see if everything is ok, then...)
1783       $0 *
1784
1785       $0 -R .
1786
1787eof
1788@}
1789
1790apply_cmd='sh'
1791finder='echo "$@@" | tr " " "\n"'
1792files_only=
1793
1794while :
1795do
1796    case "$1" in
1797        -n) apply_cmd='cat' ;;
1798        -R) finder='find "$@@" -type f';;
1799        -h) help ; exit 1 ;;
1800        *) break ;;
1801    esac
1802    shift
1803done
1804
1805if [ -z "$1" ]; then
1806        echo Usage: $0 [-h] [-n] [-r] files...
1807        exit 1
1808fi
1809
1810LOWER='abcdefghijklmnopqrstuvwxyz'
1811UPPER='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
1812
1813case `basename $0` in
1814        *upper*) TO=$UPPER; FROM=$LOWER ;;
1815        *)       FROM=$UPPER; TO=$LOWER ;;
1816esac
1817
1818eval $finder | sed -n '
1819
1820# remove all trailing slashes
1821s/\/*$//
1822
1823# add ./ if there is no path, only a filename
1824/\//! s/^/.\//
1825
1826# save path+filename
1827h
1828
1829# remove path
1830s/.*\///
1831
1832# do conversion only on filename
1833y/'$FROM'/'$TO'/
1834
1835# now line contains original path+file, while
1836# hold space contains the new filename
1837x
1838
1839# add converted file name to line, which now contains
1840# path/file-name\nconverted-file-name
1841G
1842
1843# check if converted file name is equal to original file name,
1844# if it is, do not print nothing
1845/^.*\/\(.*\)\n\1/b
1846
1847# now, transform path/fromfile\n, into
1848# mv path/fromfile path/tofile and print it
1849s/^\(.*\/\)\(.*\)\n\(.*\)$/mv "\1\2" "\1\3"/p
1850
1851' | $apply_cmd
1852@end example
1853@c end---------------------------------------------
1854
1855@node Print bash environment
1856@section Print @command{bash} Environment
1857
1858This script strips the definition of the shell functions
1859from the output of the @command{set} Bourne-shell command.
1860
1861@c start-------------------------------------------
1862@example
1863#!/bin/sh
1864
1865set | sed -n '
1866:x
1867
1868@ifinfo
1869# if no occurrence of "=()" print and load next line
1870@end ifinfo
1871@ifnotinfo
1872# if no occurrence of @samp{=()} print and load next line
1873@end ifnotinfo
1874/=()/! @{ p; b; @}
1875/ () $/! @{ p; b; @}
1876
1877# possible start of functions section
1878# save the line in case this is a var like FOO="() "
1879h
1880
1881# if the next line has a brace, we quit because
1882# nothing comes after functions
1883n
1884/^@{/ q
1885
1886# print the old line
1887x; p
1888
1889# work on the new line now
1890x; bx
1891'
1892@end example
1893@c end---------------------------------------------
1894
1895@node Reverse chars of lines
1896@section Reverse Characters of Lines
1897
1898This script can be used to reverse the position of characters
1899in lines.  The technique moves two characters at a time, hence
1900it is faster than more intuitive implementations.
1901
1902Note the @code{tx} command before the definition of the label.
1903This is often needed to reset the flag that is tested by
1904the @code{t} command.
1905
1906Imaginative readers will find uses for this script.  An example
1907is reversing the output of @command{banner}.@footnote{This requires
1908another script to pad the output of banner; for example
1909
1910@example
1911#! /bin/sh
1912
1913banner -w $1 $2 $3 $4 |
1914  sed -e :a -e '/^.\@{0,'$1'\@}$/ @{ s/$/ /; ba; @}' |
1915  ~/sedscripts/reverseline.sed
1916@end example
1917}
1918
1919@c start-------------------------------------------
1920@example
1921#!/usr/bin/sed -f
1922
1923/../! b
1924
1925# Reverse a line.  Begin embedding the line between two newlines
1926s/^.*$/\
1927&\
1928/
1929
1930# Move first character at the end.  The regexp matches until
1931# there are zero or one characters between the markers
1932tx
1933:x
1934s/\(\n.\)\(.*\)\(.\n\)/\3\2\1/
1935tx
1936
1937# Remove the newline markers
1938s/\n//g
1939@end example
1940@c end---------------------------------------------
1941
1942@node tac
1943@section Reverse Lines of Files
1944
1945This one begins a series of totally useless (yet interesting)
1946scripts emulating various Unix commands.  This, in particular,
1947is a @command{tac} workalike.
1948
1949Note that on implementations other than @acronym{GNU} @command{sed}
1950@ifset PERL
1951and @value{SSED}
1952@end ifset
1953this script might easily overflow internal buffers.
1954
1955@c start-------------------------------------------
1956@example
1957#!/usr/bin/sed -nf
1958
1959# reverse all lines of input, i.e. first line became last, ...
1960
1961# from the second line, the buffer (which contains all previous lines)
1962# is *appended* to current line, so, the order will be reversed
19631! G
1964
1965# on the last line we're done -- print everything
1966$ p
1967
1968# store everything on the buffer again
1969h
1970@end example
1971@c end---------------------------------------------
1972
1973@node cat -n
1974@section Numbering Lines
1975
1976This script replaces @samp{cat -n}; in fact it formats its output
1977exactly like @acronym{GNU} @command{cat} does.
1978
1979Of course this is completely useless and for two reasons:  first,
1980because somebody else did it in C, second, because the following
1981Bourne-shell script could be used for the same purpose and would
1982be much faster:
1983
1984@c start-------------------------------------------
1985@example
1986#! /bin/sh
1987sed -e "=" $@@ | sed -e '
1988  s/^/      /
1989  N
1990  s/^ *\(......\)\n/\1  /
1991'
1992@end example
1993@c end---------------------------------------------
1994
1995It uses @command{sed} to print the line number, then groups lines two
1996by two using @code{N}.  Of course, this script does not teach as much as
1997the one presented below.
1998
1999The algorithm used for incrementing uses both buffers, so the line
2000is printed as soon as possible and then discarded.  The number
2001is split so that changing digits go in a buffer and unchanged ones go
2002in the other; the changed digits are modified in a single step
2003(using a @code{y} command).  The line number for the next line
2004is then composed and stored in the hold space, to be used in the
2005next iteration.
2006
2007@c start-------------------------------------------
2008@example
2009#!/usr/bin/sed -nf
2010
2011# Prime the pump on the first line
2012x
2013/^$/ s/^.*$/1/
2014
2015# Add the correct line number before the pattern
2016G
2017h
2018
2019# Format it and print it
2020s/^/      /
2021s/^ *\(......\)\n/\1  /p
2022
2023# Get the line number from hold space; add a zero
2024# if we're going to add a digit on the next line
2025g
2026s/\n.*$//
2027/^9*$/ s/^/0/
2028
2029# separate changing/unchanged digits with an x
2030s/.9*$/x&/
2031
2032# keep changing digits in hold space
2033h
2034s/^.*x//
2035y/0123456789/1234567890/
2036x
2037
2038# keep unchanged digits in pattern space
2039s/x.*$//
2040
2041# compose the new number, remove the newline implicitly added by G
2042G
2043s/\n//
2044h
2045@end example
2046@c end---------------------------------------------
2047
2048@node cat -b
2049@section Numbering Non-blank Lines
2050
2051Emulating @samp{cat -b} is almost the same as @samp{cat -n}---we only
2052have to select which lines are to be numbered and which are not.
2053
2054The part that is common to this script and the previous one is
2055not commented to show how important it is to comment @command{sed}
2056scripts properly...
2057
2058@c start-------------------------------------------
2059@example
2060#!/usr/bin/sed -nf
2061
2062/^$/ @{
2063  p
2064  b
2065@}
2066
2067# Same as cat -n from now
2068x
2069/^$/ s/^.*$/1/
2070G
2071h
2072s/^/      /
2073s/^ *\(......\)\n/\1  /p
2074x
2075s/\n.*$//
2076/^9*$/ s/^/0/
2077s/.9*$/x&/
2078h
2079s/^.*x//
2080y/0123456789/1234567890/
2081x
2082s/x.*$//
2083G
2084s/\n//
2085h
2086@end example
2087@c end---------------------------------------------
2088
2089@node wc -c
2090@section Counting Characters
2091
2092This script shows another way to do arithmetic with @command{sed}.
2093In this case we have to add possibly large numbers, so implementing
2094this by successive increments would not be feasible (and possibly
2095even more complicated to contrive than this script).
2096
2097The approach is to map numbers to letters, kind of an abacus
2098implemented with @command{sed}.  @samp{a}s are units, @samp{b}s are
2099tens and so on: we simply add the number of characters
2100on the current line as units, and then propagate the carry
2101to tens, hundreds, and so on.
2102
2103As usual, running totals are kept in hold space.
2104
2105On the last line, we convert the abacus form back to decimal.
2106For the sake of variety, this is done with a loop rather than
2107with some 80 @code{s} commands@footnote{Some implementations
2108have a limit of 199 commands per script}: first we
2109convert units, removing @samp{a}s from the number; then we
2110rotate letters so that tens become @samp{a}s, and so on
2111until no more letters remain.
2112
2113@c start-------------------------------------------
2114@example
2115#!/usr/bin/sed -nf
2116
2117# Add n+1 a's to hold space (+1 is for the newline)
2118s/./a/g
2119H
2120x
2121s/\n/a/
2122
2123# Do the carry.  The t's and b's are not necessary,
2124# but they do speed up the thing
2125t a
2126: a;  s/aaaaaaaaaa/b/g; t b; b done
2127: b;  s/bbbbbbbbbb/c/g; t c; b done
2128: c;  s/cccccccccc/d/g; t d; b done
2129: d;  s/dddddddddd/e/g; t e; b done
2130: e;  s/eeeeeeeeee/f/g; t f; b done
2131: f;  s/ffffffffff/g/g; t g; b done
2132: g;  s/gggggggggg/h/g; t h; b done
2133: h;  s/hhhhhhhhhh//g
2134
2135: done
2136$! @{
2137  h
2138  b
2139@}
2140
2141# On the last line, convert back to decimal
2142
2143: loop
2144/a/! s/[b-h]*/&0/
2145s/aaaaaaaaa/9/
2146s/aaaaaaaa/8/
2147s/aaaaaaa/7/
2148s/aaaaaa/6/
2149s/aaaaa/5/
2150s/aaaa/4/
2151s/aaa/3/
2152s/aa/2/
2153s/a/1/
2154
2155: next
2156y/bcdefgh/abcdefg/
2157/[a-h]/ b loop
2158p
2159@end example
2160@c end---------------------------------------------
2161
2162@node wc -w
2163@section Counting Words
2164
2165This script is almost the same as the previous one, once each
2166of the words on the line is converted to a single @samp{a}
2167(in the previous script each letter was changed to an @samp{a}).
2168
2169It is interesting that real @command{wc} programs have optimized
2170loops for @samp{wc -c}, so they are much slower at counting
2171words rather than characters.  This script's bottleneck,
2172instead, is arithmetic, and hence the word-counting one
2173is faster (it has to manage smaller numbers).
2174
2175Again, the common parts are not commented to show the importance
2176of commenting @command{sed} scripts.
2177
2178@c start-------------------------------------------
2179@example
2180#!/usr/bin/sed -nf
2181
2182# Convert words to a's
2183s/[ @kbd{tab}][ @kbd{tab}]*/ /g
2184s/^/ /
2185s/ [^ ][^ ]*/a /g
2186s/ //g
2187
2188# Append them to hold space
2189H
2190x
2191s/\n//
2192
2193# From here on it is the same as in wc -c.
2194/aaaaaaaaaa/! bx;   s/aaaaaaaaaa/b/g
2195/bbbbbbbbbb/! bx;   s/bbbbbbbbbb/c/g
2196/cccccccccc/! bx;   s/cccccccccc/d/g
2197/dddddddddd/! bx;   s/dddddddddd/e/g
2198/eeeeeeeeee/! bx;   s/eeeeeeeeee/f/g
2199/ffffffffff/! bx;   s/ffffffffff/g/g
2200/gggggggggg/! bx;   s/gggggggggg/h/g
2201s/hhhhhhhhhh//g
2202:x
2203$! @{ h; b; @}
2204:y
2205/a/! s/[b-h]*/&0/
2206s/aaaaaaaaa/9/
2207s/aaaaaaaa/8/
2208s/aaaaaaa/7/
2209s/aaaaaa/6/
2210s/aaaaa/5/
2211s/aaaa/4/
2212s/aaa/3/
2213s/aa/2/
2214s/a/1/
2215y/bcdefgh/abcdefg/
2216/[a-h]/ by
2217p
2218@end example
2219@c end---------------------------------------------
2220
2221@node wc -l
2222@section Counting Lines
2223
2224No strange things are done now, because @command{sed} gives us
2225@samp{wc -l} functionality for free!!! Look:
2226
2227@c start-------------------------------------------
2228@example
2229#!/usr/bin/sed -nf
2230$=
2231@end example
2232@c end---------------------------------------------
2233
2234@node head
2235@section Printing the First Lines
2236
2237This script is probably the simplest useful @command{sed} script.
2238It displays the first 10 lines of input; the number of displayed
2239lines is right before the @code{q} command.
2240
2241@c start-------------------------------------------
2242@example
2243#!/usr/bin/sed -f
224410q
2245@end example
2246@c end---------------------------------------------
2247
2248@node tail
2249@section Printing the Last Lines
2250
2251Printing the last @var{n} lines rather than the first is more complex
2252but indeed possible.  @var{n} is encoded in the second line, before
2253the bang character.
2254
2255This script is similar to the @command{tac} script in that it keeps the
2256final output in the hold space and prints it at the end:
2257
2258@c start-------------------------------------------
2259@example
2260#!/usr/bin/sed -nf
2261
22621! @{; H; g; @}
22631,10 !s/[^\n]*\n//
2264$p
2265h
2266@end example
2267@c end---------------------------------------------
2268
2269Mainly, the scripts keeps a window of 10 lines and slides it
2270by adding a line and deleting the oldest (the substitution command
2271on the second line works like a @code{D} command but does not
2272restart the loop).
2273
2274The ``sliding window'' technique is a very powerful way to write
2275efficient and complex @command{sed} scripts, because commands like
2276@code{P} would require a lot of work if implemented manually.
2277
2278To introduce the technique, which is fully demonstrated in the
2279rest of this chapter and is based on the @code{N}, @code{P}
2280and @code{D} commands, here is an implementation of @command{tail}
2281using a simple ``sliding window.''
2282
2283This looks complicated but in fact the working is the same as
2284the last script: after we have kicked in the appropriate number
2285of lines, however, we stop using the hold space to keep inter-line
2286state, and instead use @code{N} and @code{D} to slide pattern
2287space by one line:
2288
2289@c start-------------------------------------------
2290@example
2291#!/usr/bin/sed -f
2292
22931h
22942,10 @{; H; g; @}
2295$q
22961,9d
2297N
2298D
2299@end example
2300@c end---------------------------------------------
2301
2302Note how the first, second and fourth line are inactive after
2303the first ten lines of input.  After that, all the script does
2304is: exiting on the last line of input, appending the next input
2305line to pattern space, and removing the first line.
2306
2307@node uniq
2308@section Make Duplicate Lines Unique
2309
2310This is an example of the art of using the @code{N}, @code{P}
2311and @code{D} commands, probably the most difficult to master.
2312
2313@c start-------------------------------------------
2314@example
2315#!/usr/bin/sed -f
2316h
2317
2318:b
2319# On the last line, print and exit
2320$b
2321N
2322/^\(.*\)\n\1$/ @{
2323    # The two lines are identical.  Undo the effect of
2324    # the n command.
2325    g
2326    bb
2327@}
2328
2329# If the @code{N} command had added the last line, print and exit
2330$b
2331
2332# The lines are different; print the first and go
2333# back working on the second.
2334P
2335D
2336@end example
2337@c end---------------------------------------------
2338
2339As you can see, we mantain a 2-line window using @code{P} and @code{D}.
2340This technique is often used in advanced @command{sed} scripts.
2341
2342@node uniq -d
2343@section Print Duplicated Lines of Input
2344
2345This script prints only duplicated lines, like @samp{uniq -d}.
2346
2347@c start-------------------------------------------
2348@example
2349#!/usr/bin/sed -nf
2350
2351$b
2352N
2353/^\(.*\)\n\1$/ @{
2354    # Print the first of the duplicated lines
2355    s/.*\n//
2356    p
2357
2358    # Loop until we get a different line
2359    :b
2360    $b
2361    N
2362    /^\(.*\)\n\1$/ @{
2363        s/.*\n//
2364        bb
2365    @}
2366@}
2367
2368# The last line cannot be followed by duplicates
2369$b
2370
2371# Found a different one.  Leave it alone in the pattern space
2372# and go back to the top, hunting its duplicates
2373D
2374@end example
2375@c end---------------------------------------------
2376
2377@node uniq -u
2378@section Remove All Duplicated Lines
2379
2380This script prints only unique lines, like @samp{uniq -u}.
2381
2382@c start-------------------------------------------
2383@example
2384#!/usr/bin/sed -f
2385
2386# Search for a duplicate line --- until that, print what you find.
2387$b
2388N
2389/^\(.*\)\n\1$/ ! @{
2390    P
2391    D
2392@}
2393
2394:c
2395# Got two equal lines in pattern space.  At the
2396# end of the file we simply exit
2397$d
2398
2399# Else, we keep reading lines with @code{N} until we
2400# find a different one
2401s/.*\n//
2402N
2403/^\(.*\)\n\1$/ @{
2404    bc
2405@}
2406
2407# Remove the last instance of the duplicate line
2408# and go back to the top
2409D
2410@end example
2411@c end---------------------------------------------
2412
2413@node cat -s
2414@section Squeezing Blank Lines
2415
2416As a final example, here are three scripts, of increasing complexity
2417and speed, that implement the same function as @samp{cat -s}, that is
2418squeezing blank lines.
2419
2420The first leaves a blank line at the beginning and end if there are
2421some already.
2422
2423@c start-------------------------------------------
2424@example
2425#!/usr/bin/sed -f
2426
2427# on empty lines, join with next
2428# Note there is a star in the regexp
2429:x
2430/^\n*$/ @{
2431N
2432bx
2433@}
2434
2435# now, squeeze all '\n', this can be also done by:
2436# s/^\(\n\)*/\1/
2437s/\n*/\
2438/
2439@end example
2440@c end---------------------------------------------
2441
2442This one is a bit more complex and removes all empty lines
2443at the beginning.  It does leave a single blank line at end
2444if one was there.
2445
2446@c start-------------------------------------------
2447@example
2448#!/usr/bin/sed -f
2449
2450# delete all leading empty lines
24511,/^./@{
2452/./!d
2453@}
2454
2455# on an empty line we remove it and all the following
2456# empty lines, but one
2457:x
2458/./!@{
2459N
2460s/^\n$//
2461tx
2462@}
2463@end example
2464@c end---------------------------------------------
2465
2466This removes leading and trailing blank lines.  It is also the
2467fastest.  Note that loops are completely done with @code{n} and
2468@code{b}, without relying on @command{sed} to restart the
2469the script automatically at the end of a line.
2470
2471@c start-------------------------------------------
2472@example
2473#!/usr/bin/sed -nf
2474
2475# delete all (leading) blanks
2476/./!d
2477
2478# get here: so there is a non empty
2479:x
2480# print it
2481p
2482# get next
2483n
2484# got chars? print it again, etc...
2485/./bx
2486
2487# no, don't have chars: got an empty line
2488:z
2489# get next, if last line we finish here so no trailing
2490# empty lines are written
2491n
2492# also empty? then ignore it, and get next... this will
2493# remove ALL empty lines
2494/./!bz
2495
2496# all empty lines were deleted/ignored, but we have a non empty.  As
2497# what we want to do is to squeeze, insert a blank line artificially
2498i\
2499
2500bx
2501@end example
2502@c end---------------------------------------------
2503
2504@node Limitations
2505@chapter @value{SSED}'s Limitations and Non-limitations
2506
2507@cindex @acronym{GNU} extensions, unlimited line length
2508@cindex Portability, line length limitations
2509For those who want to write portable @command{sed} scripts,
2510be aware that some implementations have been known to
2511limit line lengths (for the pattern and hold spaces)
2512to be no more than 4000 bytes.
2513The @sc{posix} standard specifies that conforming @command{sed}
2514implementations shall support at least 8192 byte line lengths.
2515@value{SSED} has no built-in limit on line length;
2516as long as it can @code{malloc()} more (virtual) memory,
2517you can feed or construct lines as long as you like.
2518
2519However, recursion is used to handle subpatterns and indefinite
2520repetition.  This means that the available stack space may limit
2521the size of the buffer that can be processed by certain patterns.
2522
2523@ifset PERL
2524There are some size limitations in the regular expression
2525matcher but it is hoped that they will never in practice
2526be relevant.  The maximum length of a compiled pattern
2527is 65539 (sic) bytes.  All values in repeating quantifiers
2528must be less than 65536.  The maximum nesting depth of
2529all parenthesized subpatterns, including capturing and
2530non-capturing subpatterns@footnote{The
2531distinction is meaningful when referring to Perl-style
2532regular expressions.}, assertions, and other types of
2533subpattern, is 200.
2534
2535Also, @value{SSED} recognizes the @sc{posix} syntax
2536@code{[.@var{ch}.]} and @code{[=@var{ch}=]}
2537where @var{ch} is a ``collating element'', but these
2538are not supported, and an error is given if they are
2539encountered.
2540
2541Here are a few distinctions between the real Perl-style
2542regular expressions and those that @option{-R} recognizes.
2543
2544@enumerate
2545@item
2546Lookahead assertions do not allow repeat quantifiers after them
2547Perl permits them, but they do not mean what you
2548might think. For example, @samp{(?!a)@{3@}} does not assert that the
2549next three characters are not @samp{a}. It just asserts three times that the
2550next character is not @samp{a} --- a waste of time and nothing else.
2551
2552@item
2553Capturing subpatterns that occur inside  negative  lookahead
2554head  assertions  are  counted,  but  their  entries are counted
2555as empty in the second half of an @code{s} command.
2556Perl sets its numerical variables from any such patterns
2557that are matched before the assertion fails to match
2558something (thereby succeeding), but only if the negative
2559lookahead assertion contains just one branch.
2560
2561@item
2562The following Perl escape sequences are not supported:
2563@samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E},
2564@samp{\Q}. In fact these are implemented by Perl's general
2565string-handling and are not part of its pattern matching engine.
2566
2567@item
2568The Perl @samp{\G} assertion is not supported as it is not
2569relevant to single pattern matches.
2570
2571@item
2572Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})}
2573and @samp{(?p@{code@})} constructions. However, there is some experimental
2574support for recursive patterns using the non-Perl item @samp{(?R)}.
2575
2576@item
2577There are at the time of writing some oddities in Perl
25785.005_02 concerned with the settings of captured strings
2579when part of a pattern is repeated. For example, matching
2580@samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets
2581@samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.}
2582to the value @samp{b}, but matching @samp{aabbaa}
2583against @samp{/^(aa(bb)?)+$/} leaves @samp{$2}
2584unset.  However, if the pattern is changed to
2585@samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set.
2586In Perl 5.004 @samp{$2} is set in both cases, and that is also
2587true of @value{SSED}.
2588
2589@item
2590Another as yet unresolved discrepancy is that in Perl
25915.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches
2592the string @samp{a}, whereas in @value{SSED} it does not.
2593However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched
2594against @samp{a} leaves $1 unset.
2595@end enumerate
2596@end ifset
2597
2598@node Other Resources
2599@chapter Other Resources for Learning About @command{sed}
2600
2601@cindex Additional reading about @command{sed}
2602In addition to several books that have been written about @command{sed}
2603(either specifically or as chapters in books which discuss
2604shell programming), one can find out more about @command{sed}
2605(including suggestions of a few books) from the FAQ
2606for the @code{sed-users} mailing list, available from:
2607@display
2608@uref{http://sed.sourceforge.net/sedfaq.html}
2609@end display
2610
2611Also of interest are
2612@uref{http://www.student.northpark.edu/pemente/sed/index.htm}
2613and @uref{http://sed.sf.net/grabbag},
2614which include @command{sed} tutorials and other @command{sed}-related goodies.
2615
2616The @code{sed-users} mailing list itself maintained by Sven Guckes.
2617To subscribe, visit @uref{http://groups.yahoo.com} and search
2618for the @code{sed-users} mailing list.
2619
2620@node Reporting Bugs
2621@chapter Reporting Bugs
2622
2623@cindex Bugs, reporting
2624Email bug reports to @email{bonzini@@gnu.org}.
2625Be sure to include the word ``sed'' somewhere in the @code{Subject:} field.
2626Also, please include the output of @samp{sed --version} in the body
2627of your report if at all possible.
2628
2629Please do not send a bug report like this:
2630
2631@example
2632@i{@i{@r{while building frobme-1.3.4}}}
2633$ configure
2634@error{} sed: file sedscr line 1: Unknown option to 's'
2635@end example
2636
2637If @value{SSED} doesn't configure your favorite package, take a
2638few extra minutes to identify the specific problem and make a stand-alone
2639test case.  Unlike other programs such as C compilers, making such test
2640cases for @command{sed} is quite simple.
2641
2642A stand-alone test case includes all the data necessary to perform the
2643test, and the specific invocation of @command{sed} that causes the problem.
2644The smaller a stand-alone test case is, the better.  A test case should
2645not involve something as far removed from @command{sed} as ``try to configure
2646frobme-1.3.4''.  Yes, that is in principle enough information to look
2647for the bug, but that is not a very practical prospect.
2648
2649Here are a few commonly reported bugs that are not bugs.
2650
2651@table @asis
2652@item @code{N} command on the last line
2653@cindex Portability, @code{N} command on the last line
2654@cindex Non-bugs, @code{N} command on the last line
2655
2656Most versions of @command{sed} exit without printing anything when
2657the @command{N} command is issued on the last line of a file.
2658@value{SSED} prints pattern space before exiting unless of course
2659the @command{-n} command switch has been specified.  This choice is
2660by design.
2661
2662For example, the behavior of
2663@example
2664sed N foo bar
2665@end example
2666@noindent
2667would depend on whether foo has an even or an odd number of
2668lines@footnote{which is the actual ``bug'' that prompted the
2669change in behavior}.  Or, when writing a script to read the
2670next few lines following a pattern match, traditional
2671implementations of @code{sed} would force you to write
2672something like
2673@example
2674/foo/@{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N @}
2675@end example
2676@noindent
2677instead of just
2678@example
2679/foo/@{ N;N;N;N;N;N;N;N;N; @}
2680@end example
2681
2682@cindex @code{POSIXLY_CORRECT} behavior, @code{N} command
2683In any case, the simplest workaround is to use @code{$d;N} in
2684scripts that rely on the traditional behavior, or to set
2685the @code{POSIXLY_CORRECT} variable to a non-empty value.
2686
2687@item Regex syntax clashes (problems with backslashes)
2688@cindex @acronym{GNU} extensions, to basic regular expressions
2689@cindex Non-bugs, regex syntax clashes
2690@command{sed} uses the @sc{posix} basic regular expression syntax.  According to
2691the standard, the meaning of some escape sequences is undefined in
2692this syntax;  notable in the case of @command{sed} are @code{\|},
2693@code{\+}, @code{\?}, @code{\`}, @code{\'}, @code{\<},
2694@code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}.
2695
2696As in all @acronym{GNU} programs that use @sc{posix} basic regular
2697expressions, @command{sed} interprets these escape sequences as special
2698characters.  So, @code{x\+} matches one or more occurrences of @samp{x}.
2699@code{abc\|def} matches either @samp{abc} or @samp{def}.
2700
2701This syntax may cause problems when running scripts written for other
2702@command{sed}s.  Some @command{sed} programs have been written with the
2703assumption that @code{\|} and @code{\+} match the literal characters
2704@code{|} and @code{+}.  Such scripts must be modified by removing the
2705spurious backslashes if they are to be used with modern implementations
2706of @command{sed}, like
2707@ifset PERL
2708@value{SSED} or
2709@end ifset
2710@acronym{GNU} @command{sed}.
2711
2712On the other hand, some scripts use s|abc\|def||g to remove occurrences
2713of @emph{either} @code{abc} or @code{def}.  While this worked until
2714@command{sed} 4.0.x, newer versions interpret this as removing the
2715string @code{abc|def}.  This is again undefined behavior according to
2716@acronym{POSIX}, and this interpretation is arguably more robust: older
2717@command{sed}s, for example, required that the regex matcher parsed
2718@code{\/} as @code{/} in the common case of escaping a slash, which is
2719again undefined behavior; the new behavior avoids this, and this is good
2720because the regex matcher is only partially under our control.
2721
2722@cindex @acronym{GNU} extensions, special escapes
2723In addition, this version of @command{sed} supports several escape characters
2724(some of which are multi-character) to insert non-printable characters
2725in scripts (@code{\a}, @code{\c}, @code{\d}, @code{\o}, @code{\r},
2726@code{\t}, @code{\v}, @code{\x}).  These can cause similar problems
2727with scripts written for other @command{sed}s.
2728
2729@item @option{-i} clobbers read-only files
2730@cindex In-place editing
2731@cindex @value{SSEDEXT}, in-place editing
2732@cindex Non-bugs, in-place editing
2733
2734In short, @samp{sed -i} will let you delete the contents of
2735a read-only file, and in general the @option{-i} option
2736(@pxref{Invoking sed, , Invocation}) lets you clobber
2737protected files.  This is not a bug, but rather a consequence
2738of how the Unix filesystem works.
2739
2740The permissions on a file say what can happen to the data
2741in that file, while the permissions on a directory say what can
2742happen to the list of files in that directory.  @samp{sed -i}
2743will not ever open for writing  a file that is already on disk.
2744Rather, it will work on a temporary file that is finally renamed
2745to the original name: if you rename or delete files, you're actually
2746modifying the contents of the directory, so the operation depends on
2747the permissions of the directory, not of the file.  For this same
2748reason, @command{sed} does not let you use @option{-i} on a writeable file
2749in a read-only directory, and will break hard or symbolic links when
2750@option{-i} is used on such a file.
2751
2752@item @code{0a} does not work (gives an error)
2753@cindex @code{0} address
2754@cindex @acronym{GNU} extensions, @code{0} address
2755@cindex Non-bugs, @code{0} address
2756
2757There is no line 0.  0 is a special address that is only used to treat
2758addresses like @code{0,/@var{RE}/} as active when the script starts: if
2759you write @code{1,/abc/d} and the first line includes the word @samp{abc},
2760then that match would be ignored because address ranges must span at least
2761two lines (barring the end of the file); but what you probably wanted is
2762to delete every line up to the first one including @samp{abc}, and this
2763is obtained with @code{0,/abc/d}.
2764
2765@ifclear PERL
2766@item @code{[a-z]} is case insensitive
2767@cindex Non-bugs, localization-related
2768
2769You are encountering problems with locales.  POSIX mandates that @code{[a-z]}
2770uses the current locale's collation order -- in C parlance, that means using
2771@code{strcoll(3)} instead of @code{strcmp(3)}.  Some locales have a
2772case-insensitive collation order, others don't.
2773
2774Another problem is that @code{[a-z]} tries to use collation symbols.
2775This only happens if you are on the @acronym{GNU} system, using
2776@acronym{GNU} libc's regular expression matcher instead of compiling the
2777one supplied with @acronym{GNU} sed.  In a Danish locale, for example,
2778the regular expression @code{^[a-z]$} matches the string @samp{aa},
2779because this is a single collating symbol that comes after @samp{a}
2780and before @samp{b}; @samp{ll} behaves similarly in Spanish
2781locales, or @samp{ij} in Dutch locales.
2782
2783To work around these problems, which may cause bugs in shell scripts, set
2784the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}.
2785
2786@item @code{s/.*//} does not clear pattern space
2787@cindex Non-bugs, localization-related
2788@cindex @value{SSEDEXT}, emptying pattern space
2789@cindex Emptying pattern space
2790
2791This happens if your input stream includes invalid multibyte
2792sequences.  @sc{posix} mandates that such sequences
2793are @emph{not} matched by @samp{.}, so that @samp{s/.*//} will not clear
2794pattern space as you would expect.  In fact, there is no way to clear
2795sed's buffers in the middle of the script in most multibyte locales
2796(including UTF-8 locales).  For this reason, @value{SSED} provides a `z'
2797command (for `zap') as an extension.
2798
2799To work around these problems, which may cause bugs in shell scripts, set
2800the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}.
2801@end ifclear
2802@end table
2803
2804
2805@node Extended regexps
2806@appendix Extended regular expressions
2807@cindex Extended regular expressions, syntax
2808
2809The only difference between basic and extended regular expressions is in
2810the behavior of a few characters: @samp{?}, @samp{+}, parentheses,
2811and braces (@samp{@{@}}).  While basic regular expressions require
2812these to be escaped if you want them to behave as special characters,
2813when using extended regular expressions you must escape them if
2814you want them @emph{to match a literal character}.
2815
2816@noindent
2817Examples:
2818@table @code
2819@item abc?
2820becomes @samp{abc\?} when using extended regular expressions.  It matches
2821the literal string @samp{abc?}.
2822
2823@item c\+
2824becomes @samp{c+} when using extended regular expressions.  It matches
2825one or more @samp{c}s.
2826
2827@item a\@{3,\@}
2828becomes @samp{a@{3,@}} when using extended regular expressions.  It matches
2829three or more @samp{a}s.
2830
2831@item \(abc\)\@{2,3\@}
2832becomes @samp{(abc)@{2,3@}} when using extended regular expressions.  It
2833matches either @samp{abcabc} or @samp{abcabcabc}.
2834
2835@item \(abc*\)\1
2836becomes @samp{(abc*)\1} when using extended regular expressions.
2837Backreferences must still be escaped when using extended regular
2838expressions.
2839@end table
2840
2841@ifset PERL
2842@node Perl regexps
2843@appendix Perl-style regular expressions
2844@cindex Perl-style regular expressions, syntax
2845
2846@emph{This part is taken from the @file{pcre.txt} file distributed together
2847with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.}
2848
2849Perl introduced several extensions to regular expressions, some
2850of them incompatible with the syntax of regular expressions
2851accepted by Emacs and other @acronym{GNU} tools (whose matcher was
2852based on the Emacs matcher).  @value{SSED} implements
2853both kinds of extensions.
2854
2855@iftex
2856Summarizing, we have:
2857
2858@itemize @bullet
2859@item
2860A backslash can introduce several special sequences
2861
2862@item
2863The circumflex, dollar sign, and period characters behave specially
2864with regard to new lines
2865
2866@item
2867Strange uses of square brackets are parsed differently
2868
2869@item
2870You can toggle modifiers in the middle of a regular expression
2871
2872@item
2873You can specify that a subpattern does not count when numbering backreferences
2874
2875@item
2876@cindex Greedy regular expression matching
2877You can specify greedy or non-greedy matching
2878
2879@item
2880You can have more than ten back references
2881
2882@item
2883You can do complex look aheads and look behinds (in the spirit of
2884@code{\b}, but with subpatterns).
2885
2886@item
2887You can often improve performance by avoiding that @command{sed} wastes
2888time with backtracking
2889
2890@item
2891You can have if/then/else branches
2892
2893@item
2894You can do recursive matches, for example to look for unbalanced parentheses
2895
2896@item
2897You can have comments and non-significant whitespace, because things can
2898get complex...
2899@end itemize
2900
2901Most of these extensions are introduced by the special @code{(?}
2902sequence, which gives special meanings to parenthesized groups.
2903@end iftex
2904@menu
2905Other extensions can be roughly subdivided in two categories
2906On one hand Perl introduces several more escaped sequences
2907(that is, sequences introduced by a backslash).  On the other
2908hand, it specifies that if a question mark follows an open
2909parentheses it should give a special meaning to the parenthesized
2910group.
2911
2912* Backslash::                       Introduces special sequences
2913* Circumflex/dollar sign/period::   Behave specially with regard to new lines
2914* Square brackets::                 Are a bit different in strange cases
2915* Options setting::                 Toggle modifiers in the middle of a regexp
2916* Non-capturing subpatterns::       Are not counted when backreferencing
2917* Repetition::                      Allows for non-greedy matching
2918* Backreferences::                  Allows for more than 10 back references
2919* Assertions::                      Allows for complex look ahead matches
2920* Non-backtracking subpatterns::    Often gives more performance
2921* Conditional subpatterns::         Allows if/then/else branches
2922* Recursive patterns::              For example to match parentheses
2923* Comments::                        Because things can get complex...
2924@end menu
2925
2926@node Backslash
2927@appendixsec Backslash
2928@cindex Perl-style regular expressions, escaped sequences
2929
2930There are a few difference in the handling of backslashed
2931sequences in Perl mode.
2932
2933First of all, there are no @code{\o} and @code{\d} sequences.
2934@sc{ascii} values for characters can be specified in octal
2935with a @code{\@var{xxx}} sequence, where @var{xxx} is a
2936sequence of up to three octal digits.  If the first digit
2937is a zero, the treatment of the sequence is straightforward;
2938just note that if the character that follows the escaped digit
2939is itself an octal digit, you have to supply three octal digits
2940for @var{xxx}.  For example @code{\07} is a @sc{bel} character
2941rather than a @sc{nul} and a literal @code{7} (this sequence is
2942instead represented by @code{\0007}).
2943
2944@cindex Perl-style regular expressions, backreferences
2945The handling of a backslash followed by a digit other than 0
2946is complicated.  Outside a character class, @command{sed} reads it
2947and any following digits as a decimal number. If the number
2948is less than 10, or if there have been at least that many
2949previous capturing left parentheses in the expression, the
2950entire sequence is taken as a back reference. A description
2951of how this works is given later, following the discussion
2952of parenthesized subpatterns.
2953
2954Inside a character class, or if the decimal number is
2955greater than 9 and there have not been that many capturing
2956subpatterns, @command{sed} re-reads up to three octal digits following
2957the backslash, and generates a single byte from the
2958least significant 8 bits of the value. Any subsequent digits
2959stand for themselves.  For example:
2960
2961@example
2962\040  @i{@r{is another way of writing a space}}
2963\40   @i{@r{is the same, provided there are fewer than 40}}
2964      @i{@r{previous capturing subpatterns}}
2965\7    @i{@r{is always a back reference}}
2966\011  @i{@r{is always a tab}}
2967\11   @i{@r{might be a back reference, or another way of writing a tab}}
2968\0113 @i{@r{is a tab followed by the character @samp{3}}}
2969\113  @i{@r{is the character with octal code 113 (since there}}
2970      @i{@r{can be no more than 99 back references)}}
2971\377  @i{@r{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}}
2972\81   @i{@r{is either a back reference, or a binary zero}}
2973      @i{@r{followed by the two characters @samp{81}}}
2974@end example
2975
2976Note that octal values of 100 or greater must not be introduced
2977by a leading zero, because no more than three octal
2978digits are ever read. Note that this applies only to the LHS
2979pattern; it is not possible yet to specify more than 9 backreferences
2980on the RHS of the `s' command.
2981
2982All the sequences that define a single byte value can be
2983used both inside and outside character classes. In addition,
2984inside a character class, the sequence @code{\b} is interpreted
2985as the backspace character (hex 08). Outside a character
2986class it has a different meaning (see below).
2987
2988In addition, there are four additional escapes specifying
2989generic character classes (like @code{\w} and @code{\W} do):
2990
2991@cindex Perl-style regular expressions, character classes
2992@table @samp
2993@item \d
2994Matches any decimal digit
2995
2996@item \D
2997Matches any character that is not a decimal digit
2998@end table
2999
3000In Perl mode, these character type sequences can appear both inside and
3001outside character classes. Instead, in @sc{posix} mode these sequences
3002(as well as @code{\w} and @code{\W}) are treated as two literal characters
3003(a backslash and a letter) inside square brackets.
3004
3005Escaped sequences specifying assertions are also different in
3006Perl mode.  An assertion specifies a condition that has to be met
3007at a particular point in a match, without consuming any
3008characters from the subject string. The use of subpatterns
3009for more complicated assertions is described below.  The
3010backslashed assertions are
3011
3012@cindex Perl-style regular expressions, assertions
3013@table @samp
3014@item \b
3015Asserts that the point is at a word boundary.
3016A word boundary is a position in the subject string where
3017the current character and the previous character do not both
3018match @code{\w} or @code{\W} (i.e. one matches @code{\w} and
3019the other matches @code{\W}), or the start or end of the string
3020if the first or last character matches @code{\w}, respectively.
3021
3022@item \B
3023Asserts that the point is not at a word boundary.
3024
3025@item \A
3026Asserts the matcher is at the start of pattern space (independent
3027of multiline mode).
3028
3029@item \Z
3030Asserts the matcher is at the end of pattern space,
3031or at a newline before the end of pattern space (independent of
3032multiline mode)
3033
3034@item \z
3035Asserts the matcher is at the end of pattern space (independent
3036of multiline mode)
3037@end table
3038
3039These assertions may not appear in character classes (but
3040note that @code{\b} has a different meaning, namely the
3041backspace character, inside a character class).
3042Note that Perl mode does not support directly assertions
3043for the beginning and the end of word; the @acronym{GNU} extensions
3044@code{\<} and @code{\>} achieve this purpose in @sc{posix} mode
3045instead.
3046
3047The @code{\A}, @code{\Z}, and @code{\z} assertions differ
3048from the traditional circumflex and dollar sign (described below)
3049in that they only ever match at the very start and end of the
3050subject string, whatever options are set; in particular @code{\A}
3051and @code{\z} are the same as the @acronym{GNU} extensions
3052@code{\`} and @code{\'} that are active in @sc{posix} mode.
3053
3054@node Circumflex/dollar sign/period
3055@appendixsec Circumflex, dollar sign, period
3056@cindex Perl-style regular expressions, newlines
3057
3058Outside a character class, in the default matching mode, the
3059circumflex character is an assertion which is true only if
3060the current matching point is at the start of the subject
3061string.  Inside a character class, the circumflex has an entirely
3062different meaning (see below).
3063
3064The circumflex need not be the first character of the pattern if
3065a number of alternatives are involved, but it should be the
3066first thing in each alternative in which it appears if the
3067pattern is ever to match that branch. If all possible alternatives,
3068start with a circumflex, that is, if the pattern is
3069constrained to match only at the start of the subject, it is
3070said to be an @dfn{anchored} pattern. (There are also other constructs
3071structs that can cause a pattern to be anchored.)
3072
3073A dollar sign is an assertion which is true only if the
3074current matching point is at the end of the subject string,
3075or immediately before a newline character that is the last
3076character in the string (by default).  A dollar sign need not be the
3077last character of the pattern if a number of alternatives
3078are involved, but it should be the last item in any branch
3079in which it appears.  A dollar sign has no special meaning in a
3080character class.
3081
3082@cindex Perl-style regular expressions, multiline
3083The meanings of the circumflex and dollar sign characters are
3084changed if the @code{M} modifier option is used. When this is
3085the case, they match immediately after and immediately
3086before an internal @code{\n} character, respectively, in addition
3087to matching at the start and end of the subject string.  For
3088example, the pattern @code{/^abc$/} matches the subject string
3089@samp{def\nabc} in multiline mode, but not otherwise.  Consequently,
3090patterns that are anchored in single line mode
3091because all branches start with @code{^} are not anchored in
3092multiline mode.
3093
3094@cindex Perl-style regular expressions, multiline
3095Note that the sequences @code{\A}, @code{\Z}, and @code{\z}
3096can be used to match the start and end of the subject in both
3097modes, and if all branches of a pattern start with @code{\A}
3098is it always anchored, whether the @code{M} modifier is set or not.
3099
3100@cindex Perl-style regular expressions, single line
3101Outside a character class, a dot in the pattern matches any
3102one character in the subject, including a non-printing character,
3103but not (by default) newline.  If the @code{S} modifier is used,
3104dots match newlines as well.  Actually, the handling of
3105dot is entirely independent of the handling of circumflex
3106and dollar sign, the only relationship being that they both
3107involve newline characters. Dot has no special meaning in a
3108character class.
3109
3110@node Square brackets
3111@appendixsec Square brackets
3112@cindex Perl-style regular expressions, character classes
3113
3114An opening square bracket introduces a character class, terminated
3115by a closing square bracket.  A closing square bracket on its own
3116is not special.  If a closing square bracket is required as a
3117member of the class, it should be the first data character in
3118the class (after an initial circumflex, if present) or escaped with a backslash.
3119
3120A character class matches a single character in the subject;
3121the character must be in the set of characters defined by
3122the class, unless the first character in the class is a circumflex,
3123in which case the subject character must not be in
3124the set defined by the class. If a circumflex is actually
3125required as a member of the class, ensure it is not the
3126first character, or escape it with a backslash.
3127
3128For example, the character class [aeiou] matches any lower
3129case vowel, while [^aeiou] matches any character that is not
3130a lower case vowel. Note that a circumflex is just a convenient
3131venient notation for specifying the characters which are in
3132the class by enumerating those that are not. It is not an
3133assertion: it still consumes a character from the subject
3134string, and fails if the current pointer is at the end of
3135the string.
3136
3137@cindex Perl-style regular expressions, case-insensitive
3138When caseless matching is set, any letters in a class
3139represent both their upper case and lower case versions, so
3140for example, a caseless @code{[aeiou]} matches uppercase
3141and lowercase @samp{A}s, and a caseless @code{[^aeiou]}
3142does not match @samp{A}, whereas a case-sensitive version would.
3143
3144@cindex Perl-style regular expressions, single line
3145@cindex Perl-style regular expressions, multiline
3146The newline character is never treated in any special way in
3147character classes, whatever the setting of the @code{S} and
3148@code{M} options (modifiers) is.  A class such as @code{[^a]} will
3149always match a newline.
3150
3151The minus (hyphen) character can be used to specify a range
3152of characters in a character class.  For example, @code{[d-m]}
3153matches any letter between d and m, inclusive.  If a minus
3154character is required in a class, it must be escaped with a
3155backslash or appear in a position where it cannot be interpreted
3156as indicating a range, typically as the first or last
3157character in the class.
3158
3159It is not possible to have the literal character @code{]} as the
3160end character of a range.  A pattern such as @code{[W-]46]} is
3161interpreted as a class of two characters (@code{W} and @code{-})
3162followed by a literal string @code{46]}, so it would match
3163@samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped
3164with a backslash it is interpreted as the end of range, so
3165@code{[W-\]46]} is interpreted as a single class containing a
3166range followed by two separate characters. The octal or
3167hexadecimal representation of @code{]} can also be used to end a range.
3168
3169Ranges operate in @sc{ascii} collating sequence. They can also be
3170used for characters specified numerically, for example
3171@code{[\000-\037]}. If a range that includes letters is used when
3172caseless matching is set, it matches the letters in either
3173case. For example, a caseless @code{[W-c]} is equivalent to
3174@code{[][\^_`wxyzabc]}, matched caselessly, and if character
3175tables for the French locale are in use, @code{[\xc8-\xcb]}
3176matches accented E characters in both cases.
3177
3178Unlike in @sc{posix} mode, the character types @code{\d},
3179@code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W}
3180may also appear in a character class, and add the characters
3181that they match to the class. For example, @code{[\dABCDEF]} matches any
3182hexadecimal digit.  A circumflex can conveniently be used
3183with the upper case character types to specify a more restricted
3184set of characters than the matching lower case type.
3185For example, the class @code{[^\W_]} matches any letter or digit,
3186but not underscore.
3187
3188All non-alphameric characters other than @code{\}, @code{-},
3189@code{^} (at the start) and the terminating @code{]}
3190are non-special in character classes, but it does no harm
3191if they are escaped.
3192
3193Perl 5.6 supports the @sc{posix} notation for character classes, which
3194uses names enclosed by @code{[:} and @code{:]} within the enclosing
3195square brackets, and @value{SSED} supports this notation as well.
3196For example,
3197
3198@example
3199[01[:alpha:]%]
3200@end example
3201
3202@noindent
3203matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}.
3204The supported class names are
3205
3206@table @code
3207@item alnum
3208Matches letters and digits
3209
3210@item alpha
3211Matches letters
3212
3213@item ascii
3214Matches character codes 0 - 127
3215
3216@item cntrl
3217Matches control characters
3218
3219@item digit
3220Matches decimal digits (same as \d)
3221
3222@item graph
3223Matches printing characters, excluding space
3224
3225@item lower
3226Matches lower case letters
3227
3228@item print
3229Matches printing characters, including space
3230
3231@item punct
3232Matches printing characters, excluding letters and digits
3233
3234@item space
3235Matches white space (same as \s)
3236
3237@item upper
3238Matches upper case letters
3239
3240@item word
3241Matches ``word'' characters (same as \w)
3242
3243@item xdigit
3244Matches hexadecimal digits
3245@end table
3246
3247The names @code{ascii} and @code{word} are extensions valid only in
3248Perl mode.  Another Perl extension is negation, which is
3249indicated by a circumflex character after the colon. For example,
3250
3251@example
3252[12[:^digit:]]
3253@end example
3254
3255@noindent
3256matches @samp{1}, @samp{2}, or any non-digit.
3257
3258@node Options setting
3259@appendixsec Options setting
3260@cindex Perl-style regular expressions, toggling options
3261@cindex Perl-style regular expressions, case-insensitive
3262@cindex Perl-style regular expressions, multiline
3263@cindex Perl-style regular expressions, single line
3264@cindex Perl-style regular expressions, extended
3265
3266The settings of the @code{I}, @code{M}, @code{S}, @code{X}
3267modifiers can be changed from within the pattern by
3268a sequence of Perl option letters enclosed between @code{(?}
3269and @code{)}. The option letters must be lowercase.
3270
3271For example, @code{(?im)} sets caseless, multiline matching. It is
3272also possible to unset these options by preceding the letter
3273with a hyphen; you can also have combined settings and unsettings:
3274@code{(?im-sx)} sets caseless and multiline matching,
3275while unsets single line matching (for dots) and extended
3276whitespace interpretation.  If a letter appears both before
3277and after the hyphen, the option is unset.
3278
3279The scope of these option changes depends on where in the
3280pattern the setting occurs. For settings that are outside
3281any subpattern (defined below), the effect is the same as if
3282the options were set or unset at the start of matching. The
3283following patterns all behave in exactly the same way:
3284
3285@example
3286(?i)abc
3287a(?i)bc
3288ab(?i)c
3289abc(?i)
3290@end example
3291
3292which in turn is the same as specifying the pattern abc with
3293the @code{I} modifier.  In other words, ``top level'' settings
3294apply to the whole pattern (unless there are other
3295changes inside subpatterns). If there is more than one setting
3296of the same option at top level, the rightmost setting
3297is used.
3298
3299If an option change occurs inside a subpattern, the effect
3300is different.  This is a change of behaviour in Perl 5.005.
3301An option change inside a subpattern affects only that part
3302of the subpattern @emph{that follows} it, so
3303
3304@example
3305(a(?i)b)c
3306@end example
3307
3308@noindent
3309matches abc and aBc and no other  strings  (assuming
3310case-sensitive matching is used).  By this means, options can
3311be made to have different settings in different parts of the
3312pattern.  Any changes made in one alternative do carry on
3313into subsequent branches within the same subpattern.  For
3314example,
3315
3316@example
3317(a(?i)b|c)
3318@end example
3319
3320@noindent
3321matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C},
3322even though when matching @samp{C} the first branch is
3323abandoned before the option setting.
3324This is because the effects of option settings happen at
3325compile time. There would be some very weird behaviour otherwise.
3326
3327@ignore
3328There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA
3329that can be changed in the same way as the Perl-compatible options by
3330using the characters U and X respectively.  The (?X) flag
3331setting is special in that it must always occur earlier in
3332the pattern than any of the additional features it turns on,
3333even when it is at top level. It is best put at the start.
3334@end ignore
3335
3336
3337@node Non-capturing subpatterns
3338@appendixsec Non-capturing subpatterns
3339@cindex Perl-style regular expressions, non-capturing subpatterns
3340
3341Marking part of a pattern as a subpattern does two things.
3342On one hand, it localizes a set of alternatives; on the other
3343hand, it sets up the subpattern as a capturing subpattern (as
3344defined above).  The subpattern can be backreferenced and
3345referenced in the right side of @code{s} commands.
3346
3347For example, if the string @samp{the red king} is matched against
3348the pattern
3349
3350@example
3351the ((red|white) (king|queen))
3352@end example
3353
3354@noindent
3355the captured substrings are @samp{red king}, @samp{red},
3356and @samp{king}, and are numbered 1, 2, and 3.
3357
3358The fact that plain parentheses fulfil two functions is not
3359always helpful.  There are often times when a grouping
3360subpattern is required without a capturing requirement.  If an
3361opening parenthesis is followed by @code{?:}, the subpattern does
3362not do any capturing, and is not counted when computing the
3363number of any subsequent capturing subpatterns. For example,
3364if the string @samp{the white queen} is matched against the pattern
3365
3366@example
3367the ((?:red|white) (king|queen))
3368@end example
3369
3370@noindent
3371the captured substrings are @samp{white queen} and @samp{queen},
3372and are numbered 1 and 2. The maximum number of captured
3373substrings is 99, while the maximum number of all subpatterns,
3374both capturing and non-capturing, is 200.
3375
3376As a convenient shorthand, if any option settings are
3377equired at the start of a non-capturing subpattern, the
3378option letters may appear between the @code{?} and the
3379@code{:}.  Thus the two patterns
3380
3381@example
3382(?i:saturday|sunday)
3383(?:(?i)saturday|sunday)
3384@end example
3385
3386@noindent
3387match exactly the same set of strings.  Because alternative
3388branches are tried from left to right, and options are not
3389reset until the end of the subpattern is reached, an option
3390setting in one branch does affect subsequent branches, so
3391the above patterns match @samp{SUNDAY} as well as @samp{Saturday}.
3392
3393
3394@node Repetition
3395@appendixsec Repetition
3396@cindex Perl-style regular expressions, repetitions
3397
3398Repetition is specified by quantifiers, which can follow any
3399of the following items:
3400
3401@itemize @bullet
3402@item
3403a single character, possibly escaped
3404
3405@item
3406the @code{.} special character
3407
3408@item
3409a character class
3410
3411@item
3412a back reference (see next section)
3413
3414@item
3415a parenthesized subpattern (unless it is an assertion; @pxref{Assertions})
3416@end itemize
3417
3418The general repetition quantifier specifies a minimum and
3419maximum number of permitted matches, by giving the two
3420numbers in curly brackets (braces), separated by a comma.
3421The numbers must be less than 65536, and the first must be
3422less than or equal to the second. For example:
3423
3424@example
3425z@{2,4@}
3426@end example
3427
3428@noindent
3429matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own
3430is not a special character. If the second number is omitted,
3431but the comma is present, there is no upper limit; if the
3432second number and the comma are both omitted, the quantifier
3433specifies an exact number of required matches. Thus
3434
3435@example
3436[aeiou]@{3,@}
3437@end example
3438
3439@noindent
3440matches at least 3 successive vowels, but may match many
3441more, while
3442
3443@example
3444\d@{8@}
3445@end example
3446
3447@noindent
3448matches exactly 8 digits.  An opening curly bracket that
3449appears in a position where a quantifier is not allowed, or
3450one that does not match the syntax of a quantifier, is taken
3451as a literal character. For example, @{,6@} is not a quantifier,
3452but a literal string of four characters.@footnote{It
3453raises an error if @option{-R} is not used.}
3454
3455The quantifier @samp{@{0@}} is permitted, causing the expression to
3456behave as if the previous item and the quantifier were not
3457present.
3458
3459For convenience (and historical compatibility) the three
3460most common quantifiers have single-character abbreviations:
3461
3462@table @code
3463@item *
3464is equivalent to @{0,@}
3465
3466@item +
3467is equivalent to @{1,@}
3468
3469@item ?
3470is equivalent to @{0,1@}
3471@end table
3472
3473It is possible to construct infinite loops by following a
3474subpattern that can match no characters with a quantifier
3475that has no upper limit, for example:
3476
3477@example
3478(a?)*
3479@end example
3480
3481Earlier versions of Perl used to give an error at
3482compile time for such patterns. However, because there are
3483cases where this can be useful, such patterns are now
3484accepted, but if any repetition of the subpattern does in
3485fact match no characters, the loop is forcibly broken.
3486
3487@cindex Greedy regular expression matching
3488@cindex Perl-style regular expressions, stingy repetitions
3489By default, the quantifiers are @dfn{greedy} like in @sc{posix}
3490mode, that is, they match as much as possible (up to the maximum
3491number of permitted times), without causing the rest of the
3492pattern to fail. The classic example of where this gives problems
3493is in trying to match comments in C programs. These appear between
3494the sequences @code{/*} and @code{*/} and within the sequence, individual
3495@code{*} and @code{/} characters may appear. An attempt to match C
3496comments by applying the pattern
3497
3498@example
3499/\*.*\*/
3500@end example
3501
3502@noindent
3503to the string
3504
3505@example
3506/* first command */ not comment /* second comment */
3507@end example
3508
3509@noindent
3510
3511fails, because it matches the entire string owing to the
3512greediness of the @code{.*} item.
3513
3514However, if a quantifier is followed by a question mark, it
3515ceases to be greedy, and instead matches the minimum number
3516of times possible, so the pattern @code{/\*.*?\*/}
3517does the right thing with the C comments. The meaning of the
3518various quantifiers is not otherwise changed, just the preferred
3519number of matches.  Do not confuse this use of question
3520mark with its use as a quantifier in its own right.
3521Because it has two uses, it can sometimes appear doubled, as in
3522
3523@example
3524\d??\d
3525@end example
3526
3527which matches one digit by preference, but can match two if
3528that is the only way the rest of the pattern matches.
3529
3530Note that greediness does not matter when specifying addresses,
3531but can be nevertheless used to improve performance.
3532
3533@ignore
3534If the PCRE_UNGREEDY option is set (an option which is not
3535available in Perl), the quantifiers are not greedy by
3536default, but individual ones can be made greedy by following
3537them with a question mark. In other words, it inverts the
3538default behaviour.
3539@end ignore
3540
3541When a parenthesized subpattern is quantified with a minimum
3542repeat count that is greater than 1 or with a limited maximum,
3543more store is required for the compiled pattern, in
3544proportion to the size of the minimum or maximum.
3545
3546@cindex Perl-style regular expressions, single line
3547If a pattern starts with @code{.*} or @code{.@{0,@}} and the
3548@code{S} modifier is used, the pattern is implicitly anchored,
3549because whatever follows will be tried against every character
3550position in the subject string, so there is no point in
3551retrying the overall match at any position after the first.
3552PCRE treats such a pattern as though it were preceded by \A.
3553
3554When a capturing subpattern is repeated, the value captured
3555is the substring that matched the final iteration. For example,
3556after
3557
3558@example
3559(tweedle[dume]@{3@}\s*)+
3560@end example
3561
3562@noindent
3563has matched @samp{tweedledum tweedledee} the value of the
3564captured substring is @samp{tweedledee}.  However, if there are
3565nested capturing subpatterns, the corresponding captured
3566values may have been set in previous iterations. For example,
3567after
3568
3569@example
3570/(a|(b))+/
3571@end example
3572
3573matches @samp{aba}, the value of the second captured substring is
3574@samp{b}.
3575
3576@node Backreferences
3577@appendixsec Backreferences
3578@cindex Perl-style regular expressions, backreferences
3579
3580Outside a character class, a backslash followed by a digit
3581greater than 0 (and possibly further digits) is a back
3582reference to a capturing subpattern earlier (i.e.  to its
3583left) in the pattern, provided there have been that many
3584previous capturing left parentheses.
3585
3586However, if the decimal number following the backslash is
3587less than 10, it is always taken as a back reference, and
3588causes an error only if there are not that many capturing
3589left parentheses in the entire pattern. In other words, the
3590parentheses that are referenced need not be to the left of
3591the reference for numbers less than 10. @ref{Backslash}
3592for further details of the handling of digits following a backslash.
3593
3594A back reference matches whatever actually matched the capturing
3595subpattern in the current subject string, rather than
3596anything matching the subpattern itself. So the pattern
3597
3598@example
3599(sens|respons)e and \1ibility
3600@end example
3601
3602@noindent
3603matches @samp{sense and sensibility} and @samp{response and responsibility},
3604but not @samp{sense and responsibility}. If caseful
3605matching is in force at the time of the back reference, the
3606case of letters is relevant. For example,
3607
3608@example
3609((?i)blah)\s+\1
3610@end example
3611
3612@noindent
3613matches @samp{blah blah} and @samp{Blah Blah}, but not
3614@samp{BLAH blah}, even though the original capturing
3615subpattern is matched caselessly.
3616
3617There may be more than one back reference to the same subpattern.
3618Also, if a subpattern has not actually been used in a
3619particular match, any back references to it always fail. For
3620example, the pattern
3621
3622@example
3623(a|(bc))\2
3624@end example
3625
3626@noindent
3627always fails if it starts to match @samp{a} rather than
3628@samp{bc}.  Because there may be up to 99 back references, all
3629digits following the backslash are taken as part of a potential
3630back reference number; this is different from what happens
3631in @sc{posix} mode. If the pattern continues with a digit
3632character, some delimiter must be used to terminate the back
3633reference.  If the @code{X} modifier option is set, this can be
3634whitespace.  Otherwise an empty comment can be used, or the
3635following character can be expressed in hexadecimal or octal.
3636Note that this applies only to the LHS pattern; it is
3637not possible yet to specify more than 9 backreferences on the
3638RHS of the `s' command.
3639
3640A back reference that occurs inside the parentheses to which
3641it refers fails when the subpattern is first used, so, for
3642example, @code{(a\1)} never matches.  However, such references
3643can be useful inside repeated subpatterns. For example, the
3644pattern
3645
3646@example
3647(a|b\1)+
3648@end example
3649
3650@noindent
3651matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa},
3652etc. At each iteration of the subpattern, the back reference matches
3653the character string corresponding to the previous iteration.  In
3654order for this to work, the pattern must be such that the first
3655iteration does not need to match the back reference.  This can be
3656done using alternation, as in the example above, or by a
3657quantifier with a minimum of zero.
3658
3659@node Assertions
3660@appendixsec Assertions
3661@cindex Perl-style regular expressions, assertions
3662@cindex Perl-style regular expressions, asserting subpatterns
3663
3664An assertion is a test on the characters following or
3665preceding the current matching point that does not actually
3666consume any characters. The simple assertions coded as @code{\b},
3667@code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$}
3668are described above. More complicated assertions are coded as
3669subpatterns.  There are two kinds: those that look ahead of the
3670current position in the subject string, and those that look behind it.
3671
3672@cindex Perl-style regular expressions, lookahead subpatterns
3673An assertion subpattern is matched in the normal way, except
3674that it does not cause the current matching position to be
3675changed. Lookahead assertions start with @code{(?=} for positive
3676assertions and @code{(?!} for negative assertions. For example,
3677
3678@example
3679\w+(?=;)
3680@end example
3681
3682@noindent
3683matches a word followed by a semicolon, but does not include
3684the semicolon in the match, and
3685
3686@example
3687foo(?!bar)
3688@end example
3689
3690@noindent
3691matches any occurrence of @samp{foo} that is not followed by
3692@samp{bar}.
3693
3694Note that the apparently similar pattern
3695
3696@example
3697(?!foo)bar
3698@end example
3699
3700@noindent
3701@cindex Perl-style regular expressions, lookbehind subpatterns
3702finds any occurrence of @samp{bar} even if it is preceded by
3703@samp{foo}, because the assertion @code{(?!foo)} is always true
3704when the next three characters are @samp{bar}. A lookbehind
3705assertion is needed to achieve this effect.
3706Lookbehind assertions start with @code{(?<=} for positive
3707assertions and @code{(?<!} for negative assertions. So,
3708
3709@example
3710(?<!foo)bar
3711@end example
3712
3713achieves the required effect of finding an occurrence of
3714@samp{bar} that is not preceded by @samp{foo}. The contents of a
3715lookbehind assertion are restricted
3716such that all the strings it matches must have a fixed
3717length.  However, if there are several alternatives, they do
3718not all have to have the same fixed length.  This is an extension
3719compared with Perl 5.005, which requires all branches to match
3720the same length of string. Thus
3721
3722@example
3723(?<=dogs|cats|)
3724@end example
3725
3726@noindent
3727is permitted, but the apparently equivalent regular expression
3728
3729@example
3730(?<!dogs?|cats?)
3731@end example
3732
3733@noindent
3734causes an error at compile time. Branches that match different
3735length strings are permitted only at the top level of
3736a lookbehind assertion: an assertion such as
3737
3738@example
3739(?<=ab(c|de))
3740@end example
3741
3742@noindent
3743is not permitted, because its single top-level branch can
3744match two different lengths, but it is acceptable if rewritten
3745to use two top-level branches:
3746
3747@example
3748(?<=abc|abde)
3749@end example
3750
3751All this is required because lookbehind assertions simply
3752move the current position back by the alternative's fixed
3753width and then try to match.  If there are
3754insufficient characters before the current position, the
3755match is deemed to fail.  Lookbehinds, in conjunction with
3756non-backtracking subpatterns can be particularly useful for
3757matching at the ends of strings; an example is given at the end
3758of the section on non-backtracking subpatterns.
3759
3760Several assertions (of any sort) may occur in succession.
3761For example,
3762
3763@example
3764(?<=\d@{3@})(?<!999)foo
3765@end example
3766
3767@noindent
3768matches @samp{foo} preceded by three digits that are not @samp{999}.
3769Notice that each of the assertions is applied independently
3770at the same point in the subject string. First there is a
3771check that the previous three characters are all digits, and
3772then there is a check that the same three characters are not
3773@samp{999}.  This pattern does not match @samp{foo} preceded by six
3774characters, the first of which are digits and the last three
3775of which are not @samp{999}.  For example, it doesn't match
3776@samp{123abcfoo}. A pattern to do that is
3777
3778@example
3779(?<=\d@{3@}...)(?<!999)foo
3780@end example
3781
3782@noindent
3783This time the first assertion looks at the preceding six
3784characters, checking that the first three are digits, and
3785then the second assertion checks that the preceding three
3786characters are not @samp{999}.  Actually, assertions can be
3787nested in any combination, so one can write this as
3788
3789@example
3790(?<=\d@{3@}(?!999)...)foo
3791@end example
3792
3793or
3794
3795@example
3796(?<=\d@{3@}...(?<!999))foo
3797@end example
3798
3799@noindent
3800both of which might be considered more readable.
3801
3802Assertion subpatterns are not capturing subpatterns, and may
3803not be repeated, because it makes no sense to assert the
3804same thing several times. If any kind of assertion contains
3805capturing subpatterns within it, these are counted for the
3806purposes of numbering the capturing subpatterns in the whole
3807pattern.  However, substring capturing is carried out only
3808for positive assertions, because it does not make sense for
3809negative assertions.
3810
3811Assertions count towards the maximum of 200 parenthesized
3812subpatterns.
3813
3814@node Non-backtracking subpatterns
3815@appendixsec Non-backtracking subpatterns
3816@cindex Perl-style regular expressions, non-backtracking subpatterns
3817
3818With both maximizing and minimizing repetition, failure of
3819what follows normally causes the repeated item to be evaluated
3820again to see if a different number of repeats allows the
3821rest of the pattern to match. Sometimes it is useful to
3822prevent this, either to change the nature of the match, or
3823to cause it fail earlier than it otherwise might, when the
3824author of the pattern knows there is no point in carrying
3825on.
3826
3827Consider, for example, the pattern @code{\d+foo} when applied to
3828the subject line
3829
3830@example
3831123456bar
3832@end example
3833
3834After matching all 6 digits and then failing to match @samp{foo},
3835the normal action of the matcher is to try again with only 5
3836digits matching the @code{\d+} item, and then with 4, and so on,
3837before ultimately failing. Non-backtracking subpatterns
3838provide the means for specifying that once a portion of the
3839pattern has matched, it is not to be re-evaluated in this way,
3840so the matcher would give up immediately on failing to match
3841@samp{foo} the first time.  The notation is another kind of special
3842parenthesis, starting with @code{(?>} as in this example:
3843
3844@example
3845(?>\d+)bar
3846@end example
3847
3848This kind of parenthesis ``locks up'' the part of the pattern
3849it contains once it has matched, and a failure further into
3850the pattern is prevented from backtracking into it.
3851Backtracking past it to previous items, however, works as
3852normal.
3853
3854Non-backtracking subpatterns are not capturing subpatterns.  Simple
3855cases such as the above example can be thought of as a maximizing
3856repeat that must swallow everything it can.  So,
3857while both @code{\d+} and @code{\d+?} are prepared to adjust the number of
3858digits they match in order to make the rest of the pattern
3859match, @code{(?>\d+)} can only match an entire sequence of digits.
3860
3861This construction can of course contain arbitrarily complicated
3862subpatterns, and it can be nested.
3863
3864@cindex Perl-style regular expressions, lookbehind subpatterns
3865Non-backtracking subpatterns can be used in conjunction with look-behind
3866assertions to specify efficient matching at the end
3867of the subject string. Consider a simple pattern such as
3868
3869@example
3870abcd$
3871@end example
3872
3873@noindent
3874when applied to a long string which does not match.  Because
3875matching proceeds from left to right, @command{sed} will look for
3876each @samp{a} in the subject and then see if what follows matches
3877the rest of the pattern. If the pattern is specified as
3878
3879@example
3880^.*abcd$
3881@end example
3882
3883@noindent
3884the initial @code{.*} matches the entire string at first, but when
3885this fails (because there is no following @samp{a}), it backtracks
3886to match all but the last character, then all but the
3887last two characters, and so on. Once again the search for
3888@samp{a} covers the entire string, from right to left, so we are
3889no better off. However, if the pattern is written as
3890
3891@example
3892^(?>.*)(?<=abcd)
3893@end example
3894
3895there can be no backtracking for the .* item; it can match
3896only the entire string. The subsequent lookbehind assertion
3897does a single test on the last four characters. If it fails,
3898the match fails immediately. For long strings, this approach
3899makes a significant difference to the processing time.
3900
3901When a pattern contains an unlimited repeat inside a subpattern
3902that can itself be repeated an unlimited number of
3903times, the use of a once-only subpattern is the only way to
3904avoid some failing matches taking a very long time
3905indeed.@footnote{Actually, the matcher embedded in @value{SSED}
3906tries to do something for this in the simplest cases,
3907like @code{([^b]*b)*}.  These cases are actually quite
3908common: they happen for example in a regular expression
3909like @code{\/\*([^*]*\*)*\/} which matches C comments.}
3910
3911The pattern
3912
3913@example
3914(\D+|<\d+>)*[!?]
3915@end example
3916
3917([^0-9<]+<(\d+>)?)*[!?]
3918
3919@noindent
3920matches an unlimited number of substrings that either consist
3921of non-digits, or digits enclosed in angular brackets, followed by
3922an exclamation or question mark. When it matches, it runs quickly.
3923However, if it is applied to
3924
3925@example
3926aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3927@end example
3928
3929@noindent
3930it takes a long time before reporting failure.  This is
3931because the string can be divided between the two repeats in
3932a large number of ways, and all have to be tried.@footnote{The
3933example used @code{[!?]} rather than a single character at the end,
3934because both @value{SSED} and Perl have an optimization that allows
3935for fast failure when a single character is used. They
3936remember the last single character that is required for a
3937match, and fail early if it is not present in the string.}
3938
3939If the pattern is changed to
3940
3941@example
3942((?>\D+)|<\d+>)*[!?]
3943@end example
3944
3945sequences of non-digits cannot be broken, and failure happens
3946quickly.
3947
3948@node Conditional subpatterns
3949@appendixsec Conditional subpatterns
3950@cindex Perl-style regular expressions, conditional subpatterns
3951
3952It is possible to cause the matching process to obey a subpattern
3953conditionally or to choose between two alternative
3954subpatterns, depending on the result of an assertion, or
3955whether a previous capturing subpattern matched or not. The
3956two possible forms of conditional subpattern are
3957
3958@example
3959(?(@var{condition})@var{yes-pattern})
3960(?(@var{condition})@var{yes-pattern}|@var{no-pattern})
3961@end example
3962
3963If the condition is satisfied, the yes-pattern is used; otherwise
3964the no-pattern (if present) is used. If there are more than two
3965alternatives in the subpattern, a compile-time error occurs.
3966
3967There are two kinds of condition. If the text between the
3968parentheses consists of a sequence of digits, the condition
3969is satisfied if the capturing subpattern of that number has
3970previously matched.  The number must be greater than zero.
3971Consider the following pattern, which contains non-significant
3972white space to make it more readable (assume the @code{X} modifier)
3973and to divide it into three parts for ease of discussion:
3974
3975@example
3976( \( )?   [^()]+   (?(1) \) )
3977@end example
3978
3979The first part matches an optional opening parenthesis, and
3980if that character is present, sets it as the first captured
3981substring. The second part matches one or more characters
3982that are not parentheses. The third part is a conditional
3983subpattern that tests whether the first set of parentheses
3984matched or not.  If they did, that is, if subject started
3985with an opening parenthesis, the condition is true, and so
3986the yes-pattern is executed and a closing parenthesis is
3987required. Otherwise, since no-pattern is not present, the
3988subpattern matches nothing.  In other words, this pattern
3989matches a sequence of non-parentheses, optionally enclosed
3990in parentheses.
3991
3992@cindex Perl-style regular expressions, lookahead subpatterns
3993If the condition is not a sequence of digits, it must be an
3994assertion.  This may be a positive or negative lookahead or
3995lookbehind assertion. Consider this pattern, again containing
3996non-significant white space, and with the two alternatives
3997on the second line:
3998
3999@example
4000(?(?=...[a-z])
4001   \d\d-[a-z]@{3@}-\d\d |
4002   \d\d-\d\d-\d\d )
4003@end example
4004
4005The condition is a positive lookahead assertion that matches
4006a letter that is three characters away from the current point.
4007If a letter is found, the subject is matched against the first
4008alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are
4009letters and @var{dd} are digits); otherwise it is matched against
4010the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}.
4011
4012
4013@node Recursive patterns
4014@appendixsec Recursive patterns
4015@cindex Perl-style regular expressions, recursive patterns
4016@cindex Perl-style regular expressions, recursion
4017
4018Consider the problem of matching a string in parentheses,
4019allowing for unlimited nested parentheses. Without the use
4020of recursion, the best that can be done is to use a pattern
4021that matches up to some fixed depth of nesting. It is not
4022possible to handle an arbitrary nesting depth. Perl 5.6 has
4023provided an experimental facility that allows regular
4024expressions to recurse (amongst other things). It does this
4025by interpolating Perl code in the expression at run time,
4026and the code can refer to the expression itself. A Perl pattern
4027tern to solve the parentheses problem can be created like
4028this:
4029
4030@example
4031$re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x;
4032@end example
4033
4034The @code{(?p@{...@})} item interpolates Perl code at run time,
4035and in this case refers recursively to the pattern in which it
4036appears. Obviously, @command{sed} cannot support the interpolation of
4037Perl code.  Instead, the special item @code{(?R)} is provided for
4038the specific case of recursion. This pattern solves the
4039parentheses problem (assume the @code{X} modifier option is used
4040so that white space is ignored):
4041
4042@example
4043\( ( (?>[^()]+) | (?R) )* \)
4044@end example
4045
4046First it matches an opening parenthesis. Then it matches any
4047number of substrings which can either be a sequence of
4048non-parentheses, or a recursive match of the pattern itself
4049(i.e. a correctly parenthesized substring). Finally there is
4050a closing parenthesis.
4051
4052This particular example pattern contains nested unlimited
4053repeats, and so the use of a non-backtracking subpattern for
4054matching strings of non-parentheses is important when applying
4055the pattern to strings that do not match. For example, when
4056it is applied to
4057
4058@example
4059(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
4060@end example
4061
4062it yields a ``no match'' response quickly. However, if a
4063standard backtracking subpattern is not used, the match runs
4064for a very long time indeed because there are so many different
4065ways the @code{+} and @code{*} repeats can carve up the subject,
4066and all have to be tested before failure can be reported.
4067
4068The values set for any capturing subpatterns are those from
4069the outermost level of the recursion at which the subpattern
4070value is set. If the pattern above is matched against
4071
4072@example
4073(ab(cd)ef)
4074@end example
4075
4076@noindent
4077the value for the capturing parentheses is @samp{ef}, which is
4078the last value taken on at the top level.
4079
4080@node Comments
4081@appendixsec Comments
4082@cindex Perl-style regular expressions, comments
4083
4084The sequence (?# marks the start of a comment which continues
4085ues up to the next closing parenthesis. Nested parentheses
4086are not permitted. The characters that make up a comment
4087play no part in the pattern matching at all.
4088
4089@cindex Perl-style regular expressions, extended
4090If the @code{X} modifier option is used, an unescaped @code{#} character
4091outside a character class introduces a comment that continues
4092up to the next newline character in the pattern.
4093@end ifset
4094
4095
4096@page
4097@node Concept Index
4098@unnumbered Concept Index
4099
4100This is a general index of all issues discussed in this manual, with the
4101exception of the @command{sed} commands and command-line options.
4102
4103@printindex cp
4104
4105@page
4106@node Command and Option Index
4107@unnumbered Command and Option Index
4108
4109This is an alphabetical list of all @command{sed} commands and command-line
4110options.
4111
4112@printindex fn
4113
4114@contents
4115@bye
4116
4117@c XXX FIXME: the term "cycle" is never defined...
4118