• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1\input texinfo  @c -*-texinfo-*-
2@c Do not edit this file!! It is automatically generated from sed-in.texi.
3@c
4@c -- Stuff that needs adding: ----------------------------------------------
5@c (document the `;' command-separator)
6@c --------------------------------------------------------------------------
7@c Check for consistency: regexps in @code, text that they match in @samp.
8@c
9@c Tips:
10@c    @command for command
11@c    @samp for command fragments: @samp{cat -s}
12@c    @code for sed commands and flags
13@c    Use ``quote'' not `quote' or "quote".
14@c
15@c %**start of header
16@setfilename sed.info
17@settitle sed, a stream editor
18@c %**end of header
19
20@c @smallbook
21
22@include version.texi
23
24@c Combine indices.
25@syncodeindex ky cp
26@syncodeindex pg cp
27@syncodeindex tp cp
28
29@defcodeindex op
30@syncodeindex op fn
31
32@include config.texi
33
34@copying
35This file documents version @value{VERSION} of
36@value{SSED}, a stream editor.
37
38Copyright @copyright{} 1998, 1999, 2001, 2002, 2003, 2004 Free
39Software Foundation, Inc.
40
41This document is released under the terms of the @acronym{GNU} Free
42Documentation License as published by the Free Software Foundation;
43either version 1.1, or (at your option) any later version.
44
45You should have received a copy of the @acronym{GNU} Free Documentation
46License along with @value{SSED}; see the file @file{COPYING.DOC}.
47If not, write to the Free Software Foundation, 59 Temple Place - Suite
48330, Boston, MA 02110-1301, USA.
49
50There are no Cover Texts and no Invariant Sections; this text, along
51with its equivalent in the printed manual, constitutes the Title Page.
52@end copying
53
54@setchapternewpage off
55
56@titlepage
57@title @command{sed}, a stream editor
58@subtitle version @value{VERSION}, @value{UPDATED}
59@author by Ken Pizzini, Paolo Bonzini
60
61@page
62@vskip 0pt plus 1filll
63Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc.
64
65@insertcopying
66
67Published by the Free Software Foundation, @*
6851 Franklin Street, Fifth Floor @*
69Boston, MA 02110-1301, USA
70@end titlepage
71
72
73@node Top
74@top
75
76@ifnottex
77@insertcopying
78@end ifnottex
79
80@menu
81* Introduction::               Introduction
82* Invoking sed::               Invocation
83* sed Programs::               @command{sed} programs
84* Examples::                   Some sample scripts
85* Limitations::                Limitations and (non-)limitations of @value{SSED}
86* Other Resources::            Other resources for learning about @command{sed}
87* Reporting Bugs::             Reporting bugs
88
89* Extended regexps::           @command{egrep}-style regular expressions
90@ifset PERL
91* Perl regexps::               Perl-style regular expressions
92@end ifset
93
94* Concept Index::              A menu with all the topics in this manual.
95* Command and Option Index::   A menu with all @command{sed} commands and
96                               command-line options.
97
98@detailmenu
99--- The detailed node listing ---
100
101sed Programs:
102* Execution Cycle::                 How @command{sed} works
103* Addresses::                       Selecting lines with @command{sed}
104* Regular Expressions::             Overview of regular expression syntax
105* Common Commands::                 Often used commands
106* The "s" Command::                 @command{sed}'s Swiss Army Knife
107* Other Commands::                  Less frequently used commands
108* Programming Commands::            Commands for @command{sed} gurus
109* Extended Commands::               Commands specific of @value{SSED}
110* Escapes::                         Specifying special characters
111
112Examples:
113* Centering lines::
114* Increment a number::
115* Rename files to lower case::
116* Print bash environment::
117* Reverse chars of lines::
118* tac::                             Reverse lines of files
119* cat -n::                          Numbering lines
120* cat -b::                          Numbering non-blank lines
121* wc -c::                           Counting chars
122* wc -w::                           Counting words
123* wc -l::                           Counting lines
124* head::                            Printing the first lines
125* tail::                            Printing the last lines
126* uniq::                            Make duplicate lines unique
127* uniq -d::                         Print duplicated lines of input
128* uniq -u::                         Remove all duplicated lines
129* cat -s::                          Squeezing blank lines
130
131@ifset PERL
132Perl regexps::                      Perl-style regular expressions
133* Backslash::                       Introduces special sequences
134* Circumflex/dollar sign/period::   Behave specially with regard to new lines
135* Square brackets::                 Are a bit different in strange cases
136* Options setting::                 Toggle modifiers in the middle of a regexp
137* Non-capturing subpatterns::       Are not counted when backreferencing
138* Repetition::                      Allows for non-greedy matching
139* Backreferences::                  Allows for more than 10 back references
140* Assertions::                      Allows for complex look ahead matches
141* Non-backtracking subpatterns::    Often gives more performance
142* Conditional subpatterns::         Allows if/then/else branches
143* Recursive patterns::              For example to match parentheses
144* Comments::                        Because things can get complex...
145@end ifset
146
147@end detailmenu
148@end menu
149
150
151@node Introduction
152@chapter Introduction
153
154@cindex Stream editor
155@command{sed} is a stream editor.
156A stream editor is used to perform basic text
157transformations on an input stream
158(a file or input from a pipeline).
159While in some ways similar to an editor which
160permits scripted edits (such as @command{ed}),
161@command{sed} works by making only one pass over the
162input(s), and is consequently more efficient.
163But it is @command{sed}'s ability to filter text in a pipeline
164which particularly distinguishes it from other types of
165editors.
166
167
168@node Invoking sed
169@chapter Invocation
170
171Normally @command{sed} is invoked like this:
172
173@example
174sed SCRIPT INPUTFILE...
175@end example
176
177The full format for invoking @command{sed} is:
178
179@example
180sed OPTIONS... [SCRIPT] [INPUTFILE...]
181@end example
182
183If you do not specify @var{INPUTFILE}, or if @var{INPUTFILE} is @file{-},
184@command{sed} filters the contents of the standard input.  The @var{script}
185is actually the first non-option parameter, which @command{sed} specially
186considers a script and not an input file if (and only if) none of the
187other @var{options} specifies a script to be executed, that is if neither
188of the @option{-e} and @option{-f} options is specified.
189
190@command{sed} may be invoked with the following command-line options:
191
192@table @code
193@item --version
194@opindex --version
195@cindex Version, printing
196Print out the version of @command{sed} that is being run and a copyright notice,
197then exit.
198
199@item --help
200@opindex --help
201@cindex Usage summary, printing
202Print a usage message briefly summarizing these command-line options
203and the bug-reporting address,
204then exit.
205
206@item -n
207@itemx --quiet
208@itemx --silent
209@opindex -n
210@opindex --quiet
211@opindex --silent
212@cindex Disabling autoprint, from command line
213By default, @command{sed} prints out the pattern space
214at the end of each cycle through the script (@pxref{Execution Cycle, ,
215How @code{sed} works}).
216These options disable this automatic printing,
217and @command{sed} only produces output when explicitly told to
218via the @code{p} command.
219
220@item -e @var{script}
221@itemx --expression=@var{script}
222@opindex -e
223@opindex --expression
224@cindex Script, from command line
225Add the commands in @var{script} to the set of commands to be
226run while processing the input.
227
228@item -f @var{script-file}
229@itemx --file=@var{script-file}
230@opindex -f
231@opindex --file
232@cindex Script, from a file
233Add the commands contained in the file @var{script-file}
234to the set of commands to be run while processing the input.
235
236@item -i[@var{SUFFIX}]
237@itemx --in-place[=@var{SUFFIX}]
238@opindex -i
239@opindex --in-place
240@cindex In-place editing, activating
241@cindex @value{SSEDEXT}, in-place editing
242This option specifies that files are to be edited in-place.
243@value{SSED} does this by creating a temporary file and
244sending output to this file rather than to the standard
245output.@footnote{This applies to commands such as @code{=},
246@code{a}, @code{c}, @code{i}, @code{l}, @code{p}.  You can
247still write to the standard output by using the @code{w}
248@cindex @value{SSEDEXT}, @file{/dev/stdout} file
249or @code{W} commands together with the @file{/dev/stdout}
250special file}.
251
252This option implies @option{-s}.
253
254When the end of the file is reached, the temporary file is
255renamed to the output file's original name.  The extension,
256if supplied, is used to modify the name of the old file
257before renaming the temporary file, thereby making a backup
258copy@footnote{Note that @value{SSED} creates the backup
259file whether or not any output is actually changed.}).
260
261@cindex In-place editing, Perl-style backup file names
262This rule is followed: if the extension doesn't contain a @code{*},
263then it is appended to the end of the current filename as a
264suffix; if the extension does contain one or more @code{*}
265characters, then @emph{each} asterisk is replaced with the
266current filename.  This allows you to add a prefix to the
267backup file, instead of (or in addition to) a suffix, or
268even to place backup copies of the original files into another
269directory (provided the directory already exists).
270
271If no extension is supplied, the original file is
272overwritten without making a backup.
273
274@item -l @var{N}
275@itemx --line-length=@var{N}
276@opindex -l
277@opindex --line-length
278@cindex Line length, setting
279Specify the default line-wrap length for the @code{l} command.
280A length of 0 (zero) means to never wrap long lines.  If
281not specified, it is taken to be 70.
282
283@item --posix
284@cindex @value{SSEDEXT}, disabling
285@value{SSED} includes several extensions to @acronym{POSIX}
286sed.  In order to simplify writing portable scripts, this
287option disables all the extensions that this manual documents,
288including additional commands.
289@cindex @code{POSIXLY_CORRECT} behavior, enabling
290Most of the extensions accept @command{sed} programs that
291are outside the syntax mandated by @acronym{POSIX}, but some
292of them (such as the behavior of the @command{N} command
293described in @pxref{Reporting Bugs}) actually violate the
294standard.  If you want to disable only the latter kind of
295extension, you can set the @code{POSIXLY_CORRECT} variable
296to a non-empty value.
297
298@item -b
299@itemx --binary
300@opindex -b
301@opindex --binary
302This option is available on every platform, but is only effective where the
303operating system makes a distinction between text files and binary files.
304When such a distinction is made---as is the case for MS-DOS, Windows,
305Cygwin---text files are composed of lines separated by a carriage return
306@emph{and} a line feed character, and @command{sed} does not see the
307ending CR.  When this option is specified, @command{sed} will open
308input files in binary mode, thus not requesting this special processing
309and considering lines to end at a line feed.
310
311@item --follow-symlinks
312@opindex --follow-symlinks
313This option is available only on platforms that support
314symbolic links and has an effect only if option @option{-i}
315is specified.  In this case, if the file that is specified
316on the command line is a symbolic link, @command{sed} will
317follow the link and edit the ultimate destination of the
318link.  The default behavior is to break the symbolic link,
319so that the link destination will not be modified.
320
321@item -r
322@itemx --regexp-extended
323@opindex -r
324@opindex --regexp-extended
325@cindex Extended regular expressions, choosing
326@cindex @acronym{GNU} extensions, extended regular expressions
327Use extended regular expressions rather than basic
328regular expressions.  Extended regexps are those that
329@command{egrep} accepts; they can be clearer because they
330usually have less backslashes, but are a @acronym{GNU} extension
331and hence scripts that use them are not portable.
332@xref{Extended regexps, , Extended regular expressions}.
333
334@ifset PERL
335@item -R
336@itemx --regexp-perl
337@opindex -R
338@opindex --regexp-perl
339@cindex Perl-style regular expressions, choosing
340@cindex @value{SSEDEXT}, Perl-style regular expressions
341Use Perl-style regular expressions rather than basic
342regular expressions.  Perl-style regexps are extremely
343powerful but are a @value{SSED} extension and hence scripts that
344use it are not portable.  @xref{Perl regexps, ,
345Perl-style regular expressions}.
346@end ifset
347
348@item -s
349@itemx --separate
350@cindex Working on separate files
351By default, @command{sed} will consider the files specified on the
352command line as a single continuous long stream.  This @value{SSED}
353extension allows the user to consider them as separate files:
354range addresses (such as @samp{/abc/,/def/}) are not allowed
355to span several files, line numbers are relative to the start
356of each file, @code{$} refers to the last line of each file,
357and files invoked from the @code{R} commands are rewound at the
358start of each file.
359
360@item -u
361@itemx --unbuffered
362@opindex -u
363@opindex --unbuffered
364@cindex Unbuffered I/O, choosing
365Buffer both input and output as minimally as practical.
366(This is particularly useful if the input is coming from
367the likes of @samp{tail -f}, and you wish to see the transformed
368output as soon as possible.)
369
370@end table
371
372If no @option{-e}, @option{-f}, @option{--expression}, or @option{--file}
373options are given on the command-line,
374then the first non-option argument on the command line is
375taken to be the @var{script} to be executed.
376
377@cindex Files to be processed as input
378If any command-line parameters remain after processing the above,
379these parameters are interpreted as the names of input files to
380be processed.
381@cindex Standard input, processing as input
382A file name of @samp{-} refers to the standard input stream.
383The standard input will be processed if no file names are specified.
384
385
386@node sed Programs
387@chapter @command{sed} Programs
388
389@cindex @command{sed} program structure
390@cindex Script structure
391A @command{sed} program consists of one or more @command{sed} commands,
392passed in by one or more of the
393@option{-e}, @option{-f}, @option{--expression}, and @option{--file}
394options, or the first non-option argument if zero of these
395options are used.
396This document will refer to ``the'' @command{sed} script;
397this is understood to mean the in-order catenation
398of all of the @var{script}s and @var{script-file}s passed in.
399
400Each @code{sed} command consists of an optional address or
401address range, followed by a one-character command name
402and any additional command-specific code.
403
404@menu
405* Execution Cycle::          How @command{sed} works
406* Addresses::                Selecting lines with @command{sed}
407* Regular Expressions::      Overview of regular expression syntax
408* Common Commands::          Often used commands
409* The "s" Command::          @command{sed}'s Swiss Army Knife
410* Other Commands::           Less frequently used commands
411* Programming Commands::     Commands for @command{sed} gurus
412* Extended Commands::        Commands specific of @value{SSED}
413* Escapes::                  Specifying special characters
414@end menu
415
416
417@node Execution Cycle
418@section How @command{sed} Works
419
420@cindex Buffer spaces, pattern and hold
421@cindex Spaces, pattern and hold
422@cindex Pattern space, definition
423@cindex Hold space, definition
424@command{sed} maintains two data buffers: the active @emph{pattern} space,
425and the auxiliary @emph{hold} space. Both are initially empty.
426
427@command{sed} operates by performing the following cycle on each
428lines of input: first, @command{sed} reads one line from the input
429stream, removes any trailing newline, and places it in the pattern space.
430Then commands are executed; each command can have an address associated
431to it: addresses are a kind of condition code, and a command is only
432executed if the condition is verified before the command is to be
433executed.
434
435When the end of the script is reached, unless the @option{-n} option
436is in use, the contents of pattern space are printed out to the output
437stream, adding back the trailing newline if it was removed.@footnote{Actually,
438if @command{sed} prints a line without the terminating newline, it will
439nevertheless print the missing newline as soon as more text is sent to
440the same output stream, which gives the ``least expected surprise''
441even though it does not make commands like @samp{sed -n p} exactly
442identical to @command{cat}.} Then the next cycle starts for the next
443input line.
444
445Unless special commands (like @samp{D}) are used, the pattern space is
446deleted between two cycles. The hold space, on the other hand, keeps
447its data between cycles (see commands @samp{h}, @samp{H}, @samp{x},
448@samp{g}, @samp{G} to move data between both buffers).
449
450
451@node Addresses
452@section Selecting lines with @command{sed}
453@cindex Addresses, in @command{sed} scripts
454@cindex Line selection
455@cindex Selecting lines to process
456
457Addresses in a @command{sed} script can be in any of the following forms:
458@table @code
459@item @var{number}
460@cindex Address, numeric
461@cindex Line, selecting by number
462Specifying a line number will match only that line in the input.
463(Note that @command{sed} counts lines continuously across all input files
464unless @option{-i} or @option{-s} options are specified.)
465
466@item @var{first}~@var{step}
467@cindex @acronym{GNU} extensions, @samp{@var{n}~@var{m}} addresses
468This @acronym{GNU} extension matches every @var{step}th line
469starting with line @var{first}.
470In particular, lines will be selected when there exists
471a non-negative @var{n} such that the current line-number equals
472@var{first} + (@var{n} * @var{step}).
473Thus, to select the odd-numbered lines,
474one would use @code{1~2};
475to pick every third line starting with the second, @samp{2~3} would be used;
476to pick every fifth line starting with the tenth, use @samp{10~5};
477and @samp{50~0} is just an obscure way of saying @code{50}.
478
479@item $
480@cindex Address, last line
481@cindex Last line, selecting
482@cindex Line, selecting last
483This address matches the last line of the last file of input, or
484the last line of each file when the @option{-i} or @option{-s} options
485are specified.
486
487@item /@var{regexp}/
488@cindex Address, as a regular expression
489@cindex Line, selecting by regular expression match
490This will select any line which matches the regular expression @var{regexp}.
491If @var{regexp} itself includes any @code{/} characters,
492each must be escaped by a backslash (@code{\}).
493
494@cindex empty regular expression
495@cindex @value{SSEDEXT}, modifiers and the empty regular expression
496The empty regular expression @samp{//} repeats the last regular
497expression match (the same holds if the empty regular expression is
498passed to the @code{s} command).  Note that modifiers to regular expressions
499are evaluated when the regular expression is compiled, thus it is invalid to
500specify them together with the empty regular expression.
501
502@item \%@var{regexp}%
503(The @code{%} may be replaced by any other single character.)
504
505@cindex Slash character, in regular expressions
506This also matches the regular expression @var{regexp},
507but allows one to use a different delimiter than @code{/}.
508This is particularly useful if the @var{regexp} itself contains
509a lot of slashes, since it avoids the tedious escaping of every @code{/}.
510If @var{regexp} itself includes any delimiter characters,
511each must be escaped by a backslash (@code{\}).
512
513@item /@var{regexp}/I
514@itemx \%@var{regexp}%I
515@cindex @acronym{GNU} extensions, @code{I} modifier
516@ifset PERL
517@cindex Perl-style regular expressions, case-insensitive
518@end ifset
519The @code{I} modifier to regular-expression matching is a @acronym{GNU}
520extension which causes the @var{regexp} to be matched in
521a case-insensitive manner.
522
523@item /@var{regexp}/M
524@itemx \%@var{regexp}%M
525@ifset PERL
526@cindex @value{SSEDEXT}, @code{M} modifier
527@end ifset
528@cindex Perl-style regular expressions, multiline
529The @code{M} modifier to regular-expression matching is a @value{SSED}
530extension which causes @code{^} and @code{$} to match respectively
531(in addition to the normal behavior) the empty string after a newline,
532and the empty string before a newline.  There are special character
533sequences
534@ifset PERL
535(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
536in basic or extended regular expression modes)
537@end ifset
538@ifclear PERL
539(@code{\`} and @code{\'})
540@end ifclear
541which always match the beginning or the end of the buffer.
542@code{M} stands for @cite{multi-line}.
543
544@ifset PERL
545@item /@var{regexp}/S
546@itemx \%@var{regexp}%S
547@cindex @value{SSEDEXT}, @code{S} modifier
548@cindex Perl-style regular expressions, single line
549The @code{S} modifier to regular-expression matching is only valid
550in Perl mode and specifies that the dot character (@code{.}) will
551match the newline character too.  @code{S} stands for @cite{single-line}.
552@end ifset
553
554@ifset PERL
555@item /@var{regexp}/X
556@itemx \%@var{regexp}%X
557@cindex @value{SSEDEXT}, @code{X} modifier
558@cindex Perl-style regular expressions, extended
559The @code{X} modifier to regular-expression matching is also
560valid in Perl mode only.  If it is used, whitespace in the
561pattern (other than in a character class) and
562characters between a @kbd{#} outside a character class and the
563next newline character are ignored. An escaping backslash
564can be used to include a whitespace or @kbd{#} character as part
565of the pattern.
566@end ifset
567@end table
568
569If no addresses are given, then all lines are matched;
570if one address is given, then only lines matching that
571address are matched.
572
573@cindex Range of lines
574@cindex Several lines, selecting
575An address range can be specified by specifying two addresses
576separated by a comma (@code{,}).  An address range matches lines
577starting from where the first address matches, and continues
578until the second address matches (inclusively).
579
580If the second address is a @var{regexp}, then checking for the
581ending match will start with the line @emph{following} the
582line which matched the first address: a range will always
583span at least two lines (except of course if the input stream
584ends).
585
586If the second address is a @var{number} less than (or equal to)
587the line matching the first address, then only the one line is
588matched.
589
590@cindex Special addressing forms
591@cindex Range with start address of zero
592@cindex Zero, as range start address
593@cindex @var{addr1},+N
594@cindex @var{addr1},~N
595@cindex @acronym{GNU} extensions, special two-address forms
596@cindex @acronym{GNU} extensions, @code{0} address
597@cindex @acronym{GNU} extensions, 0,@var{addr2} addressing
598@cindex @acronym{GNU} extensions, @var{addr1},+@var{N} addressing
599@cindex @acronym{GNU} extensions, @var{addr1},~@var{N} addressing
600@value{SSED} also supports some special two-address forms; all these
601are @acronym{GNU} extensions:
602@table @code
603@item 0,/@var{regexp}/
604A line number of @code{0} can be used in an address specification like
605@code{0,/@var{regexp}/} so that @command{sed} will try to match
606@var{regexp} in the first input line too.  In other words,
607@code{0,/@var{regexp}/} is similar to @code{1,/@var{regexp}/},
608except that if @var{addr2} matches the very first line of input the
609@code{0,/@var{regexp}/} form will consider it to end the range, whereas
610the @code{1,/@var{regexp}/} form will match the beginning of its range and
611hence make the range span up to the @emph{second} occurrence of the
612regular expression.
613
614Note that this is the only place where the @code{0} address makes
615sense; there is no 0-th line and commands which are given the @code{0}
616address in any other way will give an error.
617
618@item @var{addr1},+@var{N}
619Matches @var{addr1} and the @var{N} lines following @var{addr1}.
620
621@item @var{addr1},~@var{N}
622Matches @var{addr1} and the lines following @var{addr1}
623until the next line whose input line number is a multiple of @var{N}.
624@end table
625
626@cindex Excluding lines
627@cindex Selecting non-matching lines
628Appending the @code{!} character to the end of an address
629specification negates the sense of the match.
630That is, if the @code{!} character follows an address range,
631then only lines which do @emph{not} match the address range
632will be selected.
633This also works for singleton addresses,
634and, perhaps perversely, for the null address.
635
636
637@node Regular Expressions
638@section Overview of Regular Expression Syntax
639
640To know how to use @command{sed}, people should understand regular
641expressions (@dfn{regexp} for short).  A regular expression
642is a pattern that is matched against a
643subject string from left to right.  Most characters are
644@dfn{ordinary}: they stand for
645themselves in a pattern, and match the corresponding characters
646in the subject.  As a trivial example, the pattern
647
648@example
649The quick brown fox
650@end example
651
652@noindent
653matches a portion of a subject string that is identical to
654itself.  The power of regular expressions comes from the
655ability to include alternatives and repetitions in the pattern.
656These are encoded in the pattern by the use of @dfn{special characters},
657which do not stand for themselves but instead
658are interpreted in some special way.  Here is a brief description
659of regular expression syntax as used in @command{sed}.
660
661@table @code
662@item @var{char}
663A single ordinary character matches itself.
664
665@item *
666@cindex @acronym{GNU} extensions, to basic regular expressions
667Matches a sequence of zero or more instances of matches for the
668preceding regular expression, which must be an ordinary character, a
669special character preceded by @code{\}, a @code{.}, a grouped regexp
670(see below), or a bracket expression.  As a @acronym{GNU} extension, a
671postfixed regular expression can also be followed by @code{*}; for
672example, @code{a**} is equivalent to @code{a*}.  @acronym{POSIX}
6731003.1-2001 says that @code{*} stands for itself when it appears at
674the start of a regular expression or subexpression, but many
675non@acronym{GNU} implementations do not support this and portable
676scripts should instead use @code{\*} in these contexts.
677
678@item \+
679@cindex @acronym{GNU} extensions, to basic regular expressions
680As @code{*}, but matches one or more.  It is a @acronym{GNU} extension.
681
682@item \?
683@cindex @acronym{GNU} extensions, to basic regular expressions
684As @code{*}, but only matches zero or one.  It is a @acronym{GNU} extension.
685
686@item \@{@var{i}\@}
687As @code{*}, but matches exactly @var{i} sequences (@var{i} is a
688decimal integer; for portability, keep it between 0 and 255
689inclusive).
690
691@item \@{@var{i},@var{j}\@}
692Matches between @var{i} and @var{j}, inclusive, sequences.
693
694@item \@{@var{i},\@}
695Matches more than or equal to @var{i} sequences.
696
697@item \(@var{regexp}\)
698Groups the inner @var{regexp} as a whole, this is used to:
699
700@itemize @bullet
701@item
702@cindex @acronym{GNU} extensions, to basic regular expressions
703Apply postfix operators, like @code{\(abcd\)*}:
704this will search for zero or more whole sequences
705of @samp{abcd}, while @code{abcd*} would search
706for @samp{abc} followed by zero or more occurrences
707of @samp{d}.  Note that support for @code{\(abcd\)*} is
708required by @acronym{POSIX} 1003.1-2001, but many non-@acronym{GNU}
709implementations do not support it and hence it is not universally
710portable.
711
712@item
713Use back references (see below).
714@end itemize
715
716@item .
717Matches any character, including newline.
718
719@item ^
720Matches the null string at beginning of the pattern space, i.e. what
721appears after the circumflex must appear at the beginning of the
722pattern space.
723
724In most scripts, pattern space is initialized to the content of each
725line (@pxref{Execution Cycle, , How @code{sed} works}).  So, it is a
726useful simplification to think of @code{^#include} as matching only
727lines where @samp{#include} is the first thing on line---if there are
728spaces before, for example, the match fails.  This simplification is
729valid as long as the original content of pattern space is not modified,
730for example with an @code{s} command.
731
732@code{^} acts as a special character only at the beginning of the
733regular expression or subexpression (that is, after @code{\(} or
734@code{\|}).  Portable scripts should avoid @code{^} at the beginning of
735a subexpression, though, as @acronym{POSIX} allows implementations that
736treat @code{^} as an ordinary character in that context.
737
738@item $
739It is the same as @code{^}, but refers to end of pattern space.
740@code{$} also acts as a special character only at the end
741of the regular expression or subexpression (that is, before @code{\)}
742or @code{\|}), and its use at the end of a subexpression is not
743portable.
744
745
746@item [@var{list}]
747@itemx [^@var{list}]
748Matches any single character in @var{list}: for example,
749@code{[aeiou]} matches all vowels.  A list may include
750sequences like @code{@var{char1}-@var{char2}}, which
751matches any character between (inclusive) @var{char1}
752and @var{char2}.
753
754A leading @code{^} reverses the meaning of @var{list}, so that
755it matches any single character @emph{not} in @var{list}.  To include
756@code{]} in the list, make it the first character (after
757the @code{^} if needed), to include @code{-} in the list,
758make it the first or last; to include @code{^} put
759it after the first character.
760
761@cindex @code{POSIXLY_CORRECT} behavior, bracket expressions
762The characters @code{$}, @code{*}, @code{.}, @code{[}, and @code{\}
763are normally not special within @var{list}.  For example, @code{[\*]}
764matches either @samp{\} or @samp{*}, because the @code{\} is not
765special here.  However, strings like @code{[.ch.]}, @code{[=a=]}, and
766@code{[:space:]} are special within @var{list} and represent collating
767symbols, equivalence classes, and character classes, respectively, and
768@code{[} is therefore special within @var{list} when it is followed by
769@code{.}, @code{=}, or @code{:}.  Also, when not in
770@env{POSIXLY_CORRECT} mode, special escapes like @code{\n} and
771@code{\t} are recognized within @var{list}.  @xref{Escapes}.
772
773@item @var{regexp1}\|@var{regexp2}
774@cindex @acronym{GNU} extensions, to basic regular expressions
775Matches either @var{regexp1} or @var{regexp2}.  Use
776parentheses to use complex alternative regular expressions.
777The matching process tries each alternative in turn, from
778left to right, and the first one that succeeds is used.
779It is a @acronym{GNU} extension.
780
781@item @var{regexp1}@var{regexp2}
782Matches the concatenation of @var{regexp1} and @var{regexp2}.
783Concatenation binds more tightly than @code{\|}, @code{^}, and
784@code{$}, but less tightly than the other regular expression
785operators.
786
787@item \@var{digit}
788Matches the @var{digit}-th @code{\(@dots{}\)} parenthesized
789subexpression in the regular expression.  This is called a @dfn{back
790reference}.  Subexpressions are implicity numbered by counting
791occurrences of @code{\(} left-to-right.
792
793@item \n
794Matches the newline character.
795
796@item \@var{char}
797Matches @var{char}, where @var{char} is one of @code{$},
798@code{*}, @code{.}, @code{[}, @code{\}, or @code{^}.
799Note that the only C-like
800backslash sequences that you can portably assume to be
801interpreted are @code{\n} and @code{\\}; in particular
802@code{\t} is not portable, and matches a @samp{t} under most
803implementations of @command{sed}, rather than a tab character.
804
805@end table
806
807@cindex Greedy regular expression matching
808Note that the regular expression matcher is greedy, i.e., matches
809are attempted from left to right and, if two or more matches are
810possible starting at the same character, it selects the longest.
811
812@noindent
813Examples:
814@table @samp
815@item abcdef
816Matches @samp{abcdef}.
817
818@item a*b
819Matches zero or more @samp{a}s followed by a single
820@samp{b}.  For example, @samp{b} or @samp{aaaaab}.
821
822@item a\?b
823Matches @samp{b} or @samp{ab}.
824
825@item a\+b\+
826Matches one or more @samp{a}s followed by one or more
827@samp{b}s: @samp{ab} is the shortest possible match, but
828other examples are @samp{aaaab} or @samp{abbbbb} or
829@samp{aaaaaabbbbbbb}.
830
831@item .*
832@itemx .\+
833These two both match all the characters in a string;
834however, the first matches every string (including the empty
835string), while the second matches only strings containing
836at least one character.
837
838@item ^main.*(.*)
839his matches a string starting with @samp{main},
840followed by an opening and closing
841parenthesis.  The @samp{n}, @samp{(} and @samp{)} need not
842be adjacent.
843
844@item ^#
845This matches a string beginning with @samp{#}.
846
847@item \\$
848This matches a string ending with a single backslash.  The
849regexp contains two backslashes for escaping.
850
851@item \$
852Instead, this matches a string consisting of a single dollar sign,
853because it is escaped.
854
855@item [a-zA-Z0-9]
856In the C locale, this matches any @acronym{ASCII} letters or digits.
857
858@item [^ @kbd{tab}]\+
859(Here @kbd{tab} stands for a single tab character.)
860This matches a string of one or more
861characters, none of which is a space or a tab.
862Usually this means a word.
863
864@item ^\(.*\)\n\1$
865This matches a string consisting of two equal substrings separated by
866a newline.
867
868@item .\@{9\@}A$
869This matches nine characters followed by an @samp{A}.
870
871@item ^.\@{15\@}A
872This matches the start of a string that contains 16 characters,
873the last of which is an @samp{A}.
874
875@end table
876
877
878
879@node Common Commands
880@section Often-Used Commands
881
882If you use @command{sed} at all, you will quite likely want to know
883these commands.
884
885@table @code
886@item #
887[No addresses allowed.]
888
889@findex # (comments)
890@cindex Comments, in scripts
891The @code{#} character begins a comment;
892the comment continues until the next newline.
893
894@cindex Portability, comments
895If you are concerned about portability, be aware that
896some implementations of @command{sed} (which are not @sc{posix}
897conformant) may only support a single one-line comment,
898and then only when the very first character of the script is a @code{#}.
899
900@findex -n, forcing from within a script
901@cindex Caveat --- #n on first line
902Warning: if the first two characters of the @command{sed} script
903are @code{#n}, then the @option{-n} (no-autoprint) option is forced.
904If you want to put a comment in the first line of your script
905and that comment begins with the letter @samp{n}
906and you do not want this behavior,
907then be sure to either use a capital @samp{N},
908or place at least one space before the @samp{n}.
909
910@item q [@var{exit-code}]
911This command only accepts a single address.
912
913@findex q (quit) command
914@cindex @value{SSEDEXT}, returning an exit code
915@cindex Quitting
916Exit @command{sed} without processing any more commands or input.
917Note that the current pattern space is printed if auto-print is
918not disabled with the @option{-n} options.  The ability to return
919an exit code from the @command{sed} script is a @value{SSED} extension.
920
921@item d
922@findex d (delete) command
923@cindex Text, deleting
924Delete the pattern space;
925immediately start next cycle.
926
927@item p
928@findex p (print) command
929@cindex Text, printing
930Print out the pattern space (to the standard output).
931This command is usually only used in conjunction with the @option{-n}
932command-line option.
933
934@item n
935@findex n (next-line) command
936@cindex Next input line, replace pattern space with
937@cindex Read next input line
938If auto-print is not disabled, print the pattern space,
939then, regardless, replace the pattern space with the next line of input.
940If there is no more input then @command{sed} exits without processing
941any more commands.
942
943@item @{ @var{commands} @}
944@findex @{@} command grouping
945@cindex Grouping commands
946@cindex Command groups
947A group of commands may be enclosed between
948@code{@{} and @code{@}} characters.
949This is particularly useful when you want a group of commands
950to be triggered by a single address (or address-range) match.
951
952@end table
953
954@node The "s" Command
955@section The @code{s} Command
956
957The syntax of the @code{s} (as in substitute) command is
958@samp{s/@var{regexp}/@var{replacement}/@var{flags}}.  The @code{/}
959characters may be uniformly replaced by any other single
960character within any given @code{s} command.  The @code{/}
961character (or whatever other character is used in its stead)
962can appear in the @var{regexp} or @var{replacement}
963only if it is preceded by a @code{\} character.
964
965The @code{s} command is probably the most important in @command{sed}
966and has a lot of different options.  Its basic concept is simple:
967the @code{s} command attempts to match the pattern
968space against the supplied @var{regexp}; if the match is
969successful, then that portion of the pattern
970space which was matched is replaced with @var{replacement}.
971
972@cindex Backreferences, in regular expressions
973@cindex Parenthesized substrings
974The @var{replacement} can contain @code{\@var{n}} (@var{n} being
975a number from 1 to 9, inclusive) references, which refer to
976the portion of the match which is contained between the @var{n}th
977@code{\(} and its matching @code{\)}.
978Also, the @var{replacement} can contain unescaped @code{&}
979characters which reference the whole matched portion
980of the pattern space.
981@cindex @value{SSEDEXT}, case modifiers in @code{s} commands
982Finally, as a @value{SSED} extension, you can include a
983special sequence made of a backslash and one of the letters
984@code{L}, @code{l}, @code{U}, @code{u}, or @code{E}.
985The meaning is as follows:
986
987@table @code
988@item \L
989Turn the replacement
990to lowercase until a @code{\U} or @code{\E} is found,
991
992@item \l
993Turn the
994next character to lowercase,
995
996@item \U
997Turn the replacement to uppercase
998until a @code{\L} or @code{\E} is found,
999
1000@item \u
1001Turn the next character
1002to uppercase,
1003
1004@item \E
1005Stop case conversion started by @code{\L} or @code{\U}.
1006@end table
1007
1008To include a literal @code{\}, @code{&}, or newline in the final
1009replacement, be sure to precede the desired @code{\}, @code{&},
1010or newline in the @var{replacement} with a @code{\}.
1011
1012@findex s command, option flags
1013@cindex Substitution of text, options
1014The @code{s} command can be followed by zero or more of the
1015following @var{flags}:
1016
1017@table @code
1018@item g
1019@cindex Global substitution
1020@cindex Replacing all text matching regexp in a line
1021Apply the replacement to @emph{all} matches to the @var{regexp},
1022not just the first.
1023
1024@item @var{number}
1025@cindex Replacing only @var{n}th match of regexp in a line
1026Only replace the @var{number}th match of the @var{regexp}.
1027
1028@cindex @acronym{GNU} extensions, @code{g} and @var{number} modifier interaction in @code{s} command
1029@cindex Mixing @code{g} and @var{number} modifiers in the @code{s} command
1030Note: the @sc{posix} standard does not specify what should happen
1031when you mix the @code{g} and @var{number} modifiers,
1032and currently there is no widely agreed upon meaning
1033across @command{sed} implementations.
1034For @value{SSED}, the interaction is defined to be:
1035ignore matches before the @var{number}th,
1036and then match and replace all matches from
1037the @var{number}th on.
1038
1039@item p
1040@cindex Text, printing after substitution
1041If the substitution was made, then print the new pattern space.
1042
1043Note: when both the @code{p} and @code{e} options are specified,
1044the relative ordering of the two produces very different results.
1045In general, @code{ep} (evaluate then print) is what you want,
1046but operating the other way round can be useful for debugging.
1047For this reason, the current version of @value{SSED} interprets
1048specially the presence of @code{p} options both before and after
1049@code{e}, printing the pattern space before and after evaluation,
1050while in general flags for the @code{s} command show their
1051effect just once.  This behavior, although documented, might
1052change in future versions.
1053
1054@item w @var{file-name}
1055@cindex Text, writing to a file after substitution
1056@cindex @value{SSEDEXT}, @file{/dev/stdout} file
1057@cindex @value{SSEDEXT}, @file{/dev/stderr} file
1058If the substitution was made, then write out the result to the named file.
1059As a @value{SSED} extension, two special values of @var{file-name} are
1060supported: @file{/dev/stderr}, which writes the result to the standard
1061error, and @file{/dev/stdout}, which writes to the standard
1062output.@footnote{This is equivalent to @code{p} unless the @option{-i}
1063option is being used.}
1064
1065@item e
1066@cindex Evaluate Bourne-shell commands, after substitution
1067@cindex Subprocesses
1068@cindex @value{SSEDEXT}, evaluating Bourne-shell commands
1069@cindex @value{SSEDEXT}, subprocesses
1070This command allows one to pipe input from a shell command
1071into pattern space.  If a substitution was made, the command
1072that is found in pattern space is executed and pattern space
1073is replaced with its output.  A trailing newline is suppressed;
1074results are undefined if the command to be executed contains
1075a @sc{nul} character.  This is a @value{SSED} extension.
1076
1077@item I
1078@itemx i
1079@cindex @acronym{GNU} extensions, @code{I} modifier
1080@cindex Case-insensitive matching
1081@ifset PERL
1082@cindex Perl-style regular expressions, case-insensitive
1083@end ifset
1084The @code{I} modifier to regular-expression matching is a @acronym{GNU}
1085extension which makes @command{sed} match @var{regexp} in a
1086case-insensitive manner.
1087
1088@item M
1089@itemx m
1090@cindex @value{SSEDEXT}, @code{M} modifier
1091@ifset PERL
1092@cindex Perl-style regular expressions, multiline
1093@end ifset
1094The @code{M} modifier to regular-expression matching is a @value{SSED}
1095extension which causes @code{^} and @code{$} to match respectively
1096(in addition to the normal behavior) the empty string after a newline,
1097and the empty string before a newline.  There are special character
1098sequences
1099@ifset PERL
1100(@code{\A} and @code{\Z} in Perl mode, @code{\`} and @code{\'}
1101in basic or extended regular expression modes)
1102@end ifset
1103@ifclear PERL
1104(@code{\`} and @code{\'})
1105@end ifclear
1106which always match the beginning or the end of the buffer.
1107@code{M} stands for @cite{multi-line}.
1108
1109@ifset PERL
1110@item S
1111@itemx s
1112@cindex @value{SSEDEXT}, @code{S} modifier
1113@cindex Perl-style regular expressions, single line
1114The @code{S} modifier to regular-expression matching is only valid
1115in Perl mode and specifies that the dot character (@code{.}) will
1116match the newline character too.  @code{S} stands for @cite{single-line}.
1117@end ifset
1118
1119@ifset PERL
1120@item X
1121@itemx x
1122@cindex @value{SSEDEXT}, @code{X} modifier
1123@cindex Perl-style regular expressions, extended
1124The @code{X} modifier to regular-expression matching is also
1125valid in Perl mode only.  If it is used, whitespace in the
1126pattern (other than in a character class) and
1127characters between a @kbd{#} outside a character class and the
1128next newline character are ignored. An escaping backslash
1129can be used to include a whitespace or @kbd{#} character as part
1130of the pattern.
1131@end ifset
1132@end table
1133
1134
1135@node Other Commands
1136@section Less Frequently-Used Commands
1137
1138Though perhaps less frequently used than those in the previous
1139section, some very small yet useful @command{sed} scripts can be built with
1140these commands.
1141
1142@table @code
1143@item y/@var{source-chars}/@var{dest-chars}/
1144(The @code{/} characters may be uniformly replaced by
1145any other single character within any given @code{y} command.)
1146
1147@findex y (transliterate) command
1148@cindex Transliteration
1149Transliterate any characters in the pattern space which match
1150any of the @var{source-chars} with the corresponding character
1151in @var{dest-chars}.
1152
1153Instances of the @code{/} (or whatever other character is used in its stead),
1154@code{\}, or newlines can appear in the @var{source-chars} or @var{dest-chars}
1155lists, provide that each instance is escaped by a @code{\}.
1156The @var{source-chars} and @var{dest-chars} lists @emph{must}
1157contain the same number of characters (after de-escaping).
1158
1159@item a\
1160@itemx @var{text}
1161@cindex @value{SSEDEXT}, two addresses supported by most commands
1162As a @acronym{GNU} extension, this command accepts two addresses.
1163
1164@findex a (append text lines) command
1165@cindex Appending text after a line
1166@cindex Text, appending
1167Queue the lines of text which follow this command
1168(each but the last ending with a @code{\},
1169which are removed from the output)
1170to be output at the end of the current cycle,
1171or when the next input line is read.
1172
1173Escape sequences in @var{text} are processed, so you should
1174use @code{\\} in @var{text} to print a single backslash.
1175
1176As a @acronym{GNU} extension, if between the @code{a} and the newline there is
1177other than a whitespace-@code{\} sequence, then the text of this line,
1178starting at the first non-whitespace character after the @code{a},
1179is taken as the first line of the @var{text} block.
1180(This enables a simplification in scripting a one-line add.)
1181This extension also works with the @code{i} and @code{c} commands.
1182
1183@item i\
1184@itemx @var{text}
1185@cindex @value{SSEDEXT}, two addresses supported by most commands
1186As a @acronym{GNU} extension, this command accepts two addresses.
1187
1188@findex i (insert text lines) command
1189@cindex Inserting text before a line
1190@cindex Text, insertion
1191Immediately output the lines of text which follow this command
1192(each but the last ending with a @code{\},
1193which are removed from the output).
1194
1195@item c\
1196@itemx @var{text}
1197@findex c (change to text lines) command
1198@cindex Replacing selected lines with other text
1199Delete the lines matching the address or address-range,
1200and output the lines of text which follow this command
1201(each but the last ending with a @code{\},
1202which are removed from the output)
1203in place of the last line
1204(or in place of each line, if no addresses were specified).
1205A new cycle is started after this command is done,
1206since the pattern space will have been deleted.
1207
1208@item =
1209@cindex @value{SSEDEXT}, two addresses supported by most commands
1210As a @acronym{GNU} extension, this command accepts two addresses.
1211
1212@findex = (print line number) command
1213@cindex Printing line number
1214@cindex Line number, printing
1215Print out the current input line number (with a trailing newline).
1216
1217@item l @var{n}
1218@findex l (list unambiguously) command
1219@cindex List pattern space
1220@cindex Printing text unambiguously
1221@cindex Line length, setting
1222@cindex @value{SSEDEXT}, setting line length
1223Print the pattern space in an unambiguous form:
1224non-printable characters (and the @code{\} character)
1225are printed in C-style escaped form; long lines are split,
1226with a trailing @code{\} character to indicate the split;
1227the end of each line is marked with a @code{$}.
1228
1229@var{n} specifies the desired line-wrap length;
1230a length of 0 (zero) means to never wrap long lines.  If omitted,
1231the default as specified on the command line is used.  The @var{n}
1232parameter is a @value{SSED} extension.
1233
1234@item r @var{filename}
1235@cindex @value{SSEDEXT}, two addresses supported by most commands
1236As a @acronym{GNU} extension, this command accepts two addresses.
1237
1238@findex r (read file) command
1239@cindex Read text from a file
1240@cindex @value{SSEDEXT}, @file{/dev/stdin} file
1241Queue the contents of @var{filename} to be read and
1242inserted into the output stream at the end of the current cycle,
1243or when the next input line is read.
1244Note that if @var{filename} cannot be read, it is treated as
1245if it were an empty file, without any error indication.
1246
1247As a @value{SSED} extension, the special value @file{/dev/stdin}
1248is supported for the file name, which reads the contents of the
1249standard input.
1250
1251@item w @var{filename}
1252@findex w (write file) command
1253@cindex Write to a file
1254@cindex @value{SSEDEXT}, @file{/dev/stdout} file
1255@cindex @value{SSEDEXT}, @file{/dev/stderr} file
1256Write the pattern space to @var{filename}.
1257As a @value{SSED} extension, two special values of @var{file-name} are
1258supported: @file{/dev/stderr}, which writes the result to the standard
1259error, and @file{/dev/stdout}, which writes to the standard
1260output.@footnote{This is equivalent to @code{p} unless the @option{-i}
1261option is being used.}
1262
1263The file will be created (or truncated) before the
1264first input line is read; all @code{w} commands
1265(including instances of @code{w} flag on successful @code{s} commands)
1266which refer to the same @var{filename} are output without
1267closing and reopening the file.
1268
1269@item D
1270@findex D (delete first line) command
1271@cindex Delete first line from pattern space
1272Delete text in the pattern space up to the first newline.
1273If any text is left, restart cycle with the resultant
1274pattern space (without reading a new line of input),
1275otherwise start a normal new cycle.
1276
1277@item N
1278@findex N (append Next line) command
1279@cindex Next input line, append to pattern space
1280@cindex Append next input line to pattern space
1281Add a newline to the pattern space,
1282then append the next line of input to the pattern space.
1283If there is no more input then @command{sed} exits without processing
1284any more commands.
1285
1286@item P
1287@findex P (print first line) command
1288@cindex Print first line from pattern space
1289Print out the portion of the pattern space up to the first newline.
1290
1291@item h
1292@findex h (hold) command
1293@cindex Copy pattern space into hold space
1294@cindex Replace hold space with copy of pattern space
1295@cindex Hold space, copying pattern space into
1296Replace the contents of the hold space with the contents of the pattern space.
1297
1298@item H
1299@findex H (append Hold) command
1300@cindex Append pattern space to hold space
1301@cindex Hold space, appending from pattern space
1302Append a newline to the contents of the hold space,
1303and then append the contents of the pattern space to that of the hold space.
1304
1305@item g
1306@findex g (get) command
1307@cindex Copy hold space into pattern space
1308@cindex Replace pattern space with copy of hold space
1309@cindex Hold space, copy into pattern space
1310Replace the contents of the pattern space with the contents of the hold space.
1311
1312@item G
1313@findex G (appending Get) command
1314@cindex Append hold space to pattern space
1315@cindex Hold space, appending to pattern space
1316Append a newline to the contents of the pattern space,
1317and then append the contents of the hold space to that of the pattern space.
1318
1319@item x
1320@findex x (eXchange) command
1321@cindex Exchange hold space with pattern space
1322@cindex Hold space, exchange with pattern space
1323Exchange the contents of the hold and pattern spaces.
1324
1325@end table
1326
1327
1328@node Programming Commands
1329@section Commands for @command{sed} gurus
1330
1331In most cases, use of these commands indicates that you are
1332probably better off programming in something like @command{awk}
1333or Perl.  But occasionally one is committed to sticking
1334with @command{sed}, and these commands can enable one to write
1335quite convoluted scripts.
1336
1337@cindex Flow of control in scripts
1338@table @code
1339@item : @var{label}
1340[No addresses allowed.]
1341
1342@findex : (label) command
1343@cindex Labels, in scripts
1344Specify the location of @var{label} for branch commands.
1345In all other respects, a no-op.
1346
1347@item b @var{label}
1348@findex b (branch) command
1349@cindex Branch to a label, unconditionally
1350@cindex Goto, in scripts
1351Unconditionally branch to @var{label}.
1352The @var{label} may be omitted, in which case the next cycle is started.
1353
1354@item t @var{label}
1355@findex t (test and branch if successful) command
1356@cindex Branch to a label, if @code{s///} succeeded
1357@cindex Conditional branch
1358Branch to @var{label} only if there has been a successful @code{s}ubstitution
1359since the last input line was read or conditional branch was taken.
1360The @var{label} may be omitted, in which case the next cycle is started.
1361
1362@end table
1363
1364@node Extended Commands
1365@section Commands Specific to @value{SSED}
1366
1367These commands are specific to @value{SSED}, so you
1368must use them with care and only when you are sure that
1369hindering portability is not evil.  They allow you to check
1370for @value{SSED} extensions or to do tasks that are required
1371quite often, yet are unsupported by standard @command{sed}s.
1372
1373@table @code
1374@item e [@var{command}]
1375@findex e (evaluate) command
1376@cindex Evaluate Bourne-shell commands
1377@cindex Subprocesses
1378@cindex @value{SSEDEXT}, evaluating Bourne-shell commands
1379@cindex @value{SSEDEXT}, subprocesses
1380This command allows one to pipe input from a shell command
1381into pattern space.  Without parameters, the @code{e} command
1382executes the command that is found in pattern space and
1383replaces the pattern space with the output; a trailing newline
1384is suppressed.
1385
1386If a parameter is specified, instead, the @code{e} command
1387interprets it as a command and sends its output to the output stream
1388(like @code{r} does).  The command can run across multiple
1389lines, all but the last ending with a back-slash.
1390
1391In both cases, the results are undefined if the command to be
1392executed contains a @sc{nul} character.
1393
1394@item L @var{n}
1395@findex L (fLow paragraphs) command
1396@cindex Reformat pattern space
1397@cindex Reformatting paragraphs
1398@cindex @value{SSEDEXT}, reformatting paragraphs
1399@cindex @value{SSEDEXT}, @code{L} command
1400This @value{SSED} extension fills and joins lines in pattern space
1401to produce output lines of (at most) @var{n} characters, like
1402@code{fmt} does; if @var{n} is omitted, the default as specified
1403on the command line is used.  This command is considered a failed
1404experiment and unless there is enough request (which seems unlikely)
1405will be removed in future versions.
1406
1407@ignore
1408Blank lines, spaces between words, and indentation are
1409preserved in the output; successive input lines with different
1410indentation are not joined; tabs are expanded to 8 columns.
1411
1412If the pattern space contains multiple lines, they are joined, but
1413since the pattern space usually contains a single line, the behavior
1414of a simple @code{L;d} script is the same as @samp{fmt -s} (i.e.,
1415it does not join short lines to form longer ones).
1416
1417@var{n} specifies the desired line-wrap length; if omitted,
1418the default as specified on the command line is used.
1419@end ignore
1420
1421@item Q [@var{exit-code}]
1422This command only accepts a single address.
1423
1424@findex Q (silent Quit) command
1425@cindex @value{SSEDEXT}, quitting silently
1426@cindex @value{SSEDEXT}, returning an exit code
1427@cindex Quitting
1428This command is the same as @code{q}, but will not print the
1429contents of pattern space.  Like @code{q}, it provides the
1430ability to return an exit code to the caller.
1431
1432This command can be useful because the only alternative ways
1433to accomplish this apparently trivial function are to use
1434the @option{-n} option (which can unnecessarily complicate
1435your script) or resorting to the following snippet, which
1436wastes time by reading the whole file without any visible effect:
1437
1438@example
1439:eat
1440$d       @i{@r{Quit silently on the last line}}
1441N        @i{@r{Read another line, silently}}
1442g        @i{@r{Overwrite pattern space each time to save memory}}
1443b eat
1444@end example
1445
1446@item R @var{filename}
1447@findex R (read line) command
1448@cindex Read text from a file
1449@cindex @value{SSEDEXT}, reading a file a line at a time
1450@cindex @value{SSEDEXT}, @code{R} command
1451@cindex @value{SSEDEXT}, @file{/dev/stdin} file
1452Queue a line of @var{filename} to be read and
1453inserted into the output stream at the end of the current cycle,
1454or when the next input line is read.
1455Note that if @var{filename} cannot be read, or if its end is
1456reached, no line is appended, without any error indication.
1457
1458As with the @code{r} command, the special value @file{/dev/stdin}
1459is supported for the file name, which reads a line from the
1460standard input.
1461
1462@item T @var{label}
1463@findex T (test and branch if failed) command
1464@cindex @value{SSEDEXT}, branch if @code{s///} failed
1465@cindex Branch to a label, if @code{s///} failed
1466@cindex Conditional branch
1467Branch to @var{label} only if there have been no successful
1468@code{s}ubstitutions since the last input line was read or
1469conditional branch was taken. The @var{label} may be omitted,
1470in which case the next cycle is started.
1471
1472@item v @var{version}
1473@findex v (version) command
1474@cindex @value{SSEDEXT}, checking for their presence
1475@cindex Requiring @value{SSED}
1476This command does nothing, but makes @command{sed} fail if
1477@value{SSED} extensions are not supported, simply because other
1478versions of @command{sed} do not implement it.  In addition, you
1479can specify the version of @command{sed} that your script
1480requires, such as @code{4.0.5}.  The default is @code{4.0}
1481because that is the first version that implemented this command.
1482
1483This command enables all @value{SSEDEXT} even if
1484@env{POSIXLY_CORRECT} is set in the environment.
1485
1486@item W @var{filename}
1487@findex W (write first line) command
1488@cindex Write first line to a file
1489@cindex @value{SSEDEXT}, writing first line to a file
1490Write to the given filename the portion of the pattern space up to
1491the first newline.  Everything said under the @code{w} command about
1492file handling holds here too.
1493
1494@item z
1495@findex z (Zap) command
1496@cindex @value{SSEDEXT}, emptying pattern space
1497@cindex Emptying pattern space
1498This command empties the content of pattern space.  It is
1499usually the same as @samp{s/.*//}, but is more efficient
1500and works in the presence of invalid multibyte sequences
1501in the input stream.  @sc{posix} mandates that such sequences
1502are @emph{not} matched by @samp{.}, so that there is no portable
1503way to clear @command{sed}'s buffers in the middle of the
1504script in most multibyte locales (including UTF-8 locales).
1505@end table
1506
1507@node Escapes
1508@section @acronym{GNU} Extensions for Escapes in Regular Expressions
1509
1510@cindex @acronym{GNU} extensions, special escapes
1511Until this chapter, we have only encountered escapes of the form
1512@samp{\^}, which tell @command{sed} not to interpret the circumflex
1513as a special character, but rather to take it literally.  For
1514example, @samp{\*} matches a single asterisk rather than zero
1515or more backslashes.
1516
1517@cindex @code{POSIXLY_CORRECT} behavior, escapes
1518This chapter introduces another kind of escape@footnote{All
1519the escapes introduced here are @acronym{GNU}
1520extensions, with the exception of @code{\n}.  In basic regular
1521expression mode, setting @code{POSIXLY_CORRECT} disables them inside
1522bracket expressions.}---that
1523is, escapes that are applied to a character or sequence of characters
1524that ordinarily are taken literally, and that @command{sed} replaces
1525with a special character.  This provides a way
1526of encoding non-printable characters in patterns in a visible manner.
1527There is no restriction on the appearance of non-printing characters
1528in a @command{sed} script but when a script is being prepared in the
1529shell or by text editing, it is usually easier to use one of
1530the following escape sequences than the binary character it
1531represents:
1532
1533The list of these escapes is:
1534
1535@table @code
1536@item \a
1537Produces or matches a @sc{bel} character, that is an ``alert'' (@sc{ascii} 7).
1538
1539@item \f
1540Produces or matches a form feed (@sc{ascii} 12).
1541
1542@item \n
1543Produces or matches a newline (@sc{ascii} 10).
1544
1545@item \r
1546Produces or matches a carriage return (@sc{ascii} 13).
1547
1548@item \t
1549Produces or matches a horizontal tab (@sc{ascii} 9).
1550
1551@item \v
1552Produces or matches a so called ``vertical tab'' (@sc{ascii} 11).
1553
1554@item \c@var{x}
1555Produces or matches @kbd{@sc{Control}-@var{x}}, where @var{x} is
1556any character.  The precise effect of @samp{\c@var{x}} is as follows:
1557if @var{x} is a lower case letter, it is converted to upper case.
1558Then bit 6 of the character (hex 40) is inverted.  Thus @samp{\cz} becomes
1559hex 1A, but @samp{\c@{} becomes hex 3B, while @samp{\c;} becomes hex 7B.
1560
1561@item \d@var{xxx}
1562Produces or matches a character whose decimal @sc{ascii} value is @var{xxx}.
1563
1564@item \o@var{xxx}
1565@ifset PERL
1566@item \@var{xxx}
1567@end ifset
1568Produces or matches a character whose octal @sc{ascii} value is @var{xxx}.
1569@ifset PERL
1570The syntax without the @code{o} is active in Perl mode, while the one
1571with the @code{o} is active in the normal or extended @sc{posix} regular
1572expression modes.
1573@end ifset
1574
1575@item \x@var{xx}
1576Produces or matches a character whose hexadecimal @sc{ascii} value is @var{xx}.
1577@end table
1578
1579@samp{\b} (backspace) was omitted because of the conflict with
1580the existing ``word boundary'' meaning.
1581
1582Other escapes match a particular character class and are valid only in
1583regular expressions:
1584
1585@table @code
1586@item \w
1587Matches any ``word'' character.  A ``word'' character is any
1588letter or digit or the underscore character.
1589
1590@item \W
1591Matches any ``non-word'' character.
1592
1593@item \b
1594Matches a word boundary; that is it matches if the character
1595to the left is a ``word'' character and the character to the
1596right is a ``non-word'' character, or vice-versa.
1597
1598@item \B
1599Matches everywhere but on a word boundary; that is it matches
1600if the character to the left and the character to the right
1601are either both ``word'' characters or both ``non-word''
1602characters.
1603
1604@item \`
1605Matches only at the start of pattern space.  This is different
1606from @code{^} in multi-line mode.
1607
1608@item \'
1609Matches only at the end of pattern space.  This is different
1610from @code{$} in multi-line mode.
1611
1612@ifset PERL
1613@item \G
1614Match only at the start of pattern space or, when doing a global
1615substitution using the @code{s///g} command and option, at
1616the end-of-match position of the prior match.  For example,
1617@samp{s/\Ga/Z/g} will change an initial run of @code{a}s to
1618a run of @code{Z}s
1619@end ifset
1620@end table
1621
1622@node Examples
1623@chapter Some Sample Scripts
1624
1625Here are some @command{sed} scripts to guide you in the art of mastering
1626@command{sed}.
1627
1628@menu
1629Some exotic examples:
1630* Centering lines::
1631* Increment a number::
1632* Rename files to lower case::
1633* Print bash environment::
1634* Reverse chars of lines::
1635
1636Emulating standard utilities:
1637* tac::                             Reverse lines of files
1638* cat -n::                          Numbering lines
1639* cat -b::                          Numbering non-blank lines
1640* wc -c::                           Counting chars
1641* wc -w::                           Counting words
1642* wc -l::                           Counting lines
1643* head::                            Printing the first lines
1644* tail::                            Printing the last lines
1645* uniq::                            Make duplicate lines unique
1646* uniq -d::                         Print duplicated lines of input
1647* uniq -u::                         Remove all duplicated lines
1648* cat -s::                          Squeezing blank lines
1649@end menu
1650
1651@node Centering lines
1652@section Centering Lines
1653
1654This script centers all lines of a file on a 80 columns width.
1655To change that width, the number in @code{\@{@dots{}\@}} must be
1656replaced, and the number of added spaces also must be changed.
1657
1658Note how the buffer commands are used to separate parts in
1659the regular expressions to be matched---this is a common
1660technique.
1661
1662@c start-------------------------------------------
1663@example
1664#!/usr/bin/sed -f
1665
1666@group
1667# Put 80 spaces in the buffer
16681 @{
1669  x
1670  s/^$/          /
1671  s/^.*$/&&&&&&&&/
1672  x
1673@}
1674@end group
1675
1676@group
1677# del leading and trailing spaces
1678y/@kbd{tab}/ /
1679s/^ *//
1680s/ *$//
1681@end group
1682
1683@group
1684# add a newline and 80 spaces to end of line
1685G
1686@end group
1687
1688@group
1689# keep first 81 chars (80 + a newline)
1690s/^\(.\@{81\@}\).*$/\1/
1691@end group
1692
1693@group
1694# \2 matches half of the spaces, which are moved to the beginning
1695s/^\(.*\)\n\(.*\)\2/\2\1/
1696@end group
1697@end example
1698@c end---------------------------------------------
1699
1700@node Increment a number
1701@section Increment a Number
1702
1703This script is one of a few that demonstrate how to do arithmetic
1704in @command{sed}.  This is indeed possible,@footnote{@command{sed} guru Greg
1705Ubben wrote an implementation of the @command{dc} @sc{rpn} calculator!
1706It is distributed together with sed.} but must be done manually.
1707
1708To increment one number you just add 1 to last digit, replacing
1709it by the following digit.  There is one exception: when the digit
1710is a nine the previous digits must be also incremented until you
1711don't have a nine.
1712
1713This solution by Bruno Haible is very clever and smart because
1714it uses a single buffer; if you don't have this limitation, the
1715algorithm used in @ref{cat -n, Numbering lines}, is faster.
1716It works by replacing trailing nines with an underscore, then
1717using multiple @code{s} commands to increment the last digit,
1718and then again substituting underscores with zeros.
1719
1720@c start-------------------------------------------
1721@example
1722#!/usr/bin/sed -f
1723
1724/[^0-9]/ d
1725
1726@group
1727# replace all leading 9s by _ (any other character except digits, could
1728# be used)
1729:d
1730s/9\(_*\)$/_\1/
1731td
1732@end group
1733
1734@group
1735# incr last digit only.  The first line adds a most-significant
1736# digit of 1 if we have to add a digit.
1737#
1738# The @code{tn} commands are not necessary, but make the thing
1739# faster
1740@end group
1741
1742@group
1743s/^\(_*\)$/1\1/; tn
1744s/8\(_*\)$/9\1/; tn
1745s/7\(_*\)$/8\1/; tn
1746s/6\(_*\)$/7\1/; tn
1747s/5\(_*\)$/6\1/; tn
1748s/4\(_*\)$/5\1/; tn
1749s/3\(_*\)$/4\1/; tn
1750s/2\(_*\)$/3\1/; tn
1751s/1\(_*\)$/2\1/; tn
1752s/0\(_*\)$/1\1/; tn
1753@end group
1754
1755@group
1756:n
1757y/_/0/
1758@end group
1759@end example
1760@c end---------------------------------------------
1761
1762@node Rename files to lower case
1763@section Rename Files to Lower Case
1764
1765This is a pretty strange use of @command{sed}.  We transform text, and
1766transform it to be shell commands, then just feed them to shell.
1767Don't worry, even worse hacks are done when using @command{sed}; I have
1768seen a script converting the output of @command{date} into a @command{bc}
1769program!
1770
1771The main body of this is the @command{sed} script, which remaps the name
1772from lower to upper (or vice-versa) and even checks out
1773if the remapped name is the same as the original name.
1774Note how the script is parameterized using shell
1775variables and proper quoting.
1776
1777@c start-------------------------------------------
1778@example
1779@group
1780#! /bin/sh
1781# rename files to lower/upper case...
1782#
1783# usage:
1784#    move-to-lower *
1785#    move-to-upper *
1786# or
1787#    move-to-lower -R .
1788#    move-to-upper -R .
1789#
1790@end group
1791
1792@group
1793help()
1794@{
1795        cat << eof
1796Usage: $0 [-n] [-r] [-h] files...
1797@end group
1798
1799@group
1800-n      do nothing, only see what would be done
1801-R      recursive (use find)
1802-h      this message
1803files   files to remap to lower case
1804@end group
1805
1806@group
1807Examples:
1808       $0 -n *        (see if everything is ok, then...)
1809       $0 *
1810@end group
1811
1812       $0 -R .
1813
1814@group
1815eof
1816@}
1817@end group
1818
1819@group
1820apply_cmd='sh'
1821finder='echo "$@@" | tr " " "\n"'
1822files_only=
1823@end group
1824
1825@group
1826while :
1827do
1828    case "$1" in
1829        -n) apply_cmd='cat' ;;
1830        -R) finder='find "$@@" -type f';;
1831        -h) help ; exit 1 ;;
1832        *) break ;;
1833    esac
1834    shift
1835done
1836@end group
1837
1838@group
1839if [ -z "$1" ]; then
1840        echo Usage: $0 [-h] [-n] [-r] files...
1841        exit 1
1842fi
1843@end group
1844
1845@group
1846LOWER='abcdefghijklmnopqrstuvwxyz'
1847UPPER='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
1848@end group
1849
1850@group
1851case `basename $0` in
1852        *upper*) TO=$UPPER; FROM=$LOWER ;;
1853        *)       FROM=$UPPER; TO=$LOWER ;;
1854esac
1855@end group
1856
1857eval $finder | sed -n '
1858
1859@group
1860# remove all trailing slashes
1861s/\/*$//
1862@end group
1863
1864@group
1865# add ./ if there is no path, only a filename
1866/\//! s/^/.\//
1867@end group
1868
1869@group
1870# save path+filename
1871h
1872@end group
1873
1874@group
1875# remove path
1876s/.*\///
1877@end group
1878
1879@group
1880# do conversion only on filename
1881y/'$FROM'/'$TO'/
1882@end group
1883
1884@group
1885# now line contains original path+file, while
1886# hold space contains the new filename
1887x
1888@end group
1889
1890@group
1891# add converted file name to line, which now contains
1892# path/file-name\nconverted-file-name
1893G
1894@end group
1895
1896@group
1897# check if converted file name is equal to original file name,
1898# if it is, do not print nothing
1899/^.*\/\(.*\)\n\1/b
1900@end group
1901
1902@group
1903# now, transform path/fromfile\n, into
1904# mv path/fromfile path/tofile and print it
1905s/^\(.*\/\)\(.*\)\n\(.*\)$/mv "\1\2" "\1\3"/p
1906@end group
1907
1908' | $apply_cmd
1909@end example
1910@c end---------------------------------------------
1911
1912@node Print bash environment
1913@section Print @command{bash} Environment
1914
1915This script strips the definition of the shell functions
1916from the output of the @command{set} Bourne-shell command.
1917
1918@c start-------------------------------------------
1919@example
1920#!/bin/sh
1921
1922@group
1923set | sed -n '
1924:x
1925@end group
1926
1927@group
1928@ifinfo
1929# if no occurrence of "=()" print and load next line
1930@end ifinfo
1931@ifnotinfo
1932# if no occurrence of @samp{=()} print and load next line
1933@end ifnotinfo
1934/=()/! @{ p; b; @}
1935/ () $/! @{ p; b; @}
1936@end group
1937
1938@group
1939# possible start of functions section
1940# save the line in case this is a var like FOO="() "
1941h
1942@end group
1943
1944@group
1945# if the next line has a brace, we quit because
1946# nothing comes after functions
1947n
1948/^@{/ q
1949@end group
1950
1951@group
1952# print the old line
1953x; p
1954@end group
1955
1956@group
1957# work on the new line now
1958x; bx
1959'
1960@end group
1961@end example
1962@c end---------------------------------------------
1963
1964@node Reverse chars of lines
1965@section Reverse Characters of Lines
1966
1967This script can be used to reverse the position of characters
1968in lines.  The technique moves two characters at a time, hence
1969it is faster than more intuitive implementations.
1970
1971Note the @code{tx} command before the definition of the label.
1972This is often needed to reset the flag that is tested by
1973the @code{t} command.
1974
1975Imaginative readers will find uses for this script.  An example
1976is reversing the output of @command{banner}.@footnote{This requires
1977another script to pad the output of banner; for example
1978
1979@example
1980#! /bin/sh
1981
1982banner -w $1 $2 $3 $4 |
1983  sed -e :a -e '/^.\@{0,'$1'\@}$/ @{ s/$/ /; ba; @}' |
1984  ~/sedscripts/reverseline.sed
1985@end example
1986}
1987
1988@c start-------------------------------------------
1989@example
1990#!/usr/bin/sed -f
1991
1992/../! b
1993
1994@group
1995# Reverse a line.  Begin embedding the line between two newlines
1996s/^.*$/\
1997&\
1998/
1999@end group
2000
2001@group
2002# Move first character at the end.  The regexp matches until
2003# there are zero or one characters between the markers
2004tx
2005:x
2006s/\(\n.\)\(.*\)\(.\n\)/\3\2\1/
2007tx
2008@end group
2009
2010@group
2011# Remove the newline markers
2012s/\n//g
2013@end group
2014@end example
2015@c end---------------------------------------------
2016
2017@node tac
2018@section Reverse Lines of Files
2019
2020This one begins a series of totally useless (yet interesting)
2021scripts emulating various Unix commands.  This, in particular,
2022is a @command{tac} workalike.
2023
2024Note that on implementations other than @acronym{GNU} @command{sed}
2025@ifset PERL
2026and @value{SSED}
2027@end ifset
2028this script might easily overflow internal buffers.
2029
2030@c start-------------------------------------------
2031@example
2032#!/usr/bin/sed -nf
2033
2034# reverse all lines of input, i.e. first line became last, ...
2035
2036@group
2037# from the second line, the buffer (which contains all previous lines)
2038# is *appended* to current line, so, the order will be reversed
20391! G
2040@end group
2041
2042@group
2043# on the last line we're done -- print everything
2044$ p
2045@end group
2046
2047@group
2048# store everything on the buffer again
2049h
2050@end group
2051@end example
2052@c end---------------------------------------------
2053
2054@node cat -n
2055@section Numbering Lines
2056
2057This script replaces @samp{cat -n}; in fact it formats its output
2058exactly like @acronym{GNU} @command{cat} does.
2059
2060Of course this is completely useless and for two reasons:  first,
2061because somebody else did it in C, second, because the following
2062Bourne-shell script could be used for the same purpose and would
2063be much faster:
2064
2065@c start-------------------------------------------
2066@example
2067@group
2068#! /bin/sh
2069sed -e "=" $@@ | sed -e '
2070  s/^/      /
2071  N
2072  s/^ *\(......\)\n/\1  /
2073'
2074@end group
2075@end example
2076@c end---------------------------------------------
2077
2078It uses @command{sed} to print the line number, then groups lines two
2079by two using @code{N}.  Of course, this script does not teach as much as
2080the one presented below.
2081
2082The algorithm used for incrementing uses both buffers, so the line
2083is printed as soon as possible and then discarded.  The number
2084is split so that changing digits go in a buffer and unchanged ones go
2085in the other; the changed digits are modified in a single step
2086(using a @code{y} command).  The line number for the next line
2087is then composed and stored in the hold space, to be used in the
2088next iteration.
2089
2090@c start-------------------------------------------
2091@example
2092#!/usr/bin/sed -nf
2093
2094@group
2095# Prime the pump on the first line
2096x
2097/^$/ s/^.*$/1/
2098@end group
2099
2100@group
2101# Add the correct line number before the pattern
2102G
2103h
2104@end group
2105
2106@group
2107# Format it and print it
2108s/^/      /
2109s/^ *\(......\)\n/\1  /p
2110@end group
2111
2112@group
2113# Get the line number from hold space; add a zero
2114# if we're going to add a digit on the next line
2115g
2116s/\n.*$//
2117/^9*$/ s/^/0/
2118@end group
2119
2120@group
2121# separate changing/unchanged digits with an x
2122s/.9*$/x&/
2123@end group
2124
2125@group
2126# keep changing digits in hold space
2127h
2128s/^.*x//
2129y/0123456789/1234567890/
2130x
2131@end group
2132
2133@group
2134# keep unchanged digits in pattern space
2135s/x.*$//
2136@end group
2137
2138@group
2139# compose the new number, remove the newline implicitly added by G
2140G
2141s/\n//
2142h
2143@end group
2144@end example
2145@c end---------------------------------------------
2146
2147@node cat -b
2148@section Numbering Non-blank Lines
2149
2150Emulating @samp{cat -b} is almost the same as @samp{cat -n}---we only
2151have to select which lines are to be numbered and which are not.
2152
2153The part that is common to this script and the previous one is
2154not commented to show how important it is to comment @command{sed}
2155scripts properly...
2156
2157@c start-------------------------------------------
2158@example
2159#!/usr/bin/sed -nf
2160
2161@group
2162/^$/ @{
2163  p
2164  b
2165@}
2166@end group
2167
2168@group
2169# Same as cat -n from now
2170x
2171/^$/ s/^.*$/1/
2172G
2173h
2174s/^/      /
2175s/^ *\(......\)\n/\1  /p
2176x
2177s/\n.*$//
2178/^9*$/ s/^/0/
2179s/.9*$/x&/
2180h
2181s/^.*x//
2182y/0123456789/1234567890/
2183x
2184s/x.*$//
2185G
2186s/\n//
2187h
2188@end group
2189@end example
2190@c end---------------------------------------------
2191
2192@node wc -c
2193@section Counting Characters
2194
2195This script shows another way to do arithmetic with @command{sed}.
2196In this case we have to add possibly large numbers, so implementing
2197this by successive increments would not be feasible (and possibly
2198even more complicated to contrive than this script).
2199
2200The approach is to map numbers to letters, kind of an abacus
2201implemented with @command{sed}.  @samp{a}s are units, @samp{b}s are
2202tens and so on: we simply add the number of characters
2203on the current line as units, and then propagate the carry
2204to tens, hundreds, and so on.
2205
2206As usual, running totals are kept in hold space.
2207
2208On the last line, we convert the abacus form back to decimal.
2209For the sake of variety, this is done with a loop rather than
2210with some 80 @code{s} commands@footnote{Some implementations
2211have a limit of 199 commands per script}: first we
2212convert units, removing @samp{a}s from the number; then we
2213rotate letters so that tens become @samp{a}s, and so on
2214until no more letters remain.
2215
2216@c start-------------------------------------------
2217@example
2218#!/usr/bin/sed -nf
2219
2220@group
2221# Add n+1 a's to hold space (+1 is for the newline)
2222s/./a/g
2223H
2224x
2225s/\n/a/
2226@end group
2227
2228@group
2229# Do the carry.  The t's and b's are not necessary,
2230# but they do speed up the thing
2231t a
2232: a;  s/aaaaaaaaaa/b/g; t b; b done
2233: b;  s/bbbbbbbbbb/c/g; t c; b done
2234: c;  s/cccccccccc/d/g; t d; b done
2235: d;  s/dddddddddd/e/g; t e; b done
2236: e;  s/eeeeeeeeee/f/g; t f; b done
2237: f;  s/ffffffffff/g/g; t g; b done
2238: g;  s/gggggggggg/h/g; t h; b done
2239: h;  s/hhhhhhhhhh//g
2240@end group
2241
2242@group
2243: done
2244$! @{
2245  h
2246  b
2247@}
2248@end group
2249
2250# On the last line, convert back to decimal
2251
2252@group
2253: loop
2254/a/! s/[b-h]*/&0/
2255s/aaaaaaaaa/9/
2256s/aaaaaaaa/8/
2257s/aaaaaaa/7/
2258s/aaaaaa/6/
2259s/aaaaa/5/
2260s/aaaa/4/
2261s/aaa/3/
2262s/aa/2/
2263s/a/1/
2264@end group
2265
2266@group
2267: next
2268y/bcdefgh/abcdefg/
2269/[a-h]/ b loop
2270p
2271@end group
2272@end example
2273@c end---------------------------------------------
2274
2275@node wc -w
2276@section Counting Words
2277
2278This script is almost the same as the previous one, once each
2279of the words on the line is converted to a single @samp{a}
2280(in the previous script each letter was changed to an @samp{a}).
2281
2282It is interesting that real @command{wc} programs have optimized
2283loops for @samp{wc -c}, so they are much slower at counting
2284words rather than characters.  This script's bottleneck,
2285instead, is arithmetic, and hence the word-counting one
2286is faster (it has to manage smaller numbers).
2287
2288Again, the common parts are not commented to show the importance
2289of commenting @command{sed} scripts.
2290
2291@c start-------------------------------------------
2292@example
2293#!/usr/bin/sed -nf
2294
2295@group
2296# Convert words to a's
2297s/[ @kbd{tab}][ @kbd{tab}]*/ /g
2298s/^/ /
2299s/ [^ ][^ ]*/a /g
2300s/ //g
2301@end group
2302
2303@group
2304# Append them to hold space
2305H
2306x
2307s/\n//
2308@end group
2309
2310@group
2311# From here on it is the same as in wc -c.
2312/aaaaaaaaaa/! bx;   s/aaaaaaaaaa/b/g
2313/bbbbbbbbbb/! bx;   s/bbbbbbbbbb/c/g
2314/cccccccccc/! bx;   s/cccccccccc/d/g
2315/dddddddddd/! bx;   s/dddddddddd/e/g
2316/eeeeeeeeee/! bx;   s/eeeeeeeeee/f/g
2317/ffffffffff/! bx;   s/ffffffffff/g/g
2318/gggggggggg/! bx;   s/gggggggggg/h/g
2319s/hhhhhhhhhh//g
2320:x
2321$! @{ h; b; @}
2322:y
2323/a/! s/[b-h]*/&0/
2324s/aaaaaaaaa/9/
2325s/aaaaaaaa/8/
2326s/aaaaaaa/7/
2327s/aaaaaa/6/
2328s/aaaaa/5/
2329s/aaaa/4/
2330s/aaa/3/
2331s/aa/2/
2332s/a/1/
2333y/bcdefgh/abcdefg/
2334/[a-h]/ by
2335p
2336@end group
2337@end example
2338@c end---------------------------------------------
2339
2340@node wc -l
2341@section Counting Lines
2342
2343No strange things are done now, because @command{sed} gives us
2344@samp{wc -l} functionality for free!!! Look:
2345
2346@c start-------------------------------------------
2347@example
2348@group
2349#!/usr/bin/sed -nf
2350$=
2351@end group
2352@end example
2353@c end---------------------------------------------
2354
2355@node head
2356@section Printing the First Lines
2357
2358This script is probably the simplest useful @command{sed} script.
2359It displays the first 10 lines of input; the number of displayed
2360lines is right before the @code{q} command.
2361
2362@c start-------------------------------------------
2363@example
2364@group
2365#!/usr/bin/sed -f
236610q
2367@end group
2368@end example
2369@c end---------------------------------------------
2370
2371@node tail
2372@section Printing the Last Lines
2373
2374Printing the last @var{n} lines rather than the first is more complex
2375but indeed possible.  @var{n} is encoded in the second line, before
2376the bang character.
2377
2378This script is similar to the @command{tac} script in that it keeps the
2379final output in the hold space and prints it at the end:
2380
2381@c start-------------------------------------------
2382@example
2383#!/usr/bin/sed -nf
2384
2385@group
23861! @{; H; g; @}
23871,10 !s/[^\n]*\n//
2388$p
2389h
2390@end group
2391@end example
2392@c end---------------------------------------------
2393
2394Mainly, the scripts keeps a window of 10 lines and slides it
2395by adding a line and deleting the oldest (the substitution command
2396on the second line works like a @code{D} command but does not
2397restart the loop).
2398
2399The ``sliding window'' technique is a very powerful way to write
2400efficient and complex @command{sed} scripts, because commands like
2401@code{P} would require a lot of work if implemented manually.
2402
2403To introduce the technique, which is fully demonstrated in the
2404rest of this chapter and is based on the @code{N}, @code{P}
2405and @code{D} commands, here is an implementation of @command{tail}
2406using a simple ``sliding window.''
2407
2408This looks complicated but in fact the working is the same as
2409the last script: after we have kicked in the appropriate number
2410of lines, however, we stop using the hold space to keep inter-line
2411state, and instead use @code{N} and @code{D} to slide pattern
2412space by one line:
2413
2414@c start-------------------------------------------
2415@example
2416#!/usr/bin/sed -f
2417
2418@group
24191h
24202,10 @{; H; g; @}
2421$q
24221,9d
2423N
2424D
2425@end group
2426@end example
2427@c end---------------------------------------------
2428
2429Note how the first, second and fourth line are inactive after
2430the first ten lines of input.  After that, all the script does
2431is: exiting on the last line of input, appending the next input
2432line to pattern space, and removing the first line.
2433
2434@node uniq
2435@section Make Duplicate Lines Unique
2436
2437This is an example of the art of using the @code{N}, @code{P}
2438and @code{D} commands, probably the most difficult to master.
2439
2440@c start-------------------------------------------
2441@example
2442@group
2443#!/usr/bin/sed -f
2444h
2445@end group
2446
2447@group
2448:b
2449# On the last line, print and exit
2450$b
2451N
2452/^\(.*\)\n\1$/ @{
2453    # The two lines are identical.  Undo the effect of
2454    # the n command.
2455    g
2456    bb
2457@}
2458@end group
2459
2460@group
2461# If the @code{N} command had added the last line, print and exit
2462$b
2463@end group
2464
2465@group
2466# The lines are different; print the first and go
2467# back working on the second.
2468P
2469D
2470@end group
2471@end example
2472@c end---------------------------------------------
2473
2474As you can see, we mantain a 2-line window using @code{P} and @code{D}.
2475This technique is often used in advanced @command{sed} scripts.
2476
2477@node uniq -d
2478@section Print Duplicated Lines of Input
2479
2480This script prints only duplicated lines, like @samp{uniq -d}.
2481
2482@c start-------------------------------------------
2483@example
2484#!/usr/bin/sed -nf
2485
2486@group
2487$b
2488N
2489/^\(.*\)\n\1$/ @{
2490    # Print the first of the duplicated lines
2491    s/.*\n//
2492    p
2493@end group
2494
2495@group
2496    # Loop until we get a different line
2497    :b
2498    $b
2499    N
2500    /^\(.*\)\n\1$/ @{
2501        s/.*\n//
2502        bb
2503    @}
2504@}
2505@end group
2506
2507@group
2508# The last line cannot be followed by duplicates
2509$b
2510@end group
2511
2512@group
2513# Found a different one.  Leave it alone in the pattern space
2514# and go back to the top, hunting its duplicates
2515D
2516@end group
2517@end example
2518@c end---------------------------------------------
2519
2520@node uniq -u
2521@section Remove All Duplicated Lines
2522
2523This script prints only unique lines, like @samp{uniq -u}.
2524
2525@c start-------------------------------------------
2526@example
2527#!/usr/bin/sed -f
2528
2529@group
2530# Search for a duplicate line --- until that, print what you find.
2531$b
2532N
2533/^\(.*\)\n\1$/ ! @{
2534    P
2535    D
2536@}
2537@end group
2538
2539@group
2540:c
2541# Got two equal lines in pattern space.  At the
2542# end of the file we simply exit
2543$d
2544@end group
2545
2546@group
2547# Else, we keep reading lines with @code{N} until we
2548# find a different one
2549s/.*\n//
2550N
2551/^\(.*\)\n\1$/ @{
2552    bc
2553@}
2554@end group
2555
2556@group
2557# Remove the last instance of the duplicate line
2558# and go back to the top
2559D
2560@end group
2561@end example
2562@c end---------------------------------------------
2563
2564@node cat -s
2565@section Squeezing Blank Lines
2566
2567As a final example, here are three scripts, of increasing complexity
2568and speed, that implement the same function as @samp{cat -s}, that is
2569squeezing blank lines.
2570
2571The first leaves a blank line at the beginning and end if there are
2572some already.
2573
2574@c start-------------------------------------------
2575@example
2576#!/usr/bin/sed -f
2577
2578@group
2579# on empty lines, join with next
2580# Note there is a star in the regexp
2581:x
2582/^\n*$/ @{
2583N
2584bx
2585@}
2586@end group
2587
2588@group
2589# now, squeeze all '\n', this can be also done by:
2590# s/^\(\n\)*/\1/
2591s/\n*/\
2592/
2593@end group
2594@end example
2595@c end---------------------------------------------
2596
2597This one is a bit more complex and removes all empty lines
2598at the beginning.  It does leave a single blank line at end
2599if one was there.
2600
2601@c start-------------------------------------------
2602@example
2603#!/usr/bin/sed -f
2604
2605@group
2606# delete all leading empty lines
26071,/^./@{
2608/./!d
2609@}
2610@end group
2611
2612@group
2613# on an empty line we remove it and all the following
2614# empty lines, but one
2615:x
2616/./!@{
2617N
2618s/^\n$//
2619tx
2620@}
2621@end group
2622@end example
2623@c end---------------------------------------------
2624
2625This removes leading and trailing blank lines.  It is also the
2626fastest.  Note that loops are completely done with @code{n} and
2627@code{b}, without relying on @command{sed} to restart the
2628the script automatically at the end of a line.
2629
2630@c start-------------------------------------------
2631@example
2632#!/usr/bin/sed -nf
2633
2634@group
2635# delete all (leading) blanks
2636/./!d
2637@end group
2638
2639@group
2640# get here: so there is a non empty
2641:x
2642# print it
2643p
2644# get next
2645n
2646# got chars? print it again, etc...
2647/./bx
2648@end group
2649
2650@group
2651# no, don't have chars: got an empty line
2652:z
2653# get next, if last line we finish here so no trailing
2654# empty lines are written
2655n
2656# also empty? then ignore it, and get next... this will
2657# remove ALL empty lines
2658/./!bz
2659@end group
2660
2661@group
2662# all empty lines were deleted/ignored, but we have a non empty.  As
2663# what we want to do is to squeeze, insert a blank line artificially
2664i\
2665@end group
2666
2667bx
2668@end example
2669@c end---------------------------------------------
2670
2671@node Limitations
2672@chapter @value{SSED}'s Limitations and Non-limitations
2673
2674@cindex @acronym{GNU} extensions, unlimited line length
2675@cindex Portability, line length limitations
2676For those who want to write portable @command{sed} scripts,
2677be aware that some implementations have been known to
2678limit line lengths (for the pattern and hold spaces)
2679to be no more than 4000 bytes.
2680The @sc{posix} standard specifies that conforming @command{sed}
2681implementations shall support at least 8192 byte line lengths.
2682@value{SSED} has no built-in limit on line length;
2683as long as it can @code{malloc()} more (virtual) memory,
2684you can feed or construct lines as long as you like.
2685
2686However, recursion is used to handle subpatterns and indefinite
2687repetition.  This means that the available stack space may limit
2688the size of the buffer that can be processed by certain patterns.
2689
2690@ifset PERL
2691There are some size limitations in the regular expression
2692matcher but it is hoped that they will never in practice
2693be relevant.  The maximum length of a compiled pattern
2694is 65539 (sic) bytes.  All values in repeating quantifiers
2695must be less than 65536.  The maximum nesting depth of
2696all parenthesized subpatterns, including capturing and
2697non-capturing subpatterns@footnote{The
2698distinction is meaningful when referring to Perl-style
2699regular expressions.}, assertions, and other types of
2700subpattern, is 200.
2701
2702Also, @value{SSED} recognizes the @sc{posix} syntax
2703@code{[.@var{ch}.]} and @code{[=@var{ch}=]}
2704where @var{ch} is a ``collating element'', but these
2705are not supported, and an error is given if they are
2706encountered.
2707
2708Here are a few distinctions between the real Perl-style
2709regular expressions and those that @option{-R} recognizes.
2710
2711@enumerate
2712@item
2713Lookahead assertions do not allow repeat quantifiers after them
2714Perl permits them, but they do not mean what you
2715might think. For example, @samp{(?!a)@{3@}} does not assert that the
2716next three characters are not @samp{a}. It just asserts three times that the
2717next character is not @samp{a} --- a waste of time and nothing else.
2718
2719@item
2720Capturing subpatterns that occur inside  negative  lookahead
2721head  assertions  are  counted,  but  their  entries are counted
2722as empty in the second half of an @code{s} command.
2723Perl sets its numerical variables from any such patterns
2724that are matched before the assertion fails to match
2725something (thereby succeeding), but only if the negative
2726lookahead assertion contains just one branch.
2727
2728@item
2729The following Perl escape sequences are not supported:
2730@samp{\l}, @samp{\u}, @samp{\L}, @samp{\U}, @samp{\E},
2731@samp{\Q}. In fact these are implemented by Perl's general
2732string-handling and are not part of its pattern matching engine.
2733
2734@item
2735The Perl @samp{\G} assertion is not supported as it is not
2736relevant to single pattern matches.
2737
2738@item
2739Fairly obviously, @value{SSED} does not support the @samp{(?@{code@})}
2740and @samp{(?p@{code@})} constructions. However, there is some experimental
2741support for recursive patterns using the non-Perl item @samp{(?R)}.
2742
2743@item
2744There are at the time of writing some oddities in Perl
27455.005_02 concerned with the settings of captured strings
2746when part of a pattern is repeated. For example, matching
2747@samp{aba} against the pattern @samp{/^(a(b)?)+$/} sets
2748@samp{$2}@footnote{@samp{$2} would be @samp{\2} in @value{SSED}.}
2749to the value @samp{b}, but matching @samp{aabbaa}
2750against @samp{/^(aa(bb)?)+$/} leaves @samp{$2}
2751unset.  However, if the pattern is changed to
2752@samp{/^(aa(b(b))?)+$/} then @samp{$2} (and @samp{$3}) are set.
2753In Perl 5.004 @samp{$2} is set in both cases, and that is also
2754true of @value{SSED}.
2755
2756@item
2757Another as yet unresolved discrepancy is that in Perl
27585.005_02 the pattern @samp{/^(a)?(?(1)a|b)+$/} matches
2759the string @samp{a}, whereas in @value{SSED} it does not.
2760However, in both Perl and @value{SSED} @samp{/^(a)?a/} matched
2761against @samp{a} leaves $1 unset.
2762@end enumerate
2763@end ifset
2764
2765@node Other Resources
2766@chapter Other Resources for Learning About @command{sed}
2767
2768@cindex Additional reading about @command{sed}
2769In addition to several books that have been written about @command{sed}
2770(either specifically or as chapters in books which discuss
2771shell programming), one can find out more about @command{sed}
2772(including suggestions of a few books) from the FAQ
2773for the @code{sed-users} mailing list, available from:
2774@display
2775@uref{http://sed.sourceforge.net/sedfaq.html}
2776@end display
2777
2778Also of interest are
2779@uref{http://www.student.northpark.edu/pemente/sed/index.htm}
2780and @uref{http://sed.sf.net/grabbag},
2781which include @command{sed} tutorials and other @command{sed}-related goodies.
2782
2783The @code{sed-users} mailing list itself maintained by Sven Guckes.
2784To subscribe, visit @uref{http://groups.yahoo.com} and search
2785for the @code{sed-users} mailing list.
2786
2787@node Reporting Bugs
2788@chapter Reporting Bugs
2789
2790@cindex Bugs, reporting
2791Email bug reports to @email{bonzini@@gnu.org}.
2792Be sure to include the word ``sed'' somewhere in the @code{Subject:} field.
2793Also, please include the output of @samp{sed --version} in the body
2794of your report if at all possible.
2795
2796Please do not send a bug report like this:
2797
2798@example
2799@i{@i{@r{while building frobme-1.3.4}}}
2800$ configure
2801@error{} sed: file sedscr line 1: Unknown option to 's'
2802@end example
2803
2804If @value{SSED} doesn't configure your favorite package, take a
2805few extra minutes to identify the specific problem and make a stand-alone
2806test case.  Unlike other programs such as C compilers, making such test
2807cases for @command{sed} is quite simple.
2808
2809A stand-alone test case includes all the data necessary to perform the
2810test, and the specific invocation of @command{sed} that causes the problem.
2811The smaller a stand-alone test case is, the better.  A test case should
2812not involve something as far removed from @command{sed} as ``try to configure
2813frobme-1.3.4''.  Yes, that is in principle enough information to look
2814for the bug, but that is not a very practical prospect.
2815
2816Here are a few commonly reported bugs that are not bugs.
2817
2818@table @asis
2819@item @code{N} command on the last line
2820@cindex Portability, @code{N} command on the last line
2821@cindex Non-bugs, @code{N} command on the last line
2822
2823Most versions of @command{sed} exit without printing anything when
2824the @command{N} command is issued on the last line of a file.
2825@value{SSED} prints pattern space before exiting unless of course
2826the @command{-n} command switch has been specified.  This choice is
2827by design.
2828
2829For example, the behavior of
2830@example
2831sed N foo bar
2832@end example
2833@noindent
2834would depend on whether foo has an even or an odd number of
2835lines@footnote{which is the actual ``bug'' that prompted the
2836change in behavior}.  Or, when writing a script to read the
2837next few lines following a pattern match, traditional
2838implementations of @code{sed} would force you to write
2839something like
2840@example
2841/foo/@{ $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N; $!N @}
2842@end example
2843@noindent
2844instead of just
2845@example
2846/foo/@{ N;N;N;N;N;N;N;N;N; @}
2847@end example
2848
2849@cindex @code{POSIXLY_CORRECT} behavior, @code{N} command
2850In any case, the simplest workaround is to use @code{$d;N} in
2851scripts that rely on the traditional behavior, or to set
2852the @code{POSIXLY_CORRECT} variable to a non-empty value.
2853
2854@item Regex syntax clashes (problems with backslashes)
2855@cindex @acronym{GNU} extensions, to basic regular expressions
2856@cindex Non-bugs, regex syntax clashes
2857@command{sed} uses the @sc{posix} basic regular expression syntax.  According to
2858the standard, the meaning of some escape sequences is undefined in
2859this syntax;  notable in the case of @command{sed} are @code{\|},
2860@code{\+}, @code{\?}, @code{\`}, @code{\'}, @code{\<},
2861@code{\>}, @code{\b}, @code{\B}, @code{\w}, and @code{\W}.
2862
2863As in all @acronym{GNU} programs that use @sc{posix} basic regular
2864expressions, @command{sed} interprets these escape sequences as special
2865characters.  So, @code{x\+} matches one or more occurrences of @samp{x}.
2866@code{abc\|def} matches either @samp{abc} or @samp{def}.
2867
2868This syntax may cause problems when running scripts written for other
2869@command{sed}s.  Some @command{sed} programs have been written with the
2870assumption that @code{\|} and @code{\+} match the literal characters
2871@code{|} and @code{+}.  Such scripts must be modified by removing the
2872spurious backslashes if they are to be used with modern implementations
2873of @command{sed}, like
2874@ifset PERL
2875@value{SSED} or
2876@end ifset
2877@acronym{GNU} @command{sed}.
2878
2879On the other hand, some scripts use s|abc\|def||g to remove occurrences
2880of @emph{either} @code{abc} or @code{def}.  While this worked until
2881@command{sed} 4.0.x, newer versions interpret this as removing the
2882string @code{abc|def}.  This is again undefined behavior according to
2883@acronym{POSIX}, and this interpretation is arguably more robust: older
2884@command{sed}s, for example, required that the regex matcher parsed
2885@code{\/} as @code{/} in the common case of escaping a slash, which is
2886again undefined behavior; the new behavior avoids this, and this is good
2887because the regex matcher is only partially under our control.
2888
2889@cindex @acronym{GNU} extensions, special escapes
2890In addition, this version of @command{sed} supports several escape characters
2891(some of which are multi-character) to insert non-printable characters
2892in scripts (@code{\a}, @code{\c}, @code{\d}, @code{\o}, @code{\r},
2893@code{\t}, @code{\v}, @code{\x}).  These can cause similar problems
2894with scripts written for other @command{sed}s.
2895
2896@item @option{-i} clobbers read-only files
2897@cindex In-place editing
2898@cindex @value{SSEDEXT}, in-place editing
2899@cindex Non-bugs, in-place editing
2900
2901In short, @samp{sed -i} will let you delete the contents of
2902a read-only file, and in general the @option{-i} option
2903(@pxref{Invoking sed, , Invocation}) lets you clobber
2904protected files.  This is not a bug, but rather a consequence
2905of how the Unix filesystem works.
2906
2907The permissions on a file say what can happen to the data
2908in that file, while the permissions on a directory say what can
2909happen to the list of files in that directory.  @samp{sed -i}
2910will not ever open for writing  a file that is already on disk.
2911Rather, it will work on a temporary file that is finally renamed
2912to the original name: if you rename or delete files, you're actually
2913modifying the contents of the directory, so the operation depends on
2914the permissions of the directory, not of the file.  For this same
2915reason, @command{sed} does not let you use @option{-i} on a writeable file
2916in a read-only directory, and will break hard or symbolic links when
2917@option{-i} is used on such a file.
2918
2919@item @code{0a} does not work (gives an error)
2920@cindex @code{0} address
2921@cindex @acronym{GNU} extensions, @code{0} address
2922@cindex Non-bugs, @code{0} address
2923
2924There is no line 0.  0 is a special address that is only used to treat
2925addresses like @code{0,/@var{RE}/} as active when the script starts: if
2926you write @code{1,/abc/d} and the first line includes the word @samp{abc},
2927then that match would be ignored because address ranges must span at least
2928two lines (barring the end of the file); but what you probably wanted is
2929to delete every line up to the first one including @samp{abc}, and this
2930is obtained with @code{0,/abc/d}.
2931
2932@ifclear PERL
2933@item @code{[a-z]} is case insensitive
2934@cindex Non-bugs, localization-related
2935
2936You are encountering problems with locales.  POSIX mandates that @code{[a-z]}
2937uses the current locale's collation order -- in C parlance, that means using
2938@code{strcoll(3)} instead of @code{strcmp(3)}.  Some locales have a
2939case-insensitive collation order, others don't.
2940
2941Another problem is that @code{[a-z]} tries to use collation symbols.
2942This only happens if you are on the @acronym{GNU} system, using
2943@acronym{GNU} libc's regular expression matcher instead of compiling the
2944one supplied with @acronym{GNU} sed.  In a Danish locale, for example,
2945the regular expression @code{^[a-z]$} matches the string @samp{aa},
2946because this is a single collating symbol that comes after @samp{a}
2947and before @samp{b}; @samp{ll} behaves similarly in Spanish
2948locales, or @samp{ij} in Dutch locales.
2949
2950To work around these problems, which may cause bugs in shell scripts, set
2951the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}.
2952
2953@item @code{s/.*//} does not clear pattern space
2954@cindex Non-bugs, localization-related
2955@cindex @value{SSEDEXT}, emptying pattern space
2956@cindex Emptying pattern space
2957
2958This happens if your input stream includes invalid multibyte
2959sequences.  @sc{posix} mandates that such sequences
2960are @emph{not} matched by @samp{.}, so that @samp{s/.*//} will not clear
2961pattern space as you would expect.  In fact, there is no way to clear
2962sed's buffers in the middle of the script in most multibyte locales
2963(including UTF-8 locales).  For this reason, @value{SSED} provides a `z'
2964command (for `zap') as an extension.
2965
2966To work around these problems, which may cause bugs in shell scripts, set
2967the @env{LC_COLLATE} and @env{LC_CTYPE} environment variables to @samp{C}.
2968@end ifclear
2969@end table
2970
2971
2972@node Extended regexps
2973@appendix Extended regular expressions
2974@cindex Extended regular expressions, syntax
2975
2976The only difference between basic and extended regular expressions is in
2977the behavior of a few characters: @samp{?}, @samp{+}, parentheses,
2978and braces (@samp{@{@}}).  While basic regular expressions require
2979these to be escaped if you want them to behave as special characters,
2980when using extended regular expressions you must escape them if
2981you want them @emph{to match a literal character}.
2982
2983@noindent
2984Examples:
2985@table @code
2986@item abc?
2987becomes @samp{abc\?} when using extended regular expressions.  It matches
2988the literal string @samp{abc?}.
2989
2990@item c\+
2991becomes @samp{c+} when using extended regular expressions.  It matches
2992one or more @samp{c}s.
2993
2994@item a\@{3,\@}
2995becomes @samp{a@{3,@}} when using extended regular expressions.  It matches
2996three or more @samp{a}s.
2997
2998@item \(abc\)\@{2,3\@}
2999becomes @samp{(abc)@{2,3@}} when using extended regular expressions.  It
3000matches either @samp{abcabc} or @samp{abcabcabc}.
3001
3002@item \(abc*\)\1
3003becomes @samp{(abc*)\1} when using extended regular expressions.
3004Backreferences must still be escaped when using extended regular
3005expressions.
3006@end table
3007
3008@ifset PERL
3009@node Perl regexps
3010@appendix Perl-style regular expressions
3011@cindex Perl-style regular expressions, syntax
3012
3013@emph{This part is taken from the @file{pcre.txt} file distributed together
3014with the free @sc{pcre} regular expression matcher; it was written by Philip Hazel.}
3015
3016Perl introduced several extensions to regular expressions, some
3017of them incompatible with the syntax of regular expressions
3018accepted by Emacs and other @acronym{GNU} tools (whose matcher was
3019based on the Emacs matcher).  @value{SSED} implements
3020both kinds of extensions.
3021
3022@iftex
3023Summarizing, we have:
3024
3025@itemize @bullet
3026@item
3027A backslash can introduce several special sequences
3028
3029@item
3030The circumflex, dollar sign, and period characters behave specially
3031with regard to new lines
3032
3033@item
3034Strange uses of square brackets are parsed differently
3035
3036@item
3037You can toggle modifiers in the middle of a regular expression
3038
3039@item
3040You can specify that a subpattern does not count when numbering backreferences
3041
3042@item
3043@cindex Greedy regular expression matching
3044You can specify greedy or non-greedy matching
3045
3046@item
3047You can have more than ten back references
3048
3049@item
3050You can do complex look aheads and look behinds (in the spirit of
3051@code{\b}, but with subpatterns).
3052
3053@item
3054You can often improve performance by avoiding that @command{sed} wastes
3055time with backtracking
3056
3057@item
3058You can have if/then/else branches
3059
3060@item
3061You can do recursive matches, for example to look for unbalanced parentheses
3062
3063@item
3064You can have comments and non-significant whitespace, because things can
3065get complex...
3066@end itemize
3067
3068Most of these extensions are introduced by the special @code{(?}
3069sequence, which gives special meanings to parenthesized groups.
3070@end iftex
3071@menu
3072Other extensions can be roughly subdivided in two categories
3073On one hand Perl introduces several more escaped sequences
3074(that is, sequences introduced by a backslash).  On the other
3075hand, it specifies that if a question mark follows an open
3076parentheses it should give a special meaning to the parenthesized
3077group.
3078
3079* Backslash::                       Introduces special sequences
3080* Circumflex/dollar sign/period::   Behave specially with regard to new lines
3081* Square brackets::                 Are a bit different in strange cases
3082* Options setting::                 Toggle modifiers in the middle of a regexp
3083* Non-capturing subpatterns::       Are not counted when backreferencing
3084* Repetition::                      Allows for non-greedy matching
3085* Backreferences::                  Allows for more than 10 back references
3086* Assertions::                      Allows for complex look ahead matches
3087* Non-backtracking subpatterns::    Often gives more performance
3088* Conditional subpatterns::         Allows if/then/else branches
3089* Recursive patterns::              For example to match parentheses
3090* Comments::                        Because things can get complex...
3091@end menu
3092
3093@node Backslash
3094@appendixsec Backslash
3095@cindex Perl-style regular expressions, escaped sequences
3096
3097There are a few difference in the handling of backslashed
3098sequences in Perl mode.
3099
3100First of all, there are no @code{\o} and @code{\d} sequences.
3101@sc{ascii} values for characters can be specified in octal
3102with a @code{\@var{xxx}} sequence, where @var{xxx} is a
3103sequence of up to three octal digits.  If the first digit
3104is a zero, the treatment of the sequence is straightforward;
3105just note that if the character that follows the escaped digit
3106is itself an octal digit, you have to supply three octal digits
3107for @var{xxx}.  For example @code{\07} is a @sc{bel} character
3108rather than a @sc{nul} and a literal @code{7} (this sequence is
3109instead represented by @code{\0007}).
3110
3111@cindex Perl-style regular expressions, backreferences
3112The handling of a backslash followed by a digit other than 0
3113is complicated.  Outside a character class, @command{sed} reads it
3114and any following digits as a decimal number. If the number
3115is less than 10, or if there have been at least that many
3116previous capturing left parentheses in the expression, the
3117entire sequence is taken as a back reference. A description
3118of how this works is given later, following the discussion
3119of parenthesized subpatterns.
3120
3121Inside a character class, or if the decimal number is
3122greater than 9 and there have not been that many capturing
3123subpatterns, @command{sed} re-reads up to three octal digits following
3124the backslash, and generates a single byte from the
3125least significant 8 bits of the value. Any subsequent digits
3126stand for themselves.  For example:
3127
3128@example
3129\040  @i{@r{is another way of writing a space}}
3130\40   @i{@r{is the same, provided there are fewer than 40}}
3131      @i{@r{previous capturing subpatterns}}
3132\7    @i{@r{is always a back reference}}
3133\011  @i{@r{is always a tab}}
3134\11   @i{@r{might be a back reference, or another way of writing a tab}}
3135\0113 @i{@r{is a tab followed by the character @samp{3}}}
3136\113  @i{@r{is the character with octal code 113 (since there}}
3137      @i{@r{can be no more than 99 back references)}}
3138\377  @i{@r{is a byte consisting entirely of 1 bits (@sc{ascii} 255)}}
3139\81   @i{@r{is either a back reference, or a binary zero}}
3140      @i{@r{followed by the two characters @samp{81}}}
3141@end example
3142
3143Note that octal values of 100 or greater must not be introduced
3144by a leading zero, because no more than three octal
3145digits are ever read. Note that this applies only to the LHS
3146pattern; it is not possible yet to specify more than 9 backreferences
3147on the RHS of the `s' command.
3148
3149All the sequences that define a single byte value can be
3150used both inside and outside character classes. In addition,
3151inside a character class, the sequence @code{\b} is interpreted
3152as the backspace character (hex 08). Outside a character
3153class it has a different meaning (see below).
3154
3155In addition, there are four additional escapes specifying
3156generic character classes (like @code{\w} and @code{\W} do):
3157
3158@cindex Perl-style regular expressions, character classes
3159@table @samp
3160@item \d
3161Matches any decimal digit
3162
3163@item \D
3164Matches any character that is not a decimal digit
3165@end table
3166
3167In Perl mode, these character type sequences can appear both inside and
3168outside character classes. Instead, in @sc{posix} mode these sequences
3169(as well as @code{\w} and @code{\W}) are treated as two literal characters
3170(a backslash and a letter) inside square brackets.
3171
3172Escaped sequences specifying assertions are also different in
3173Perl mode.  An assertion specifies a condition that has to be met
3174at a particular point in a match, without consuming any
3175characters from the subject string. The use of subpatterns
3176for more complicated assertions is described below.  The
3177backslashed assertions are
3178
3179@cindex Perl-style regular expressions, assertions
3180@table @samp
3181@item \b
3182Asserts that the point is at a word boundary.
3183A word boundary is a position in the subject string where
3184the current character and the previous character do not both
3185match @code{\w} or @code{\W} (i.e. one matches @code{\w} and
3186the other matches @code{\W}), or the start or end of the string
3187if the first or last character matches @code{\w}, respectively.
3188
3189@item \B
3190Asserts that the point is not at a word boundary.
3191
3192@item \A
3193Asserts the matcher is at the start of pattern space (independent
3194of multiline mode).
3195
3196@item \Z
3197Asserts the matcher is at the end of pattern space,
3198or at a newline before the end of pattern space (independent of
3199multiline mode)
3200
3201@item \z
3202Asserts the matcher is at the end of pattern space (independent
3203of multiline mode)
3204@end table
3205
3206These assertions may not appear in character classes (but
3207note that @code{\b} has a different meaning, namely the
3208backspace character, inside a character class).
3209Note that Perl mode does not support directly assertions
3210for the beginning and the end of word; the @acronym{GNU} extensions
3211@code{\<} and @code{\>} achieve this purpose in @sc{posix} mode
3212instead.
3213
3214The @code{\A}, @code{\Z}, and @code{\z} assertions differ
3215from the traditional circumflex and dollar sign (described below)
3216in that they only ever match at the very start and end of the
3217subject string, whatever options are set; in particular @code{\A}
3218and @code{\z} are the same as the @acronym{GNU} extensions
3219@code{\`} and @code{\'} that are active in @sc{posix} mode.
3220
3221@node Circumflex/dollar sign/period
3222@appendixsec Circumflex, dollar sign, period
3223@cindex Perl-style regular expressions, newlines
3224
3225Outside a character class, in the default matching mode, the
3226circumflex character is an assertion which is true only if
3227the current matching point is at the start of the subject
3228string.  Inside a character class, the circumflex has an entirely
3229different meaning (see below).
3230
3231The circumflex need not be the first character of the pattern if
3232a number of alternatives are involved, but it should be the
3233first thing in each alternative in which it appears if the
3234pattern is ever to match that branch. If all possible alternatives,
3235start with a circumflex, that is, if the pattern is
3236constrained to match only at the start of the subject, it is
3237said to be an @dfn{anchored} pattern. (There are also other constructs
3238structs that can cause a pattern to be anchored.)
3239
3240A dollar sign is an assertion which is true only if the
3241current matching point is at the end of the subject string,
3242or immediately before a newline character that is the last
3243character in the string (by default).  A dollar sign need not be the
3244last character of the pattern if a number of alternatives
3245are involved, but it should be the last item in any branch
3246in which it appears.  A dollar sign has no special meaning in a
3247character class.
3248
3249@cindex Perl-style regular expressions, multiline
3250The meanings of the circumflex and dollar sign characters are
3251changed if the @code{M} modifier option is used. When this is
3252the case, they match immediately after and immediately
3253before an internal @code{\n} character, respectively, in addition
3254to matching at the start and end of the subject string.  For
3255example, the pattern @code{/^abc$/} matches the subject string
3256@samp{def\nabc} in multiline mode, but not otherwise.  Consequently,
3257patterns that are anchored in single line mode
3258because all branches start with @code{^} are not anchored in
3259multiline mode.
3260
3261@cindex Perl-style regular expressions, multiline
3262Note that the sequences @code{\A}, @code{\Z}, and @code{\z}
3263can be used to match the start and end of the subject in both
3264modes, and if all branches of a pattern start with @code{\A}
3265is it always anchored, whether the @code{M} modifier is set or not.
3266
3267@cindex Perl-style regular expressions, single line
3268Outside a character class, a dot in the pattern matches any
3269one character in the subject, including a non-printing character,
3270but not (by default) newline.  If the @code{S} modifier is used,
3271dots match newlines as well.  Actually, the handling of
3272dot is entirely independent of the handling of circumflex
3273and dollar sign, the only relationship being that they both
3274involve newline characters. Dot has no special meaning in a
3275character class.
3276
3277@node Square brackets
3278@appendixsec Square brackets
3279@cindex Perl-style regular expressions, character classes
3280
3281An opening square bracket introduces a character class, terminated
3282by a closing square bracket.  A closing square bracket on its own
3283is not special.  If a closing square bracket is required as a
3284member of the class, it should be the first data character in
3285the class (after an initial circumflex, if present) or escaped with a backslash.
3286
3287A character class matches a single character in the subject;
3288the character must be in the set of characters defined by
3289the class, unless the first character in the class is a circumflex,
3290in which case the subject character must not be in
3291the set defined by the class. If a circumflex is actually
3292required as a member of the class, ensure it is not the
3293first character, or escape it with a backslash.
3294
3295For example, the character class [aeiou] matches any lower
3296case vowel, while [^aeiou] matches any character that is not
3297a lower case vowel. Note that a circumflex is just a convenient
3298venient notation for specifying the characters which are in
3299the class by enumerating those that are not. It is not an
3300assertion: it still consumes a character from the subject
3301string, and fails if the current pointer is at the end of
3302the string.
3303
3304@cindex Perl-style regular expressions, case-insensitive
3305When caseless matching is set, any letters in a class
3306represent both their upper case and lower case versions, so
3307for example, a caseless @code{[aeiou]} matches uppercase
3308and lowercase @samp{A}s, and a caseless @code{[^aeiou]}
3309does not match @samp{A}, whereas a case-sensitive version would.
3310
3311@cindex Perl-style regular expressions, single line
3312@cindex Perl-style regular expressions, multiline
3313The newline character is never treated in any special way in
3314character classes, whatever the setting of the @code{S} and
3315@code{M} options (modifiers) is.  A class such as @code{[^a]} will
3316always match a newline.
3317
3318The minus (hyphen) character can be used to specify a range
3319of characters in a character class.  For example, @code{[d-m]}
3320matches any letter between d and m, inclusive.  If a minus
3321character is required in a class, it must be escaped with a
3322backslash or appear in a position where it cannot be interpreted
3323as indicating a range, typically as the first or last
3324character in the class.
3325
3326It is not possible to have the literal character @code{]} as the
3327end character of a range.  A pattern such as @code{[W-]46]} is
3328interpreted as a class of two characters (@code{W} and @code{-})
3329followed by a literal string @code{46]}, so it would match
3330@samp{W46]} or @samp{-46]}. However, if the @code{]} is escaped
3331with a backslash it is interpreted as the end of range, so
3332@code{[W-\]46]} is interpreted as a single class containing a
3333range followed by two separate characters. The octal or
3334hexadecimal representation of @code{]} can also be used to end a range.
3335
3336Ranges operate in @sc{ascii} collating sequence. They can also be
3337used for characters specified numerically, for example
3338@code{[\000-\037]}. If a range that includes letters is used when
3339caseless matching is set, it matches the letters in either
3340case. For example, a caseless @code{[W-c]} is equivalent to
3341@code{[][\^_`wxyzabc]}, matched caselessly, and if character
3342tables for the French locale are in use, @code{[\xc8-\xcb]}
3343matches accented E characters in both cases.
3344
3345Unlike in @sc{posix} mode, the character types @code{\d},
3346@code{\D}, @code{\s}, @code{\S}, @code{\w}, and @code{\W}
3347may also appear in a character class, and add the characters
3348that they match to the class. For example, @code{[\dABCDEF]} matches any
3349hexadecimal digit.  A circumflex can conveniently be used
3350with the upper case character types to specify a more restricted
3351set of characters than the matching lower case type.
3352For example, the class @code{[^\W_]} matches any letter or digit,
3353but not underscore.
3354
3355All non-alphameric characters other than @code{\}, @code{-},
3356@code{^} (at the start) and the terminating @code{]}
3357are non-special in character classes, but it does no harm
3358if they are escaped.
3359
3360Perl 5.6 supports the @sc{posix} notation for character classes, which
3361uses names enclosed by @code{[:} and @code{:]} within the enclosing
3362square brackets, and @value{SSED} supports this notation as well.
3363For example,
3364
3365@example
3366[01[:alpha:]%]
3367@end example
3368
3369@noindent
3370matches @samp{0}, @samp{1}, any alphabetic character, or @samp{%}.
3371The supported class names are
3372
3373@table @code
3374@item alnum
3375Matches letters and digits
3376
3377@item alpha
3378Matches letters
3379
3380@item ascii
3381Matches character codes 0 - 127
3382
3383@item cntrl
3384Matches control characters
3385
3386@item digit
3387Matches decimal digits (same as \d)
3388
3389@item graph
3390Matches printing characters, excluding space
3391
3392@item lower
3393Matches lower case letters
3394
3395@item print
3396Matches printing characters, including space
3397
3398@item punct
3399Matches printing characters, excluding letters and digits
3400
3401@item space
3402Matches white space (same as \s)
3403
3404@item upper
3405Matches upper case letters
3406
3407@item word
3408Matches ``word'' characters (same as \w)
3409
3410@item xdigit
3411Matches hexadecimal digits
3412@end table
3413
3414The names @code{ascii} and @code{word} are extensions valid only in
3415Perl mode.  Another Perl extension is negation, which is
3416indicated by a circumflex character after the colon. For example,
3417
3418@example
3419[12[:^digit:]]
3420@end example
3421
3422@noindent
3423matches @samp{1}, @samp{2}, or any non-digit.
3424
3425@node Options setting
3426@appendixsec Options setting
3427@cindex Perl-style regular expressions, toggling options
3428@cindex Perl-style regular expressions, case-insensitive
3429@cindex Perl-style regular expressions, multiline
3430@cindex Perl-style regular expressions, single line
3431@cindex Perl-style regular expressions, extended
3432
3433The settings of the @code{I}, @code{M}, @code{S}, @code{X}
3434modifiers can be changed from within the pattern by
3435a sequence of Perl option letters enclosed between @code{(?}
3436and @code{)}. The option letters must be lowercase.
3437
3438For example, @code{(?im)} sets caseless, multiline matching. It is
3439also possible to unset these options by preceding the letter
3440with a hyphen; you can also have combined settings and unsettings:
3441@code{(?im-sx)} sets caseless and multiline matching,
3442while unsets single line matching (for dots) and extended
3443whitespace interpretation.  If a letter appears both before
3444and after the hyphen, the option is unset.
3445
3446The scope of these option changes depends on where in the
3447pattern the setting occurs. For settings that are outside
3448any subpattern (defined below), the effect is the same as if
3449the options were set or unset at the start of matching. The
3450following patterns all behave in exactly the same way:
3451
3452@example
3453(?i)abc
3454a(?i)bc
3455ab(?i)c
3456abc(?i)
3457@end example
3458
3459which in turn is the same as specifying the pattern abc with
3460the @code{I} modifier.  In other words, ``top level'' settings
3461apply to the whole pattern (unless there are other
3462changes inside subpatterns). If there is more than one setting
3463of the same option at top level, the rightmost setting
3464is used.
3465
3466If an option change occurs inside a subpattern, the effect
3467is different.  This is a change of behaviour in Perl 5.005.
3468An option change inside a subpattern affects only that part
3469of the subpattern @emph{that follows} it, so
3470
3471@example
3472(a(?i)b)c
3473@end example
3474
3475@noindent
3476matches abc and aBc and no other  strings  (assuming
3477case-sensitive matching is used).  By this means, options can
3478be made to have different settings in different parts of the
3479pattern.  Any changes made in one alternative do carry on
3480into subsequent branches within the same subpattern.  For
3481example,
3482
3483@example
3484(a(?i)b|c)
3485@end example
3486
3487@noindent
3488matches @samp{ab}, @samp{aB}, @samp{c}, and @samp{C},
3489even though when matching @samp{C} the first branch is
3490abandoned before the option setting.
3491This is because the effects of option settings happen at
3492compile time. There would be some very weird behaviour otherwise.
3493
3494@ignore
3495There are two PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA
3496that can be changed in the same way as the Perl-compatible options by
3497using the characters U and X respectively.  The (?X) flag
3498setting is special in that it must always occur earlier in
3499the pattern than any of the additional features it turns on,
3500even when it is at top level. It is best put at the start.
3501@end ignore
3502
3503
3504@node Non-capturing subpatterns
3505@appendixsec Non-capturing subpatterns
3506@cindex Perl-style regular expressions, non-capturing subpatterns
3507
3508Marking part of a pattern as a subpattern does two things.
3509On one hand, it localizes a set of alternatives; on the other
3510hand, it sets up the subpattern as a capturing subpattern (as
3511defined above).  The subpattern can be backreferenced and
3512referenced in the right side of @code{s} commands.
3513
3514For example, if the string @samp{the red king} is matched against
3515the pattern
3516
3517@example
3518the ((red|white) (king|queen))
3519@end example
3520
3521@noindent
3522the captured substrings are @samp{red king}, @samp{red},
3523and @samp{king}, and are numbered 1, 2, and 3.
3524
3525The fact that plain parentheses fulfil two functions is not
3526always helpful.  There are often times when a grouping
3527subpattern is required without a capturing requirement.  If an
3528opening parenthesis is followed by @code{?:}, the subpattern does
3529not do any capturing, and is not counted when computing the
3530number of any subsequent capturing subpatterns. For example,
3531if the string @samp{the white queen} is matched against the pattern
3532
3533@example
3534the ((?:red|white) (king|queen))
3535@end example
3536
3537@noindent
3538the captured substrings are @samp{white queen} and @samp{queen},
3539and are numbered 1 and 2. The maximum number of captured
3540substrings is 99, while the maximum number of all subpatterns,
3541both capturing and non-capturing, is 200.
3542
3543As a convenient shorthand, if any option settings are
3544equired at the start of a non-capturing subpattern, the
3545option letters may appear between the @code{?} and the
3546@code{:}.  Thus the two patterns
3547
3548@example
3549(?i:saturday|sunday)
3550(?:(?i)saturday|sunday)
3551@end example
3552
3553@noindent
3554match exactly the same set of strings.  Because alternative
3555branches are tried from left to right, and options are not
3556reset until the end of the subpattern is reached, an option
3557setting in one branch does affect subsequent branches, so
3558the above patterns match @samp{SUNDAY} as well as @samp{Saturday}.
3559
3560
3561@node Repetition
3562@appendixsec Repetition
3563@cindex Perl-style regular expressions, repetitions
3564
3565Repetition is specified by quantifiers, which can follow any
3566of the following items:
3567
3568@itemize @bullet
3569@item
3570a single character, possibly escaped
3571
3572@item
3573the @code{.} special character
3574
3575@item
3576a character class
3577
3578@item
3579a back reference (see next section)
3580
3581@item
3582a parenthesized subpattern (unless it is an assertion; @pxref{Assertions})
3583@end itemize
3584
3585The general repetition quantifier specifies a minimum and
3586maximum number of permitted matches, by giving the two
3587numbers in curly brackets (braces), separated by a comma.
3588The numbers must be less than 65536, and the first must be
3589less than or equal to the second. For example:
3590
3591@example
3592z@{2,4@}
3593@end example
3594
3595@noindent
3596matches @samp{zz}, @samp{zzz}, or @samp{zzzz}. A closing brace on its own
3597is not a special character. If the second number is omitted,
3598but the comma is present, there is no upper limit; if the
3599second number and the comma are both omitted, the quantifier
3600specifies an exact number of required matches. Thus
3601
3602@example
3603[aeiou]@{3,@}
3604@end example
3605
3606@noindent
3607matches at least 3 successive vowels, but may match many
3608more, while
3609
3610@example
3611\d@{8@}
3612@end example
3613
3614@noindent
3615matches exactly 8 digits.  An opening curly bracket that
3616appears in a position where a quantifier is not allowed, or
3617one that does not match the syntax of a quantifier, is taken
3618as a literal character. For example, @{,6@} is not a quantifier,
3619but a literal string of four characters.@footnote{It
3620raises an error if @option{-R} is not used.}
3621
3622The quantifier @samp{@{0@}} is permitted, causing the expression to
3623behave as if the previous item and the quantifier were not
3624present.
3625
3626For convenience (and historical compatibility) the three
3627most common quantifiers have single-character abbreviations:
3628
3629@table @code
3630@item *
3631is equivalent to @{0,@}
3632
3633@item +
3634is equivalent to @{1,@}
3635
3636@item ?
3637is equivalent to @{0,1@}
3638@end table
3639
3640It is possible to construct infinite loops by following a
3641subpattern that can match no characters with a quantifier
3642that has no upper limit, for example:
3643
3644@example
3645(a?)*
3646@end example
3647
3648Earlier versions of Perl used to give an error at
3649compile time for such patterns. However, because there are
3650cases where this can be useful, such patterns are now
3651accepted, but if any repetition of the subpattern does in
3652fact match no characters, the loop is forcibly broken.
3653
3654@cindex Greedy regular expression matching
3655@cindex Perl-style regular expressions, stingy repetitions
3656By default, the quantifiers are @dfn{greedy} like in @sc{posix}
3657mode, that is, they match as much as possible (up to the maximum
3658number of permitted times), without causing the rest of the
3659pattern to fail. The classic example of where this gives problems
3660is in trying to match comments in C programs. These appear between
3661the sequences @code{/*} and @code{*/} and within the sequence, individual
3662@code{*} and @code{/} characters may appear. An attempt to match C
3663comments by applying the pattern
3664
3665@example
3666/\*.*\*/
3667@end example
3668
3669@noindent
3670to the string
3671
3672@example
3673/* first command */ not comment /* second comment */
3674@end example
3675
3676@noindent
3677
3678fails, because it matches the entire string owing to the
3679greediness of the @code{.*} item.
3680
3681However, if a quantifier is followed by a question mark, it
3682ceases to be greedy, and instead matches the minimum number
3683of times possible, so the pattern @code{/\*.*?\*/}
3684does the right thing with the C comments. The meaning of the
3685various quantifiers is not otherwise changed, just the preferred
3686number of matches.  Do not confuse this use of question
3687mark with its use as a quantifier in its own right.
3688Because it has two uses, it can sometimes appear doubled, as in
3689
3690@example
3691\d??\d
3692@end example
3693
3694which matches one digit by preference, but can match two if
3695that is the only way the rest of the pattern matches.
3696
3697Note that greediness does not matter when specifying addresses,
3698but can be nevertheless used to improve performance.
3699
3700@ignore
3701If the PCRE_UNGREEDY option is set (an option which is not
3702available in Perl), the quantifiers are not greedy by
3703default, but individual ones can be made greedy by following
3704them with a question mark. In other words, it inverts the
3705default behaviour.
3706@end ignore
3707
3708When a parenthesized subpattern is quantified with a minimum
3709repeat count that is greater than 1 or with a limited maximum,
3710more store is required for the compiled pattern, in
3711proportion to the size of the minimum or maximum.
3712
3713@cindex Perl-style regular expressions, single line
3714If a pattern starts with @code{.*} or @code{.@{0,@}} and the
3715@code{S} modifier is used, the pattern is implicitly anchored,
3716because whatever follows will be tried against every character
3717position in the subject string, so there is no point in
3718retrying the overall match at any position after the first.
3719PCRE treats such a pattern as though it were preceded by \A.
3720
3721When a capturing subpattern is repeated, the value captured
3722is the substring that matched the final iteration. For example,
3723after
3724
3725@example
3726(tweedle[dume]@{3@}\s*)+
3727@end example
3728
3729@noindent
3730has matched @samp{tweedledum tweedledee} the value of the
3731captured substring is @samp{tweedledee}.  However, if there are
3732nested capturing subpatterns, the corresponding captured
3733values may have been set in previous iterations. For example,
3734after
3735
3736@example
3737/(a|(b))+/
3738@end example
3739
3740matches @samp{aba}, the value of the second captured substring is
3741@samp{b}.
3742
3743@node Backreferences
3744@appendixsec Backreferences
3745@cindex Perl-style regular expressions, backreferences
3746
3747Outside a character class, a backslash followed by a digit
3748greater than 0 (and possibly further digits) is a back
3749reference to a capturing subpattern earlier (i.e.  to its
3750left) in the pattern, provided there have been that many
3751previous capturing left parentheses.
3752
3753However, if the decimal number following the backslash is
3754less than 10, it is always taken as a back reference, and
3755causes an error only if there are not that many capturing
3756left parentheses in the entire pattern. In other words, the
3757parentheses that are referenced need not be to the left of
3758the reference for numbers less than 10. @ref{Backslash}
3759for further details of the handling of digits following a backslash.
3760
3761A back reference matches whatever actually matched the capturing
3762subpattern in the current subject string, rather than
3763anything matching the subpattern itself. So the pattern
3764
3765@example
3766(sens|respons)e and \1ibility
3767@end example
3768
3769@noindent
3770matches @samp{sense and sensibility} and @samp{response and responsibility},
3771but not @samp{sense and responsibility}. If caseful
3772matching is in force at the time of the back reference, the
3773case of letters is relevant. For example,
3774
3775@example
3776((?i)blah)\s+\1
3777@end example
3778
3779@noindent
3780matches @samp{blah blah} and @samp{Blah Blah}, but not
3781@samp{BLAH blah}, even though the original capturing
3782subpattern is matched caselessly.
3783
3784There may be more than one back reference to the same subpattern.
3785Also, if a subpattern has not actually been used in a
3786particular match, any back references to it always fail. For
3787example, the pattern
3788
3789@example
3790(a|(bc))\2
3791@end example
3792
3793@noindent
3794always fails if it starts to match @samp{a} rather than
3795@samp{bc}.  Because there may be up to 99 back references, all
3796digits following the backslash are taken as part of a potential
3797back reference number; this is different from what happens
3798in @sc{posix} mode. If the pattern continues with a digit
3799character, some delimiter must be used to terminate the back
3800reference.  If the @code{X} modifier option is set, this can be
3801whitespace.  Otherwise an empty comment can be used, or the
3802following character can be expressed in hexadecimal or octal.
3803Note that this applies only to the LHS pattern; it is
3804not possible yet to specify more than 9 backreferences on the
3805RHS of the `s' command.
3806
3807A back reference that occurs inside the parentheses to which
3808it refers fails when the subpattern is first used, so, for
3809example, @code{(a\1)} never matches.  However, such references
3810can be useful inside repeated subpatterns. For example, the
3811pattern
3812
3813@example
3814(a|b\1)+
3815@end example
3816
3817@noindent
3818matches any number of @samp{a}s and also @samp{aba}, @samp{ababbaa},
3819etc. At each iteration of the subpattern, the back reference matches
3820the character string corresponding to the previous iteration.  In
3821order for this to work, the pattern must be such that the first
3822iteration does not need to match the back reference.  This can be
3823done using alternation, as in the example above, or by a
3824quantifier with a minimum of zero.
3825
3826@node Assertions
3827@appendixsec Assertions
3828@cindex Perl-style regular expressions, assertions
3829@cindex Perl-style regular expressions, asserting subpatterns
3830
3831An assertion is a test on the characters following or
3832preceding the current matching point that does not actually
3833consume any characters. The simple assertions coded as @code{\b},
3834@code{\B}, @code{\A}, @code{\Z}, @code{\z}, @code{^} and @code{$}
3835are described above. More complicated assertions are coded as
3836subpatterns.  There are two kinds: those that look ahead of the
3837current position in the subject string, and those that look behind it.
3838
3839@cindex Perl-style regular expressions, lookahead subpatterns
3840An assertion subpattern is matched in the normal way, except
3841that it does not cause the current matching position to be
3842changed. Lookahead assertions start with @code{(?=} for positive
3843assertions and @code{(?!} for negative assertions. For example,
3844
3845@example
3846\w+(?=;)
3847@end example
3848
3849@noindent
3850matches a word followed by a semicolon, but does not include
3851the semicolon in the match, and
3852
3853@example
3854foo(?!bar)
3855@end example
3856
3857@noindent
3858matches any occurrence of @samp{foo} that is not followed by
3859@samp{bar}.
3860
3861Note that the apparently similar pattern
3862
3863@example
3864(?!foo)bar
3865@end example
3866
3867@noindent
3868@cindex Perl-style regular expressions, lookbehind subpatterns
3869finds any occurrence of @samp{bar} even if it is preceded by
3870@samp{foo}, because the assertion @code{(?!foo)} is always true
3871when the next three characters are @samp{bar}. A lookbehind
3872assertion is needed to achieve this effect.
3873Lookbehind assertions start with @code{(?<=} for positive
3874assertions and @code{(?<!} for negative assertions. So,
3875
3876@example
3877(?<!foo)bar
3878@end example
3879
3880achieves the required effect of finding an occurrence of
3881@samp{bar} that is not preceded by @samp{foo}. The contents of a
3882lookbehind assertion are restricted
3883such that all the strings it matches must have a fixed
3884length.  However, if there are several alternatives, they do
3885not all have to have the same fixed length.  This is an extension
3886compared with Perl 5.005, which requires all branches to match
3887the same length of string. Thus
3888
3889@example
3890(?<=dogs|cats|)
3891@end example
3892
3893@noindent
3894is permitted, but the apparently equivalent regular expression
3895
3896@example
3897(?<!dogs?|cats?)
3898@end example
3899
3900@noindent
3901causes an error at compile time. Branches that match different
3902length strings are permitted only at the top level of
3903a lookbehind assertion: an assertion such as
3904
3905@example
3906(?<=ab(c|de))
3907@end example
3908
3909@noindent
3910is not permitted, because its single top-level branch can
3911match two different lengths, but it is acceptable if rewritten
3912to use two top-level branches:
3913
3914@example
3915(?<=abc|abde)
3916@end example
3917
3918All this is required because lookbehind assertions simply
3919move the current position back by the alternative's fixed
3920width and then try to match.  If there are
3921insufficient characters before the current position, the
3922match is deemed to fail.  Lookbehinds, in conjunction with
3923non-backtracking subpatterns can be particularly useful for
3924matching at the ends of strings; an example is given at the end
3925of the section on non-backtracking subpatterns.
3926
3927Several assertions (of any sort) may occur in succession.
3928For example,
3929
3930@example
3931(?<=\d@{3@})(?<!999)foo
3932@end example
3933
3934@noindent
3935matches @samp{foo} preceded by three digits that are not @samp{999}.
3936Notice that each of the assertions is applied independently
3937at the same point in the subject string. First there is a
3938check that the previous three characters are all digits, and
3939then there is a check that the same three characters are not
3940@samp{999}.  This pattern does not match @samp{foo} preceded by six
3941characters, the first of which are digits and the last three
3942of which are not @samp{999}.  For example, it doesn't match
3943@samp{123abcfoo}. A pattern to do that is
3944
3945@example
3946(?<=\d@{3@}...)(?<!999)foo
3947@end example
3948
3949@noindent
3950This time the first assertion looks at the preceding six
3951characters, checking that the first three are digits, and
3952then the second assertion checks that the preceding three
3953characters are not @samp{999}.  Actually, assertions can be
3954nested in any combination, so one can write this as
3955
3956@example
3957(?<=\d@{3@}(?!999)...)foo
3958@end example
3959
3960or
3961
3962@example
3963(?<=\d@{3@}...(?<!999))foo
3964@end example
3965
3966@noindent
3967both of which might be considered more readable.
3968
3969Assertion subpatterns are not capturing subpatterns, and may
3970not be repeated, because it makes no sense to assert the
3971same thing several times. If any kind of assertion contains
3972capturing subpatterns within it, these are counted for the
3973purposes of numbering the capturing subpatterns in the whole
3974pattern.  However, substring capturing is carried out only
3975for positive assertions, because it does not make sense for
3976negative assertions.
3977
3978Assertions count towards the maximum of 200 parenthesized
3979subpatterns.
3980
3981@node Non-backtracking subpatterns
3982@appendixsec Non-backtracking subpatterns
3983@cindex Perl-style regular expressions, non-backtracking subpatterns
3984
3985With both maximizing and minimizing repetition, failure of
3986what follows normally causes the repeated item to be evaluated
3987again to see if a different number of repeats allows the
3988rest of the pattern to match. Sometimes it is useful to
3989prevent this, either to change the nature of the match, or
3990to cause it fail earlier than it otherwise might, when the
3991author of the pattern knows there is no point in carrying
3992on.
3993
3994Consider, for example, the pattern @code{\d+foo} when applied to
3995the subject line
3996
3997@example
3998123456bar
3999@end example
4000
4001After matching all 6 digits and then failing to match @samp{foo},
4002the normal action of the matcher is to try again with only 5
4003digits matching the @code{\d+} item, and then with 4, and so on,
4004before ultimately failing. Non-backtracking subpatterns
4005provide the means for specifying that once a portion of the
4006pattern has matched, it is not to be re-evaluated in this way,
4007so the matcher would give up immediately on failing to match
4008@samp{foo} the first time.  The notation is another kind of special
4009parenthesis, starting with @code{(?>} as in this example:
4010
4011@example
4012(?>\d+)bar
4013@end example
4014
4015This kind of parenthesis ``locks up'' the part of the pattern
4016it contains once it has matched, and a failure further into
4017the pattern is prevented from backtracking into it.
4018Backtracking past it to previous items, however, works as
4019normal.
4020
4021Non-backtracking subpatterns are not capturing subpatterns.  Simple
4022cases such as the above example can be thought of as a maximizing
4023repeat that must swallow everything it can.  So,
4024while both @code{\d+} and @code{\d+?} are prepared to adjust the number of
4025digits they match in order to make the rest of the pattern
4026match, @code{(?>\d+)} can only match an entire sequence of digits.
4027
4028This construction can of course contain arbitrarily complicated
4029subpatterns, and it can be nested.
4030
4031@cindex Perl-style regular expressions, lookbehind subpatterns
4032Non-backtracking subpatterns can be used in conjunction with look-behind
4033assertions to specify efficient matching at the end
4034of the subject string. Consider a simple pattern such as
4035
4036@example
4037abcd$
4038@end example
4039
4040@noindent
4041when applied to a long string which does not match.  Because
4042matching proceeds from left to right, @command{sed} will look for
4043each @samp{a} in the subject and then see if what follows matches
4044the rest of the pattern. If the pattern is specified as
4045
4046@example
4047^.*abcd$
4048@end example
4049
4050@noindent
4051the initial @code{.*} matches the entire string at first, but when
4052this fails (because there is no following @samp{a}), it backtracks
4053to match all but the last character, then all but the
4054last two characters, and so on. Once again the search for
4055@samp{a} covers the entire string, from right to left, so we are
4056no better off. However, if the pattern is written as
4057
4058@example
4059^(?>.*)(?<=abcd)
4060@end example
4061
4062there can be no backtracking for the .* item; it can match
4063only the entire string. The subsequent lookbehind assertion
4064does a single test on the last four characters. If it fails,
4065the match fails immediately. For long strings, this approach
4066makes a significant difference to the processing time.
4067
4068When a pattern contains an unlimited repeat inside a subpattern
4069that can itself be repeated an unlimited number of
4070times, the use of a once-only subpattern is the only way to
4071avoid some failing matches taking a very long time
4072indeed.@footnote{Actually, the matcher embedded in @value{SSED}
4073tries to do something for this in the simplest cases,
4074like @code{([^b]*b)*}.  These cases are actually quite
4075common: they happen for example in a regular expression
4076like @code{\/\*([^*]*\*)*\/} which matches C comments.}
4077
4078The pattern
4079
4080@example
4081(\D+|<\d+>)*[!?]
4082@end example
4083
4084([^0-9<]+<(\d+>)?)*[!?]
4085
4086@noindent
4087matches an unlimited number of substrings that either consist
4088of non-digits, or digits enclosed in angular brackets, followed by
4089an exclamation or question mark. When it matches, it runs quickly.
4090However, if it is applied to
4091
4092@example
4093aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
4094@end example
4095
4096@noindent
4097it takes a long time before reporting failure.  This is
4098because the string can be divided between the two repeats in
4099a large number of ways, and all have to be tried.@footnote{The
4100example used @code{[!?]} rather than a single character at the end,
4101because both @value{SSED} and Perl have an optimization that allows
4102for fast failure when a single character is used. They
4103remember the last single character that is required for a
4104match, and fail early if it is not present in the string.}
4105
4106If the pattern is changed to
4107
4108@example
4109((?>\D+)|<\d+>)*[!?]
4110@end example
4111
4112sequences of non-digits cannot be broken, and failure happens
4113quickly.
4114
4115@node Conditional subpatterns
4116@appendixsec Conditional subpatterns
4117@cindex Perl-style regular expressions, conditional subpatterns
4118
4119It is possible to cause the matching process to obey a subpattern
4120conditionally or to choose between two alternative
4121subpatterns, depending on the result of an assertion, or
4122whether a previous capturing subpattern matched or not. The
4123two possible forms of conditional subpattern are
4124
4125@example
4126(?(@var{condition})@var{yes-pattern})
4127(?(@var{condition})@var{yes-pattern}|@var{no-pattern})
4128@end example
4129
4130If the condition is satisfied, the yes-pattern is used; otherwise
4131the no-pattern (if present) is used. If there are more than two
4132alternatives in the subpattern, a compile-time error occurs.
4133
4134There are two kinds of condition. If the text between the
4135parentheses consists of a sequence of digits, the condition
4136is satisfied if the capturing subpattern of that number has
4137previously matched.  The number must be greater than zero.
4138Consider the following pattern, which contains non-significant
4139white space to make it more readable (assume the @code{X} modifier)
4140and to divide it into three parts for ease of discussion:
4141
4142@example
4143( \( )?   [^()]+   (?(1) \) )
4144@end example
4145
4146The first part matches an optional opening parenthesis, and
4147if that character is present, sets it as the first captured
4148substring. The second part matches one or more characters
4149that are not parentheses. The third part is a conditional
4150subpattern that tests whether the first set of parentheses
4151matched or not.  If they did, that is, if subject started
4152with an opening parenthesis, the condition is true, and so
4153the yes-pattern is executed and a closing parenthesis is
4154required. Otherwise, since no-pattern is not present, the
4155subpattern matches nothing.  In other words, this pattern
4156matches a sequence of non-parentheses, optionally enclosed
4157in parentheses.
4158
4159@cindex Perl-style regular expressions, lookahead subpatterns
4160If the condition is not a sequence of digits, it must be an
4161assertion.  This may be a positive or negative lookahead or
4162lookbehind assertion. Consider this pattern, again containing
4163non-significant white space, and with the two alternatives
4164on the second line:
4165
4166@example
4167(?(?=...[a-z])
4168   \d\d-[a-z]@{3@}-\d\d |
4169   \d\d-\d\d-\d\d )
4170@end example
4171
4172The condition is a positive lookahead assertion that matches
4173a letter that is three characters away from the current point.
4174If a letter is found, the subject is matched against the first
4175alternative @samp{@var{dd}-@var{aaa}-@var{dd}} (where @var{aaa} are
4176letters and @var{dd} are digits); otherwise it is matched against
4177the second alternative, @samp{@var{dd}-@var{dd}-@var{dd}}.
4178
4179
4180@node Recursive patterns
4181@appendixsec Recursive patterns
4182@cindex Perl-style regular expressions, recursive patterns
4183@cindex Perl-style regular expressions, recursion
4184
4185Consider the problem of matching a string in parentheses,
4186allowing for unlimited nested parentheses. Without the use
4187of recursion, the best that can be done is to use a pattern
4188that matches up to some fixed depth of nesting. It is not
4189possible to handle an arbitrary nesting depth. Perl 5.6 has
4190provided an experimental facility that allows regular
4191expressions to recurse (amongst other things). It does this
4192by interpolating Perl code in the expression at run time,
4193and the code can refer to the expression itself. A Perl pattern
4194tern to solve the parentheses problem can be created like
4195this:
4196
4197@example
4198$re = qr@{\( (?: (?>[^()]+) | (?p@{$re@}) )* \)@}x;
4199@end example
4200
4201The @code{(?p@{...@})} item interpolates Perl code at run time,
4202and in this case refers recursively to the pattern in which it
4203appears. Obviously, @command{sed} cannot support the interpolation of
4204Perl code.  Instead, the special item @code{(?R)} is provided for
4205the specific case of recursion. This pattern solves the
4206parentheses problem (assume the @code{X} modifier option is used
4207so that white space is ignored):
4208
4209@example
4210\( ( (?>[^()]+) | (?R) )* \)
4211@end example
4212
4213First it matches an opening parenthesis. Then it matches any
4214number of substrings which can either be a sequence of
4215non-parentheses, or a recursive match of the pattern itself
4216(i.e. a correctly parenthesized substring). Finally there is
4217a closing parenthesis.
4218
4219This particular example pattern contains nested unlimited
4220repeats, and so the use of a non-backtracking subpattern for
4221matching strings of non-parentheses is important when applying
4222the pattern to strings that do not match. For example, when
4223it is applied to
4224
4225@example
4226(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
4227@end example
4228
4229it yields a ``no match'' response quickly. However, if a
4230standard backtracking subpattern is not used, the match runs
4231for a very long time indeed because there are so many different
4232ways the @code{+} and @code{*} repeats can carve up the subject,
4233and all have to be tested before failure can be reported.
4234
4235The values set for any capturing subpatterns are those from
4236the outermost level of the recursion at which the subpattern
4237value is set. If the pattern above is matched against
4238
4239@example
4240(ab(cd)ef)
4241@end example
4242
4243@noindent
4244the value for the capturing parentheses is @samp{ef}, which is
4245the last value taken on at the top level.
4246
4247@node Comments
4248@appendixsec Comments
4249@cindex Perl-style regular expressions, comments
4250
4251The sequence (?# marks the start of a comment which continues
4252ues up to the next closing parenthesis. Nested parentheses
4253are not permitted. The characters that make up a comment
4254play no part in the pattern matching at all.
4255
4256@cindex Perl-style regular expressions, extended
4257If the @code{X} modifier option is used, an unescaped @code{#} character
4258outside a character class introduces a comment that continues
4259up to the next newline character in the pattern.
4260@end ifset
4261
4262
4263@page
4264@node Concept Index
4265@unnumbered Concept Index
4266
4267This is a general index of all issues discussed in this manual, with the
4268exception of the @command{sed} commands and command-line options.
4269
4270@printindex cp
4271
4272@page
4273@node Command and Option Index
4274@unnumbered Command and Option Index
4275
4276This is an alphabetical list of all @command{sed} commands and command-line
4277options.
4278
4279@printindex fn
4280
4281@contents
4282@bye
4283
4284@c XXX FIXME: the term "cycle" is never defined...
4285