Back References

patteoks · Post by **patteoks** » Sat Feb 27, 2016 1:24 am

I have a list for example

1 Corinthians
2 Corinthians
1 Timothy
1 John
2 John
1 Kings

I want to turn it into

1Corinthians
2Corinthians
1Timothy
1John
2John
1Kings

The Regex filter from (RegexBuddy) Perl Flavour I tried to use was

Find Pattern (Perl Style)
([123])\ ([SKCTPJ][aihoe]\w*\ ?\d{0,3}:?)|([12])\ (Thessalonians|THESSALONIANS ?\d{0,3})

Replace with:
\1\2\3\4

It gave me an error although it worked in RegexBuddy and EditPad Pro

It only worked after I read this forum and discovered that I had to use

Replace with:
$1$$2$$3$$4$

What I'm puzzled about is that I couldn't find an example or documentation (Extract below) that indicates I had to use $1$$2$$3$$4$

What am I missing ?

Does anyone in this forum also uses RegexBuddy with TextPipe and able to provide any insights as to why it (\1\2\3\4) worked in RegexBuddy but not with Textpipe ?

Thanks for your help.

Extract from help file:

After \0 up to two further octal digits are read. In both cases, if there are fewer than two digits, just those that are present are used. Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL character (code value 7). Make sure you supply two digits after the initial zero if the character that follows is itself an octal digit.

The handling of a backslash followed by a digit other than 0 is complicated. Outside a character class, PCRE reads it and any following digits as a decimal number. If the number is less than 10, or if there have been at least that many previous capturing left parentheses in the expression, the entire sequence is taken as a back reference. A description of how this works is given later, following the discussion of parenthesized subpatterns.

Inside a character class, or if the decimal number is greater than 9 and there have not been that many capturing subpatterns, PCRE re-reads up to three octal digits following the backslash, and generates a single byte from the least significant 8 bits of the value. Any subsequent digits stand for themselves. For example:

\040 is another way of writing a space

\40 is the same, provided there are fewer than 40

previous capturing subpatterns

\7 is always a back reference

\11 might be a back reference, or another way of

writing a tab

\011 is always a tab

\0113 is a tab followed by the character "3"

\113 might be a back reference, otherwise the

character with octal code 113

\377 might be a back reference, otherwise

the byte consisting entirely of 1 bits

\81 is either a back reference, or a binary zero

followed by the two characters "8" and "1"

Note that octal values of 100 or greater must not be introduced by a leading zero, because no more than three octal digits are ever read.

All the sequences that define a single byte value or a single UTF-8 character (in UTF-8 mode) can be used both inside and outside character classes. In addition, inside a character class, the sequence \b is interpreted as the backspace character (hex 08). Outside a character class it has a different meaning (see below).

dfhtextpipe · Post by **dfhtextpipe** » Tue Mar 01, 2016 7:03 am

I use TextPipe regularly in connection with work on Biblical texts. I have 9 years experience in this field.

You didn't indicate whether the English Bible book names you gave as examples are part of free text or part of a structured document.

If they are in a structured document, it would be much simpler to use a restrict filter to govern the replacements.

The actual replacement then becomes much simpler.

Code: Select all

Perl pattern [(\d) (\w+)] with [$1$$2]
   [X] Match case
   [X] Whole words only
   [ ] Case sensitive replace
   [ ] Prompt on replace
   [ ] Skip prompt if identical
   [ ] First only
   [ ] Extract matches   Maximum text buffer size 4096
   [X] Maximum match (greedy)
   [ ] Allow comments
   [ ] '.' matches newline
   [X] UTF-8 Support

   [ ] Process longest strings first
   [ ] Simultaneous search

So the more important question is what kind of structure does your input file have?

Best regards,

David Haslam
An active volunteer for the CrossWire Bible Society

PS. I don't use RegexBuddy.
My two favourite Unicode text editors are Notepad++ and BabelPad.
On rare occasions I have used EditPad Lite for file format conversions.

btw. When quoting from something such as the TextPipe Help file, it's sensible to use the Quote feature of phpBB.

Post by **DataMystic Support** » Wed Mar 02, 2016 8:21 am

Back references are used inside the search pattern, not inside the replace pattern.

Different tools use different ways of encoding this, some use %, some $, some \, some @, some [ etc. There is no standard.

We use \ for regex escape sequences (\r\n\t etc), $ for captured variables ($1, $2 etc, or $1$ when it is hard-up against the next variable as in $1$$2$), @ for macros (e.g. @fullInputFilename) or named captured variables (@phonenumber), and % for environment variables (e.g. %PATH).

The help file is quite clear on this!

DataMystic

Back References

Back References

Re: Back References

Re: Back References