[Solved][Bash] PCRE: Latin letters with diacritics (utf8) false match and not-match

Programming languages, Coding, Executables, Package Creation, and Scripting.
Post Reply
Message
Author
User avatar
ruwolf
Posts: 919
Joined: 2008-02-18 05:04
Location: Banovce nad Bebravou
Has thanked: 91 times
Been thanked: 69 times

[Solved][Bash] PCRE: Latin letters with diacritics (utf8) false match and not-match

#1 Post by ruwolf »

I have tried to use Perl Compatible Regular Expressions (PCRE) by grep -P, also by pcre2grep and pcregrep.
All have the same error. They read Latin letters with diacritics as word boundary.
Some of them wrongly do not match them as lower-case letters.

It should not be error of encoding, because I have set UTF-8 in all locale variables:

Code: Select all

$ printenv|grep -P '^L[AC]'|sort
LANG=sk_SK.UTF-8
LANGUAGE=sk_SK.UTF-8
LC_ADDRESS=sk_SK.UTF-8
LC_ALL=sk_SK.UTF-8
LC_COLLATE=sk_SK.UTF-8
LC_CTYPE=sk_SK.UTF-8
LC_IDENTIFICATION=sk_SK.UTF-8
LC_MEASUREMENT=sk_SK.UTF-8
LC_MESSAGES=sk_SK.UTF-8
LC_MONETARY=sk_SK.UTF-8
LC_NAME=sk_SK.UTF-8
LC_NUMERIC=sk_SK.UTF-8
LC_PAPER=sk_SK.UTF-8
LC_RESPONSE=sk_SK.UTF-8
LC_TELEPHONE=sk_SK.UTF-8
LC_TIME=sk_SK.UTF-8
Testing file:

Code: Select all

$ cat diakritika.txt 
-čí
-čia
-čo
-Evička
-Košice
-ký
-mám
-úži
-Žiar
-42úver
IMHO wrong results:

Code: Select all

$ grep -P '\b\p{Ll}{2}' diakritika.txt 
-čia
-Evička
-Košice
-ký
-mám
-Žiar
-42úver

Code: Select all

$ pcregrep '\b\p{Ll}{2}' diakritika.txt 
-čia
-Evička
-Košice
-Žiar
-42úver

Code: Select all

$ pcre2grep '\b\p{Ll}{2}' diakritika.txt 
-čia
-Evička
-Košice
-Žiar
-42úver
Versions of commands and libraries:

Code: Select all

$ LANG=C; grep --version
grep (GNU grep) 3.11
⋮
grep -P uses PCRE2 10.44 2024-06-07

Code: Select all

$ pcregrep --version
pcregrep version 8.39 2016-06-14

Code: Select all

$ pcre2grep --version
pcre2grep version 10.44 2024-06-07

Code: Select all

$ bash --version
GNU bash, version 5.2.37(1)-release (x86_64-pc-linux-gnu)
(Of course, I have switched LANG to C after the testing, due language used in printing of versions.)
grep -P and pcre2grep seem to use the same library, but they give different results;
-ký & -mám are correct answers, but pcregreppcre2grep do not match them;
-úži should be in results, but no commands match it.
-Evička, -Košice, -Žiar, -42úver are all false results, because they begin by upper-case letter or digit (digits are considered to be word characters by PerlDoc.Perl.org/perlre: Character Classes and other Special Escapes).
Of course, I use single letters as single UniCode characters, not composites with Combining Diacritical Marks.
Do I make some error? Or is it bug in libraries (it seems to me to be improbable)?
Last edited by ruwolf on 2025-01-09 20:23, edited 2 times in total.

User avatar
ruwolf
Posts: 919
Joined: 2008-02-18 05:04
Location: Banovce nad Bebravou
Has thanked: 91 times
Been thanked: 69 times

Re: [Bash] PCRE: Latin letters with diacritics false match and not-match

#2 Post by ruwolf »

I have found solution.
You have to use both special switches:
(*UCP) (UniCode Properties) & (*UTF) (for encoding);
(their order is irrelevant):

Code: Select all

$ grep -P '(*UTF)(*UCP)\b\p{Ll}{2}' diakritika.txt 
-čí
-čia
-čo
-ký
-mám
-úži

Code: Select all

$ pcregrep '(*UCP)(*UTF)\b\p{Ll}{2}' diakritika.txt 
-čí
-čia
-čo
-ký
-mám
-úži

Code: Select all

$ pcre2grep '(*UCP)(*UTF)\b\p{Ll}{2}' diakritika.txt 
-čí
-čia
-čo
-ký
-mám
-úži
Source: pcre2unicode man page.

Post Reply