grep -P
, also by pcre2grep
and pcregrep
.All have the same error. They read Latin letters with diacritics as word boundary.
Some of them wrongly do not match them as lower-case letters.
It should not be error of encoding, because I have set UTF-8 in all locale variables:
Code: Select all
$ printenv|grep -P '^L[AC]'|sort
LANG=sk_SK.UTF-8
LANGUAGE=sk_SK.UTF-8
LC_ADDRESS=sk_SK.UTF-8
LC_ALL=sk_SK.UTF-8
LC_COLLATE=sk_SK.UTF-8
LC_CTYPE=sk_SK.UTF-8
LC_IDENTIFICATION=sk_SK.UTF-8
LC_MEASUREMENT=sk_SK.UTF-8
LC_MESSAGES=sk_SK.UTF-8
LC_MONETARY=sk_SK.UTF-8
LC_NAME=sk_SK.UTF-8
LC_NUMERIC=sk_SK.UTF-8
LC_PAPER=sk_SK.UTF-8
LC_RESPONSE=sk_SK.UTF-8
LC_TELEPHONE=sk_SK.UTF-8
LC_TIME=sk_SK.UTF-8
Code: Select all
$ cat diakritika.txt
-čí
-čia
-čo
-Evička
-Košice
-ký
-mám
-úži
-Žiar
-42úver
Code: Select all
$ grep -P '\b\p{Ll}{2}' diakritika.txt
-čia
-Evička
-Košice
-ký
-mám
-Žiar
-42úver
Code: Select all
$ pcregrep '\b\p{Ll}{2}' diakritika.txt
-čia
-Evička
-Košice
-Žiar
-42úver
Code: Select all
$ pcre2grep '\b\p{Ll}{2}' diakritika.txt
-čia
-Evička
-Košice
-Žiar
-42úver
Code: Select all
$ LANG=C; grep --version
grep (GNU grep) 3.11
⋮
grep -P uses PCRE2 10.44 2024-06-07
Code: Select all
$ pcregrep --version
pcregrep version 8.39 2016-06-14
Code: Select all
$ pcre2grep --version
pcre2grep version 10.44 2024-06-07
Code: Select all
$ bash --version
GNU bash, version 5.2.37(1)-release (x86_64-pc-linux-gnu)
LANG
to C
after the testing, due language used in printing of versions.)grep -P
and pcre2grep
seem to use the same library, but they give different results;-ký
& -mám
are correct answers, but pcregrep
& pcre2grep
do not match them;-úži
should be in results, but no commands match it.-Evička
, -Košice
, -Žiar
, -42úver
are all false results, because they begin by upper-case letter or digit (digits are considered to be word characters by PerlDoc.Perl.org/perlre: Character Classes and other Special Escapes).Of course, I use single letters as single UniCode characters, not composites with Combining Diacritical Marks.
Do I make some error? Or is it bug in libraries (it seems to me to be improbable)?