[Solved] Web Scraping and Character Encoding

New to Debian (Or Linux in general)? Ask your questions here!
Post Reply
Message
Author
User avatar
limotux
Posts: 122
Joined: 2011-05-30 17:38
Has thanked: 25 times
Been thanked: 10 times

[Solved] Web Scraping and Character Encoding

#1 Post by limotux »

I made a little web scraping script. I do web scraping in 2 languages, English and Arabic.

I had the script on a previous Debian KDE Plasma installation and it worked fine for both languages. But I made a fresh install recently.

On the current installation it is scraping OK. When I scrape for a search term in English I get the results OK to the text file I specified.
But if I try an Arabic search term, it goes and appears to be actually scraping but I get the text file containing strange and funny characters like "منظمة شقيقة للبنك الدويل Ù" which is supposed to be in Arabic alphabets.

FYI, I have Arabic fonts already installed, I can type and read Arabic text in any app normally without issues.

So, as the script worked before without problem I guess the issue is with how the system handles the scraped text.

What can I do to get it to write the Arabic characters properly in a readable format.
I will highly appreciate any help.
Last edited by limotux on 2023-11-21 06:34, edited 1 time in total.
Debian 12 (bookworm), KDE Plasma, Info: dual core model: Intel Core i3-10110U cache: L1: 128 KiB L2: 512 KiB L3: 4 MiB ext4
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)

Aki
Global Moderator
Global Moderator
Posts: 2203
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 50 times
Been thanked: 292 times

Re: [O/S] Web Scraping and Character Encoding

#2 Post by Aki »

Hello,
limotux wrote: 2023-11-16 19:51 [..]
FYI, I have Arabic fonts already installed, I can type and read Arabic text in any app normally without issues.
Did you configure the arabic locale in KDE plasma (System Settings -> Regional Settings -> Region and Language" ?

What is your configured locale ? You can check with the following command:

Code: Select all

locale
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

User avatar
limotux
Posts: 122
Joined: 2011-05-30 17:38
Has thanked: 25 times
Been thanked: 10 times

Re: [O/S] Web Scraping and Character Encoding

#3 Post by limotux »

Thanks @Aki
Here it is:

Code: Select all

limo@debian:~$ locale
LANG=en_US.UTF-8
LANGUAGE=C
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
limo@debian:~$ 

I had language before set as US English, but had the same results!
Debian 12 (bookworm), KDE Plasma, Info: dual core model: Intel Core i3-10110U cache: L1: 128 KiB L2: 512 KiB L3: 4 MiB ext4
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)

Aki
Global Moderator
Global Moderator
Posts: 2203
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 50 times
Been thanked: 292 times

Re: [O/S] Web Scraping and Character Encoding

#4 Post by Aki »

Hello,

Did you install arabic locales with the following command ?

Code: Select all

su -l -c "dpkg-reconfigure locales"
You can check with the following commands:

Code: Select all

localectl status
localectl list-locales
grep ar_ /etc/locale.gen
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

User avatar
limotux
Posts: 122
Joined: 2011-05-30 17:38
Has thanked: 25 times
Been thanked: 10 times

Re: [O/S] Web Scraping and Character Encoding

#5 Post by limotux »

Thanks a lot @Aki

Code: Select all

limo@debian:~$ localectl status
System Locale: LANG=en_US.UTF-8
    VC Keymap: (unset)         
   X11 Layout: us
    X11 Model: pc105
limo@debian:~$ 

Code: Select all

limo@debian:~$ localectl list-locales
C.UTF-8
ar_EG.UTF-8
en_US.UTF-8
limo@debian:~$ 

Code: Select all

limo@debian:~$ grep ar_ /etc/locale.gen
# ar_AE ISO-8859-6
# ar_AE.UTF-8 UTF-8
# ar_BH ISO-8859-6
# ar_BH.UTF-8 UTF-8
# ar_DZ ISO-8859-6
# ar_DZ.UTF-8 UTF-8
# ar_EG ISO-8859-6
ar_EG.UTF-8 UTF-8
# ar_IN UTF-8
# ar_IQ ISO-8859-6
# ar_IQ.UTF-8 UTF-8
# ar_JO ISO-8859-6
# ar_JO.UTF-8 UTF-8
# ar_KW ISO-8859-6
# ar_KW.UTF-8 UTF-8
# ar_LB ISO-8859-6
# ar_LB.UTF-8 UTF-8
# ar_LY ISO-8859-6
# ar_LY.UTF-8 UTF-8
# ar_MA ISO-8859-6
# ar_MA.UTF-8 UTF-8
# ar_OM ISO-8859-6
# ar_OM.UTF-8 UTF-8
# ar_QA ISO-8859-6
# ar_QA.UTF-8 UTF-8
# ar_SA ISO-8859-6
# ar_SA.UTF-8 UTF-8
# ar_SD ISO-8859-6
# ar_SD.UTF-8 UTF-8
# ar_SS UTF-8
# ar_SY ISO-8859-6
# ar_SY.UTF-8 UTF-8
# ar_TN ISO-8859-6
# ar_TN.UTF-8 UTF-8
# ar_YE ISO-8859-6
# ar_YE.UTF-8 UTF-8
# ar_AE.UTF-8 UTF-8
limo@debian:~$ 
Should I do now the command:

Code: Select all

su -l -c "dpkg-reconfigure locales"
Debian 12 (bookworm), KDE Plasma, Info: dual core model: Intel Core i3-10110U cache: L1: 128 KiB L2: 512 KiB L3: 4 MiB ext4
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)

Aki
Global Moderator
Global Moderator
Posts: 2203
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 50 times
Been thanked: 292 times

Re: [O/S] Web Scraping and Character Encoding

#6 Post by Aki »

Hello,
limotux wrote: 2023-11-19 10:44 Should I do now the command:

Code: Select all

su -l -c "dpkg-reconfigure locales"
No, your ar_EG locale is already installed.

What is the program you use to download web pages ?

Can you send an example of "scraped" downloaded file and its URL ? Can you attach it to a follow-up message as a zip file (do not copy and paste in the body of the message).
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

User avatar
limotux
Posts: 122
Joined: 2011-05-30 17:38
Has thanked: 25 times
Been thanked: 10 times

Re: [O/S] Web Scraping and Character Encoding

#7 Post by limotux »

I use in the bash script googler to search, curl to retrieve the text.
Can you send an example of "scraped" downloaded file and its URL ?
The text I get is like "PK ! ý¤~4 Ó
[Content_Types].xml ¢(  ´–_oÚ0Åß'í;DyÓ=LÓô¡´OÕV©LÛ«±oÀ«ÿ;)ðíg;ђ‘l”¤Ä9çü|cnîøz£döΣ'ùU9Ê3ÐÌp¡—“üûü®øœg©æT
“|>¿ž¾7žo-ø,¨µŸä+Dû…ÏV ¨/V*ãÅpé–ÄRöD—@>ŽFŸ3AcÑ#ŸŽgPÑZbv» ·«—yvÓ<£&¹PQï“NÅ/ݒ´Ð­Uu¦lŠ¸Ò­q ý+µV
‫ا لد ر ا سة التشخيصية للقطا ع ا لخا ص‬
‫خلق األسواق في مصر‬"

I am not sure I understand what URL you mean? It can search the web in general so there are many websites that the results come from.
Debian 12 (bookworm), KDE Plasma, Info: dual core model: Intel Core i3-10110U cache: L1: 128 KiB L2: 512 KiB L3: 4 MiB ext4
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)

Aki
Global Moderator
Global Moderator
Posts: 2203
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 50 times
Been thanked: 292 times

Re: [O/S] Web Scraping and Character Encoding

#8 Post by Aki »

What is the program you use to download web pages ?
Can you give an example URL ?
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

User avatar
limotux
Posts: 122
Joined: 2011-05-30 17:38
Has thanked: 25 times
Been thanked: 10 times

Re: [O/S] Web Scraping and Character Encoding

#9 Post by limotux »

Aki wrote: 2023-11-19 13:58 What is the program you use to download web pages ?
Can you give an example URL ?
Sorry @Aki
I am missing something here!
As in my previous reply, the script uses curl.
You may check any Arabic language website for example https://www.youm7.com/story/2023/8/11/% ... 86/6269508
Debian 12 (bookworm), KDE Plasma, Info: dual core model: Intel Core i3-10110U cache: L1: 128 KiB L2: 512 KiB L3: 4 MiB ext4
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)

Aki
Global Moderator
Global Moderator
Posts: 2203
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 50 times
Been thanked: 292 times

Re: [O/S] Web Scraping and Character Encoding

#10 Post by Aki »

Hello,

The curl command behaves as expected and it locally stores the html page with arabic unicode characters:

Code: Select all

$ curl -o curl_arabic_page.log  https://www.youm7.com/story/2023/8/11/%D8%A7%D9%84%D8%B2%D8%B1%D8%A7%D8%B9%D8%A9-%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%A7%D9%84%D8%B5%D9%88%D8%A8-%D8%A3%D8%AD%D8%AF%D8%AB-%D9%86%D9%82%D9%84%D8%A9-%D9%86%D9%88%D8%B9%D9%8A%D8%A9-%D8%B9%D9%84%D9%89-%D9%85%D8%B3%D8%AA%D9%88%D9%89-%D8%A7%D9%84%D8%A5%D9%86%D8%AA%D8%A7%D8%AC%D9%8A%D8%A9-%D9%85%D9%86/6269508 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 99030  100 99030    0     0   202k      0 --:--:-- --:--:-- --:--:--  203k

$ unidesc curl_arabic_page.log | head -n 10
       0              55        Basic Latin
      56             140        Arabic
     141            1467        Basic Latin
    1468            1666        Arabic
    1667            1725        Basic Latin
    1726            1819        Arabic
    1820            1925        Basic Latin
    1926            2005        Arabic
    2006            2083        Basic Latin
    2084            2166        Arabic
Also the lynx command saves the page with arabic unicode characters:

Code: Select all

$ lynx --dump https://www.youm7.com/story/2023/8/11/%D8%A7%D9%84%D8%B2%D8%B1%D8%A7%D8%B9%D8%A9-%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%A7%D9%84%D8%B5%D9%88%D8%A8-%D8%A3%D8%AD%D8%AF%D8%AB-%D9%86%D9%82%D9%84%D8%A9-%D9%86%D9%88%D8%B9%D9%8A%D8%A9-%D8%B9%D9%84%D9%89-%D9%85%D8%B3%D8%AA%D9%88%D9%89-%D8%A7%D9%84%D8%A5%D9%86%D8%AA%D8%A7%D8%AC%D9%8A%D8%A9-%D9%85%D9%86/6269508 > arabic_page.log
$ unidesc arabic_page.log | head -n 10
       0             107        Basic Latin
     108             126        Arabic
     127             133        Basic Latin
     134            1619        Arabic
    1620            1658        Basic Latin
    1659            1668        Arabic
    1669            1690        Basic Latin
    1691            1939        Arabic
    1940            2043        Basic Latin
    2044            3363        Arabic
The unidesc command is from the uniutils package.

Therefore, I cannot replicate your issue.

What is your curl command ?
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

User avatar
limotux
Posts: 122
Joined: 2011-05-30 17:38
Has thanked: 25 times
Been thanked: 10 times

Re: [O/S] Web Scraping and Character Encoding

#11 Post by limotux »

Thanks @Aki
I am worried the reason I do not have Chromium browser installed. I have the default Firfox and Brave which is set as the default browser.

Code: Select all

# Get the character encoding of the web page
    content_type=$(curl -s -I -A "$user_agent" "$url" | grep -i 'content-type')
    encoding=$(echo "$content_type" | grep -io 'charset=[^;]*' | cut -d '=' -f 2)
    if [[ -z "$encoding" ]]; then
      encoding=$(curl -s -A "$user_agent" "$url" | grep -io '<meta[^>]*charset=[^>]*>' | grep -io 'charset=[^>]*' | cut -d '=' -f 2)
    fi
    
.
.

Code: Select all

 else
      # Download the webpage and extract the main text using xpath
      main_text=$(curl -s -A "$user_agent" "$url" | xmllint --html --xpath "//p[not(ancestor::footer) and not(ancestor::nav) and not(ancestor::aside) and not(ancestor::script) and not(ancestor::button) and not(ancestor::a) and not(ancestor::h1) and not(ancestor::h2) and not(ancestor::h3) and not(ancestor::h4) and not(ancestor::h5) and not(ancestor::h6)]/text()" - 2>/dev/null | sed '/^\s*$/d')
    fi
By the way, I am seeing a lot of red after "--xpath"

To be honest, I didn't write the code in detail by myself, I used some AI like Bard.
The thing I do not understand why the script worked previously and now not working!
Debian 12 (bookworm), KDE Plasma, Info: dual core model: Intel Core i3-10110U cache: L1: 128 KiB L2: 512 KiB L3: 4 MiB ext4
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)

Aki
Global Moderator
Global Moderator
Posts: 2203
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 50 times
Been thanked: 292 times

Re: [O/S] Web Scraping and Character Encoding

#12 Post by Aki »

I suspect your script is at fault.
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

User avatar
limotux
Posts: 122
Joined: 2011-05-30 17:38
Has thanked: 25 times
Been thanked: 10 times

Re: [O/S] Web Scraping and Character Encoding

#13 Post by limotux »

Aki wrote: 2023-11-20 18:59 I suspect your script is at fault.
I will double check it and try again.
Thank you very much @Aki
Debian 12 (bookworm), KDE Plasma, Info: dual core model: Intel Core i3-10110U cache: L1: 128 KiB L2: 512 KiB L3: 4 MiB ext4
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)

Post Reply