[Solved] Web Scraping and Character Encoding
[Solved] Web Scraping and Character Encoding
I made a little web scraping script. I do web scraping in 2 languages, English and Arabic.
I had the script on a previous Debian KDE Plasma installation and it worked fine for both languages. But I made a fresh install recently.
On the current installation it is scraping OK. When I scrape for a search term in English I get the results OK to the text file I specified.
But if I try an Arabic search term, it goes and appears to be actually scraping but I get the text file containing strange and funny characters like "Ù ÙØ¸Ù Ø© Ø´ÙÙÙØ© ÙÙØ¨ÙÙ Ø§ÙØ¯ÙÙÙ Ù" which is supposed to be in Arabic alphabets.
FYI, I have Arabic fonts already installed, I can type and read Arabic text in any app normally without issues.
So, as the script worked before without problem I guess the issue is with how the system handles the scraped text.
What can I do to get it to write the Arabic characters properly in a readable format.
I will highly appreciate any help.
I had the script on a previous Debian KDE Plasma installation and it worked fine for both languages. But I made a fresh install recently.
On the current installation it is scraping OK. When I scrape for a search term in English I get the results OK to the text file I specified.
But if I try an Arabic search term, it goes and appears to be actually scraping but I get the text file containing strange and funny characters like "Ù ÙØ¸Ù Ø© Ø´ÙÙÙØ© ÙÙØ¨ÙÙ Ø§ÙØ¯ÙÙÙ Ù" which is supposed to be in Arabic alphabets.
FYI, I have Arabic fonts already installed, I can type and read Arabic text in any app normally without issues.
So, as the script worked before without problem I guess the issue is with how the system handles the scraped text.
What can I do to get it to write the Arabic characters properly in a readable format.
I will highly appreciate any help.
Last edited by limotux on 2023-11-21 06:34, edited 1 time in total.
Debian 12 (bookworm), KDE Plasma, Info: dual core model: Intel Core i3-10110U cache: L1: 128 KiB L2: 512 KiB L3: 4 MiB ext4
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)
-
- Global Moderator
- Posts: 2203
- Joined: 2014-07-20 18:12
- Location: Europe
- Has thanked: 50 times
- Been thanked: 292 times
Re: [O/S] Web Scraping and Character Encoding
Hello,
What is your configured locale ? You can check with the following command:
Did you configure the arabic locale in KDE plasma (System Settings -> Regional Settings -> Region and Language" ?
What is your configured locale ? You can check with the following command:
Code: Select all
locale
Re: [O/S] Web Scraping and Character Encoding
Thanks @Aki
Here it is:
I had language before set as US English, but had the same results!
Here it is:
Code: Select all
limo@debian:~$ locale
LANG=en_US.UTF-8
LANGUAGE=C
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
limo@debian:~$
Debian 12 (bookworm), KDE Plasma, Info: dual core model: Intel Core i3-10110U cache: L1: 128 KiB L2: 512 KiB L3: 4 MiB ext4
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)
-
- Global Moderator
- Posts: 2203
- Joined: 2014-07-20 18:12
- Location: Europe
- Has thanked: 50 times
- Been thanked: 292 times
Re: [O/S] Web Scraping and Character Encoding
Hello,
Did you install arabic locales with the following command ?
You can check with the following commands:
Did you install arabic locales with the following command ?
Code: Select all
su -l -c "dpkg-reconfigure locales"
Code: Select all
localectl status
localectl list-locales
grep ar_ /etc/locale.gen
Re: [O/S] Web Scraping and Character Encoding
Thanks a lot @Aki
Should I do now the command:
Code: Select all
limo@debian:~$ localectl status
System Locale: LANG=en_US.UTF-8
VC Keymap: (unset)
X11 Layout: us
X11 Model: pc105
limo@debian:~$
Code: Select all
limo@debian:~$ localectl list-locales
C.UTF-8
ar_EG.UTF-8
en_US.UTF-8
limo@debian:~$
Code: Select all
limo@debian:~$ grep ar_ /etc/locale.gen
# ar_AE ISO-8859-6
# ar_AE.UTF-8 UTF-8
# ar_BH ISO-8859-6
# ar_BH.UTF-8 UTF-8
# ar_DZ ISO-8859-6
# ar_DZ.UTF-8 UTF-8
# ar_EG ISO-8859-6
ar_EG.UTF-8 UTF-8
# ar_IN UTF-8
# ar_IQ ISO-8859-6
# ar_IQ.UTF-8 UTF-8
# ar_JO ISO-8859-6
# ar_JO.UTF-8 UTF-8
# ar_KW ISO-8859-6
# ar_KW.UTF-8 UTF-8
# ar_LB ISO-8859-6
# ar_LB.UTF-8 UTF-8
# ar_LY ISO-8859-6
# ar_LY.UTF-8 UTF-8
# ar_MA ISO-8859-6
# ar_MA.UTF-8 UTF-8
# ar_OM ISO-8859-6
# ar_OM.UTF-8 UTF-8
# ar_QA ISO-8859-6
# ar_QA.UTF-8 UTF-8
# ar_SA ISO-8859-6
# ar_SA.UTF-8 UTF-8
# ar_SD ISO-8859-6
# ar_SD.UTF-8 UTF-8
# ar_SS UTF-8
# ar_SY ISO-8859-6
# ar_SY.UTF-8 UTF-8
# ar_TN ISO-8859-6
# ar_TN.UTF-8 UTF-8
# ar_YE ISO-8859-6
# ar_YE.UTF-8 UTF-8
# ar_AE.UTF-8 UTF-8
limo@debian:~$
Code: Select all
su -l -c "dpkg-reconfigure locales"
Debian 12 (bookworm), KDE Plasma, Info: dual core model: Intel Core i3-10110U cache: L1: 128 KiB L2: 512 KiB L3: 4 MiB ext4
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)
-
- Global Moderator
- Posts: 2203
- Joined: 2014-07-20 18:12
- Location: Europe
- Has thanked: 50 times
- Been thanked: 292 times
Re: [O/S] Web Scraping and Character Encoding
Hello,
What is the program you use to download web pages ?
Can you send an example of "scraped" downloaded file and its URL ? Can you attach it to a follow-up message as a zip file (do not copy and paste in the body of the message).
No, your ar_EG locale is already installed.limotux wrote: ↑2023-11-19 10:44 Should I do now the command:Code: Select all
su -l -c "dpkg-reconfigure locales"
What is the program you use to download web pages ?
Can you send an example of "scraped" downloaded file and its URL ? Can you attach it to a follow-up message as a zip file (do not copy and paste in the body of the message).
Re: [O/S] Web Scraping and Character Encoding
I use in the bash script googler to search, curl to retrieve the text.
[Content_Types].xml ¢( ´_oÚ0Åß'í;DyÓ=LÓô¡´OÕV©LÛ«±oÀ«ÿ;)ðíg;Ñl¤Ä9çü|cnîøz£döΣ'ùU9Ê3ÐÌp¡üûü®øg©æT
|>¿¾7o-ø,¨µä+Dû ÏV ¨/V*ãÅpéÄRöD@>F3AcÑ#gPÑZbv» ·«yvÓ<£&¹PQïNÅ/Ý´ÐUu¦l¸Òq ý+µV
â«Ø§ ÙØ¯ ر ا سة Ø§ÙØªØ´Ø®ÙØµÙØ© ÙÙÙØ·Ø§ ع ا ÙØ®Ø§ صâ¬
â«Ø®ÙÙ Ø§Ø£ÙØ³Ùا٠Ù٠٠صرâ¬"
I am not sure I understand what URL you mean? It can search the web in general so there are many websites that the results come from.
The text I get is like "PK ! ý¤~4 ÓCan you send an example of "scraped" downloaded file and its URL ?
[Content_Types].xml ¢( ´_oÚ0Åß'í;DyÓ=LÓô¡´OÕV©LÛ«±oÀ«ÿ;)ðíg;Ñl¤Ä9çü|cnîøz£döΣ'ùU9Ê3ÐÌp¡üûü®øg©æT
|>¿¾7o-ø,¨µä+Dû ÏV ¨/V*ãÅpéÄRöD@>F3AcÑ#gPÑZbv» ·«yvÓ<£&¹PQïNÅ/Ý´ÐUu¦l¸Òq ý+µV
â«Ø§ ÙØ¯ ر ا سة Ø§ÙØªØ´Ø®ÙØµÙØ© ÙÙÙØ·Ø§ ع ا ÙØ®Ø§ صâ¬
â«Ø®ÙÙ Ø§Ø£ÙØ³Ùا٠Ù٠٠صرâ¬"
I am not sure I understand what URL you mean? It can search the web in general so there are many websites that the results come from.
Debian 12 (bookworm), KDE Plasma, Info: dual core model: Intel Core i3-10110U cache: L1: 128 KiB L2: 512 KiB L3: 4 MiB ext4
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)
-
- Global Moderator
- Posts: 2203
- Joined: 2014-07-20 18:12
- Location: Europe
- Has thanked: 50 times
- Been thanked: 292 times
Re: [O/S] Web Scraping and Character Encoding
What is the program you use to download web pages ?
Can you give an example URL ?
Can you give an example URL ?
Re: [O/S] Web Scraping and Character Encoding
Sorry @Aki
I am missing something here!
As in my previous reply, the script uses curl.
You may check any Arabic language website for example https://www.youm7.com/story/2023/8/11/% ... 86/6269508
Debian 12 (bookworm), KDE Plasma, Info: dual core model: Intel Core i3-10110U cache: L1: 128 KiB L2: 512 KiB L3: 4 MiB ext4
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)
-
- Global Moderator
- Posts: 2203
- Joined: 2014-07-20 18:12
- Location: Europe
- Has thanked: 50 times
- Been thanked: 292 times
Re: [O/S] Web Scraping and Character Encoding
Hello,
The curl command behaves as expected and it locally stores the html page with arabic unicode characters:
Also the lynx command saves the page with arabic unicode characters:
The unidesc command is from the uniutils package.
Therefore, I cannot replicate your issue.
What is your curl command ?
The curl command behaves as expected and it locally stores the html page with arabic unicode characters:
Code: Select all
$ curl -o curl_arabic_page.log https://www.youm7.com/story/2023/8/11/%D8%A7%D9%84%D8%B2%D8%B1%D8%A7%D8%B9%D8%A9-%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%A7%D9%84%D8%B5%D9%88%D8%A8-%D8%A3%D8%AD%D8%AF%D8%AB-%D9%86%D9%82%D9%84%D8%A9-%D9%86%D9%88%D8%B9%D9%8A%D8%A9-%D8%B9%D9%84%D9%89-%D9%85%D8%B3%D8%AA%D9%88%D9%89-%D8%A7%D9%84%D8%A5%D9%86%D8%AA%D8%A7%D8%AC%D9%8A%D8%A9-%D9%85%D9%86/6269508
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 99030 100 99030 0 0 202k 0 --:--:-- --:--:-- --:--:-- 203k
$ unidesc curl_arabic_page.log | head -n 10
0 55 Basic Latin
56 140 Arabic
141 1467 Basic Latin
1468 1666 Arabic
1667 1725 Basic Latin
1726 1819 Arabic
1820 1925 Basic Latin
1926 2005 Arabic
2006 2083 Basic Latin
2084 2166 Arabic
Code: Select all
$ lynx --dump https://www.youm7.com/story/2023/8/11/%D8%A7%D9%84%D8%B2%D8%B1%D8%A7%D8%B9%D8%A9-%D9%85%D8%B4%D8%B1%D9%88%D8%B9-%D8%A7%D9%84%D8%B5%D9%88%D8%A8-%D8%A3%D8%AD%D8%AF%D8%AB-%D9%86%D9%82%D9%84%D8%A9-%D9%86%D9%88%D8%B9%D9%8A%D8%A9-%D8%B9%D9%84%D9%89-%D9%85%D8%B3%D8%AA%D9%88%D9%89-%D8%A7%D9%84%D8%A5%D9%86%D8%AA%D8%A7%D8%AC%D9%8A%D8%A9-%D9%85%D9%86/6269508 > arabic_page.log
$ unidesc arabic_page.log | head -n 10
0 107 Basic Latin
108 126 Arabic
127 133 Basic Latin
134 1619 Arabic
1620 1658 Basic Latin
1659 1668 Arabic
1669 1690 Basic Latin
1691 1939 Arabic
1940 2043 Basic Latin
2044 3363 Arabic
Therefore, I cannot replicate your issue.
What is your curl command ?
Re: [O/S] Web Scraping and Character Encoding
Thanks @Aki
I am worried the reason I do not have Chromium browser installed. I have the default Firfox and Brave which is set as the default browser.
.
.
By the way, I am seeing a lot of red after "--xpath"
To be honest, I didn't write the code in detail by myself, I used some AI like Bard.
The thing I do not understand why the script worked previously and now not working!
I am worried the reason I do not have Chromium browser installed. I have the default Firfox and Brave which is set as the default browser.
Code: Select all
# Get the character encoding of the web page
content_type=$(curl -s -I -A "$user_agent" "$url" | grep -i 'content-type')
encoding=$(echo "$content_type" | grep -io 'charset=[^;]*' | cut -d '=' -f 2)
if [[ -z "$encoding" ]]; then
encoding=$(curl -s -A "$user_agent" "$url" | grep -io '<meta[^>]*charset=[^>]*>' | grep -io 'charset=[^>]*' | cut -d '=' -f 2)
fi
.
Code: Select all
else
# Download the webpage and extract the main text using xpath
main_text=$(curl -s -A "$user_agent" "$url" | xmllint --html --xpath "//p[not(ancestor::footer) and not(ancestor::nav) and not(ancestor::aside) and not(ancestor::script) and not(ancestor::button) and not(ancestor::a) and not(ancestor::h1) and not(ancestor::h2) and not(ancestor::h3) and not(ancestor::h4) and not(ancestor::h5) and not(ancestor::h6)]/text()" - 2>/dev/null | sed '/^\s*$/d')
fi
To be honest, I didn't write the code in detail by myself, I used some AI like Bard.
The thing I do not understand why the script worked previously and now not working!
Debian 12 (bookworm), KDE Plasma, Info: dual core model: Intel Core i3-10110U cache: L1: 128 KiB L2: 512 KiB L3: 4 MiB ext4
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)
-
- Global Moderator
- Posts: 2203
- Joined: 2014-07-20 18:12
- Location: Europe
- Has thanked: 50 times
- Been thanked: 292 times
Re: [O/S] Web Scraping and Character Encoding
I will double check it and try again.
Thank you very much @Aki
Debian 12 (bookworm), KDE Plasma, Info: dual core model: Intel Core i3-10110U cache: L1: 128 KiB L2: 512 KiB L3: 4 MiB ext4
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)
ID-1: swap-1 type: partition size: 977 MiB used: 3 MiB (0.3%) priority: -2
Memory: 19.37 GiB
(Installed 24/8/2023) (no techie)