bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget skips some local files for link updates in recursive mode if th


From: Florian Rosenauer
Subject: Re: wget skips some local files for link updates in recursive mode if there are many files to be downloaded
Date: Sun, 28 Feb 2021 11:51:23 +0100
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1

Hello,

I was able to track the problem down a bit more. I excluded the biggest
part which are the image downloads and played around with downloading
different sets of pages:

The following command downloads pages A-b (about 140 files in total) and
works fine:
wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex
".*xwing-miniatures.fandom.com/f/.*|.*xwing-miniatures.fandom.com/wiki/[C-Zc-z].*|.*static.wikia.nocookie.net/xwing-miniatures/images/.*|.*Template.*|.*action=edit.*|.*action=history.*|.*oldid=.*"
https://xwing-miniatures.fandom.com

The following command downloads pages A-c (about 540 files in total) and
fails to update the link rel="stylesheet" href to the locally downloaded
file:
wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex
".*xwing-miniatures.fandom.com/f/.*|.*xwing-miniatures.fandom.com/wiki/[D-Zd-z].*|.*static.wikia.nocookie.net/xwing-miniatures/images/.*|.*Template.*|.*action=edit.*|.*action=history.*|.*oldid=.*"
https://xwing-miniatures.fandom.com

I thought, maybe a "C" page triggers a bug in wget, but also downloading
only F-z fails the link-update so to me it looks like it is really
related to the number of pages beeing downloaded. The more pages beeing
downloaded, the more likely it is that wget doesn't update css
stylesheet links.
I think I will write a small hardcoded sed script to update the links in
my files to the local ones.
Maybe someone has an idea about it or other ideas to narrow down the
original problem.

Thanks
Florian




On 21.02.2021 17:49, Florian Rosenauer wrote:
Hello!

I do have the following Problem: The Page
https://xwing-miniatures.fandom.com/wiki/X-Wing_Miniatures_Wiki contains
a link element referencing a stylesheet:
<link rel="stylesheet"
href="/load.php?lang=en&amp;modules=ext.categorySelect.runtimeStyles%7Cext.fandom.ArticleInterlang.css%7Cext.fandom.CreatePage.css%7Cext.fandom.DesignSystem.css%7Cext.fandom.Thumbnails.css%7Cext.fandom.UserPreferencesV2.runtime.css%7Cext.fandom.bannerNotifications.css%7Cext.fandom.coreRuntimeStyles%2CwikiaBarRuntimeStyles%7Cext.fandom.mainPageTag.css%7Cext.staffSig.css%7Cext.visualEditor.desktopArticleTarget.noscript%7Cmediawiki.legacy.commonPrint%2Cshared%7Cskin.oasis.css%7Cskin.oasis.fanFeed.css%2CdiscussionsRuntimeStyles%7Cskin.oasis.pageheader.Share.css&amp;only=styles&amp;skin=oasis"/>


During a single page download the link is updated with the local file,
during a recursive download the file is downloaded, but the link is only
updated if the download is limited to a few pages.

To reproduce:
1a. run wget to load a single page using --page-requisites:
     wget --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
https://xwing-miniatures.fandom.com -d -o singlepage.log

1b. Result: a fully working local version in
xwing-miniatures.fandom.com\index.html, the link element is replaced
with the local file
     <link rel="stylesheet"
href="load.php@lang=en&amp;modules=ext.categorySelect.runtimeStyles%257Cext.fandom.ArticleInterlang.css%257Cext.fandom.CreatePage.css%257Cext.fandom.DesignSystem.css%257Cext.fandom.Thumbnails.css%257Cext.fandom.UserPreferencesV2.css"/>


2a. run wget to load recursive but limit it with --reject-regex to very
few pages:
     wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex ".*xwing-miniatures.fandom.com/.*/.*|.*action=edit.*"
https://xwing-miniatures.fandom.com

2b. Result: a fully working local version in
xwing-miniatures.fandom.com\index.html, the link element is replaced
with the local file:
     <link rel="stylesheet"
href="load.php@lang=en&amp;modules=ext.categorySelect.runtimeStyles%257Cext.fandom.ArticleInterlang.css%257Cext.fandom.CreatePage.css%257Cext.fandom.DesignSystem.css%257Cext.fandom.Thumbnails.css%257Cext.fandom.UserPreferencesV2.css"/>


3a. Attention: this downloads about 3300 Pages / 8500 Files!
     run wget to load recursive but less limits in --reject-regex
(reject only the forum and the wiki edit/history pages)
     wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex
"https://xwing-miniatures.fandom.com/f/.*|.*action=edit.*|.*action=history.*|.*@oldid=.*"
https://xwing-miniatures.fandom.com

3b. Result: after everything has finished, the link is altered to refer
to the online page (!) althought it was downloaded locally and is needed
for a working local page:
     <link rel="stylesheet"
href="https://xwing-miniatures.fandom.com/load.php?lang=en&amp;modules=ext.categorySelect.runtimeStyles%7Cext.fandom.ArticleInterlang.css%7Cext.fandom.CreatePage.css%7Cext.fandom.DesignSystem.css%7Cext.fandom.Thumbnails.css%7Cext.fandom.UserPreferencesV2.runtime.css%7Cext.fandom.bannerNotifications.css%7Cext.fandom.coreRuntimeStyles%2CwikiaBarRuntimeStyles%7Cext.fandom.mainPageTag.css%7Cext.staffSig.css%7Cext.visualEditor.desktopArticleTarget.noscript%7Cmediawiki.legacy.commonPrint%2Cshared%7Cskin.oasis.css%7Cskin.oasis.fanFeed.css%2CdiscussionsRuntimeStyles%7Cskin.oasis.pageheader.Share.css&amp;only=styles&amp;skin=oasis"/>

If you open the page offline (without anything cached!), it renders
totally unusable as the main CSS is missing.

Note 1: the --domains list was built by looking at the result of command
#1a to limit the downloads
Note 2: in Windows, make sure to put the download to a short directory
path, as it exceeds 256 chars soon due to the long names, and neither
Firefox nor Chrome can open file paths > 256 chars in Windows xD

Version Information:
$ wget -V
GNU Wget 1.21.1 built on cygwin.

+cares +digest +gpgme +https +ipv6 +iri +large-file +metalink +nls
+ntlm +opie +psl +ssl/gnutls

Wgetrc:
     /etc/wgetrc (system)
Locale:
     /usr/share/locale
Compile:
     gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc"
     -DLOCALEDIR="/usr/share/locale" -I.
     -I/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1/src
     -I../lib
     -I/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1/lib
     -I/usr/include/uuid -DNDEBUG -ggdb -O2 -pipe -Wall
     -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
     -fstack-protector-strong --param=ssp-buffer-size=4

-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/build=/usr/src/debug/wget-1.21.1-1


-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1=/usr/src/debug/wget-1.21.1-1

Link:
     gcc -I/usr/include/uuid -DNDEBUG -ggdb -O2 -pipe -Wall
     -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
     -fstack-protector-strong --param=ssp-buffer-size=4

-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/build=/usr/src/debug/wget-1.21.1-1


-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1=/usr/src/debug/wget-1.21.1-1

     -lmetalink -lcares -lpcre2-8 -luuid -lidn2 -lnettle -lgnutls -lz
     -lpsl -lgpgme ftp-opie.o gnutls.o http-ntlm.o ../lib/libgnu.a
     -liconv -lintl -lunistring



Should I submit a bug? Do I miss something?

Thank you
Florian




reply via email to

[Prev in Thread] Current Thread [Next in Thread]