bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

wget skips some local files for link updates in recursive mode if there


From: Florian Rosenauer
Subject: wget skips some local files for link updates in recursive mode if there are many files to be downloaded
Date: Sun, 21 Feb 2021 17:49:26 +0100
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1

Hello!

I do have the following Problem: The Page
https://xwing-miniatures.fandom.com/wiki/X-Wing_Miniatures_Wiki contains
a link element referencing a stylesheet:
<link rel="stylesheet"
href="/load.php?lang=en&amp;modules=ext.categorySelect.runtimeStyles%7Cext.fandom.ArticleInterlang.css%7Cext.fandom.CreatePage.css%7Cext.fandom.DesignSystem.css%7Cext.fandom.Thumbnails.css%7Cext.fandom.UserPreferencesV2.runtime.css%7Cext.fandom.bannerNotifications.css%7Cext.fandom.coreRuntimeStyles%2CwikiaBarRuntimeStyles%7Cext.fandom.mainPageTag.css%7Cext.staffSig.css%7Cext.visualEditor.desktopArticleTarget.noscript%7Cmediawiki.legacy.commonPrint%2Cshared%7Cskin.oasis.css%7Cskin.oasis.fanFeed.css%2CdiscussionsRuntimeStyles%7Cskin.oasis.pageheader.Share.css&amp;only=styles&amp;skin=oasis"/>

During a single page download the link is updated with the local file,
during a recursive download the file is downloaded, but the link is only
updated if the download is limited to a few pages.

To reproduce:
1a. run wget to load a single page using --page-requisites:
    wget --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
https://xwing-miniatures.fandom.com -d -o singlepage.log

1b. Result: a fully working local version in
xwing-miniatures.fandom.com\index.html, the link element is replaced
with the local file
    <link rel="stylesheet"
href="load.php@lang=en&amp;modules=ext.categorySelect.runtimeStyles%257Cext.fandom.ArticleInterlang.css%257Cext.fandom.CreatePage.css%257Cext.fandom.DesignSystem.css%257Cext.fandom.Thumbnails.css%257Cext.fandom.UserPreferencesV2.css"/>

2a. run wget to load recursive but limit it with --reject-regex to very
few pages:
    wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex ".*xwing-miniatures.fandom.com/.*/.*|.*action=edit.*"
https://xwing-miniatures.fandom.com

2b. Result: a fully working local version in
xwing-miniatures.fandom.com\index.html, the link element is replaced
with the local file:
    <link rel="stylesheet"
href="load.php@lang=en&amp;modules=ext.categorySelect.runtimeStyles%257Cext.fandom.ArticleInterlang.css%257Cext.fandom.CreatePage.css%257Cext.fandom.DesignSystem.css%257Cext.fandom.Thumbnails.css%257Cext.fandom.UserPreferencesV2.css"/>

3a. Attention: this downloads about 3300 Pages / 8500 Files!
    run wget to load recursive but less limits in --reject-regex
(reject only the forum and the wiki edit/history pages)
    wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex
"https://xwing-miniatures.fandom.com/f/.*|.*action=edit.*|.*action=history.*|.*@oldid=.*"
https://xwing-miniatures.fandom.com

3b. Result: after everything has finished, the link is altered to refer
to the online page (!) althought it was downloaded locally and is needed
for a working local page:
    <link rel="stylesheet"
href="https://xwing-miniatures.fandom.com/load.php?lang=en&amp;modules=ext.categorySelect.runtimeStyles%7Cext.fandom.ArticleInterlang.css%7Cext.fandom.CreatePage.css%7Cext.fandom.DesignSystem.css%7Cext.fandom.Thumbnails.css%7Cext.fandom.UserPreferencesV2.runtime.css%7Cext.fandom.bannerNotifications.css%7Cext.fandom.coreRuntimeStyles%2CwikiaBarRuntimeStyles%7Cext.fandom.mainPageTag.css%7Cext.staffSig.css%7Cext.visualEditor.desktopArticleTarget.noscript%7Cmediawiki.legacy.commonPrint%2Cshared%7Cskin.oasis.css%7Cskin.oasis.fanFeed.css%2CdiscussionsRuntimeStyles%7Cskin.oasis.pageheader.Share.css&amp;only=styles&amp;skin=oasis"/>
If you open the page offline (without anything cached!), it renders
totally unusable as the main CSS is missing.

Note 1: the --domains list was built by looking at the result of command
#1a to limit the downloads
Note 2: in Windows, make sure to put the download to a short directory
path, as it exceeds 256 chars soon due to the long names, and neither
Firefox nor Chrome can open file paths > 256 chars in Windows xD

Version Information:
$ wget -V
GNU Wget 1.21.1 built on cygwin.

+cares +digest +gpgme +https +ipv6 +iri +large-file +metalink +nls
+ntlm +opie +psl +ssl/gnutls

Wgetrc:
    /etc/wgetrc (system)
Locale:
    /usr/share/locale
Compile:
    gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc"
    -DLOCALEDIR="/usr/share/locale" -I.
    -I/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1/src
    -I../lib
    -I/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1/lib
    -I/usr/include/uuid -DNDEBUG -ggdb -O2 -pipe -Wall
    -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
    -fstack-protector-strong --param=ssp-buffer-size=4

-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/build=/usr/src/debug/wget-1.21.1-1

-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1=/usr/src/debug/wget-1.21.1-1
Link:
    gcc -I/usr/include/uuid -DNDEBUG -ggdb -O2 -pipe -Wall
    -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
    -fstack-protector-strong --param=ssp-buffer-size=4

-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/build=/usr/src/debug/wget-1.21.1-1

-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1=/usr/src/debug/wget-1.21.1-1
    -lmetalink -lcares -lpcre2-8 -luuid -lidn2 -lnettle -lgnutls -lz
    -lpsl -lgpgme ftp-opie.o gnutls.o http-ntlm.o ../lib/libgnu.a
    -liconv -lintl -lunistring



Should I submit a bug? Do I miss something?

Thank you
Florian




reply via email to

[Prev in Thread] Current Thread [Next in Thread]