bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget skips some local files for link updates in recursive mode if th


From: Florian Rosenauer
Subject: Re: wget skips some local files for link updates in recursive mode if there are many files to be downloaded
Date: Sun, 28 Feb 2021 18:48:07 +0100
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1

Hello again,

when writing the sed script I finally stumbled upon the bug as there
were much more stylesheets in the original webpage than on the local
directory, and it is really easy to reproduce. I will file a bug via the
official form at savannah.gnu.org.

The problem is, that MediaWiki or at least the implementation at
fandom.com uses ONE stylesheet generator which takes all stylesheets as
parameter and delivers a ONE-IN-ALL stylesheet. This exceeds ~200 chars,
and then gets truncated by wget. However, there are several stylesheets
which differ only after these ~200 chars. So wget link-replaces only the
last of these stylesheets. The more files you download, the more likely
this problem happens.

To reproduce:
1. create a file wget-test.html on your php enabled webserver with this
content:
<html>
<head>
<title>wget bug demo</title>
<link rel="stylesheet"
href="wget-stylesheet.php?v=1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890x1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890x1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890xA"/>
<link rel="stylesheet"
href="wget-stylesheet.php?v=1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890x1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890x1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890xB"/>
</head>
<body>
        Hello amazing wget fans!
</body>
</html>

2. create a file wget-stylesheet.php with the following content:
<?php
/* irl this would generate a CSS based on the parameters */
echo '//v=' . htmlspecialchars($_GET["v"]);
?>

p { }
body{
    color: red;
}

3. run this command:
wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
http://localhost/wget-test/wget-test.html

Expected result: three files downloaded, three files on disk, and all
two <link rel="stylesheet" href="..."/> replaced.

Real result: three files downloaded, but only two on the disk, and only
the second <link rel="stylesheet" href="..."/> is replaced with the
local file.


Florian

On 28.02.2021 11:51, Florian Rosenauer wrote:
Hello,

I was able to track the problem down a bit more. I excluded the biggest
part which are the image downloads and played around with downloading
different sets of pages:

The following command downloads pages A-b (about 140 files in total) and
works fine:
wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex
".*xwing-miniatures.fandom.com/f/.*|.*xwing-miniatures.fandom.com/wiki/[C-Zc-z].*|.*static.wikia.nocookie.net/xwing-miniatures/images/.*|.*Template.*|.*action=edit.*|.*action=history.*|.*oldid=.*"
https://xwing-miniatures.fandom.com

The following command downloads pages A-c (about 540 files in total) and
fails to update the link rel="stylesheet" href to the locally downloaded
file:
wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex
".*xwing-miniatures.fandom.com/f/.*|.*xwing-miniatures.fandom.com/wiki/[D-Zd-z].*|.*static.wikia.nocookie.net/xwing-miniatures/images/.*|.*Template.*|.*action=edit.*|.*action=history.*|.*oldid=.*"
https://xwing-miniatures.fandom.com

I thought, maybe a "C" page triggers a bug in wget, but also downloading
only F-z fails the link-update so to me it looks like it is really
related to the number of pages beeing downloaded. The more pages beeing
downloaded, the more likely it is that wget doesn't update css
stylesheet links.
I think I will write a small hardcoded sed script to update the links in
my files to the local ones.
Maybe someone has an idea about it or other ideas to narrow down the
original problem.

Thanks
Florian




On 21.02.2021 17:49, Florian Rosenauer wrote:
Hello!

I do have the following Problem: The Page
https://xwing-miniatures.fandom.com/wiki/X-Wing_Miniatures_Wiki
contains a link element referencing a stylesheet:
<link rel="stylesheet"
href="/load.php?lang=en&amp;modules=ext.categorySelect.runtimeStyles%7Cext.fandom.ArticleInterlang.css%7Cext.fandom.CreatePage.css%7Cext.fandom.DesignSystem.css%7Cext.fandom.Thumbnails.css%7Cext.fandom.UserPreferencesV2.runtime.css%7Cext.fandom.bannerNotifications.css%7Cext.fandom.coreRuntimeStyles%2CwikiaBarRuntimeStyles%7Cext.fandom.mainPageTag.css%7Cext.staffSig.css%7Cext.visualEditor.desktopArticleTarget.noscript%7Cmediawiki.legacy.commonPrint%2Cshared%7Cskin.oasis.css%7Cskin.oasis.fanFeed.css%2CdiscussionsRuntimeStyles%7Cskin.oasis.pageheader.Share.css&amp;only=styles&amp;skin=oasis"/>


During a single page download the link is updated with the local file,
during a recursive download the file is downloaded, but the link is
only updated if the download is limited to a few pages.

To reproduce:
1a. run wget to load a single page using --page-requisites:
     wget --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
https://xwing-miniatures.fandom.com -d -o singlepage.log

1b. Result: a fully working local version in
xwing-miniatures.fandom.com\index.html, the link element is replaced
with the local file
     <link rel="stylesheet"
href="load.php@lang=en&amp;modules=ext.categorySelect.runtimeStyles%257Cext.fandom.ArticleInterlang.css%257Cext.fandom.CreatePage.css%257Cext.fandom.DesignSystem.css%257Cext.fandom.Thumbnails.css%257Cext.fandom.UserPreferencesV2.css"/>


2a. run wget to load recursive but limit it with --reject-regex to
very few pages:
     wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex ".*xwing-miniatures.fandom.com/.*/.*|.*action=edit.*"
https://xwing-miniatures.fandom.com

2b. Result: a fully working local version in
xwing-miniatures.fandom.com\index.html, the link element is replaced
with the local file:
     <link rel="stylesheet"
href="load.php@lang=en&amp;modules=ext.categorySelect.runtimeStyles%257Cext.fandom.ArticleInterlang.css%257Cext.fandom.CreatePage.css%257Cext.fandom.DesignSystem.css%257Cext.fandom.Thumbnails.css%257Cext.fandom.UserPreferencesV2.css"/>


3a. Attention: this downloads about 3300 Pages / 8500 Files!
     run wget to load recursive but less limits in --reject-regex
(reject only the forum and the wiki edit/history pages)
     wget --recursive --no-clobber --page-requisites --html-extension
--convert-links --restrict-file-names=windows --span-hosts
--domains=fonts.gstatic.com,static.wikia.nocookie.net,vignette.wikia.nocookie.net,vignette3.wikia.nocookie.net,www.fastly-insights.com,www.googletagmanager.com,xwing-miniatures.fandom.com
--reject-regex
"https://xwing-miniatures.fandom.com/f/.*|.*action=edit.*|.*action=history.*|.*@oldid=.*"
https://xwing-miniatures.fandom.com

3b. Result: after everything has finished, the link is altered to
refer to the online page (!) althought it was downloaded locally and
is needed for a working local page:
     <link rel="stylesheet"
href="https://xwing-miniatures.fandom.com/load.php?lang=en&amp;modules=ext.categorySelect.runtimeStyles%7Cext.fandom.ArticleInterlang.css%7Cext.fandom.CreatePage.css%7Cext.fandom.DesignSystem.css%7Cext.fandom.Thumbnails.css%7Cext.fandom.UserPreferencesV2.runtime.css%7Cext.fandom.bannerNotifications.css%7Cext.fandom.coreRuntimeStyles%2CwikiaBarRuntimeStyles%7Cext.fandom.mainPageTag.css%7Cext.staffSig.css%7Cext.visualEditor.desktopArticleTarget.noscript%7Cmediawiki.legacy.commonPrint%2Cshared%7Cskin.oasis.css%7Cskin.oasis.fanFeed.css%2CdiscussionsRuntimeStyles%7Cskin.oasis.pageheader.Share.css&amp;only=styles&amp;skin=oasis"/>

If you open the page offline (without anything cached!), it renders
totally unusable as the main CSS is missing.

Note 1: the --domains list was built by looking at the result of
command #1a to limit the downloads
Note 2: in Windows, make sure to put the download to a short directory
path, as it exceeds 256 chars soon due to the long names, and neither
Firefox nor Chrome can open file paths > 256 chars in Windows xD

Version Information:
$ wget -V
GNU Wget 1.21.1 built on cygwin.

+cares +digest +gpgme +https +ipv6 +iri +large-file +metalink +nls
+ntlm +opie +psl +ssl/gnutls

Wgetrc:
     /etc/wgetrc (system)
Locale:
     /usr/share/locale
Compile:
     gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc"
     -DLOCALEDIR="/usr/share/locale" -I.
     -I/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1/src
     -I../lib
     -I/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1/lib
     -I/usr/include/uuid -DNDEBUG -ggdb -O2 -pipe -Wall
     -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
     -fstack-protector-strong --param=ssp-buffer-size=4

-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/build=/usr/src/debug/wget-1.21.1-1


-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1=/usr/src/debug/wget-1.21.1-1

Link:
     gcc -I/usr/include/uuid -DNDEBUG -ggdb -O2 -pipe -Wall
     -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
     -fstack-protector-strong --param=ssp-buffer-size=4

-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/build=/usr/src/debug/wget-1.21.1-1


-fdebug-prefix-map=/home/BWI/src/cygwin/wget/wget-1.21.1-1.x86_64/src/wget-1.21.1=/usr/src/debug/wget-1.21.1-1

     -lmetalink -lcares -lpcre2-8 -luuid -lidn2 -lnettle -lgnutls -lz
     -lpsl -lgpgme ftp-opie.o gnutls.o http-ntlm.o ../lib/libgnu.a
     -liconv -lintl -lunistring



Should I submit a bug? Do I miss something?

Thank you
Florian




reply via email to

[Prev in Thread] Current Thread [Next in Thread]