wp-mirror-list
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Wp-mirror-list] Missing images


From: wp mirror
Subject: [Wp-mirror-list] Missing images
Date: Wed, 28 Nov 2012 13:36:26 -0500

Dear Jason,

Thank you very much for letting me know about the underpopulated
galleries.  This led to a fruitful line of investigation.

1) Image File Names

WP-MIRROR 0.4 scraped image file names from xchunks by looking for
links, such as
[[File:foo.png|...]]
[[Image:bar.png|...]]
[[Media:baz.png|...]]

It turns out that only 65% of image file names are found in links.
Galleries contain about 10%:
<gallery>
File:foo.png|...
Image:bar.png|...
Media:baz.png|...
</gallery>

and

{{gallery
|File:foo.png|...
|Image:bar.png|...
|Media:baz.png|...
}}

A great many pages have an infobox template, usually in the upper
right corner of the page.  This is where about 25% of image file names
can be found.
{{Infobox
| image_flag = foo.png
| image_coat = bar.png
| image_map = baz.png
}}
Often these are formatted with arbitrary white space, and include links
{{Infobox
| image
= foo.png
|       map  = [[Image:bar.png]]
   | coat
=           baz.png
}}
A very few (less than one in a thousand) are found in other templates.
{{multiple image
| image1 = foo.png
| image2 = bar.png
}}
{{wide image|File:foo.png|...}}

It is was an interesting task to parse all of these.

2) Testing

<http://simple.wikipedia.org/wiki/Berlin> has galleries and an infobox template.
<http://simple.wikipedia.org/wiki/London> has a wide image template.
<http://simple.wikipedia.org/wiki/Neptune> has multiple image templates.

3) Storage Requirement

WP-MIRROR 0.4:  image files = 43k, image storage = 32G
WP-MIRROR 0.5:  image files = 65k, image storage = 48G

WP-MIRROR 0.5 will scrape 50% more images and require 50% more disk
space than previous versions.  Going forward, we should advise people
that the simple wikipedia requires 60G storage.

4) Special characters

Your comment also led me down another path.  Many image file names
contain characters that have special meaning to '/bin/sh' (e.g.
ampersand, apostrophe, back-quote, dollar sign, and many others).  In
many cases, file names with such characters can be handled provided
that the special characters are escaped with a backslash.

Correctly handled are: dash, dollar, parentheses.
Dropped are: ampersand, angle brackets, apostrophe, asterisk,
backquote, braces, colon, percent, question mark, slash, square
brackets.
Untested are: at, quote, semicolon.

<http://simple.wikipedia.org/wiki/United_States_dollar> has a gallery
of images, some of which have file names containing the dollar sign.

Sincerely Yours,
Kent



reply via email to

[Prev in Thread] Current Thread [Next in Thread]