wp-mirror-list
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Wp-mirror-list] Fwd: Missing images


From: wp mirror
Subject: [Wp-mirror-list] Fwd: Missing images
Date: Wed, 12 Dec 2012 07:28:22 -0500

---------- Forwarded message ----------
From: Jason Skomorowski <address@hidden>
Date: Tue, 11 Dec 2012 21:37:57 -0500
Subject: Re: [Wp-mirror-list] Missing images
To: wp mirror <address@hidden>

Thanks Kent, glad to hear of your progress! May I give the new version a
try? Is there a source control repository someplace where you have checked
in these changes?

Cordially,
Jason

On 12-11-28 01:36 PM, wp mirror wrote:
> Dear Jason,
>
> Thank you very much for letting me know about the underpopulated
> galleries.  This led to a fruitful line of investigation.
>
> 1) Image File Names
>
> WP-MIRROR 0.4 scraped image file names from xchunks by looking for
> links, such as
> [[File:foo.png|...]]
> [[Image:bar.png|...]]
> [[Media:baz.png|...]]
>
> It turns out that only 65% of image file names are found in links.
> Galleries contain about 10%:
> <gallery>
> File:foo.png|...
> Image:bar.png|...
> Media:baz.png|...
> </gallery>
>
> and
>
> {{gallery
> |File:foo.png|...
> |Image:bar.png|...
> |Media:baz.png|...
> }}
>
> A great many pages have an infobox template, usually in the upper
> right corner of the page.  This is where about 25% of image file names
> can be found.
> {{Infobox
> | image_flag = foo.png
> | image_coat = bar.png
> | image_map = baz.png
> }}
> Often these are formatted with arbitrary white space, and include links
> {{Infobox
> | image
> = foo.png
> |       map  = [[Image:bar.png]]
>    | coat
> =           baz.png
> }}
> A very few (less than one in a thousand) are found in other templates.
> {{multiple image
> | image1 = foo.png
> | image2 = bar.png
> }}
> {{wide image|File:foo.png|...}}
>
> It is was an interesting task to parse all of these.
>
> 2) Testing
>
> <http://simple.wikipedia.org/wiki/Berlin> has galleries and an infobox 
> template.
> <http://simple.wikipedia.org/wiki/London> has a wide image template.
> <http://simple.wikipedia.org/wiki/Neptune> has multiple image templates.
>
> 3) Storage Requirement
>
> WP-MIRROR 0.4:  image files = 43k, image storage = 32G
> WP-MIRROR 0.5:  image files = 65k, image storage = 48G
>
> WP-MIRROR 0.5 will scrape 50% more images and require 50% more disk
> space than previous versions.  Going forward, we should advise people
> that the simple wikipedia requires 60G storage.
>
> 4) Special characters
>
> Your comment also led me down another path.  Many image file names
> contain characters that have special meaning to '/bin/sh' (e.g.
> ampersand, apostrophe, back-quote, dollar sign, and many others).  In
> many cases, file names with such characters can be handled provided
> that the special characters are escaped with a backslash.
>
> Correctly handled are: dash, dollar, parentheses.
> Dropped are: ampersand, angle brackets, apostrophe, asterisk,
> backquote, braces, colon, percent, question mark, slash, square
> brackets.
> Untested are: at, quote, semicolon.
>
> <http://simple.wikipedia.org/wiki/United_States_dollar> has a gallery
> of images, some of which have file names containing the dollar sign.
>
> Sincerely Yours,
> Kent
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]