[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Wp-mirror-list] Fwd: Missing images
From: |
wp mirror |
Subject: |
[Wp-mirror-list] Fwd: Missing images |
Date: |
Wed, 12 Dec 2012 07:28:22 -0500 |
---------- Forwarded message ----------
From: Jason Skomorowski <address@hidden>
Date: Tue, 11 Dec 2012 21:37:57 -0500
Subject: Re: [Wp-mirror-list] Missing images
To: wp mirror <address@hidden>
Thanks Kent, glad to hear of your progress! May I give the new version a
try? Is there a source control repository someplace where you have checked
in these changes?
Cordially,
Jason
On 12-11-28 01:36 PM, wp mirror wrote:
> Dear Jason,
>
> Thank you very much for letting me know about the underpopulated
> galleries. This led to a fruitful line of investigation.
>
> 1) Image File Names
>
> WP-MIRROR 0.4 scraped image file names from xchunks by looking for
> links, such as
> [[File:foo.png|...]]
> [[Image:bar.png|...]]
> [[Media:baz.png|...]]
>
> It turns out that only 65% of image file names are found in links.
> Galleries contain about 10%:
> <gallery>
> File:foo.png|...
> Image:bar.png|...
> Media:baz.png|...
> </gallery>
>
> and
>
> {{gallery
> |File:foo.png|...
> |Image:bar.png|...
> |Media:baz.png|...
> }}
>
> A great many pages have an infobox template, usually in the upper
> right corner of the page. This is where about 25% of image file names
> can be found.
> {{Infobox
> | image_flag = foo.png
> | image_coat = bar.png
> | image_map = baz.png
> }}
> Often these are formatted with arbitrary white space, and include links
> {{Infobox
> | image
> = foo.png
> | map = [[Image:bar.png]]
> | coat
> = baz.png
> }}
> A very few (less than one in a thousand) are found in other templates.
> {{multiple image
> | image1 = foo.png
> | image2 = bar.png
> }}
> {{wide image|File:foo.png|...}}
>
> It is was an interesting task to parse all of these.
>
> 2) Testing
>
> <http://simple.wikipedia.org/wiki/Berlin> has galleries and an infobox
> template.
> <http://simple.wikipedia.org/wiki/London> has a wide image template.
> <http://simple.wikipedia.org/wiki/Neptune> has multiple image templates.
>
> 3) Storage Requirement
>
> WP-MIRROR 0.4: image files = 43k, image storage = 32G
> WP-MIRROR 0.5: image files = 65k, image storage = 48G
>
> WP-MIRROR 0.5 will scrape 50% more images and require 50% more disk
> space than previous versions. Going forward, we should advise people
> that the simple wikipedia requires 60G storage.
>
> 4) Special characters
>
> Your comment also led me down another path. Many image file names
> contain characters that have special meaning to '/bin/sh' (e.g.
> ampersand, apostrophe, back-quote, dollar sign, and many others). In
> many cases, file names with such characters can be handled provided
> that the special characters are escaped with a backslash.
>
> Correctly handled are: dash, dollar, parentheses.
> Dropped are: ampersand, angle brackets, apostrophe, asterisk,
> backquote, braces, colon, percent, question mark, slash, square
> brackets.
> Untested are: at, quote, semicolon.
>
> <http://simple.wikipedia.org/wiki/United_States_dollar> has a gallery
> of images, some of which have file names containing the dollar sign.
>
> Sincerely Yours,
> Kent
>
- [Wp-mirror-list] Fwd: Missing images,
wp mirror <=