bug-groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #58206] [PATCH] fix PDFPIC issue with determining size of pdfs cont


From: G. Branden Robinson
Subject: [bug #58206] [PATCH] fix PDFPIC issue with determining size of pdfs containing images
Date: Fri, 21 Jan 2022 01:31:37 -0500 (EST)
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0

Update of bug #58206 (project groff):

                  Status:               Need Info => In Progress            

    _______________________________________________________

Follow-up Comment #14:

I'm mostly unblocked.

The problem with the original problematic file (the "angular 1200x800" thing)
appears to be that it had a title property that was encoded in UTF-16BE.


$ xxd angular-1280-800.pdf | sed -n '/459f0/,/45a30/p'
000459f0: 3c3c 0a2f 5469 746c 6520 3c30 3036 3130  <<./Title <00610
00045a00: 3036 4530 3036 3730 3037 3530 3036 4330  06E00670075006C0
00045a10: 3036 3130 3037 3230 3032 4430 3033 3130  0610072002D00310
00045a20: 3033 3230 3033 3830 3033 3030 3032 4430  03200380030002D0
00045a30: 3033 3830 3033 3030 3033 3030 3030 303e  038003000300000>


You don't see that a lot these days, with the success of the global campaign
to exterminate big-endian desktop (and mobile) computing.

So this is what pdfinfo ends up doing with that.


$ pdfinfo angular-1280-800.pdf | xxd
00000000: 5469 746c 653a 2020 2020 2020 2020 2020  Title:          
00000010: 0061 006e 0067 0075 006c 0061 0072 002d  .a.n.g.u.l.a.r.-
00000020: 0031 0032 0038 0030 002d 0038 0030 0030  .1.2.8.0.-.8.0.0
00000030: 0000 0a50 726f 6475 6365 723a 2020 2020  ...Producer:    
00000040: 2020 2068 7474 7073 3a2f 2f69 6d61 6765     https://image
00000050: 6d61 6769 636b 2e6f 7267 0a43 7265 6174  magick.org.Creat
00000060: 696f 6e44 6174 653a 2020 204d 6f6e 2041  ionDate:   Mon A
00000070: 7072 2032 3020 3034 3a33 333a 3434 2032  pr 20 04:33:44 2
00000080: 3032 3020 4145 5354 0a4d 6f64 4461 7465  020 AEST.ModDate
00000090: 3a20 2020 2020 2020 204d 6f6e 2041 7072  :        Mon Apr
000000a0: 2032 3020 3034 3a33 333a 3434 2032 3032   20 04:33:44 202
000000b0: 3020 4145 5354 0a54 6167 6765 643a 2020  0 AEST.Tagged:  
000000c0: 2020 2020 2020 206e 6f0a 5573 6572 5072         no.UserPr
000000d0: 6f70 6572 7469 6573 3a20 6e6f 0a53 7573  operties: no.Sus
000000e0: 7065 6374 733a 2020 2020 2020 206e 6f0a  pects:       no.
000000f0: 466f 726d 3a20 2020 2020 2020 2020 2020  Form:           
00000100: 6e6f 6e65 0a4a 6176 6153 6372 6970 743a  none.JavaScript:
00000110: 2020 2020 206e 6f0a 5061 6765 733a 2020       no.Pages:  
00000120: 2020 2020 2020 2020 310a 456e 6372 7970          1.Encryp
00000130: 7465 643a 2020 2020 2020 6e6f 0a50 6167  ted:      no.Pag
00000140: 6520 7369 7a65 3a20 2020 2020 2031 3238  e size:      128
00000150: 3020 7820 3830 3020 7074 730a 5061 6765  0 x 800 pts.Page
00000160: 2072 6f74 3a20 2020 2020 2020 300a 4669   rot:       0.Fi
00000170: 6c65 2073 697a 653a 2020 2020 2020 3238  le size:      28
00000180: 3539 3337 2062 7974 6573 0a4f 7074 696d  5937 bytes.Optim
00000190: 697a 6564 3a20 2020 2020 206e 6f0a 5044  ized:      no.PD
000001a0: 4620 7665 7273 696f 6e3a 2020 2020 312e  F version:    1.
000001b0: 330a                                     3.


In other words, it simply blasts the encoded bytes to its own output in utter
indifference to the character encoding used by the output device.  For an
information-extraction tool whose entire purpose is human-readable output,
that seems a dubious decision to me.

But, we're stuck with it for the time being (unless a PDFPIC user wants to
migrate to Deri's lower-level output driver-leveraging alternative in comment
#7).

I'll see if I can force a UTF-16 Title property onto gnu.eps so that I can
craft a proper regression test.

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?58206>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]