octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #47553] textscan Whitespace characters differe


From: Philip Nienhuis
Subject: [Octave-bug-tracker] [bug #47553] textscan Whitespace characters different from Matlab
Date: Mon, 28 Mar 2016 19:09:44 +0000
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0 SeaMonkey/2.38

Follow-up Comment #5, bug #47553 (project octave):

@Mike:
Which docs do you refer to?
If it's the Matlab docs, yes they are confusing. I think that adequately
reflects the "one tool for all you can think of" that textscan came to be.
As to those docs, I'm even unsure if it is "\b\t" or " \b\t" (note the space)
that should be the default whitespace.

Anyway this bug report uncovers a can of worms in a way.
My understanding (as far as I could infer textscan's working) is as follows:

- whitespace (" \b\t") serves as a type of delimiter of which several
consecutive ones are always collapsed into one;

- delimiter (default a space) also delimits text fields. Consecutive ones will
not be collapsed into one ... unless multipledelimsasone is specified;

- endofline (\r\n, r\r, or \n) are default delimiters and added to the
delimiter collection, unless it is set to empty in which case it is treated as
whitespace.

So whitespace, delimiters and endofline overlap in some sense.

Things become interesting if whitespace, delimiters and endofline all occur in
a file / text srring and in textscan's input arguments.
Although I largely rewrote the textscan.m/strread.m combo, my memory gets a
bit confused, sorry. ATM I am morphing strread.m from a self-contained
function into one invoking binary textscan as backend and I'm bitten by
exactly these "uncertainties" - IOW in hindsight I sometimes doubt if I've
been doing the right things all along.

Back to whitespace, Matlab 2016a prerelease does this:

>> str = ['1' char(8) '2' char(9) '3' char(10) '4' char(13) '5']
str =
2       3
4
5
>> uint8 (str)
ans =
   49    8   50    9   51   13   52   10   53
>> C = textscan (str, '%f');
>> C{1}'
ans =
     1     2     3     4
>> C = textscan (str, '%f', 'whitespace', [char(8) char(9) char(10)
char(13)]);
>> C{1}'
ans =
     1     2     3     4     5
>> C = textscan (str, '%s');  %% NOTE: now reading strings
>> C{1}'
ans = 
    '1'    '2'    '3'    '4…'
>> C = C{1}; uint8 (C{4})
ans =
   52   13   53
>> C = textscan (str, '%s', 'whitespace', [char(8) char(9) char(10)
char(13)]);
>> C{1}'
ans = 
    '1'    '2'    '3'    '4'    '5'


... showing that whitespace somehow gets "promoted" to delimiters, regardless
of numeric or text fields in the input. Only \r (char(13) is treated as
special. But maybe that could be due to the default "endofline" character
(intractably for us inferred from the input file/string).

It could be that ML's textscan doesn't explicitly turn whitespace into
delimiters, but simply stops reading whenever a next character doesn't fit in
the numerical scheme, or when reading text, assumes whitespace doesn't occur
in text fields.

The rationale may be very logical and straightforward, but the results can be
a bit confusing - for me at least :-)

As to the example in comment #3:

>> C = textscan(sprintf ('one\ntwo\nthree\nfour\n'), '%s %s')
C = 
    {4x1 cell}    {4x1 cell}
>> C{1}
ans = 
    'one'
    'two'
    'three'
    'four'
>> C{2}
ans = 
    ''
    ''
    ''
    ''


Rik's first command gives:

>> D = textscan(sprintf('one\ntwo\nthree\nfour\n'), '%s %s', 'Whitespace', '
\b\t')
D = 
    {4x1 cell}    {4x1 cell}
>> D{1}
ans = 
    'one'
    'two'
    'three'
    'four'
>> D{2}
ans = 
    ''
    ''
    ''
    ''


...IOW the same as in comment #3.


    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?47553>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]