[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Improving strread / textread / textscan
From: |
Ben Abbott |
Subject: |
Re: Improving strread / textread / textscan |
Date: |
Tue, 25 Oct 2011 17:56:44 -0400 |
On Oct 25, 2011, at 4:43 PM, Philip Nienhuis wrote:
> Ben Abbott wrote:
>
>> On Oct 24, 2011, at 5:47 PM, Philip Nienhuis wrote:
<snip>
>>> Not all of it.
>>> An EOL can also be a field delimiter. Obvious, because an EOL naturally
>>> cuts off fields if there's no other delimiter first.
>>> The rest of i. looks correct to me.
>>
>> Maybe we're defining "delimiter" differently? ... or maybe I'm being overlay
>> pedantic?
>
> I think you're simply at a more abstract level then me, while I (the guy who
> patched this part for Octave) tend to think more at a practical level (how do
> I manage to code it).
>
>> I'm using the term to indicate a character that separates lines. Which an
>> EOL does. Or a character that separates fields. Which EOL does not do.
>
> ....unless the EOL chars are part of whitespace. Now ML's default whitespace
> for strread = ' \b\r\n\t'.
> AFAIU ML only allows '\n', '\r\n', or '\r' as EOL (default = determined from
> file), all of which are in strread's default whitespace, and as whitespace is
> the default delimiter, EOL's implicitly can delimit fields.
> Perhaps this is where my confusion stems from. See a few lines below...:
Ok. I hadn't noticed that strread and textscan used different defaults for
whitespace (textscan uses " \b\t").
>> Thus, EOLs are delimiters for lines but not for fields within a line.
>>
>> The MW docs do a reasonable job of describing this. See "Field and Row
>> Delimiters" at the link below.
>>
>> http://www.mathworks.com/help/techdoc/ref/textscan.html
>
> ... we should be careful to not mix up strread and textscan.
> I suppose you think more "the textscan way", while I (knowing that currently
> strread does the actual work for textscan) tend to perceive stuff more
> against strread.m background.
Yeah. That explains a lot! :-)
<snip>
>>
>> ... imply to me that that when reading character data, when "delimiter" is
>> specified, white-space is not used to delimit, and the characters read are
>> trimmed of leading and trailing white-space.
>
> That's my impression as well.
>
> From textscan docs:
> <QUOTE>
> textscan adds a space character, char(32), to any specified Whitespace unless
> Whitespace is empty ('') and the format includes any string conversion
> specifier.
> <QUOTE>
> I suppose strread does the same. Perhaps this is where we need to search for
> analysis of ML behavior.
Your inference is correct.
a = strread ('1 2 3', '%n', 'whitespace', sprintf('\t'))
a =
1
2
3
>>> Strict compliance with rule g. might render patching of strread.m much more
>>> complicated, as for each individual format specifier we'd have to check the
>>> whitespace/delimiters around the field in question, depending on the format
>>> specifier's nature.
>>> This is more easily done in a compiled version that linearly ploughs
>>> through the text string, than in current strread.m that works by parsing
>>> complete columns one by one.
>>> I can try to implement rule g. in a quick-and-dirty fashion, perhaps this
>>> will solve the actual bug that provoked my renewed interest.
>>>
>>> How much further should we go in fixing current strread (the work horse for
>>> textscan and textread), given the end-of-life for strread in ML plus jwe's
>>> upcoming compiled textscan version? (if he -or someone else- ever gets time
>>> to finish it, of course)
>>> I'm not in favor of blindly imitating as much as we can of the more
>>> obscure, or undocumented, or inconsistent, or corner case behavior of ML.
>>> I'd prefer clarity and consistency over strict ML compatibility.
>>> Your suggestion of documenting the Octave behavior that ML didn't document
>>> for its own functions is to be applauded.
>>
>> For the moment, I'm mostly concerned about documenting how textscan should
>> work. If you've been able to improve Octave's compatibility, then I
>> recommend you put together a changeset. John or someone else may make it
>> obsolete at some point, but that is part of the nature of code development
>> ... after all you're about to do the same to one of my contributions ;-)
>
> Happened to me too, several times. Yes that's our fate...
> But you are quick in turning ideas into changesets. I'm more reluctant and
> rather wait until I'm fairly sure.
>
> I'll try to prepare a changeset for strread.m in the coming days (I have only
> little time each day due to medical issues).
ok. No rush, I'll finish writing a test script for textscan. I'm nearly done,
but still need to write some test using files.
>> In any event, my latest attempt is below to document how textscan parses
>> fields is below.
>>
>
>> 01) Lines of input are delimited by EOL chars. The EOL character may be
>> specified by the parameter "endofline". The default is determined from
>> the file ("\n", "\r", or "\r\n").
>
> ... 01) only applies if textscan reads from file. Correct?
I think it also applied to strread.
a = strread (sprintf ('1\n2\n3'), '%n')
a =
1
2
3
[a, b] = strread (sprintf ('1\n2\n3'), '%n %n')
a =
1
2
3
b =
0
0
[a, b] = strread (sprintf ('1\n2\n3'), '%n %n')
a =
1
2
3
b =
0
0
[a, b] = strread (sprintf ('1 1\n2\n3'), '%n %n')
a =
1
2
3
b =
1
0
Maybe I'm missing something, but it looks to me as if Matlab's textscan and
strread treat EOLs and whitespace in the same way.
>> 02) When reading character fields, if no "delimiter" property is defined,
>> then
>> the characters contained by the "whitespace" property are used to delimit
>> fields. When the "delimiter" property is defined, the defined
>> "whitespace"
>> property is ignored for the purpose of delimiting strings. Also, when the
>> "delimiter" property is defined all leading and trailing characters
>> contained in the "whitespace" property are trimmed from the strings read.
>> 03) Any attempt to read fields beyond an EOL are treated as being empty. For
>> numeric data empty values are replaced by the property "emptyvalue".
>> 04) Values for numeric fields are separated by characters contained by the
>> "whitespace", or "delimiter", properties.
>
> ... or their union (?) (which is what I think); but see below 09)
Yes. That would be a better description.
>> 05) The white-space char set can be adapted by the user with the "whitespace"
>> property. It can even be set to empty.
>
> ... I'm not sure, but I think ML only allows certain characters to be part of
> whitespace. At least I read the strread docs this way. I don't know if this
> also holds for textscan.
For strread you are correct.
http://www.mathworks.com/help/techdoc/ref/strread.html
I don't think there is any such restriction for textscan.
>> 06) A repetitiion of white-space chars is folded into one char.
>> 07) Delimiters are also characters that separate fields. Multiple
>> delimiters are not folded into a single instance.
>> 09) For numeric fields, vectors of white-space, and one delimiter, are folded
>> into one _delimiter_ that separates the fields
> __VV__count goes wrong...
What are you referring to?
>> 09) A pair of delimiters separated by white-space (or nothing) implies an
>> empty value.
>> 10) If the delimiter property is specified, then white-space is *not* used to
>> delimit character fields. However, white-space is always used to delimit
>> numeric fields.
>> 11) For numeric data, the default "emptyvalue" is NaN. If the numeric
>> type doesn't support NaN, then zero is used (int32 for example). For
>> character fields, an empty value is just an empty string.
>> 12) Multiple consecutive delimiters can be folded into one delimiter by
>> setting the "MultipleDelimsAsOne" parameter to true.
>>
>> Once this part is settled, then I hope to write tests for all of this. Later
>> I'll add tests for all data types, patterns, field-multiplicity, and
>> skipping fields / literals.
>
> For which textscan version?
For Matlab's textscan. I had agreed to do that for jwe sometime ago.
Ben
- Re: Improving strread / textread / textscan, (continued)
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/23
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/24
- Re: Improving strread / textread / textscan, Philip Nienhuis, 2011/10/24
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/24
- Re: Improving strread / textread / textscan, Philip Nienhuis, 2011/10/24
- Re: Improving strread / textread / textscan, Philip Nienhuis, 2011/10/24
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/24
- Re: Improving strread / textread / textscan, Philip Nienhuis, 2011/10/24
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/24
- Re: Improving strread / textread / textscan, Philip Nienhuis, 2011/10/25
- Re: Improving strread / textread / textscan,
Ben Abbott <=
- Re: Improving strread / textread / textscan, PhilipNienhuis, 2011/10/31
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/31