[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Q on call-process and grep
From: |
Drew Adams |
Subject: |
RE: Q on call-process and grep |
Date: |
Thu, 22 Dec 2005 13:11:45 -0800 |
If it were me, I'd make a copy of the file, and then chop it into
smaller pieces where I can illustrate the problem in a manageable
length (say, 10 or 20 lines, but the fewer the better). The sed command
sed -n 17,35p bigfile >smallfile
There are over 30,000 lines.
will print lines 17 to 35, inclusive, so you can do your testing. But
you say that some of the lines are quite long. So try this:
awk '{print length($0)}' smallfile
to see how long is too long. If the lines are under 4000 chars, I'd
feel safe in guessing that line length isn't a problem. If you have
lines 20,000 chars or more, then I'd start thinking about the input.
I was hoping that I was missing something simple. You seem to be confirming
that I didn't miss anything obvious (to you) ;-).
The longest line is over 12,000 characters.
Does each line in the problem set end in a CR/LF? I've had datafiles
that gave me bad data because somehow some lines ended with CR/LF,
others with CR/CR/LF, and others with CR only. How I got the problem
isn't relevant. But to normalize the input, try
tr -d '\r' <smallfile | sed -n p >clean_smallfile
which should remove any extraneous CRs which might be causing
corruption and restore the line endings to your Cygwin default (Unix or
DOS, whichever you picked).
Did that on the complete original file. `ediff' shows no difference from the
original.
I tried using a small file - just a few lines of the original - no change.
Terms that can't be found still aren't; those that can be found still are.
Use tr to delete all the characters that are permissible or
expected, and whatever is left must be an unexpected character. Examine
the output with cat -A or od or your tool of choice. E.g.,
tr -d '\n\r\t\40-\176' <infile >outfile
Did that. outfile is empty, so I guess everything was ASCII.
If it were me, I might wonder about embedded backspaces or carriage
returns in the text. Just a thought. Good luck on your hunting!
My guess is that the line lengths and number of lines don't matter here,
because it works fine for other words, including 1) words in the longest
line and 2) words in the last line of the file. It's a mystery to me why it
doesn't work for certain words.
Thanks for your suggestions, though - they were good things to try, even if
I haven't yet solved the problem.