aspell-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Splitting text into words and non-words


From: Kevin Atkinson
Subject: Re: Splitting text into words and non-words
Date: Sat, 02 Jan 1999 16:35:24 +0000

Asger Alstrup Nielsen wrote:
> 
> > So I was wondering if anyone on this list has any experience in writing
> > this sort of context recognition code or could give me some pointers in
> > the right direction.
> 
> I describe an algorithm in principle, and as a multi-pass algorithm that
> requires O(n) space.  You probably want to do something to reduce that to a
> O(1) space algorithm, and that should be pretty easy to do by using a small
> buffer.
>
> First convert the file into a list of strings by splitting at whitespace and
> change from letter to non-letters.  Also, throw alway any non-letter 
> characters
> that appear in ordinary text:  , . : ; ( ) ! ? - "
> 
> ...
>
> This is a rough algorithm that probably is easy to fool,

Yes, very, as it will won't do well at are with functions calls as

  printCont(sc.soundslike(wrd));

would become 

 printCont sc soundslike wrd

and thus none of of it would be considered code.
  
And you can't always count words between two punctuation charters as
code because then a string like
  
 .  Howevr,

would mark Howevr as code when it is clearly part of a sentence.

Instead how about this:

Count the number of occurrence of all words which appear next to any
sort of symbol--including punctuation.

Then go back and look at the surrounding symbols for all words which
appear more than X number of times.  If a word has a high occurrence of
a particular symbol either before or after it, mark all occurrences of
the word as correct.

For example Given the following code sample:

    case '$':
      if (cin.get() == '$') {
        switch(cin.get()) {
        case 's':
          get_word_pair(word,word2);
          cout << sc.score(word.c_str(),word2.c_str()) << endl;
          break;
        case 'S':
          switch(cin.get()) {
          case 'W':
          case 'w':
            cin >> word;
            cout << sc.to_soundslike(word) << endl;
            ignore_rest();
            break;
          case 'L':

The counts would be (for words which would normally be misspelled)
  cin  4
  cout 2
  endl 2 
  sc   2
  str  2

Because this is a small block of code we will let X be 2 thus.
  cin  ->  (cin.    3/4 times
  cout ->  cout <   2/2 times
  endl ->  << endl; 2/2 times
  sc   ->  << sc.   2/2 times
  str  ->  _str(    2/2 times

Thus all 5 words will be ignored.

Under your system only the "cout", and "endl" would be ignored.


-- 
Kevin Atkinson
address@hidden
http://metalab.unc.edu/kevina/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]