aspell-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Splitting text into words and non-words


From: Asger Alstrup Nielsen
Subject: Re: Splitting text into words and non-words
Date: Sat, 2 Jan 1999 22:42:58 +0100

> > This is a rough algorithm that probably is easy to fool,
> 
> Yes, very, as it will won't do well at are with functions calls as
> 
>   printCont(sc.soundslike(wrd));
> 
> would become 
> 
>  printCont sc soundslike wrd
> 
> and thus none of of it would be considered code.

True.

> And you can't always count words between two punctuation charters as
> code because then a string like
>   
>  .  Howevr,
> 
> would mark Howevr as code when it is clearly part of a sentence.

No, since punctuation is dropped, this would be spellchecked.

> Instead how about this:
> 
> Count the number of occurrence of all words which appear next to any
> sort of symbol--including punctuation.
> 
> Then go back and look at the surrounding symbols for all words which
> appear more than X number of times.  If a word has a high occurrence of
> a particular symbol either before or after it, mark all occurrences of
> the word as correct.

[example that seems to work well with your scheme]
 
> Under your system only the "cout", and "endl" would be ignored.

True.  It would need a few extra heuristics to do as well.

My approach to the problem was that I wanted the spellchecker to only throw
away "words" that are known to be non-words, if possible, and keep others that
we are not sure of.  But mostly, I just wanted to give you some feedback
because you asked for it.  Of course it is possible to do better.  It seems
your approach is suitable for that.

Greets,

Asger Alstrup




reply via email to

[Prev in Thread] Current Thread [Next in Thread]