Splitting text into words and non-words

aspell-user

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Splitting text into words and non-words

From:	Asger Alstrup Nielsen
Subject:	Splitting text into words and non-words
Date:	Sat, 2 Jan 1999 13:48:22 +0100

> I had an idea on a great general way to determine if a word should be
> skipped.  Determine the words to skip based on the symbols that (almost)
> always surround the word.

I think you need to design the algorithm in detail before you try to implement
it.  You outlined it in the post, but it needs to be more elaborate.  

The observation that the syntactic context on a word tells something about the
semantics of it is of course useful, but I think that it will be difficult to
develop a scheme that works correctly for all kinds of texts.

However, it's worth trying out, and maybe over time, the algorithm can be
fine-tuned to behave better.

I propose a more detailed algorithm below that you try out at first.

> So I was wondering if anyone on this list has any experience in writing
> this sort of context recognition code or could give me some pointers in
> the right direction.

I describe an algorithm in principle, and as a multi-pass algorithm that
requires O(n) space.  You probably want to do something to reduce that to a
O(1) space algorithm, and that should be pretty easy to do by using a small
buffer.

Anyway, here goes:

First convert the file into a list of strings by splitting at whitespace and
change from letter to non-letters.  Also, throw alway any non-letter characters
that appear in ordinary text:  , . : ; ( ) ! ? - "

The text 

        'cout << (num+1) << "test" << endl;' 

would be converted into the list: 

        'cout' '<<' 'num' '+1' '<<' 'test' '<<' 'endl'

We have two kinds of strings:  Ordinary words that are made up from letters
alone.  Let's color those white.  We also have "non-words" that do not contain
any letters.  These are colored red.

Now, what we want to do is to polarize each string in the list, and categorize
it as either red or green, meaning "certainly not a word" and "certainly a
word".  When we have done that for all strings, we are ready to spellcheck the
lot:  All the green words are spellchecked, while all the red strings are
skipped.

So at this point, all we know is that some of the strings are certainly not a
word (because some of them are red), but we are unsure about the remaining,
because we don't have any green words.  Some of them might be keywords or
identifiers in a programming language, and those are not interesting to us.

In other words, the task is to color the white strings as either red or green. 
In order to do so, we apply a heuristic that exploits context information.

So, perform a scan of the white strings.  For each white string, have a look at
the preceding string color and the following string color.  If neither is red,
we judge that the string is a real word, and color it green.

Now, for all the other white strings where one of the neighbours is red, do
this:  Pair together every red string that is prior to white string with the
white string.  Similar for all red strings following white strings.

Now scan these pairs, and if a white word only appears in pairs with the same
red string, we judge that the white word is a keyword in a programming
language, and color it red.

The rest, we color green.

This is a rough algorithm that probably is easy to fool, but the framework
could serves as a starting point.  There are two things that can be easily
fiddled with:  The initial list of characters that are ignored (i.e. the
punctuation characters.)  In some contexts these characters might help resolve
some issues.

And of course, the heuristics can be refined.  For instance, there is a high
probability that white strings following the red string "<<" is a keyword.  One
approach to use such information would be to develop a scoring system, and
develop a set of rules that will score keywords according to some specific
criteria that can be developed by trying out the system on a wide variety of
texts.

> Please rank the your experience with in the following languages with 0
> being none and 5 being expert.

C++ (Modern, STL) 4
C++ (Object Oriented) 4
C 4
Java 3
Perl 2
Lisp 2

Greets,

Asger Alstrup

[Prev in Thread]

Current Thread

[Next in Thread]

Splitting text into words and non-words, Asger Alstrup Nielsen <=
- Re: Splitting text into words and non-words, Kevin Atkinson, 1999/01/02
- Re: Splitting text into words and non-words, Asger Alstrup Nielsen, 1999/01/02
  - Re: Splitting text into words and non-words, Kevin Atkinson, 1999/01/02

Prev by Date: Programming experience
Next by Date: Re: Splitting text into words and non-words
Previous by thread: Programming experience
Next by thread: Re: Splitting text into words and non-words
Index(es):
- Date
- Thread