EPIA'03 - 11th Portuguese Conference on Artificial Intelligence

NLTR -- Natural Language and Text Retrieval


Session: December 5, 14:45-16:15, Room B
Title: Text Mining using Generalized Character Based n-Grams
Nuno C. Marques and Agnès Braud
Abstract: Term extraction is an essential problem in Text Mining. Standard techniques relying on word identification potentially lead to problems, notably a misleading split of entity names. An alternative is to consider more flexible tokens, namely n-grams and in particular character n-grams that appear more frequently in a corpus. In this manner, it is possible to pick out n-grams that correspond to grammatical units of the language, from stems to sequences of words, without any prior lexical knowledge. The straightforward algorithm for character n-gram extraction is however expensive. In this paper, we study the computational costs of extracting character n-grams from a corpus. We propose a new approach for efficiently counting all n-grams occurring with a desired frequency in a corpus. Tests show that frequent long enough n-grams include most of the relevant words and multiword units in the corpus and that n-grams can be used with advantages over words in many frameworks. We conclude by presenting an application of this work to Text Mining.
Back to schedule.