|
EPIA'03 - 11th Portuguese Conference on Artificial Intelligence
NLTR -- Natural Language and Text Retrieval
|
Session: December 5, 14:45-16:15, Room B |
Title: |
Text Mining using Generalized Character Based n-Grams |
|
Nuno C. Marques and Agnès Braud |
Abstract: |
Term extraction is an essential problem in Text Mining. Standard techniques relying on word identification potentially lead to problems, notably a misleading split of entity names. An alternative is to consider more flexible tokens, namely n-grams and in particular character n-grams that appear more frequently in a corpus. In this manner, it is possible to pick out n-grams that correspond to grammatical units of the language, from stems to sequences of words, without any prior lexical knowledge. The straightforward algorithm for character n-gram extraction is however expensive.
In this paper, we study the computational costs of extracting character n-grams from a corpus. We propose a new approach for efficiently counting all n-grams occurring with a desired frequency in a corpus. Tests show that frequent long enough n-grams include most of the relevant words and multiword units in the corpus and that n-grams can be used with advantages over words in many frameworks. We conclude by presenting an application of this work to Text Mining. |
Back to schedule. |