A Scalable Model of Frequency Distribution of Low Occurrence Multi-words Towards Handling Very Large Spectrum of Text Corpora Sizes by Joaquim Ferreira da Silva (DI-NOVA FCT and NOVA LINCS)

04 Fev 2026 - 14h00

Categoria:
Seminário

Onde:
Presencial

Local:
Sala de Seminários do DI e Google Meet

Descrição:

Predicting the diversity of words and multi-words (n-grams) in a text corpus and their frequency distributions is important in NLP and language modelling, and is becoming critical to enable the design of modern applications, namely Large Language Models, e.g. for guiding tokenization and corpus analysis for pre-training. This requires the ability to model the very large scale corpora behaviour, the handling of multiwords as subwords or phrases, and the distribution of n-grams across different frequency ranges, namely the low occurrence n-grams.

Ligação:
https://meet.google.com/fup-ddqu-iox