How Large Corpora Sizes Influence the Distribution of Low Frequency Text n-grams by Joaquim Ferreira da Silva (NOVA FCT - NOVA LINCS)

10 Jul 2024 - das 14h00 às 15h00



Sala de Seminários do DI e ZOOM


The prediction of the numbers of distinct word n-grams and their frequency distributions in text corpora is important in domains like information processing and language modelling. With big data corpora, there is an increased application complexity due to the large volume of data. Traditional studies have been confined to small or moderate size corpora, leading to statistical laws on word frequency distribution. However, when going to very large corpora, some of the assumptions underlying those laws need to be revised, related to the corpus vocabulary and the numbers of word occurrences. So, although it becomes critical to know how the corpus size influences those distributions, there is a lack of models that characterize such influence. We propose a model that aims at filling this gap, enabling the prediction of the impact of corpus growth upon application time and space complexities.