Picture courtsey: Katrina Wright on Unsplash
Last year, a debate sprang across the NLP community when Google released XLNet, a model that improved upon BERT on 20 tasks! But some people were sceptical of the performance comparison as the two models were trained on two datasets of very different size. It was hard to tell whether the increase in performance is due to more data or novel architectural design on XLNet. The debate finally settled when Facebook released RoBERTa, retrained BERT with insanely more data, the result being that it outperformed XLNet showing that the increment in performance was due to more data and not the architecture itself [1]. Does this bolster the claim that more the data, better is the model?
Picture courtsey: BERT, RoBERTa, DistilBERT, XLNet — which one to use?
Figure showing the dataset sizes of BERT, XLNet and RoBERTa
In January 2020, researchers from OpenAI empirically identified trends between the performance of Transformer model size (excluding the embeddings layer) and dataset size in comparison with limited computational power. Some of their salient claims are: [2]
Picture courtsey: Scaling Laws for Neural Language Models
Their empirical data certainly points towards the bigger is better narrative in NLP
This increase in the size of language models is making it increasingly difficult for budding researchers (*cough* like me :P) to contribute to this beautiful field. An Israeli research company AI21 labs looked for the expected cost of training NLP models and the factors involved. The team compared three different-sized BERT models on the 15 GB Wikipedia and Book corpora, evaluating both the cost of a single training run and typical, fully-loaded model cost. The team estimated fully-loaded cost to include hyperparameter tuning and multiple runs for each setting: [3,4]
Picture courtsey: The Cost of Training NLP Models
Notice the words "billion" and "trillion" in the graph
Training costs can vary drastically due to different technical parameters, climbing up to the US $1.3 million for a single run when training Google’s 11 billion parameter Text-to-Text Transfer Transformer (T5) neural network model variant. A project that might require several runs could see total training costs hit the jaw-dropping US $10 million! [3]
So, what's next? I believe that the bigger is better narrative will be in play until it plateaus and sooner or later the results will also depend on the brain (novel architectures) rather than only brawn (data).
Until then, Stay Well! Stay Safe!