[Paper Review] The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only (2023)
저자 : The Falcon LLM teamAbstract 핵심적절히 필터링 & 중복제거한 웹 데이터는 좋은 모델을 만들 수 있게 한다 !we show that properly filtered and deduplicated web data alone can lead to powerful models배경 as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon 1. IntroductionContributions.We introduce REFINEDW..
2024.05.22