[Paper Review] The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only (2023)

Paper Review

[Paper Review] The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only (2023)

인간개발자 2024. 5. 22. 00:27

저자 : The Falcon LLM team

Abstract

핵심
- 적절히 필터링 & 중복제거한 웹 데이터는 좋은 모델을 만들 수 있게 한다 !
  we show that properly filtered and deduplicated web data alone can lead to powerful models
배경
- as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon

1. Introduction

Contributions.

We introduce REFINEDWEB, a high-quality five trillion tokens web-only English pretraining dataset
We demonstrate that web data alone can result in models outperforming both public and private curated corpora, as captured by zero-shot benchmarks, challenging current views about data quality
We publicly release a 600B tokens extract of RefinedWeb, and 1/7B parameters LLMs trained on it, to serve as a new baseline high-quality web dataset for the natural language processing community.

3. Macrodata Refinement and RefinedWeb

MDR (MacroData Refinement): 필터링 & 중복제거 파이프라인
- We introduce MDR (MacroData Refinement), a pipeline for filtering and deduplicating web data from CommonCrawl at very large scale
- Using MDR, we produce REFINEDWEB

3.1. Document preparation

Reading data

데이터 선정
- CommonCroll 데이터셋 사용
- WARC, WET 파일 중 WARC을 사용함
  - WARC (raw HTML response), WET files (preprocessed to only include plain text
  - our pipeline starts from raw WARC files, read with the warcio library
  - why WARC? : WET files to include undesirable navigation menus, ads

URL Filtering

we base our filtering of adult documents only on the URL itself, and not on the content of the documents.

URL blocklist
- we aggregated a list of 4.6M domains, that we explicitly ban.
- We select categories likely to contain adult or malicious content, as well as spam or unstructured text.
URL scoring
Common HQ sources blocked

Extracting Text

From WARC using warcio, trafilatura for text extraction

Language Identification

설명
- 58.20% of all documents in the processed CommonCrawl data were identified as English
- we classify processed CommonCrawl data into 176 languages.
방법
- We use the fastText language classifier of CCNet (Wenzek et al., 2020) at the document level
- We remove documents for which the top language scores below 0.65

3.2. Filtering: document-wise and line-wise

Repetition Removal

In-document repetition removal

We could catch this content at the later deduplication stage, but it is cheaper and easier to catch it document-wise early on.
removing documents with a high proportion of repeated lines, paragraphs, or 𝑛-grams.*

*참조문서 : Scaling Language Models: Methods, Analysis & Insights from Training Gopher, Deepmind

Document-wise Filtering

quality heuristics

대상
- machine-generated spam, made predominantly of lists of keywords, boilerplate text, or sequences of special characters
방법*
- remove any document that does not contain between 50 and 100,000 words, or whose mean word length is outside the range of 3 to 10 characters
- we remove any document with a symbol-to-word ratio greater than 0.1 for either the hash symbol or the ellipsis
- we remove any document with more than 90% of lines starting with a bullet point, or more than 30% ending with an ellipsis
- We also require that 80% of words in a document contain at least one alphabetic character, and apply a "stop word" filter, to remove documents that do not contain at least two of the following English words: the, be, to, of, and, that, have, with

*참조문서: Scaling Language Models: Methods, Analysis & Insights from Training Gopher, Deepmind 중 quality filtering

Line-wise Filtering

Remove undesirable lines

If these corrections remove more than 5% of a document, we remove it entirely.
대상
- undesirable lines (e.g., call to actions, navigation buttons, social counters, etc)
방법
- We analyse documents line-by-line, and discard or edit the lines based on the following rules: (사진 참조)
- Finally, if the words in the flagged lines represent more than 5% of the total document words, the document is discarded.

3.3. Deduplication: fuzzy, exact, and across dumps

Fuzzy deduplication.

방법
- We perform MinHash deduplication using 9,000 hashes per document, calculated over 5-grams and divided into 20 buckets of 450 hashes.
- 느슨한 세팅을 사용하면, 중복 제거율이 낮아지고, 이는 모델 퍼포먼스를 저하시킨다.
  Using less aggressive settings, such as the 10 hashes of The Pile (Gao et al., 2020),
  resulted in lower deduplication rates and worsened model performance.

Exact deduplication.

Exact substring operates at the sequence-level instead of the document-level, finding matches between strings that are exact token-by-token matches by using a suffix array

방법
- We remove any match of more than 50 consecutive tokens, using the implementation of Lee et al.

'Deduplicating Training Data Makes Language Models Better' 논문 중 4.1 Exact Substring Duplication*

code: https://github.com/google-research/deduplicate-text-datasets

We note that exact substring alters documents, by removing specific spans
- document를 통으로 drop 하거나, loss-masking을 하는 방법을 시도해봤으나 두드러지는 효과가 없었다.

이 논문의 Data Pipeline 부분을 세밀하게 살펴 보았습니다.

논문 본문에는 Pipeline 각각 기능에 대한 소개, Appendix에는 각 기능을 실제 세팅에 대한 자세한 설명이 나와있습니다. Appendix를 꼭 한 번 읽어보시기를 추천합니다.
Deepmind의 Scaling Language Models: Methods, Analysis & Insights from Training Gopher 논문도 많이 참고한 것으로 보입니다.
데이터셋도 허깅페이스에 올라와있네요.

Paper : https://arxiv.org/pdf/2306.01116
HuggingFace: https://huggingface.co/datasets/tiiuae/falcon-refinedweb

tiiuae/falcon-refinedweb · Datasets at Hugging Face

The fashionable Tokyo thoroughfare of Omotesando invariably bustles with hordes of young people, but turn away from its rows of Zelkova trees and head down one of its back streets, and peace and quiet prevail. This is where it all started for hairdresser H

huggingface.co