google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction

google-research-datasets

Fetched on 2026/03/01 20:06

This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model. The approach and the dataset are described in more detail by Stahlberg and Kumar (2021) (https://www.aclweb.org/anthology/2021.bea-1.4/) - View it on GitHub

Star

162

Rank

211822

google-research-datasets

google-research-datasets / C4_200M-synthetic-dataset-for-grammatical-error-correction