@YourMomsTrashman

YourMomsTrashman@lemmy.world · 10 days ago

That’s exactly why projects like the common crawl exist though !

YourMomsTrashman@lemmy.world · 11 days ago

Traditionally, with machine learning, it is standard practice to mention what datasets and/or pretrains were used, so that the results are transparent and can be replicated. With GPT-2, it was “the common crawl and our own crawled 8 million web pages”, and since then I feel it’s mostly left out, falling back on (easily manipulated) benchmarks instead 😬