- Hilary Sumner
- Sep 24
- 1 min read
A team of researchers has shown that it’s possible to build large AI datasets entirely from ethical sources, namely 130,000 English-language books from the Library of Congress—almost twice the size of Project Gutenberg’s collection. Their project adds to recent open-source efforts like Hugging Face’s FineWeb, which aim to make AI training more transparent and responsible. While experts say this careful approach may not be big enough to power today’s largest AI models, they hope it will encourage companies to be more open about what data they use. CLICK HERE FOR FULL ARTICLE

