AI Training Datasets Directory
Comprehensive analysis of 48 training datasets powering the world's leading AI models
Total Datasets
48
Average Quality
5.8/10
Top Domain
general
Browse by Domain
All Datasets (48)
redpajama
RedPajama is a project by Together AI to create leading open-source models by reproducing the LLaMA training dataset. RedPajama-Data-1T is a 1.2 trillion token dataset (~2.67TB compressed, ~5TB uncompressed) designed to match the training data composition described in Meta AI's LLaMA paper.
Dataset Size
**RedPajama-Data-1T:**
Domains
Used By
3 notable models
Key Strengths:
- •Transparent Composition: Clear breakdown of all data sources and proportions
- •Reproducibility: Open-source code enables full reproduction
the pile
The Pile is a 825.18 GiB (886 GB) diverse, open-source dataset of English text created by EleutherAI in 2020 and publicly released on December 31, 2020. It represents a carefully curated collection of 22 smaller, high-quality datasets specifically selected for training large language models. The Pile was constructed to address the limitations of relying primarily on Common Crawl by introducing a dataset with significant diversity from multiple authoritative sources.
Dataset Size
The Pile consists of 825.18 GiB of English text composed of 22 smaller datasets with different "epoc
Domains
Used By
5 notable models
Key Strengths:
- •Deliberate Diversity: Explicitly curated to include diverse content types (academia, code, Q&A, books, web text), improving model generalization.
- •Documented Quality: Each component dataset is thoroughly documented with rationale for inclusion, enabling researchers to understand dataset composition.
bigcode
BigCode refers to the comprehensive code datasets created by the BigCode Project, an open scientific collaboration for responsible large language model development for code. It includes The Stack and related code compilation datasets totaling multiple terabytes of permissively-licensed source code.
Dataset Size
- **The Stack v1**: 3TB of deduplicated code
Domains
Used By
3 notable models
Key Strengths:
- •Legal Clarity: Permissive licenses eliminate legal concerns
- •Comprehensive: 358+ languages unprecedented coverage
the stack
The Stack is a large-scale dataset of source code created by the BigCode Project, an open scientific collaboration focused on responsible development of large language models for code. The Stack contains over 6TB of permissively-licensed source code files covering 358 programming languages, extracted from public GitHub repositories.
Dataset Size
- **Version 1.0**: 3TB of deduplicated code
Domains
Used By
3 notable models
Key Strengths:
- •Legal Clarity: Permissive licenses eliminate licensing concerns
- •Comprehensive: 358 languages provide broad coverage
pubmed central
PubMed Central is a free digital archive of biomedical and life sciences journal literature. Approximately 96.93GB of medical literature is included in The Pile dataset.
Dataset Size
- **Total Size**: ~96.93GB in The Pile (2 epochs assigned)
Domains
Used By
2 notable models
Key Strengths:
- •Medical Authority: Peer-reviewed medical research
- •Domain Expertise: Specialized medical terminology and concepts
starcoder data
StarCoder Data is the comprehensive code dataset used to train HuggingFace's StarCoder model. Created by the BigCode Project and processed for StarCoder training, it includes carefully selected and processed source code from multiple sources with permissive licensing.
Dataset Size
- **Total Data**: The Stack v1 (3TB) processed for model training
Domains
Used By
3 notable models
Key Strengths:
- •Quality: Demonstrates effectiveness in StarCoder performance
- •Multi-Language: 80+ language support
orca
Orca is a dataset of 1 million instruction-response pairs created by Microsoft using chain-of-thought reasoning and large language model outputs. It extends the Alpaca dataset with more complex reasoning instructions and detailed step-by-step responses.
Dataset Size
- **Total Samples**: ~1 million instruction-response pairs
Domains
Used By
3 notable models
Key Strengths:
- •Large Scale: 1 million instruction-response pairs
- •Reasoning: Explicit chain-of-thought in responses
oscar
OSCAR (Open Super-large Crawled ALMAnaCH coRpus) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus. OSCAR 22.01 (released January 2022) uses the Ungoliant architecture for extraction and filtering, while the 2019 version uses goclassy.
Dataset Size
- **Languages**: 151-163 different languages available
Domains
Used By
3 notable models
Key Strengths:
- •Multilingual: 150+ languages enable global AI development
- •Low-Resource Support: Critical for underrepresented languages
webtext
WebText is the original dataset used by OpenAI to train GPT-2. It consists of approximately 40GB of text scraped from webpages curated by humans, specifically URLs shared and upvoted on Reddit.
Dataset Size
- **Total Size**: ~40GB of text data
Domains
Used By
1 notable models
Key Strengths:
- •Quality Signal: Human curation through Reddit upvotes
- •Effective: Produced high-performing GPT-2 model
mbpp
MBPP (Mostly Basic Python Programming) is a benchmark dataset of 1,000 short Python programming problems created by Google. Released in 2021, it serves as a benchmark for evaluating code generation models on programming task completion.
Dataset Size
- **Total Problems**: 1,000 programming problems
Domains
Used By
3 notable models
Key Strengths:
- •Benchmark Standard: Widely-used evaluation metric
- •Test Coverage: Multiple test cases ensure robustness
dolly 15k
Dolly-15k is a dataset of ~15,000 instruction-following examples created by Databricks. It consists of high-quality instruction-output pairs designed to enable open-source models to match commercial models' instruction-following capabilities.
Dataset Size
- **Total Examples**: ~15,000 instruction-output pairs
Domains
Used By
3 notable models
Key Strengths:
- •High Quality: Manually curated examples
- •Small but Effective: 15K sufficient for fine-tuning
c4
C4 (Colossal Clean Crawled Corpus) is a dataset of approximately 750GB of "reasonably clean and natural English text" developed by Google engineers. It was created by taking a single month's scrape of Common Crawl (April 2019) and applying filtering heuristics to remove duplicate, placeholder, nonsensical, and non-English content.
Dataset Size
- **Total Size**: ~750GB compressed (305GB deduplicated English)
Domains
Used By
4 notable models
Key Strengths:
- •Scale and Accessibility: 750GB of publicly available, filtered text
- •Systematic Filtering: Documented heuristics enable reproducibility
openwebtext
OpenWebText is an open-source replication of the WebText dataset from OpenAI, originally used to train GPT-2. Created by Aaron Gokaslan and Vanya Cohen of Brown University, it consists of text extracted from web pages curated by humans on Reddit.
Dataset Size
- **Total Size**: 41.70 GB (generated), 13.51 GB downloaded, 55.21 GB total disk
Domains
Used By
3 notable models
Key Strengths:
- •Human Curation: Reddit upvotes provide quality signal superior to raw web text
- •Reproducible: Process openly documented with public code
github code
The GitHub Code dataset refers to source code extracted from public repositories on GitHub. It is commonly included in large-scale AI training datasets through initiatives like CodeSearchNet and The Stack. The dataset consists of source code files covering multiple programming languages, paired with documentation and metadata from GitHub repositories.
Dataset Size
GitHub Code typically includes:
Domains
Used By
5 notable models
Key Strengths:
- •Real-World Relevance: Uses actual production code, improving practical applicability
- •Multi-Language: Covers diverse programming languages
stackoverflow
StackOverflow is a community-driven question and answer platform for programmers. It is included in datasets like The Pile and RedPajama as a source of high-quality programming knowledge and practical code solutions.
Dataset Size
- **Content**: Programming questions, answers, and code snippets
Domains
Used By
3 notable models
Key Strengths:
- •Practical Knowledge: Real-world programming solutions
- •Community Quality: Upvotes and accepted answers provide quality signals
codeparrot
CodeParrot is a large-scale dataset of source code designed specifically for training code generation models. It consists of permissively-licensed code from various sources including GitHub.
Dataset Size
- **Content**: Source code files with various licenses
Domains
Used By
2 notable models
Key Strengths:
- •Code-Specific: Designed explicitly for code models
- •Legal Clarity: Permissive licenses
cc 100
CC-100 is a multilingual dataset extracted from Common Crawl, containing 100+ languages processed for language model training. It represents a language-specific approach to filtering Common Crawl.
Dataset Size
- **Languages**: 100+ languages
Domains
Used By
2 notable models
Key Strengths:
- •Multilingual: 100+ languages
- •Comprehensive: Broad language coverage
roots
ROOTS is a large-scale multilingual dataset created by the BigScience workshop. It encompasses text data in multiple languages specifically curated for training large multilingual language models.
Dataset Size
- **Languages**: Multiple languages represented
Domains
Used By
2 notable models
Key Strengths:
- •Multilingual: Supports many languages
- •Equitable: Emphasis on low-resource languages
datacomp
DataComp is a benchmark and framework for comparing visual datasets used to train vision models. Created by researchers at Stanford, MIT, and other institutions in 2023, it provides standardized evaluation of how dataset composition affects model performance.
Dataset Size
- **Pool**: 12.8 billion image-text pairs available for selection
Domains
Used By
3 notable models
Key Strengths:
- •Benchmarking: Standardized framework for dataset comparison
- •Efficiency: Demonstrates data selection can improve efficiency
slimorca
SlimOrca is a reduced and deduplicated version of the Orca dataset, containing ~500K high-quality instruction-response pairs. Created to provide a more focused subset of Orca with emphasis on quality and diversity.
Dataset Size
- **Total Samples**: ~500,000 instruction-response pairs
Domains
Used By
2 notable models
Key Strengths:
- •Quality Focused: High-quality examples selected from Orca
- •Deduplicated: Removes redundancy improving training efficiency
bookcorpus
BookCorpus is a dataset consisting of approximately 11,038 self-published books (with 7,185 confirmed unique books) scraped from Smashwords, an indie ebook distribution platform. The corpus contains around 985 million words and was introduced in a 2015 paper by researchers from the University of Toronto and MIT. It was one of the earliest major datasets used for training large language models.
Dataset Size
The dataset comprises:
Domains
Used By
6 notable models
Key Strengths:
- •Narrative Quality: Full-length books provide coherent, narrative text with natural linguistic patterns and context-dependent language.
- •Linguistic Diversity: Self-published authors from around the world contributed, providing diverse writing styles and perspectives (despite genre skew).
arxiv
ArXiv is a dataset of scientific preprints from arXiv.org, an open-access repository of academic papers in physics, mathematics, computer science, and other fields. Approximately 56GB of scientific papers are included in The Pile dataset.
Dataset Size
- **Total Size**: ~56GB in The Pile
Domains
Used By
2 notable models
Key Strengths:
- •Scientific Authority: Peer-reviewed content from established repository
- •Domain-Specific: Specialized vocabulary and concepts
laion 400m
LAION-400M is an openly available dataset containing 400 million CLIP-filtered image-text pairs. It represents an earlier and smaller version of the LAION dataset series, released before LAION-5B. The dataset consists of images scraped from the web with accompanying captions, filtered using OpenAI's CLIP model.
Dataset Size
- **Total Pairs**: 400 million image-text pairs
Domains
Used By
4 notable models
Key Strengths:
- •Open Access Pioneer: First dataset to democratize access to hundreds of millions of image-text pairs for research.
- •CLIP Filtering: Established effective filtering methodology using CLIP scores for image-text alignment.
yfcc100m
YFCC100M (Yahoo Flickr Creative Commons 100 Million) is a dataset of ~100 million photos shared on Flickr with Creative Commons licenses. Released in 2016 by Yahoo and the University of Science and Technology of China, it provides a large-scale collection of publicly licensed images.
Dataset Size
- **Total Images**: ~100 million photos
Domains
Used By
3 notable models
Key Strengths:
- •Legal Clarity: Creative Commons licensing clear for commercial and research use
- •Large Scale: 100 million images substantial for training
sbu captions
SBU Captions is a dataset of 1 million image-caption pairs from Flickr that are user-captioned. Released around 2014, it serves as a freely available alternative to manually annotated datasets like MSCOCO for training vision-language models.
Dataset Size
- **Total Pairs**: ~1 million image-caption pairs
Domains
Used By
3 notable models
Key Strengths:
- •Open Access: Freely available without licensing restrictions
- •User Captions: Natural language descriptions from actual users
codesearchnet
CodeSearchNet is a dataset of code-function and documentation pairs created by GitHub and collected for code search, code-to-documentation, and documentation-to-code tasks. Released by GitHub in 2019, it includes 6 major programming languages with functions and their documentation.
Dataset Size
- **Total Functions**: ~2 million code functions
Domains
Used By
3 notable models
Key Strengths:
- •Documentation: Well-documented functions with meaningful docstrings
- •Multi-Language: 6 languages enable diverse learning
humaneval
HumanEval is a benchmark dataset of 164 hand-written Python programming problems created by OpenAI. Released in 2021, it evaluates code generation models on their ability to solve realistic programming tasks with test-driven evaluation.
Dataset Size
- **Total Problems**: 164 programming problems
Domains
Used By
3 notable models
Key Strengths:
- •Quality: Hand-written by expert programmers
- •Realistic: Problems resemble actual programming tasks
natural questions
Natural Questions is a dataset of 307K question-document-answer triples created by Google from real Google search queries and Wikipedia documents. Introduced in 2019, it represents actual user questions paired with relevant Wikipedia articles and human-annotated answers.
Dataset Size
- **Total Questions**: 307,000 natural questions
Domains
Used By
3 notable models
Key Strengths:
- •Authenticity: Real user questions vs. artificial
- •Human Annotation: Expert-annotated answers
ms marco
MS MARCO (Microsoft Machine Reading Comprehension) is a large-scale dataset of 1+ million real Bing search queries with human-annotated answers and passages. Created by Microsoft researchers in 2016, it represents one of the largest machine reading comprehension datasets.
Dataset Size
- **Total QA Pairs**: 1+ million real search queries
Domains
Used By
3 notable models
Key Strengths:
- •Large Scale: 1M+ queries substantial training data
- •Authentic Queries: Real user search queries
squad
SQuAD (Stanford Question Answering Dataset) is a reading comprehension dataset of 100K question-answer pairs created by Stanford University. Published in 2016, it consists of questions asked by crowdworkers about Wikipedia articles.
Dataset Size
- **Total QA Pairs**: ~100,000 question-answer pairs
Domains
Used By
3 notable models
Key Strengths:
- •Influential: Established reading comprehension as benchmark task
- •Quality: Crowdworker-generated questions
triviaqa
TriviaQA is a large-scale reading comprehension dataset of 650K question-answer pairs created by researchers from University of Washington. It combines web and Wikipedia passages with trivia questions and evidence passages.
Dataset Size
- **Total QA Pairs**: ~650,000 question-answer pairs
Domains
Used By
3 notable models
Key Strengths:
- •Large Scale: 650K questions substantially larger than SQuAD
- •Multiple Evidence: Multiple passages enable complex reasoning
openassistant conversations
OpenAssistant (OASST) Conversations is a dataset of ~161K multilingual conversations created by the OpenAssistant community. Composed of human-annotated conversations with quality rankings across 35+ languages.
Dataset Size
- **Total Conversations**: ~161,000 conversations
Domains
Used By
2 notable models
Key Strengths:
- •Multilingual: 35+ languages unprecedented for conversations
- •Community Effort: Collaborative annotation and curation
sharegpt
ShareGPT is a dataset of ~90K conversations collected from the ShareGPT website where users share ChatGPT conversations. It represents real user-ChatGPT interactions covering diverse topics and domains.
Dataset Size
- **Total Conversations**: ~90,000 conversations
Domains
Used By
3 notable models
Key Strengths:
- •Authentic: Real user-ChatGPT interactions
- •Scale: 90K conversations
alpaca
Alpaca is a 52K instruction-following dataset created by Stanford researchers as part of the Stanford Alpaca project. It consists of instruction-output pairs generated by text-davinci-003 (GPT-3.5) using data from self-instruct, designed to enable training smaller models to follow instructions as effectively as larger models.
Dataset Size
- **Total Samples**: 52,000 instruction-output pairs
Domains
Used By
3 notable models
Key Strengths:
- •Pioneering: Early open instruction-following dataset
- •Effective: Demonstrates strong instruction-following with smaller models
vicuna conversations
Vicuna Conversations is a dataset of ~70K conversations collected from ChatGPT sharing platform ShareGPT. Created by the LMSys team at UC Berkeley, it consists of user conversations with ChatGPT covering diverse topics and domains.
Dataset Size
- **Total Conversations**: ~70,000 multi-turn conversations
Domains
Used By
3 notable models
Key Strengths:
- •Authenticity: Real user-ChatGPT interactions
- •Scale: 70K conversations provide substantial data
ultrachat
UltraChat is a large-scale, diverse, and multi-round dialogue dataset containing 1.5 million AI-generated conversations. Created by Thematic, it aims to provide diverse instruction-following and conversational data by using various LLMs to generate diverse conversations on 200K distinct topics.
Dataset Size
- **Total Conversations**: ~1.5 million multi-round conversations
Domains
Used By
3 notable models
Key Strengths:
- •Scale: 1.5M conversations substantially larger than alternatives
- •Diversity: 200K topics enable broad coverage
wizardlm
WizardLM is a dataset created by Microsoft using the Evol-Instruct method to generate complex and diverse instruction-following data. It contains ~250K evolved instructions designed to progressively increase in complexity and diversity.
Dataset Size
- **Total Samples**: ~250,000 instruction-response pairs
Domains
Used By
3 notable models
Key Strengths:
- •Complexity: Progressively complex instructions improve reasoning
- •Diversity: Evol-Instruct ensures varied task types
no robots.md
Dataset Size
Domains
Used By
0 notable models
mathinstruct.md
Dataset Size
Domains
Used By
0 notable models
metamath.md
Dataset Size
Domains
Used By
0 notable models
openmathinstruct.md
Dataset Size
Domains
Used By
0 notable models
gsm8k.md
Dataset Size
Domains
Used By
0 notable models
math.md
Dataset Size
Domains
Used By
0 notable models
wikipedia
Wikipedia is a free online encyclopedia created and edited collaboratively by millions of volunteers worldwide. The dataset used for AI training consists of the complete text of Wikipedia articles available in English and multiple other languages. As of 2023, Wikipedia represents a carefully curated collection of general knowledge maintained by the Wikipedia community.
Dataset Size
The English Wikipedia contains approximately 6.8 million articles. For AI training purposes, Wikiped
Domains
Used By
5 notable models
Key Strengths:
- •High-Quality Content: Wikipedia articles are subject to community review, fact-checking, and citation requirements, resulting in generally reliable information.
- •Multilingual Coverage: Available in 300+ languages, enabling training of models that understand and generate content across diverse linguistic communities.
laion 5b
LAION-5B (Large-scale Artificial Intelligence Open Network) is an openly available dataset containing 5.85 billion CLIP-filtered image-text pairs. Created by the LAION research collective, it represents one of the largest multimodal datasets for training vision-language models. The dataset was released in 2022 and consists of images scraped from the web with accompanying alt-text captions.
Dataset Size
- **Total Pairs**: 5.85 billion image-text pairs
Domains
Used By
5 notable models
Key Strengths:
- •Unprecedented Scale: 5.85B pairs democratizes vision-language model development, enabling open research without corporate resources.
- •Open Access: Freely available metadata enables reproducible research and independent model development.
redcaps
RedCaps is a large-scale dataset of 12 million image-text pairs collected from Reddit. Created by researchers at the University of Michigan and Facebook AI Research, it consists of images and captions from Reddit posts across 350+ subreddits. Released in 2021, RedCaps emphasizes human-written captions rather than alt-text.
Dataset Size
- **Total Pairs**: 12 million image-text pairs
Domains
Used By
4 notable models
Key Strengths:
- •Human-Written Captions: Genuine human descriptions provide natural language quality superior to alt-text, including context, emotion, and storytelling.
- •Community Diversity: 350+ subreddits span diverse visual domains from nature photography to memes to technical images.
conceptual captions
Conceptual Captions is a dataset of ~3.3 million image-text pairs created by Google by filtering and cleaning the web-crawled data. Released in 2018, it aims to serve as a large-scale alternative to manually annotated datasets like MSCOCO, enabling training of vision-language models at scale.
Dataset Size
- **Version 1**: 3.3 million image-alt text pairs
Domains
Used By
3 notable models
Key Strengths:
- •Large Scale: 3.3-12 million pairs enable large-scale model training
- •Web-Scale: Captures diversity of internet images
common crawl
Common Crawl is a nonprofit organization that maintains the largest freely available archive of web crawl data. The organization has been creating massive web crawls since 2008, resulting in an archive exceeding 9.5 petabytes of data as of mid-2023. Common Crawl's mission is to provide raw web data for various research purposes, including AI development, with a deliberate lack of curation to enable open-ended innovation and research for downstream users.
Dataset Size
The Common Crawl corpus contains approximately 9.5+ petabytes of data collected from billions of URL
Domains
Used By
5 notable models
Key Strengths:
- •Scale and Accessibility: At 9.5+ petabytes, Common Crawl provides unprecedented scale for training data, freely available to researchers worldwide, democratizing AI development beyond well-resourced companies.
- •Diversity: The dataset captures billions of web pages across multiple domains and content types, enabling models to learn from diverse writing styles and topics.
Want to Add a Dataset?
Know of a training dataset that's not listed? Help us expand our directory.
Contact us with dataset details and research links