AI Training Datasets Directory

Comprehensive analysis of 48 training datasets powering the world's leading AI models

Total Datasets

48

Average Quality

5.8/10

Top Domain

general

Browse by Domain

general (13)
code (11)
chat (11)
reasoning (11)
science (10)

All Datasets (48)

redpajama

RedPajama is a project by Together AI to create leading open-source models by reproducing the LLaMA training dataset. RedPajama-Data-1T is a 1.2 trillion token dataset (~2.67TB compressed, ~5TB uncompressed) designed to match the training data composition described in Meta AI's LLaMA paper.

🟢 8.5/10

Dataset Size

**RedPajama-Data-1T:**

Domains

code
general
science
multilingual

Used By

3 notable models

Key Strengths:

  • Transparent Composition: Clear breakdown of all data sources and proportions
  • Reproducibility: Open-source code enables full reproduction

the pile

The Pile is a 825.18 GiB (886 GB) diverse, open-source dataset of English text created by EleutherAI in 2020 and publicly released on December 31, 2020. It represents a carefully curated collection of 22 smaller, high-quality datasets specifically selected for training large language models. The Pile was constructed to address the limitations of relying primarily on Common Crawl by introducing a dataset with significant diversity from multiple authoritative sources.

🟢 8/10

Dataset Size

The Pile consists of 825.18 GiB of English text composed of 22 smaller datasets with different "epoc

Domains

code
general
science
multilingual

Used By

5 notable models

Key Strengths:

  • Deliberate Diversity: Explicitly curated to include diverse content types (academia, code, Q&A, books, web text), improving model generalization.
  • Documented Quality: Each component dataset is thoroughly documented with rationale for inclusion, enabling researchers to understand dataset composition.

bigcode

BigCode refers to the comprehensive code datasets created by the BigCode Project, an open scientific collaboration for responsible large language model development for code. It includes The Stack and related code compilation datasets totaling multiple terabytes of permissively-licensed source code.

🔵 7.5/10

Dataset Size

- **The Stack v1**: 3TB of deduplicated code

Domains

code

Used By

3 notable models

Key Strengths:

  • Legal Clarity: Permissive licenses eliminate legal concerns
  • Comprehensive: 358+ languages unprecedented coverage

the stack

The Stack is a large-scale dataset of source code created by the BigCode Project, an open scientific collaboration focused on responsible development of large language models for code. The Stack contains over 6TB of permissively-licensed source code files covering 358 programming languages, extracted from public GitHub repositories.

🔵 7/10

Dataset Size

- **Version 1.0**: 3TB of deduplicated code

Domains

code

Used By

3 notable models

Key Strengths:

  • Legal Clarity: Permissive licenses eliminate licensing concerns
  • Comprehensive: 358 languages provide broad coverage

pubmed central

PubMed Central is a free digital archive of biomedical and life sciences journal literature. Approximately 96.93GB of medical literature is included in The Pile dataset.

🔵 7/10

Dataset Size

- **Total Size**: ~96.93GB in The Pile (2 epochs assigned)

Domains

science

Used By

2 notable models

Key Strengths:

  • Medical Authority: Peer-reviewed medical research
  • Domain Expertise: Specialized medical terminology and concepts

starcoder data

StarCoder Data is the comprehensive code dataset used to train HuggingFace's StarCoder model. Created by the BigCode Project and processed for StarCoder training, it includes carefully selected and processed source code from multiple sources with permissive licensing.

🔵 7/10

Dataset Size

- **Total Data**: The Stack v1 (3TB) processed for model training

Domains

code

Used By

3 notable models

Key Strengths:

  • Quality: Demonstrates effectiveness in StarCoder performance
  • Multi-Language: 80+ language support

orca

Orca is a dataset of 1 million instruction-response pairs created by Microsoft using chain-of-thought reasoning and large language model outputs. It extends the Alpaca dataset with more complex reasoning instructions and detailed step-by-step responses.

🔵 7/10

Dataset Size

- **Total Samples**: ~1 million instruction-response pairs

Domains

chat
reasoning

Used By

3 notable models

Key Strengths:

  • Large Scale: 1 million instruction-response pairs
  • Reasoning: Explicit chain-of-thought in responses

oscar

OSCAR (Open Super-large Crawled ALMAnaCH coRpus) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus. OSCAR 22.01 (released January 2022) uses the Ungoliant architecture for extraction and filtering, while the 2019 version uses goclassy.

🔵 6.5/10

Dataset Size

- **Languages**: 151-163 different languages available

Domains

multilingual

Used By

3 notable models

Key Strengths:

  • Multilingual: 150+ languages enable global AI development
  • Low-Resource Support: Critical for underrepresented languages

webtext

WebText is the original dataset used by OpenAI to train GPT-2. It consists of approximately 40GB of text scraped from webpages curated by humans, specifically URLs shared and upvoted on Reddit.

🔵 6.5/10

Dataset Size

- **Total Size**: ~40GB of text data

Domains

general

Used By

1 notable models

Key Strengths:

  • Quality Signal: Human curation through Reddit upvotes
  • Effective: Produced high-performing GPT-2 model

mbpp

MBPP (Mostly Basic Python Programming) is a benchmark dataset of 1,000 short Python programming problems created by Google. Released in 2021, it serves as a benchmark for evaluating code generation models on programming task completion.

🔵 6.5/10

Dataset Size

- **Total Problems**: 1,000 programming problems

Domains

code

Used By

3 notable models

Key Strengths:

  • Benchmark Standard: Widely-used evaluation metric
  • Test Coverage: Multiple test cases ensure robustness

dolly 15k

Dolly-15k is a dataset of ~15,000 instruction-following examples created by Databricks. It consists of high-quality instruction-output pairs designed to enable open-source models to match commercial models' instruction-following capabilities.

🔵 6.5/10

Dataset Size

- **Total Examples**: ~15,000 instruction-output pairs

Domains

science
chat

Used By

3 notable models

Key Strengths:

  • High Quality: Manually curated examples
  • Small but Effective: 15K sufficient for fine-tuning

c4

C4 (Colossal Clean Crawled Corpus) is a dataset of approximately 750GB of "reasonably clean and natural English text" developed by Google engineers. It was created by taking a single month's scrape of Common Crawl (April 2019) and applying filtering heuristics to remove duplicate, placeholder, nonsensical, and non-English content.

🔵 6/10

Dataset Size

- **Total Size**: ~750GB compressed (305GB deduplicated English)

Domains

general
multilingual

Used By

4 notable models

Key Strengths:

  • Scale and Accessibility: 750GB of publicly available, filtered text
  • Systematic Filtering: Documented heuristics enable reproducibility

openwebtext

OpenWebText is an open-source replication of the WebText dataset from OpenAI, originally used to train GPT-2. Created by Aaron Gokaslan and Vanya Cohen of Brown University, it consists of text extracted from web pages curated by humans on Reddit.

🔵 6/10

Dataset Size

- **Total Size**: 41.70 GB (generated), 13.51 GB downloaded, 55.21 GB total disk

Domains

general

Used By

3 notable models

Key Strengths:

  • Human Curation: Reddit upvotes provide quality signal superior to raw web text
  • Reproducible: Process openly documented with public code

github code

The GitHub Code dataset refers to source code extracted from public repositories on GitHub. It is commonly included in large-scale AI training datasets through initiatives like CodeSearchNet and The Stack. The dataset consists of source code files covering multiple programming languages, paired with documentation and metadata from GitHub repositories.

🔵 6/10

Dataset Size

GitHub Code typically includes:

Domains

code

Used By

5 notable models

Key Strengths:

  • Real-World Relevance: Uses actual production code, improving practical applicability
  • Multi-Language: Covers diverse programming languages

stackoverflow

StackOverflow is a community-driven question and answer platform for programmers. It is included in datasets like The Pile and RedPajama as a source of high-quality programming knowledge and practical code solutions.

🔵 6/10

Dataset Size

- **Content**: Programming questions, answers, and code snippets

Domains

code

Used By

3 notable models

Key Strengths:

  • Practical Knowledge: Real-world programming solutions
  • Community Quality: Upvotes and accepted answers provide quality signals

codeparrot

CodeParrot is a large-scale dataset of source code designed specifically for training code generation models. It consists of permissively-licensed code from various sources including GitHub.

🔵 6/10

Dataset Size

- **Content**: Source code files with various licenses

Domains

code

Used By

2 notable models

Key Strengths:

  • Code-Specific: Designed explicitly for code models
  • Legal Clarity: Permissive licenses

cc 100

CC-100 is a multilingual dataset extracted from Common Crawl, containing 100+ languages processed for language model training. It represents a language-specific approach to filtering Common Crawl.

🔵 6/10

Dataset Size

- **Languages**: 100+ languages

Domains

multilingual

Used By

2 notable models

Key Strengths:

  • Multilingual: 100+ languages
  • Comprehensive: Broad language coverage

roots

ROOTS is a large-scale multilingual dataset created by the BigScience workshop. It encompasses text data in multiple languages specifically curated for training large multilingual language models.

🔵 6/10

Dataset Size

- **Languages**: Multiple languages represented

Domains

multilingual

Used By

2 notable models

Key Strengths:

  • Multilingual: Supports many languages
  • Equitable: Emphasis on low-resource languages

datacomp

DataComp is a benchmark and framework for comparing visual datasets used to train vision models. Created by researchers at Stanford, MIT, and other institutions in 2023, it provides standardized evaluation of how dataset composition affects model performance.

🔵 6/10

Dataset Size

- **Pool**: 12.8 billion image-text pairs available for selection

Domains

science
vision

Used By

3 notable models

Key Strengths:

  • Benchmarking: Standardized framework for dataset comparison
  • Efficiency: Demonstrates data selection can improve efficiency

slimorca

SlimOrca is a reduced and deduplicated version of the Orca dataset, containing ~500K high-quality instruction-response pairs. Created to provide a more focused subset of Orca with emphasis on quality and diversity.

🔵 6/10

Dataset Size

- **Total Samples**: ~500,000 instruction-response pairs

Domains

chat
reasoning

Used By

2 notable models

Key Strengths:

  • Quality Focused: High-quality examples selected from Orca
  • Deduplicated: Removes redundancy improving training efficiency

bookcorpus

BookCorpus is a dataset consisting of approximately 11,038 self-published books (with 7,185 confirmed unique books) scraped from Smashwords, an indie ebook distribution platform. The corpus contains around 985 million words and was introduced in a 2015 paper by researchers from the University of Toronto and MIT. It was one of the earliest major datasets used for training large language models.

🟡 5.5/10

Dataset Size

The dataset comprises:

Domains

general

Used By

6 notable models

Key Strengths:

  • Narrative Quality: Full-length books provide coherent, narrative text with natural linguistic patterns and context-dependent language.
  • Linguistic Diversity: Self-published authors from around the world contributed, providing diverse writing styles and perspectives (despite genre skew).

arxiv

ArXiv is a dataset of scientific preprints from arXiv.org, an open-access repository of academic papers in physics, mathematics, computer science, and other fields. Approximately 56GB of scientific papers are included in The Pile dataset.

🟡 5.5/10

Dataset Size

- **Total Size**: ~56GB in The Pile

Domains

science
reasoning

Used By

2 notable models

Key Strengths:

  • Scientific Authority: Peer-reviewed content from established repository
  • Domain-Specific: Specialized vocabulary and concepts

laion 400m

LAION-400M is an openly available dataset containing 400 million CLIP-filtered image-text pairs. It represents an earlier and smaller version of the LAION dataset series, released before LAION-5B. The dataset consists of images scraped from the web with accompanying captions, filtered using OpenAI's CLIP model.

🟡 5.5/10

Dataset Size

- **Total Pairs**: 400 million image-text pairs

Domains

vision

Used By

4 notable models

Key Strengths:

  • Open Access Pioneer: First dataset to democratize access to hundreds of millions of image-text pairs for research.
  • CLIP Filtering: Established effective filtering methodology using CLIP scores for image-text alignment.

yfcc100m

YFCC100M (Yahoo Flickr Creative Commons 100 Million) is a dataset of ~100 million photos shared on Flickr with Creative Commons licenses. Released in 2016 by Yahoo and the University of Science and Technology of China, it provides a large-scale collection of publicly licensed images.

🟡 5.5/10

Dataset Size

- **Total Images**: ~100 million photos

Domains

vision

Used By

3 notable models

Key Strengths:

  • Legal Clarity: Creative Commons licensing clear for commercial and research use
  • Large Scale: 100 million images substantial for training

sbu captions

SBU Captions is a dataset of 1 million image-caption pairs from Flickr that are user-captioned. Released around 2014, it serves as a freely available alternative to manually annotated datasets like MSCOCO for training vision-language models.

🟡 5.5/10

Dataset Size

- **Total Pairs**: ~1 million image-caption pairs

Domains

science
vision

Used By

3 notable models

Key Strengths:

  • Open Access: Freely available without licensing restrictions
  • User Captions: Natural language descriptions from actual users

codesearchnet

CodeSearchNet is a dataset of code-function and documentation pairs created by GitHub and collected for code search, code-to-documentation, and documentation-to-code tasks. Released by GitHub in 2019, it includes 6 major programming languages with functions and their documentation.

🟡 5.5/10

Dataset Size

- **Total Functions**: ~2 million code functions

Domains

code
reasoning

Used By

3 notable models

Key Strengths:

  • Documentation: Well-documented functions with meaningful docstrings
  • Multi-Language: 6 languages enable diverse learning

humaneval

HumanEval is a benchmark dataset of 164 hand-written Python programming problems created by OpenAI. Released in 2021, it evaluates code generation models on their ability to solve realistic programming tasks with test-driven evaluation.

🟡 5.5/10

Dataset Size

- **Total Problems**: 164 programming problems

Domains

code
reasoning

Used By

3 notable models

Key Strengths:

  • Quality: Hand-written by expert programmers
  • Realistic: Problems resemble actual programming tasks

natural questions

Natural Questions is a dataset of 307K question-document-answer triples created by Google from real Google search queries and Wikipedia documents. Introduced in 2019, it represents actual user questions paired with relevant Wikipedia articles and human-annotated answers.

🟡 5.5/10

Dataset Size

- **Total Questions**: 307,000 natural questions

Domains

chat
reasoning

Used By

3 notable models

Key Strengths:

  • Authenticity: Real user questions vs. artificial
  • Human Annotation: Expert-annotated answers

ms marco

MS MARCO (Microsoft Machine Reading Comprehension) is a large-scale dataset of 1+ million real Bing search queries with human-annotated answers and passages. Created by Microsoft researchers in 2016, it represents one of the largest machine reading comprehension datasets.

🟡 5.5/10

Dataset Size

- **Total QA Pairs**: 1+ million real search queries

Domains

reasoning

Used By

3 notable models

Key Strengths:

  • Large Scale: 1M+ queries substantial training data
  • Authentic Queries: Real user search queries

squad

SQuAD (Stanford Question Answering Dataset) is a reading comprehension dataset of 100K question-answer pairs created by Stanford University. Published in 2016, it consists of questions asked by crowdworkers about Wikipedia articles.

🟡 5.5/10

Dataset Size

- **Total QA Pairs**: ~100,000 question-answer pairs

Domains

reasoning

Used By

3 notable models

Key Strengths:

  • Influential: Established reading comprehension as benchmark task
  • Quality: Crowdworker-generated questions

triviaqa

TriviaQA is a large-scale reading comprehension dataset of 650K question-answer pairs created by researchers from University of Washington. It combines web and Wikipedia passages with trivia questions and evidence passages.

🟡 5.5/10

Dataset Size

- **Total QA Pairs**: ~650,000 question-answer pairs

Domains

reasoning

Used By

3 notable models

Key Strengths:

  • Large Scale: 650K questions substantially larger than SQuAD
  • Multiple Evidence: Multiple passages enable complex reasoning

openassistant conversations

OpenAssistant (OASST) Conversations is a dataset of ~161K multilingual conversations created by the OpenAssistant community. Composed of human-annotated conversations with quality rankings across 35+ languages.

🟡 5.5/10

Dataset Size

- **Total Conversations**: ~161,000 conversations

Domains

multilingual
chat

Used By

2 notable models

Key Strengths:

  • Multilingual: 35+ languages unprecedented for conversations
  • Community Effort: Collaborative annotation and curation

sharegpt

ShareGPT is a dataset of ~90K conversations collected from the ShareGPT website where users share ChatGPT conversations. It represents real user-ChatGPT interactions covering diverse topics and domains.

🟡 5.5/10

Dataset Size

- **Total Conversations**: ~90,000 conversations

Domains

chat

Used By

3 notable models

Key Strengths:

  • Authentic: Real user-ChatGPT interactions
  • Scale: 90K conversations

alpaca

Alpaca is a 52K instruction-following dataset created by Stanford researchers as part of the Stanford Alpaca project. It consists of instruction-output pairs generated by text-davinci-003 (GPT-3.5) using data from self-instruct, designed to enable training smaller models to follow instructions as effectively as larger models.

🟡 5.5/10

Dataset Size

- **Total Samples**: 52,000 instruction-output pairs

Domains

science
chat

Used By

3 notable models

Key Strengths:

  • Pioneering: Early open instruction-following dataset
  • Effective: Demonstrates strong instruction-following with smaller models

vicuna conversations

Vicuna Conversations is a dataset of ~70K conversations collected from ChatGPT sharing platform ShareGPT. Created by the LMSys team at UC Berkeley, it consists of user conversations with ChatGPT covering diverse topics and domains.

🟡 5.5/10

Dataset Size

- **Total Conversations**: ~70,000 multi-turn conversations

Domains

chat

Used By

3 notable models

Key Strengths:

  • Authenticity: Real user-ChatGPT interactions
  • Scale: 70K conversations provide substantial data

ultrachat

UltraChat is a large-scale, diverse, and multi-round dialogue dataset containing 1.5 million AI-generated conversations. Created by Thematic, it aims to provide diverse instruction-following and conversational data by using various LLMs to generate diverse conversations on 200K distinct topics.

🟡 5.5/10

Dataset Size

- **Total Conversations**: ~1.5 million multi-round conversations

Domains

chat
reasoning

Used By

3 notable models

Key Strengths:

  • Scale: 1.5M conversations substantially larger than alternatives
  • Diversity: 200K topics enable broad coverage

wizardlm

WizardLM is a dataset created by Microsoft using the Evol-Instruct method to generate complex and diverse instruction-following data. It contains ~250K evolved instructions designed to progressively increase in complexity and diversity.

🟡 5.5/10

Dataset Size

- **Total Samples**: ~250,000 instruction-response pairs

Domains

chat
reasoning

Used By

3 notable models

Key Strengths:

  • Complexity: Progressively complex instructions improve reasoning
  • Diversity: Evol-Instruct ensures varied task types

no robots.md

🟡 5.5/10

Dataset Size

Domains

general

Used By

0 notable models

mathinstruct.md

🟡 5.5/10

Dataset Size

Domains

general

Used By

0 notable models

metamath.md

🟡 5.5/10

Dataset Size

Domains

general

Used By

0 notable models

openmathinstruct.md

🟡 5.5/10

Dataset Size

Domains

general

Used By

0 notable models

gsm8k.md

🟡 5.5/10

Dataset Size

Domains

general

Used By

0 notable models

math.md

🟡 5.5/10

Dataset Size

Domains

general

Used By

0 notable models

wikipedia

Wikipedia is a free online encyclopedia created and edited collaboratively by millions of volunteers worldwide. The dataset used for AI training consists of the complete text of Wikipedia articles available in English and multiple other languages. As of 2023, Wikipedia represents a carefully curated collection of general knowledge maintained by the Wikipedia community.

🟡 5/10

Dataset Size

The English Wikipedia contains approximately 6.8 million articles. For AI training purposes, Wikiped

Domains

science
multilingual

Used By

5 notable models

Key Strengths:

  • High-Quality Content: Wikipedia articles are subject to community review, fact-checking, and citation requirements, resulting in generally reliable information.
  • Multilingual Coverage: Available in 300+ languages, enabling training of models that understand and generate content across diverse linguistic communities.

laion 5b

LAION-5B (Large-scale Artificial Intelligence Open Network) is an openly available dataset containing 5.85 billion CLIP-filtered image-text pairs. Created by the LAION research collective, it represents one of the largest multimodal datasets for training vision-language models. The dataset was released in 2022 and consists of images scraped from the web with accompanying alt-text captions.

🟡 5/10

Dataset Size

- **Total Pairs**: 5.85 billion image-text pairs

Domains

vision

Used By

5 notable models

Key Strengths:

  • Unprecedented Scale: 5.85B pairs democratizes vision-language model development, enabling open research without corporate resources.
  • Open Access: Freely available metadata enables reproducible research and independent model development.

redcaps

RedCaps is a large-scale dataset of 12 million image-text pairs collected from Reddit. Created by researchers at the University of Michigan and Facebook AI Research, it consists of images and captions from Reddit posts across 350+ subreddits. Released in 2021, RedCaps emphasizes human-written captions rather than alt-text.

🟡 5/10

Dataset Size

- **Total Pairs**: 12 million image-text pairs

Domains

vision
chat

Used By

4 notable models

Key Strengths:

  • Human-Written Captions: Genuine human descriptions provide natural language quality superior to alt-text, including context, emotion, and storytelling.
  • Community Diversity: 350+ subreddits span diverse visual domains from nature photography to memes to technical images.

conceptual captions

Conceptual Captions is a dataset of ~3.3 million image-text pairs created by Google by filtering and cleaning the web-crawled data. Released in 2018, it aims to serve as a large-scale alternative to manually annotated datasets like MSCOCO, enabling training of vision-language models at scale.

🟡 4.5/10

Dataset Size

- **Version 1**: 3.3 million image-alt text pairs

Domains

vision

Used By

3 notable models

Key Strengths:

  • Large Scale: 3.3-12 million pairs enable large-scale model training
  • Web-Scale: Captures diversity of internet images

common crawl

Common Crawl is a nonprofit organization that maintains the largest freely available archive of web crawl data. The organization has been creating massive web crawls since 2008, resulting in an archive exceeding 9.5 petabytes of data as of mid-2023. Common Crawl's mission is to provide raw web data for various research purposes, including AI development, with a deliberate lack of curation to enable open-ended innovation and research for downstream users.

🔴 2.5/10

Dataset Size

The Common Crawl corpus contains approximately 9.5+ petabytes of data collected from billions of URL

Domains

general
science

Used By

5 notable models

Key Strengths:

  • Scale and Accessibility: At 9.5+ petabytes, Common Crawl provides unprecedented scale for training data, freely available to researchers worldwide, democratizing AI development beyond well-resourced companies.
  • Diversity: The dataset captures billions of web pages across multiple domains and content types, enabling models to learn from diverse writing styles and topics.

Want to Add a Dataset?

Know of a training dataset that's not listed? Help us expand our directory.

Contact us with dataset details and research links