👋 Hi, I'm pppop7

I'm working on Vision-Language Models and Video Research.

📚 My LLaVA Training Datasets

A complete collection of datasets for training LLaVA (Large Language and Vision Assistant) and Video-LLM models.

🎯 Quick Overview

Dataset	Type	Size	Format	Purpose
LLaVA-Pretrain	Image-Text	558K pairs	JSON + ZIP	Pretraining
LLaVA-Instruct-150K	Instruction	150K conversations	JSON	Instruction Tuning
textvqa	VQA	~35K	Parquet	Text Reading in Images
GQA	VQA	~22M Q&A	Parquet	Visual Reasoning
coco_train_2017	Images	118K images	ZIP (~18GB)	General Images
VisualGenome_VG_100K_1_and_2	Images	216K images	ZIP (~15GB)	Scene Understanding
OCR-VQA	VQA	~200K	Parquet	OCR Question Answering

📥 Quick Download Guide

For ZIP datasets (COCO, Visual Genome)

from huggingface_hub import hf_hub_download

# Download COCO train2017
hf_hub_download(
    repo_id="pppop7/coco_train_2017",
    repo_type="dataset",
    filename="train2017.zip",
    local_dir="./data/coco"
)

# Download Visual Genome
hf_hub_download(
    repo_id="pppop7/VisualGenome_VG_100K_1_and_2",
    repo_type="dataset",
    filename="images.zip",
    local_dir="./data/vg"
)

For Parquet datasets (TextVQA, GQA, OCR-VQA)

from datasets import load_dataset

# Load TextVQA
textvqa = load_dataset("pppop7/textvqa")

# Load GQA
gqa = load_dataset("pppop7/GQA")

# Load OCR-VQA
ocr_vqa = load_dataset("pppop7/OCR-VQA")

Download All at Once

from huggingface_hub import snapshot_download

datasets = [
    "pppop7/LLaVA-Pretrain",
    "pppop7/LLaVA-Instruct-150K",
    "pppop7/textvqa",
    "pppop7/GQA",
    "pppop7/coco_train_2017",
    "pppop7/VisualGenome_VG_100K_1_and_2",
    "pppop7/OCR-VQA",
]

for ds in datasets:
    snapshot_download(repo_id=ds, repo_type="dataset", local_dir=f"./data/{ds.split('/')[-1]}")

🗂️ Recommended Directory Structure

After downloading, organize your data like this:

data/
├── llava/
│   ├── LLaVA-Pretrain/
│   │   ├── blip_laion_cc_sbu_558k.json
│   │   └── images/
│   └── LLaVA-Instruct-150K/
│       └── llava_v1_5_mix665k.json
├── coco/
│   └── train2017/          # Extracted from ZIP
├── vg/
│   ├── VG_100K/            # Extracted from images.zip
│   └── VG_100K_2/          # Extracted from images2.zip
├── textvqa/                # Parquet files
├── gqa/                    # Parquet files
└── ocr_vqa/                # Parquet files

📊 Dataset Details

1. LLaVA-Pretrain

Purpose: Stage 1 pretraining for vision-language alignment
Content: 558K image-caption pairs from BLIP
Files: blip_laion_cc_sbu_558k.json + images.zip

2. LLaVA-Instruct-150K

Purpose: Stage 2 instruction tuning
Content: 150K visual instruction conversations
Includes: Complex reasoning, detailed descriptions, conversations

3. TextVQA

Purpose: Text reading in images
Content: Images containing text + questions about the text
Format: Parquet with embedded images

4. GQA

Purpose: Visual reasoning and compositional questions
Content: ~22M question-answer pairs
Format: Parquet with embedded images

5. COCO train2017

Purpose: General image understanding
Content: 118K diverse images
Format: ZIP archive

6. Visual Genome

Purpose: Dense scene understanding
Content: 216K images with rich annotations
Format: Two ZIP archives (VG_100K + VG_100K_2)

7. OCR-VQA

Purpose: Reading and understanding text in book covers
Content: ~200K VQA pairs
Format: Parquet with embedded images

🎬 Coming Soon: Video Datasets

Stay tuned for video-related datasets for Video-LLM research!

📬 Contact

Feel free to reach out if you have questions about these datasets!

📜 License

Each dataset follows its original license. Please check individual dataset pages for details.

COCO: CC BY 4.0
Visual Genome: CC BY 4.0
GQA: CC BY 4.0
TextVQA: CC BY 4.0
OCR-VQA: Please check original source

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support