π Hi, I'm pppop7
I'm working on Vision-Language Models and Video Research.
π My LLaVA Training Datasets
A complete collection of datasets for training LLaVA (Large Language and Vision Assistant) and Video-LLM models.
π― Quick Overview
| Dataset | Type | Size | Format | Purpose |
|---|---|---|---|---|
| LLaVA-Pretrain | Image-Text | 558K pairs | JSON + ZIP | Pretraining |
| LLaVA-Instruct-150K | Instruction | 150K conversations | JSON | Instruction Tuning |
| textvqa | VQA | ~35K | Parquet | Text Reading in Images |
| GQA | VQA | ~22M Q&A | Parquet | Visual Reasoning |
| coco_train_2017 | Images | 118K images | ZIP (~18GB) | General Images |
| VisualGenome_VG_100K_1_and_2 | Images | 216K images | ZIP (~15GB) | Scene Understanding |
| OCR-VQA | VQA | ~200K | Parquet | OCR Question Answering |
π₯ Quick Download Guide
For ZIP datasets (COCO, Visual Genome)
from huggingface_hub import hf_hub_download
# Download COCO train2017
hf_hub_download(
repo_id="pppop7/coco_train_2017",
repo_type="dataset",
filename="train2017.zip",
local_dir="./data/coco"
)
# Download Visual Genome
hf_hub_download(
repo_id="pppop7/VisualGenome_VG_100K_1_and_2",
repo_type="dataset",
filename="images.zip",
local_dir="./data/vg"
)
For Parquet datasets (TextVQA, GQA, OCR-VQA)
from datasets import load_dataset
# Load TextVQA
textvqa = load_dataset("pppop7/textvqa")
# Load GQA
gqa = load_dataset("pppop7/GQA")
# Load OCR-VQA
ocr_vqa = load_dataset("pppop7/OCR-VQA")
Download All at Once
from huggingface_hub import snapshot_download
datasets = [
"pppop7/LLaVA-Pretrain",
"pppop7/LLaVA-Instruct-150K",
"pppop7/textvqa",
"pppop7/GQA",
"pppop7/coco_train_2017",
"pppop7/VisualGenome_VG_100K_1_and_2",
"pppop7/OCR-VQA",
]
for ds in datasets:
snapshot_download(repo_id=ds, repo_type="dataset", local_dir=f"./data/{ds.split('/')[-1]}")
ποΈ Recommended Directory Structure
After downloading, organize your data like this:
data/
βββ llava/
β βββ LLaVA-Pretrain/
β β βββ blip_laion_cc_sbu_558k.json
β β βββ images/
β βββ LLaVA-Instruct-150K/
β βββ llava_v1_5_mix665k.json
βββ coco/
β βββ train2017/ # Extracted from ZIP
βββ vg/
β βββ VG_100K/ # Extracted from images.zip
β βββ VG_100K_2/ # Extracted from images2.zip
βββ textvqa/ # Parquet files
βββ gqa/ # Parquet files
βββ ocr_vqa/ # Parquet files
π Dataset Details
1. LLaVA-Pretrain
- Purpose: Stage 1 pretraining for vision-language alignment
- Content: 558K image-caption pairs from BLIP
- Files:
blip_laion_cc_sbu_558k.json+images.zip
2. LLaVA-Instruct-150K
- Purpose: Stage 2 instruction tuning
- Content: 150K visual instruction conversations
- Includes: Complex reasoning, detailed descriptions, conversations
3. TextVQA
- Purpose: Text reading in images
- Content: Images containing text + questions about the text
- Format: Parquet with embedded images
4. GQA
- Purpose: Visual reasoning and compositional questions
- Content: ~22M question-answer pairs
- Format: Parquet with embedded images
5. COCO train2017
- Purpose: General image understanding
- Content: 118K diverse images
- Format: ZIP archive
6. Visual Genome
- Purpose: Dense scene understanding
- Content: 216K images with rich annotations
- Format: Two ZIP archives (VG_100K + VG_100K_2)
7. OCR-VQA
- Purpose: Reading and understanding text in book covers
- Content: ~200K VQA pairs
- Format: Parquet with embedded images
π¬ Coming Soon: Video Datasets
Stay tuned for video-related datasets for Video-LLM research!
π¬ Contact
Feel free to reach out if you have questions about these datasets!
π License
Each dataset follows its original license. Please check individual dataset pages for details.
- COCO: CC BY 4.0
- Visual Genome: CC BY 4.0
- GQA: CC BY 4.0
- TextVQA: CC BY 4.0
- OCR-VQA: Please check original source
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support