TL;DR: This study builds a large Japanese web corpus from the Common Crawl archive, and demonstrated its effectiveness by continual pre-training on Llama 2 7B, 13B, 70B, Mistral 7B v0.1, and Mixtral 8x7B. Abstract: Open Japanese large language models (LLMs) have been trained on the Japanese portions of corpora such as CC-100, mC4, and OSCAR. However, these corpora were not created for the quality