Abstract

Generating comics through text is widely studied. However, there are few studies on generating multi-panel Manga (Japanese comics) solely based on plain text. Japanese manga contains multiple panels on a single page, with characteristics such as coherence in storytelling, reasonable and diverse page layouts, consistency in characters, and semantic correspondence between panel drawings and panel scripts. Therefore, generating manga poses a significant challenge. This paper presents the manga generation task and constructs the Manga109Story dataset for studying manga generation solely from plain text. Additionally, we propose MangaDiffusion to facilitate the intra-panel and inter-panel information interaction during the manga generation process. The results show that our method particularly ensures the number of panels, reasonable and diverse page layouts. Based on our approach, there is potential to converting a large amount of textual stories into more engaging manga readings, leading to significant application prospects.

Manga109Story Dataset

We construct the Manga109Story dataset based on existing community work and leveraging the capabilities of multimodal large language model (MLLM). The Manga109 dataset includes basic information such as coordinates of panel, character, face, and text, as well as the text content. The Manga109Dialog dataset associates dialogues with their respective speakers. We utilize a panel order estimator to predict the panel order of each manga page. By combining this information, we create an XML file and input it together with the original manga page into the MLLM for captioning. Ultimately, we obtain captions for each panel and summarize the entire page with a story.

Architecture

We propose MangaDiffuion to achieve manga generation. During the training stage, we split panel images from a complete manga page. A padding image is added if the number of panels is less than the maximum supported number. These panel images are inputted in batches into a pretrained VAE to obtain the latent representation. Each panel image has a corresponding caption to control its content generation. The transformer block consists of an intra-panel block and an inter-panel block for information interaction. The caption only participates in the computation within the intra-panel block. Timestep $t$ is injected into the model using adaLN-single. The intra-panel mask is used to remove text and speech bubble boxes within the image, while the inter-panel mask is used to mask out the padding images.

Results

R1: Comparison between T2I SOTAs and our proposed MangaDiffusion on Manga109Story test set. The Pixart-Sigma is fine-tuned using Manga109Story train set for 120 epochs, and the other SOTAs are inference directly without any fine-tuning.

R2: Visualization results of MangaDiffuion and other T2I methods. Each row represents a story, and the text above the images shows the number of panels and the corresponding captions for each panel. It can be observed from the figures that our method performs well in terms of panel quantity control and panel layout diversity.

Publication

S. Chen, D. Li, Z. Bao, Y. Zhou, L. Tan, Y. Zhong, Z. Zhao
Manga Generation via Layout-controllable Diffusion

ArXiv | Code | Bibtex

Webpage template modified from here.