PubLayNet Dataset for Document Layout Analysis

in

These pages have been annotated with both bounding boxes and polygonal segmentations, which means we can see exactly where each word or image is located on the page.

This dataset is super helpful for researchers who want to develop algorithms that can automatically extract information from scientific articles without having to manually read through them all. For example, imagine you’re a scientist working on a new drug and you need to go through thousands of research papers to find out which ones have already been published about your topic. With PubLayNet, you could use an algorithm that can automatically extract the relevant information from these articles and present it in a more organized way for you to review.

Here’s how it works: first, we download the dataset (which is pretty big, so be prepared for some waiting time). Then, we load it into our favorite programming language (in this case, Python) and start exploring the data using Jupyter Notebooks. We can use these notebooks to visualize the annotations on sample pages and see how they’re structured.

For example, let’s say we want to extract all of the text from a specific page in PubLayNet. Here’s what that might look like:

# Import necessary libraries
import json # Importing the json library to work with json files
from PIL import Image # Importing the Image module from the PIL library to work with images
import numpy as np # Importing the numpy library to work with arrays

# Load annotations for a specific article ID (in this case, 10.3796/ajphartil2018.054)
with open('annotations_trainvaltest/article_ids.json') as f: # Opening the json file containing article IDs and assigning it to the variable 'f'
    data = json.load(f) # Loading the json data from the file into the variable 'data'
for i in range(len(data['train'])): # Looping through the 'train' section of the data
    if data['train'][i]['id'] == '10.3796/ajphartil2018.054': # Checking if the article ID matches the one we are looking for
        article_idx = i # Assigning the index of the article to the variable 'article_idx'
        break # Exiting the loop once the article is found

# Load the corresponding image and annotations for that page
with open('annotations_trainvaltest/instances_{}.json'.format(article_idx)) as f: # Opening the json file containing annotations for the specific article and assigning it to the variable 'f'
    data = json.load(f) # Loading the json data from the file into the variable 'data'
for j in range(len(data['images'])): # Looping through the 'images' section of the data
    if data['images'][j]['id'] == '10.3796/ajphartil2018.054': # Checking if the image ID matches the one we are looking for
        image_idx = j # Assigning the index of the image to the variable 'image_idx'
        break # Exiting the loop once the image is found

# Load the corresponding text for that page (in this case, using a pre-trained model)
from transformers import TFBertForSequenceClassification # Importing the TFBertForSequenceClassification model from the transformers library
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased') # Loading the pre-trained model and assigning it to the variable 'model'
input_ids = np.loadtxt('inputs/article_{}.txt'.format(image_idx)) # Loading the input ids for the specific article and assigning it to the variable 'input_ids'
output_ids = model.predict(np.expand_dims(input_ids, axis=0), batch_size=1) # Using the model to predict the output ids for the input ids and assigning it to the variable 'output_ids'
text = ' '.join([data['annotations'][j]['category'] for j in range(len(data['annotations'])) if data['annotations'][j]['iscrowd'] == 0 and output_ids[0][j] > 0.5]) # Creating a string of text by joining the categories of annotations that are not 'iscrowd' and have an output id greater than 0.5

In this example, we first load the article ID (which is stored in a JSON file) and then use it to find the corresponding image and annotations for that page. We can then extract the text from each word on the page using a pre-trained model like BERT. This might not be perfect (since there are still some errors or omissions), but it’s a good starting point for developing more sophisticated algorithms in the future!

Overall, PubLayNet is an incredibly useful dataset that can help us better understand how scientific articles are structured and how we can extract information from them automatically. By using this data to train our models, we can develop new tools and techniques that will make it easier for researchers to access and analyze the vast amounts of knowledge contained in these publications.

SICORPS