Document Image Classification with Multimodal Models

in

So let’s say you have a bunch of images that look like this:

And your goal is to figure out what type of document it is (like a receipt, an invoice, or a contract). Normally you might use some fancy machine learning algorithm that looks at the pixels in the image and tries to find patterns that match with known examples of each type of document. But this can be tricky because there are so many different variations of documents out there (like handwritten vs typed text) and they all look pretty similar on a pixel-by-pixel level.

That’s where multimodal models come in! Instead of just looking at the pixels, we also use other sources of information to help us classify the document. For example:

1. Text recognition: We can extract text from the image and use that as a feature for classification. This is especially useful if there are clear words or phrases on the page (like “invoice” or “contract”) that can be easily recognized by an optical character recognition algorithm.

2. Layout analysis: We can analyze the layout of the document to help us identify which type it might be. For example, a receipt typically has a header with the name and address of the business, followed by a list of items and prices. An invoice usually includes more detailed information about each item (like quantity and unit price) as well as payment terms and due dates.

3. Image features: We can use machine learning algorithms to extract specific image features that are characteristic of different types of documents. For example, receipts often have a lot of small text in the bottom left corner, while contracts might have larger blocks of text with bold headings and signatures at the end.

By combining all these sources of information (and more!) we can create multimodal models that are much better at classifying documents than traditional pixel-based methods. And because they’re based on multiple inputs, they’re also less likely to be fooled by tricky variations or anomalies in the data.

Document Image Classification with Multimodal Models: a fancy way of saying “using lots of different sources of information to classify images of documents”.

SICORPS