This allows you to extract relevant information from tables. In this section, we explore how you can use the Amazon Textract Textractor library to postprocess the API output of AnalyzeDocument with the Tables feature enhancements. The following image shows an example of STRUCTURED_TABLE and SEMI_STRUCTURED_TABLE. The new entity types offer the flexibility to choose which tables to keep or remove during post-processing. For example, data may appear in tabular structure that isn’t a table with defined headers. But with semi-structured tables, data might not follow a strict structure. Structured tables are tables that have clearly defined column headers. These entity types help you distinguish between a structured versus a semistructured table. To help distinguish these types of tables, we added two new entity types for a TABLE Block: SEMI_STRUCTURED_TABLE and STRUCTURED_TABLE. For example, documents often contain tables that may or may not have a discernible table header. Tables can come in various shapes and sizes. When Amazon Textract identifies a table in a document, it extracts all the details of the table into a top-level Block type of TABLE. Summary cells – A new Block type called TABLE_SUMMARY that enables you to identify if the cell is a summary cell, such as a cell for totals on a paystub.Section title – A new Block type called TABLE_SECTION_TITLE that enables you to identify if the cell detected is a section title.Footers can be one or more lines that are typically below the table or embedded as a cell within the table. Table footers – A new Block type called TABLE_FOOTER that enables you to identify the footers associated with a given table. Titles can be one or more lines, which are typically above a table or embedded as a cell within the table. Table title – A new Block type called TABLE_TITLE that enables you to identify the title of a given table.The following are the new Table Blocks introduced in this enhancement: A Block represents items that are recognized in a document within a group of pixels close to each other. These components, known as Blockobjects, encapsulate the details related to the component, such as the bounding geometry, relationships, and confidence score. Table elementsĪmazon Textract can identify several components of a table such as table cells and merged cells. The Tables feature enhancement adds support for four new elements in the API response that allows you to extract each of these table elements with ease, and adds the ability to distinguish the type of table. This sample financial report document contains table title, footer, section title, and summary rows. The following image shows that the updated model not only identifies the table in the document but all corresponding table headers and footers. We walk through how to use these improvements through code examples to use the API and process the response with the Amazon Textract Textractor library. In this post, we discuss these enhancements and give examples to help you understand and use them in your document processing workflows. In April 2023, Amazon Textract introduced the ability to automatically detect titles, footers, section titles, and summary rows present in documents via the Tables feature. With this announcement of enhancements to the Table feature, the extraction of various aspects of tabular data becomes much simpler. In such cases, custom postprocessing logic to identify such information or extract it separately from the API’s JSON output was necessary. For a similar document prior to this enhancement, the Tables feature within AnalyzeDocument would have identified those elements as cells, and it didn’t extract titles and footers that are present outside the bounds of the table. They often also include information such as table title, table footer, section title, and summary rows within the tabular structure for better readability and organization. Tabular structures in documents such as financial reports, paystubs, and certificate of analysis files are often formatted in a way that enables easy interpretation of information. In this post, we discuss the improvements made to the Tables feature and how it makes it easier to extract information in tabular structures from a wide variety of documents. Amazon Textract has a Tables feature within the AnalyzeDocument API that offers the ability to automatically extract tabular structures from any document. Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |