A New Tool for Data and Legal Discovery? AWS Announces General Availability of Amazon Textract

Released for general availability by AWS, Amazon Textract is a fully managed service that uses machine learning to automatically extract text and data, including from tables and forms, in virtually any document without the need for manual review, custom code, or machine learning experience.

en flag
nl flag
fr flag
de flag
pt flag
es flag

Press Announcement

AWS Announces General Availability of Amazon Textract

Amazon Textract uses machine learning to automatically extract text and data, including from tables and forms, in virtually any document – with no machine learning experience required.

The Globe and Mail, MET Office, PwC, Healthfirst, UiPath, Teradact, Ripcord, Kablamo, Vidado, BluePrism, and Alfresco among customers and partners using Amazon Textract

Today [May 29, 29, 2019], Amazon Web Services, Inc. (AWS), an Amazon.com company (NASDAQ: AMZN), announced the general availability of Amazon Textract, a fully managed service that uses machine learning to automatically extract text and data, including from tables and forms, in virtually any document without the need for manual review, custom code, or machine learning experience. Amazon Textract goes beyond simple optical character recognition (OCR) to identify the contents of fields in forms, information stored in tables, and the context in which the information is presented, such as a name or social security number from a tax form or the product SKU or quantity in a warehouse from an inventory report. The extracted text and data can be easily used to build smart searches on large archives of documents, or can be loaded into a database for use by applications, such as accounting, auditing, and compliance software. Amazon Textract’s API supports multiple image formats like scans, PDFs, and photos, and customers can use it with database and analytics services like Amazon Elasticsearch Service, Amazon DynamoDB, and AmazonAthena and other machine learning services like Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate, and Amazon SageMaker to derive deeper meaning from the extracted text and data. To get started with Amazon Textract, visit https://aws.amazon.com/textract.

Many companies extract text and data from files such as contracts, expense reports, mortgage guarantees, fund prospectuses, tax documents, hospital claims, and patient forms through manual data entry or simple OCR software. This is a time-consuming and often inaccurate process that produces an output requiring extensive post-processing before it can be put in a format that is usable by other applications. That’s because existing OCR technologies are unable to recognize common layouts like forms and tables, and only generate a lengthy and often inaccurate text dump. What organizations want instead is the ability to accurately identify and extract text and data from forms and tables in documents of any format and from a variety of file types and templates. Amazon Textract analyzes virtually any type of document, automatically generating highly accurate text, form, and table data. Amazon Textract identifies text and data from tables and forms in documents – such as line items and totals from a photographed receipt, tax information from a W2, or values from a table in a scanned inventory report – and recognizes a range of document formats, including those specific to financial services, insurance, and healthcare, without requiring any customization or human intervention. Amazon Textract makes it easy for customers to accurately process millions of document pages in just a few hours, significantly lowering document processing costs, and allowing customers to focus on deriving business value from their text and data instead of wasting time and effort on post-processing. Results are delivered via an API that can be easily accessed and used without requiring any machine learning experience.

“The power of Amazon Textract is that it accurately extracts text and structured data from virtually any document with no machine learning experience required. Subsequently, developers can analyze and query the extracted text and data using our database and analytics services like Amazon Elasticsearch Service, Amazon DynamoDB, and Amazon Athena and integrate with other machine learning services like Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate, and Amazon SageMaker to help customers derive deeper meaning from the extracted text and data,” said Swami Sivasubramanian, Vice President, Amazon Machine Learning. “In addition to the integration with other AWS services, the rich partner community developing around Amazon Textract makes it possible for customers to gain real meaning from their file collections, operate more efficiently, improve security compliance, automate data entry, and facilitate faster business decisions.”

Amazon Textract takes scanned files stored in an Amazon S3 bucket, reads them, and returns data in the form of JSON text annotated with the page number, section, form labels, and data types. This data can then be used for a range of applications (e.g. generating smart search indexes, redacting text in a massive collection of forms, creating automated loan approval workflows, using the data for regulatory compliance, and flagging fraud risk for insurance claims). Customers can load the data into business software, such as spreadsheets, databases, and payroll systems, or they can analyze and query the data using Amazon ElasticSearch, Amazon DynamoDB, Amazon Redshift, or Amazon Athena. Amazon Textract is available today in US East (Ohio), US East (N. Virginia), US West (Oregon), EU (Ireland), and will expand to additional regions in the coming year.

The Globe and Mail is a national icon and Canada’s most recognized media brand. “As a news media company, we rely on many PDF or scanned-source documents such as FOIs (freedom of information requests) that have important information contained in tables that we previously couldn’t access,” said Michael O’Neill, Managing Director of Digital and Data Science at The Globe and Mail. “These documents have been under-utilized because journalists were not able to access them easily or didn’t know they existed. Using Amazon Textract, we are able to extract information from tables in PDFs and easily output that data to CSV and offer easy access to these documents by making them available for search queries by our journalists. This increases efficient access to information for our journalist by tenfold.”

Met Office is the UK’s national weather service, and is a world leader in providing weather and climate services. “We hope to use AmazonTextract to digitize millions of historical weather observations from document archives,” said Philip Brohan, Climate Scientist at Met Office. “Making these observations available to science will improve our understanding of climate variability and change.”

PwC helps organizations and individuals create value by delivering quality in assurance, tax, and advisory services. “At PwC, we work to provide our customers with intelligent automation tools that help transform previously manual processes. We’ve integrated Amazon Textract into our solution for the pharmaceutical industry to automate document processing for various FDA forms like MedWatch and CIOMS,” said Siddhartha Bhattacharya of PwC. “Previously, people would manually review, edit, and process these forms, each one taking hours. Amazon Textract has proven to be the most efficient and accurate OCR solution available for these forms, extracting all of the relevant information for review and processing, and reducing time spent from hours to down to minutes.”

Healthfirst is a not-for-profit managed care organization and one of the fastest growing health plans in New York with over 1.4M diverse members and a network of more than 35,000 providers and 4,500 employees. “At Healthfirst, we are building data pipelines to turn scanned medical charts into useful clinical information to improve care coordination, drive quality outcomes, and ensure appropriate reimbursement for members under our coverage,” said Steve Prewitt, Chief Analytics Officer at Healthfirst. “We use Amazon Textract and Amazon Comprehend Medical to glean real value from unstructured data sources in an efficient way, resulting in revenue savings 10-20 times more than our usual downstream operation. By scaling up to analyze over 50,000 charts, we can find undocumented diagnoses and refer around 5,000 members for the care management they need.”

Informed, Inc. automates how financial institutions originate loans and open bank accounts. “We have already used Amazon Textract to analyze tens of thousands of loan documents on behalf of financial institutions, and our own software-as-a-service offering has been enhanced by the service, enabling us to identify 95% of the defects in loan application packages and help banks reduce their manual data entry,” said Justin Wickett, Founder and CEO, Informed Inc. “Using Amazon Textract, our software gives financial institutions real-time visibility into an applicant’s income based off of their pay stubs, bank statements, tax returns, and other financial documents. We plan to expand the types of documents we analyze using Amazon Textract in order to enable financial institutions to take advantage of our machine learning models and bring real-time decision-making efficiency to today’s slow and manual process.”

Candor’s mission is to transform the archaic, time-consuming process that burdens the mortgage industry. “We use OCR to extract data from a wide variety of lender-required documents to verify income, assets, property value, and more. Until now, the best OCR solution read one page at the rate of 38.4 seconds, but Amazon Textract achieves this in a fraction of that time,” said Tom Showalter, Founder & CEO of Candor. “We’ve been able to use Textract to accurately read complex, diverse documents such as bank statements, pay stubs, and tax documents without additional training or machine learning expertise, allowing our clients to underwrite and close a loan in days, as opposed to weeks.”

UiPath is a leading Robotic Process Automation vendor providing a complete software platform to help organizations efficiently automate business processes. “Amazon Textract will further differentiate UiPath’s robotic process automation platform by enhancing UiPath’s document understanding capabilities, enabling our customers to unlock critical business data from documents, transform that data into actionable business insights, and deliver those insights into line-of-business and operational systems,” said Param Kahlon, Chief Product Officer of UiPath.

TeraDact allows customers to transform stored images and paper documents into privacy-compliant, usable digital formats at scale. “Amazon Textract’s smart docs platform feeds TeraDact’s patented redaction services to automatically remove and secure sensitive data. TeraDact customers can permanently remove this data so that it can never be recovered or opt to replace sensitive data with patented tokens which can be recovered by individuals with the appropriate permissions. This is particularly useful in complying with government mandates surrounding individual data privacy such as GDPR,” said Tom Trobridge, COO, TeraDact.

Ripcord’s mission is to digitize and extract knowledge from paper documents using vision-guided robotics, machine learning, and advanced AI. This knowledge automates business processes and workflows. “We’ve had tremendous success utilizing Amazon Textract to augment our advanced entity extraction to benefit many industries and uncover $4 billion in new pay. We look forward to expanding our use of Amazon Textract across financial and government services, healthcare and legal,” said Alex Fielding, CEO of Ripcord.

Blue Prism develops Robotic Process Automation software to provide businesses and organizations with a more agile virtual workforce. “Blue Prism’s connected-RPA can automate and perform mission-critical processes, allowing customers the freedom to focus on more creative, meaningful work. By using Amazon Textract, we’ve given our digital workforce another powerful tool for automation. Amazon Textract accurately analyzes data from various document types using machine learning, which enhances the digital transformation journey for our customers. Using additional AWS AI services like Amazon Comprehend and Amazon Rekognition, we can tackle challenges from added secure customer authentication processes to fraud detection capabilities. The intelligence and flexibility of Amazon Textract’s form data extraction can elevate OCR to new levels in industries like financial services, retail, manufacturing and transportation to name a few,” said Dave Moss, CTO and Co-Founder of Blue Prism.

About Amazon Web Services

For 13 years, Amazon Web Services has been the world’s most comprehensive and broadly adopted cloud platform. AWS offers over 165 fully featured services for compute, storage, databases, networking, analytics, robotics, machine learning and artificial intelligence (AI), Internet of Things (IoT), mobile, security, hybrid, virtual and augmented reality (VR and AR), media, and application development, deployment, and management from 66 Availability Zones (AZs) within 21 geographic regions, spanning the U.S., Australia, Brazil, Canada, China, France, Germany, Hong Kong Special Administrative Region, India, Ireland, Japan, Korea, Singapore, Sweden, and the UK. Millions of customers including the fastest-growing startups, largest enterprises, and leading government agencies—trust AWS to power their infrastructure, become more agile, and lower costs. To learn more about AWS, visit aws.amazon.com.

About Amazon

Amazon is guided by four principles: customer obsession rather than competitor focus, passion for invention, commitment to operational excellence, and long-term thinking. Customer reviews, 1-Click shopping, personalized recommendations, Prime, Fulfillment by Amazon, AWS, Kindle Direct Publishing, Kindle, Fire tablets, Fire TV, Amazon Echo, and Alexa are some of the products and services pioneered by Amazon. For more information, visit amazon.com/about and follow @AmazonNews.

Read the complete release at AWS Announces General Availability of Amazon Textract

Additional Reading

Source: ComplexDiscovery

ComplexDiscovery combines original industry research with curated expert articles to create an informational resource that helps legal, business, and information technology professionals better understand the business and practice of data discovery and legal discovery.

All contributions are invested to support the development and distribution of ComplexDiscovery content. Contributors can make as many article contributions as they like, but will not be asked to register and pay until their contribution reaches $5.