Language Detection for Unstructured Data with AWS S3
For years, organizations have collected text data. Analyzing text data can help you meet a range of business challenges, from customer experiences to analytics. The unstructured datasets and mixed languages in business documents, emails, and web pages can possess riches of knowledge. Through processing and interpreting this data, you can gain insight that can help with decision-making in your business. But manual processing and interpretation of the text data can cost a lot of effort, time and money.
So how can we advance language detection and processing? The answer is simple, using Amazon Comprehend and Simple Storage Service (S3) batch operations.
How to use Comprehend and S3 Bucket for Language Detection?
Amazon Comprehend is a natural language processing service that uses a pre-trained Machine Learning model to gain insights from unstructured data sets.
An S3 bucket is an object storage service provided by Amazon that offers higher data availability, security, performance, and scalability. Businesses of all industries and sizes can store and secure any amount of data for cloud-native applications, data lakes, and mobile apps.
With asynchronous batch operations of AWS Comprehend, an organization can easily detect the dominant language in text documents stored in the S3 bucket. Amazon Comprehend can process up to 5GB of documents, with up to one million documents per batch.
But what will you do in the case of millions or billions of documents waiting for language detection processing in the S3 bucket? What if you need customization for processing language detection in specific languages in a document?
Architecture Solutions for Language Detection in the Real World
Let’s assume that we have millions of objects stored in the S3 bucket which need to be processed to detect the dominant language.
To build a language detection job, we need to provide text object listed manifest files to the S3 batch operations. And Amazon S3 Inventory that helps in managing the storage can be used as a manifest file input to create S3 bucket object lists.
With one of the functions supported by AWS S3 batch operations, i.e., invoking Lambda, you can run code on a high-availability compute infrastructure. Lambda will also help perform all the administrative tasks of the server and operating system, including code monitoring, capacity provisioning, autoscaling and maintenance. Your code is organized into Lambda functions. And the batch operations in S3 use LambdaInvoke to invoke and run the Lambda function on each listed object in the manifest.
Though remember that Lambda functions are subjected to its concurrency limits for the account, that’s why every invocation will get a defined runtime. However, if you want to assure that Lambda functions can still be invoked after exhausting the entire limit, you can also set reserved capacity by requesting an increase in service quota.
Lambda functions can be tailored to take further actions based on results returned by Amazon Comprehend.
The Lambda function code will scan files, check file extensions, and determine any other prerequisites before calling the AWS Comprehend API. Upon reading the text object from S3, the Lambda function calls the Comprehend API to determine the dominant language. The Language Detection API of AWS Comprehend will automatically identify text written in over 100 languages. The API response will enclose the dominant language and offer a confidence score to support the interpretation.
You can implement a language detection job in an application in several ways, this was only one of them. To do the real-time analysis, you can use the synchronous batch operations API of Amazon Comprehend. However, as suggested in this blog, if you want to process large collections of documents, you should use asynchronous batch operations. If you find it tough to accomplish yourself, you can hire developers from our experienced team who will help you in the implementation of language detection as well as can build an application from scratch for your business.