Author(s): Aditya Mohan Originally published on Towards AI. Photo by Annie Spratt on Unsplash Information extraction is the process of automating the retrieval of specific information related to a specific topic from a collection of texts or documents. This generally involves the use of natural language processing techniques. Using natural language processing to extract information often results in building complex logic that is sometimes very specific and does not generalize well. Okay… but what complex logic are we talking about? Complex logic can invlove techniques like trying to design modules that would parse a certain types of documents. One would have to go through a number of documents to get a general understanding of the layout and then try to come up with modules that relied on key-value pair extraction from OCR services like AWS Textract, or designing extraction logic based on natural language using complex regular expressions or just simply searching in the spatial locality of certain keywords for their corresponding values. These approaches though successful are not very immune to changes in document structure. With the advent of large language models (LLM) that are trained on a corpus of millions of documents and texts, it has become considerably easier to solve this problem. Large language models can easily extract information regarding attributes given a context and a schema. In most simple cases, they do not require additional fine-tuning for the task and can generalize well. The kinds of documents that can be better analyzed with LLMs are resumes, legal contracts, leases, newspaper articles, and other documents with unstructured text. Further, to democratize access to LLM capabilities, OpenAI has made APIs available that can be used to generate results from their LLM products like GPT 3.5 and GPT 4. In this article, I will talk about what a very basic information extraction pipeline might look like and how, using modern Python frameworks like LangChain and Streamlit, one can easily build web applications around LLMs. Methodology High-Level Diagram of an Information Extraction System. [Source: author] Optical Character Recognition Extracting text in a machine-readable format from an image is called Optical Character Recognition (OCR). The first step in any information extraction product or service is to extract the text from the document. The document can be a PDF file or scanned/captured images. Ultimately, the PDF is converted to a collection of images where each page is converted to a single image. Therefore, the OCR model works with only images natively. There are many paid and open-source OCR services available. In this article, I use an open-source project called Tesseract to perform OCR and retrieve the text. The OCR class in the project is a Python Protocol that can be extended to implement different paid OCR services like Tesseract, AWS Textract, Azure Vision APIs, etc. Source: author Extending OCR with Tesseract Source: author Likewise, the same OCR Protocol can be extended to implement other classes for OCR products like AWS Textract and Azure Vision. LangChain Extraction and Schema Builder Schema Builder To get a structured output from the LLM. A schema is a collection of properties. The schema will output the results in the format provided. Each property in the schema has three attributes that need to be defined: name of the property, type of the property, and whether the property is required or not. Once the schema is built, it can be used to generate a structured output from the LLM. Source: https://python.langchain.com/docs/use_cases/extraction In the above image, the input acts as a context for the LLM. The schema provides the format in which the output is to be formatted and also specifies the fields required to be extracted and their corresponding data types. The possible data types supported by LangChain are “string” and “integer”. Once we have the context and the schema, these can be fed to the LLM to generate the response. LangChain LangChain is a framework for developing applications that harness the power of language models. It enables applications that: Are context-aware Reason In this article, functions in LangChain that are related to Extraction have been used. It can be used to get structured model output with a specified schema. Source: author So, the pipeline is something like this, we get the text in a machine-readable format from the OCR module, and we also build the schema to specify the structure of our output. The text output by the OCR is the context, and the schema builder helps prepare the schema in the required format. Once we have both of these, then we use the create_extraction_chain function from LangChain to generate the output. Streamlit Frontend What is Streamlit? All machine learning applications require an interactive web application that can make it easier to present their results and performance. Streamlit is a free, open-source, all-python framework that enables data scientists to quickly build interactive dashboards and machine learning web apps with no front-end web development experience required. Just to showcase how easy it is to build interactive apps with Streamlit and Langchain, here is a code snippet and the generated web app. Source: author Do try out more examples of Streamlit apps on your data and read their documentation and tutorials to learn more. This entire information extraction application was designed using Streamlit! Streamlit * A faster way to build and share data apps Streamlit is an open-source Python framework for machine learning and data science teams. Create interactive data apps… streamlit.io State Machine So one thing to note about Streamlit is that whenever someone “interacts” with the app, the script is re-executed top to bottom. So here, interacting means clicking a button or moving the slider. Therefore, to build Streamlit applications that have multiple stages, it is beneficial to use a state design pattern while writing the code. Source: author Above is the state diagram and the transitions that have been used throughout the code for this information extraction application. To obey the rules of the State Design Pattern a Python package called Transitions has been used which […]
↧