如何使用 Google Apps 脚本从 PDF 文件中提取文本

外部会计系统为其客户生成纸质收据，然后将其扫描为 PDF 文件并上传到 Google Drive 中的文件夹。必须解析这些 PDF 发票，并且需要提取特定信息（例如发票编号、发票日期和买家的电子邮件地址）并将其保存到 Google 电子表格中。

这是我们将在此示例中使用的示例PDF 发票。

我们的 PDF 提取器脚本将从 Google Drive 读取文件并使用 Google Drive API 转换为文本文件。然后我们可以使用 RegEx解析这个文本文件并将提取的信息写入 Google Sheet。

让我们开始吧。

步骤 1. 将 PDF 转换为文本

假设 PDF 文件已经在我们的 Google Drive 中，我们将编写一个小函数将 PDF 文件转换为文本。请确保使用本教程中描述的 Advanced Drive API。

 /* * Convert PDF file to text * @param {string} fileId - The Google Drive ID of the PDF * @param {string} language - The language of the PDF text to use for OCR * return {string} - The extracted text of the PDF file */ const convertPDFToText = ( fileId , language ) => { fileId = fileId || '18FaqtRcgCozTi0IyQFQbIvdgqaO_UpjW' ; // Sample PDF file language = language || 'en' ; // English // Read the PDF file in Google Drive const pdfDocument = DriveApp . getFileById ( fileId ) ; // Use OCR to convert PDF to a temporary Google Document // Restrict the response to include file Id and Title fields only const { id , title } = Drive . Files . insert ( { title : pdfDocument . getName ( ) . replace ( / \.pdf$ / , '' ) , mimeType : pdfDocument . getMimeType ( ) || 'application/pdf' , } , pdfDocument . getBlob ( ) , { ocr : true , ocrLanguage : language , fields : 'id,title' , } ) ; // Use the Document API to extract text from the Google Document const textContent = DocumentApp . openById ( id ) . getBody ( ) . getText ( ) ; // Delete the temporary Google Document since it is no longer needed DriveApp . getFileById ( id ) . setTrashed ( true ) ; // (optional) Save the text content to another text file in Google Drive const textFile = DriveApp . createFile ( ` ${ title } .txt ` , textContent , 'text/plain' ) ; return textContent ; } ;

第 2 步：从文本中提取信息

现在我们有了 PDF 文件的文本内容，我们可以使用 RegEx 来提取我们需要的信息。我已经突出显示了我们需要保存在 Google 表格中的文本元素和有助于我们提取所需信息的 RegEx 模式。

 const extractInformationFromPDFText = ( textContent ) => { const pattern = / Invoice\sDate\s(.+?)\sInvoice\sNumber\s(.+?)\s / ; const matches = textContent . replace ( / \n / g , ' ' ) . match ( pattern ) || [ ] ; const [ , invoiceDate , invoiceNumber ] = matches ; return { invoiceDate , invoiceNumber } ; } ;

您可能需要根据 PDF 文件的独特结构调整 RegEx 模式。

第 3 步：将信息保存到 Google 表格

这是最简单的部分。我们可以使用 Google Sheets API 轻松地将提取的信息写入 Google Sheet。

 const writeToGoogleSheet = ( { invoiceDate , invoiceNumber } ) => { const spreadsheetId = '<<Google Spreadsheet ID>>' ; const sheetName = '<<Sheet Name>>' ; const sheet = SpreadsheetApp . openById ( spreadsheetId ) . getSheetByName ( sheetName ) ; if ( sheet . getLastRow ( ) === 0 ) { sheet . appendRow ( [ 'Invoice Date' , 'Invoice Number' ] ) ; } sheet . appendRow ( [ invoiceDate , invoiceNumber ] ) ; SpreadsheetApp . flush ( ) ; } ;

如果您是更复杂的 PDF，您可以考虑使用商业 API，该 API 使用机器学习来分析文档的布局并大规模提取特定信息一些用于提取 PDF 数据的流行 Web 服务包括Amazon Textract 、Adobe 的Extract API和 Google 自己的Vision AI .他们都为小规模使用提供慷慨的免费层级。

原文： https://www.labnol.org/extract-text-from-pdf-220422