Release Notes: Introducing DocumentCloud searchable notes, advanced OCR and additional internationalization options
DocumentCloud has had many feature updates within the last couple of months, including the ability to de-index documents from DocumentCloud’s public search and search engines like Google, the ability to upload documents via email, several new Add-Ons and pro features which include the ability to search publicly accessible notes and use Amazon’s Textract OCR to get better text extraction from within hard-to-OCR documents.
Uploading large sets (hundreds, thousands, or even millions) of documents to DocumentCloud using the user interface can be laborious and requires careful monitoring of uploads for processing errors and splitting up the document set into smaller batches.
DocumentCloud’s Batch Upload Script was initially written to upload the CIA Crest files, which contains almost 1 million files. It keeps track of which files were uploaded successfully, so that it can be stopped and restarted and it will pick up where it left off, and errors can be retried. It uploads files in batches. It can be stopped gracefully by pressing CTRL+C (once) while it is running. A recent rewrite allows the script to run on any directory of documents.
DocumentCloud Add-Ons: Automate data extraction, alerts, ingestion, and much more with our simple, open source plugin system
Today, we’re launching Add-Ons, which makes it easier to launch, maintain, and share new capabilities right within DocumentCloud, ranging from exporting notes to applying machine learning techniques.