Release Notes: Introducing DocumentCloud searchable notes, advanced OCR and additional internationalization options

Release Notes: Introducing DocumentCloud searchable notes, advanced OCR and additional internationalization options

Plus hide documents from Google and new Add-Ons to help analyze your FOIA finds more easily

Over the past year the DocumentCloud team has been working hard to create better solutions for journalists and the public at large to share, analyze, annotate and, ultimately, publish source documents to the web. In the past six months we have seen great growth in DocumentCloud’s capabilities including the ability to de-index documents, some API changes, new translations of the DocumentCloud website, several new Add-Ons and new paid features.

For previous site improvements, check out all of MuckRock’s release notes, and if you’d like updates emailed to you — along with ways to help contribute to the site’s development yourself — subscribe to our developer newsletter here.

General feature updates

  • You can now log into Big Local News with your MuckRock account.
  • Documents can now be de-indexed from DocumentCloud’s public search as well as search engines like Google and Bing. Select the document, click “Edit” and “Change Access” or click on the document’s globe or lock symbol to see the “Hide from search engines and DocumentCloud search” checkbox. Screenshot of change access page of DocumentCloud user interface showcasing the checkbox to hide documents from search engines and DocumentCloud public search
  • Documents can now be uploaded via email. Click on your username in the top navigation bar and click “Upload via email” to create an email address to send documents you would like to upload to your DocumentCloud account. Screenshot of page in DocumentCloud user interface where you can generate an email to send documents you want to upload to
  • Documents now have color-coded note indicators. You will see them as little indicators marked on the side of a document when looking through documents. Public ones are yellow, organization-wide notes are green, and private notes are blue. Screenshot showing DocumentCloud user interface with yellow, green, and blue note indicators on the side of a document
  • The DocumentCloud team has released a batch upload script to allow you to more easily upload large (and small) sets of documents. Read more about its usage and potential use cases in our guide on the batch upload script.

DocumentCloud internationalization

The DocumentCloud interface is now available in German, Russian and Ukranian, in addition to the previously available French, English and Spanish. Select the language by clicking on the “Language” drop down menu in the top navigation bar of DocumentCloud.

Screenshot showing DocumentCloud being available in English, Spanish, French, Ukranian, and German

New Add-Ons

  • PII Detector will help you protect personally identifiable information from falling into the wrong hands. The Add-On will search through document(s) for addresses, zip codes, phone numbers, social security numbers, credit card numbers and email addresses and make a private annotation on the document where potential fields exist. Optionally you can provide a project ID, and if PII is detected in the document, it will be added to the specified project. Schedule the Add-On to run and have it email you the results of the search if PII is detected in the documents after it is complete.
    Screenshot of menu for PII Detector Add-On including check marks for SSNs, Zipcodes, addresses, emails, phone numbers, the ability to sort the documents that have PII detected into a project, and the ability to run the Add-On on a schedule
  • The Email Conversion Add-On allows you to provide a public link to a Google Drive or Dropbox folder and easily convert EML and MSG files into PDFs and upload them to DocumentCloud. Many public records requests over the years for emails have resulted in pesky EML and MSG files that are clunky to deal and require special software like Outlook to open, so the DocumentCloud team has integrated an existing open source tool into an Add-On so that users may be able to do the conversion and upload within DocumentCloud.
  • DocumentCloud’s Transcribe Audio Add-On allows you to provide a link to Google Drive, Dropbox, WeTransfer, MediaFire or YouTube and have the audio or video file transcribed for you and the transcription uploaded to DocumentCloud for review.
    Screenshot of menu for Transcribe Audio Add-On that shows place to submit a URL to a video or audio file to be transcibed
  • The Internet Archive Export Tool allows you to save a collection of documents to DocumentCloud’s Internet Archive account for more permanent archiving. To save it to your own Internet Archive account, please email and we can guide you through it.

New Professional and Org features

These features are available for paid MuckRock and DocumentCloud users — upgrade today to use them.

  • Leverage Amazon’s Textract OCR library to get better text extraction and improve the ability to search within documents for specific text. You can access the engine selection menu both when uploading new documents
    Screenshot of menu for textract OCR engine when uploading a new document to DocumentCloud
    If the document is already uploaded, you can select the document, click “Edit”, and “Force Reprocess”
    Screenshot of menu for Force Re-Process which allows you to select Textract OCR engine

  • Search across annotations on documents using DocumentCloud’s search bar. Screenshot of search results that show a note in the search results

API updates

  • If you notice mistakes in OCR’d text on document in DocumentCloud or want to use your own custom OCR to change the extracted text, you can now use the API to edit the plaintext with the editable text API changes. Code for an example Add-On using the free OCRSpace API is available that uses the editable text API to provide custom OCR options.
  • Pagination on DocumentCloud has changed from page to cursor pagination. To continue using the older page-offset based pagination, which also has a top level ‘count’ key with a total count of the objects returned for all list queries, add a version=1.0 query parameter to your API queries. Be aware that this will make your queries less performant, possibly to the point of them being unusable. This should only be used as a stop-gap solution while you update your workflow to use the new cursor based pagination and will be deprecated in the future.

Image via Wikimedia Commons