Automating document discovery, alerting, and analysis with DocumentCloud Add-Ons
Get updates on new features, training opportunities, data resources and more by registering for the DocumentCloud newsletter! You can also join us on the NewsNerdery Slack in the #proj-documentcloud channel or on MuckRock's own Slack.
Get started by getting verified
- Register for free here, we recommend using your work email to speed up verification. You can add multiple emails later.
- Search for or create your organization here , if applicable.
- If you’re organization exists, click “Request to join.”
- If the admins are no longer at the organization, email firstname.lastname@example.org from your organizational email.
Start adding Add-Ons
Once verified, click "Add-Ons" and then "Browse All Add-Ons" to search for different plugins that might be useful.
Some good ones to help get started (note that links go to source code — to use, just activate them from the Browse All Add-Ons menu from within DocumentCloud):
- Scraper: A simple site scraper that accepts a URL and then checks it on a schedule you define for new documents; it then uploads any new documents into your account; makes them searchable if needed; and sends you an alert, with optional specialized alerts for keywords you define. You can optionally back up any acquired documents to the Filecoin and IPFS networks, which provides an additional backup of the files.
- Tabula Spreadsheet Extraction: Did someone mail you a PDF that at one point lived a happier life as a spreadsheet? We have a beta version of the Tabula data table extractor integrated into DocumentCloud. It only works on one document at a time currently, but the tool will automatically scan through the give PDF and try to pull out a spreadsheet as prettily as it can. Under active development.
- Bad Redactions: Building off the excellent X-Ray library from Free Law Project, Bad Redactions looks for instances where there are redaction fails leaving the underlying data intact. This is useful for both investigating if there's more information than meets the eye as well as making sure you properly and fully delete information from your own uploads. Note that DocumentCloud automatically flatten pages and deletes underlying data when you use our redaction tools or force OCR. We recommend trying it on the infamous Manafort filing (download and upload it into your own account), which the Add-On should flag and highlight about 25 redaction errors. You can have the Add-On leave a private annotation around the mis-redacted information or have it go ahead and try to properly redact it for you. Like any automation, this is not foolproof and not a replacement for human review!
- Regex Extractor: Pull out specified text matches into a spreadsheet across a selection of documents. Regex can be a bit tricky to learn, but if data appears in a consistent format (such as contracts always starting with "Contract #") then this can be a powerful way to turn documents into well structured data. Does not work well with OCR'd documents. Try it on this sample document with
LLV\d\d\d\d\d\d\d\d\d\d\d\d. Two useful resources are this Regular expressions tipsheet and an interactive regular expression generator.
- PDF Export: Helps you get your PDFs out of DocumentCloud, adding the selected documents into a Zip file you can download.
- Note Export: Extracts all the notes on selected documents and saves them as text files you can download.
- Bulk Edit: Let’s you update metadata on many documents at once.
- SideKick Document Classification: Makes it easy to train a machine learning model to classify documents by an arbitrary type, such as identifying if a document is likely to be an email, a resident complaint, or other categories of records.
- Push to IPFS/Filecoin: Push the selected documents to the decentralized web, making them accessible via IPFS and Filecoin via Estuary.
- Metadata export: We have two new Add-Ons that export metadata from selected documents, making it easier for you to take your key-value tags, page count and much more into your favorite spreadsheet program for further analysis.
- N-Gram Graphs: Feel like your seeing a term pop up more and more often? Now it's easier to get validation of your hunch — this Add-On maps the occurrence of words over time you input and then compares them to each other across a given search.
- Page Stats: Gives you basic statistics about the total length of a selection of documents, the longest document, shortest document and average pages per document.
- User upload frequency graph: Curious whether you're more productive during some months than others? Want to see the progress of your sharing with the public? Use this Add-On to graph your uploads over time. Tip: Put your username in as it appears in the search field (i.e.,