
Release Notes: Introducing DocumentCloud searchable notes, advanced OCR and additional internationalization options
DocumentCloud has had many feature updates within the last couple of months, including the ability to de-index documents from DocumentCloud’s public search and search engines like Google, the ability to upload documents via email, several new Add-Ons and pro features which include the ability to search publicly accessible notes and use Amazon’s Textract OCR to get better text extraction from within hard-to-OCR documents.

Upload large collections of documents to DocumentCloud with ease
Uploading large sets (hundreds, thousands, or even millions) of documents to DocumentCloud using the user interface can be laborious and requires careful monitoring of uploads for processing errors and splitting up the document set into smaller batches.
DocumentCloud’s Batch Upload Script was initially written to upload the CIA Crest files, which contains almost 1 million files. It keeps track of which files were uploaded successfully, so that it can be stopped and restarted and it will pick up where it left off, and errors can be retried. It uploads files in batches. It can be stopped gracefully by pressing CTRL+C (once) while it is running. A recent rewrite allows the script to run on any directory of documents.

Initial Gateway Grantees launch projects to help preserve, analyze and publish critical document collections
Ongoing support program protects endangered materials through decentralized storage while giving DocumentCloud users a range of new features.

New York City could be doing more to use its wastewater testing data, official says
The comments from a senior New York City environment official overseeing its wastewater surveillance program represent a sharp departure from a joint statement made to MuckRock and the Gothamist last month by the city’s health and environment agencies, which called wastewater surveillance a “developing field,” stressing a need for further research before it could be used to inform policy action.

The ‘Uncounted:’ People of color are dying at much higher rates than what COVID data suggests
Unspecific, unknown deaths rose 10 times more among Black, Hispanic and Indigenous people than among white Americans during the COVID-19 pandemic, according to a new analysis by MuckRock. The true toll of the COVID-19 pandemic on many communities of color is worse than previously known.
Projects See all
-
FOIA 101: Tips and Tricks to Make You a Transparency Master
★ FeaturedWhether it's your first request or your first request *today,* it never hurts to go over the basics. MuckRock's compiled a lot of FOIA advice over the years, and with this project, it's all in one place. -
U.S. Officials Response to COVID-19 in the Navajo Nation
★ FeaturedIn partnership with the Indigenous Investigative Collective and the Native American Journalists Association, we're building investigative journalism infrastructure in Indian Country by supporting networked reporting on COVID-19. -
DockIns: Machine Learning on Deadline for Journalists
★ FeaturedAs journalists dealing with data and document sets, we find that the most interesting information is usually hidden in large, unstructured, and incomplete sets of documents. Especially information in public contracts: what the government is buying, how much money is being spent, and who are the suppliers. To answer these questions, four media organizations — La Nacion, CLIP, Ojo Público, and MuckRock — joined forces under the JournalismAI Collab and experimented with different machine learning tools and techniques in order to build a platform that helps investigative reporters understand and process unstructured documents to get useful insights.