DockIns: Machine Learning on Deadline for Journalists

Early this year, MuckRock was invited to participate in the JournalismAI Collab, working with other news organizations across the Americas to export, test, and develop new ways to apply AI and machine learning to investigative challenges. We partnered with CLIP, Ojo Público, and La Nación to get further feedback and continue developing Sidekick, and these pieces share the results of this collaboration. Read the other pieces, available in both English and Spanish, on the DockIns project page.

Access to public information plays a fundamental role in the enforceability of other rights and is one of the main tools that civil society requires to control and influence governments.

According to UNESCO, Latin America was one of the fastest regions to approve constitutional and legal rules that guaranteed and promoted the right of access to public information.

Nowadays, there are only six countries in the region that still don’t have freedom of information legislation.

Despite the fact that the region has promptly adhered to the open data movement, NGOs, organizations, and activists point out that Access to Public Information laws has had significant setbacks. Especially during the pandemic, governments have shielded public information from the management of the pandemic and vaccination campaigns, but also from management of the health system and, of course, government contracts.

Contract data are very hard to access. In almost all cases these data are not opened or structured, they are usually hidden in large, unstructured and uncomplete sets of documents; published in two or three different sources.

Unfortunately, this situation can’t be solved with Freedom of Information requests because they are usually answered in pdf files, texts, or even scanned copies. So as the world goes into open standards in contracting processes and information, we still find this is not enough.

As we learn about different AI solutions to extract insights or help us classify these piles of documents, we usually find that they have good performance only in English. Thus, another challenge is to test and apply them in other languages, in our case, Spanish.

To cope with this situation and find a solution, four media organizations from the Americas joined forces under the Journalism AI collaboration and experimented with different Machine Learning tools in order to build a platform that helps investigative reporters understand and process unstructured documents to get useful insights.

DockIns, a platform that hosts, reads, gives insights, and classifies documents,will help us investigate in large and unstructured text document sets, both in English and Spanish. At the moment, this tool is not end-user oriented, some programming experience is required to run the different programs. But we are doing our best effort to design one end-user interface platform. You can find a mockup here.

Which technology is used and how does it work?

DockIns uses Machine Learning and Natural Language Processing techniques to classify unstructured documents and gives insights into different types of documents (contract, email, letter, etc.), its topics, and entities.

To classify documents, it uses an already developed tool by MuckRock named SideKick. It is a machine learning platform baked right into DocumentCloud designed for quickly and efficiently training new models based right within the DocumentCloud platform itself. You can read our guide on how to run the tool here.

Besides, Named Entity Recognition (NER) can be extremely useful to easily identify key elements in a set of documents, like names of people, places, brands, or monetary values. We knew that this technique is very useful only for English texts but it has posed some problems with non-English ones. That is why we tested two NER models for documents in Spanish: SpaCy and DocumentCloud’s, which uses Google Cloud Natural Language. We have detailed the process we followed and some conclusions on both models’ performance here.

How we tested the tool

To build a tool and test the different technologies already developed, we had to pick a set of documents to analyze. We decided to focus on tenders, contracts, and purchases from the Ministry of Security and its five forces (Gendarmería Nacional, Policía Federal Argentina, Servicio Penitenciario, Prefectura Naval, and Policía de Seguridad Aeroportuaria), since all organizations stated that this type of information was very hard to access in their countries.

We ended up scraping 10,133 documents between 2014 and September 2021 from Argentina’s official gazette, named Boletín Oficial de la República Argentina (BORA), in which the Argentine government publishes its legal rules enacted and other government acts as well from the legislative, executive, and judicial powers.

This gazette includes a section where almost all tenders and contracts are published, but in unstructured documents. We picked these ones because there’s an official platform where contracts are published in open format but not all of them can be found. Once we finished scraping the documents, we uploaded them into DocumentCloud to run SideKick. If you want to learn how to start using it, click here.

We had to start labeling some of the documents in order to train SideKick. In this case, labels were related to the type of product that the Ministry was buying. Then, we ran SideKick’s learning algorithm to score all of the documents that have not been explicitly marked in the labeling step. Then we ran it almost 10 times to boost its performance. Simultaneously, we started testing two Named Entity Recognition models on a sample from the same document set used for SideKick. In this link, you can learn about the two models and how we did this comparison.

This project is part of the 2021 JournalismAI Collab Challenges, a global initiative that brings together media organizations to explore innovative solutions to improve journalism via the use of AI technologies.

It was developed as part of the Americas cohort of the Collab Challenges that focused on “How might we use AI technologies to innovate newsgathering and investigative reporting techniques?” with the support of the Knight Lab at Northwestern University.

JournalismAI is a project of Polis – the journalism think-tank at the London School of Economics and Political Science – and it’s sponsored by the Google News Initiative. If you want to know more about the Collab Challenges and other JournalismAI activities, sign up for the newsletter or get in touch with the team via hello@journalismai.info.

Header image via Shutterstock under commercial license.