Beyond automation: How to develop AI and NLP tools that help journalists in the News Verification process? – An AI technologist’s perspective
By Dr Sondess Missaoui – 28 February 2020
In the process of creating a news story, journalists explore different sources to search and verify information. In relation to, Artificial intelligence (AI) – and in particular, Natural Language Processing (NLP) – has the potential to transform the way journalists gather information, make decisions and deliver stories.
However, within the past decade, technologists designing AI have been mainly focusing on automation of news writing or reporting (Linden, 2017). But if automated news writing felt natural for them, developing tools for news verification requires an entirely new skill set.
The question then arises: As technologists, how can we develop AI-driven applications that help innovating journalism practices? What software should we develop in order to deliver more accurate, capable and efficient systems for news verification?
Not surprisingly our research in DMINR project demonstrated that investigative reporting practices, including News Verification, are not anywhere close to being automated by AI technologies (Gutierrez-Lopez et al., 2019; Missaoui et al., 2019 (a)). Journalism and in particular News Verification is a creative process that implies a lot of human expertise and judgments. Given that, AI can only be another tool in the journalist’s toolbox that is innovating journalism not automating it.
Our participatory design approach in DMINR, that actively involved a wide range of journalists, allowed us as technologists to identify crucial elements during the development and deployment of AI-driven solutions in Journalism.
Below, I’ll briefly explain the Key takeaway points from our studies and research:
- First and foremost, upholding journalism standards and navigating the ethics of news story creation are mandatory for developing AI for news verification.
- Bridging both knowledge and communication gaps between technologists designing AI and journalists. An investment in training editors and reporters is crucial to allow them understanding concepts behind the technology. All the same, training technologists to understand the journalism practices and values is mandatory, so they can build AI-enabled tools that are efficient, but also ethical.
- Bringing-in experts perspectives at all stages of the AI pipeline through Human-centred AI approaches, and participatory design approach, etc.
- Developing transparent and explainable systems that are accountable to both technologists and journalists.
These were crucial considerations for us to develop the DMINR system. One of its main components is the NLP-tool which is a concrete example on how technology help to free up more time for knowledge workers, truth seekers, journalists to think creatively on solving problems and finding their stories angels.
DMINR NLP tool is designed for Named Entities (NEs) recognition, ranking and linking. The tool have been deployed based on three algorithms, all of which aim to give the journalist signals from within textual data for concepts, artefacts that s/he may want to investigate further.
By extracting ‘focused’ Named Entities and ranking them based on their relevance, the tool helps investigators to become aware of the trends and topics within their data. Moreover, a third algorithm is deployed to link the focused NEs to external useful sources in order to give background and contextual information about them. For more details about this component, read our research paper (Missaoui et, al. 2019 (b)) that will be available soon as a part of the Proceedings of the Twenty-eighth Text REtrieval Conference, TREC 2019.
Another component of DMINR system that embeds into it all the above discussed principles, is the DeepConnect: an explainable deep learning algorithm that helps meaningful connections discovery.
We will soon be sharing more details about this innovative tool and making it accessible for usability testing with journalists. Keep an eye out for it 😉
Missaoui et al., 2019 (a) :
S. Missaoui, M. Gutierrez-Lopez, A. MacFarlane, S. Makri, C. Porlezza, G. Cooper. (2019). How to Blend Journalistic Expertise with Artificial Intelligence for Research and Verifying News Stories. Where is the Human? Bridging the Gap between AI and HCI workshop. CHI 2019.
Missaoui et al., 2019 (b) :
S. Missaoui, A. MacFarlane, S. Makri, M. Gutierrez-Lopez (2019). DMINR at TREC News Track 2019. Proceedings of the Twenty-eighth Text REtrieval Conference, Gaithersburg, MD, USA, November, TREC 2019.
Gutierrez-Lopez et al., 2019 :
M. Gutierrez-Lopez, S. Missaoui, S. Makri, A. MacFarlane, C. Porlezza, G. Cooper. (2019). Journalists as Design Partners for AI. Workshop for accurate, impartial and transparent journalism: challenges and solutions. CHI 2019.
Conquering automation anxiety
By Dr Glenda Cooper – 21 November 2019
The Rise of the Robot Reporter was the frightening headline in the New York Times earlier this year, when it noted that a third of content from Bloomberg News was generated by the company’s automated system Cyborg. Frightening at least for journalists, many of whom have long feared that they would end up on the scrap heap, replaced by an algorithm doing their job – what has been dubbed the new ‘automation anxiety’ (Akst,2013).
If there was any doubt that more newsrooms had to embrace artificial intelligence however, there was a warning this week for those who have backed away from such practice. A new report from the LSE this week found that only 37 percent of newsrooms have a dedicated AI strategy, and that unless they embrace this in five years, such organisations face becoming irrelevant. The report also found that of the 71 organisations in 32 countries surveyed, half used AI for newsgathering, two thirds for production and half for distribution.
Unsurprisingly much of the focus for journalists has been about how their working practices will be affected – will they be done out of a job by the newswriting equivalent of R2D2? But, given AI is going to play an increasingly important role in news – and the LSE report is clear that this will happen – what about the ethical questions that this raises?
One of the key approaches for us here at the DMINR project has been to involve journalists from the start, so that we can design a tool that will be useful for them. We have gone into several major newsrooms to talk to journalists and observe how they work, the ways they use AI and their concerns around it. But we also need to ensure that our tool and others means they can use AI in an ethical way.
While one of the first ethical issues named in the LSE report is concern by editorial staff as to whether AI is just a way to save money, there are many other issues. While bias in journalism has been widely debated (see for example, the work of Michael Schudson), what about bias in algorithms? This can range from technical problems with data inputting to algorithms reflecting all too human biases on race and gender, as exposed by the former ProPublica journalist Julia Angwin who found that a programme used by the US criminal justice system to predict whether defendants were likely to reoffend, was actually discriminating against black people. Concerns around filter bubbles and AI, confirmation bias and even generating deep fakes have also been raised. Of course this is nothing new as a detailed piece from Columbia Journalism Review about the 1964 World’s Fair in New York this week makes clear.
Conversely however, AI has been championed as a way of producing more ethical journalism. Could AI help uncover connections that would have been missed otherwise – this certainly is one of the aims that we have in creating our tool in a world where there is so much information that it can be difficult to process. And could the debate that has gone on about problems around the biases of AI mean that newsrooms themselves have to be much more transparent and open to the audience to explain what they do.
Jeff Jarvis coined the term a decade ago of ‘process journalism’ rather than ‘product journalism’ – the changing culture of news reporting where journalists did not produce stories in distant newsrooms, but learned to engage and update their stories with interactions with their audiences. By being open about AI, and its use, can trust in the media – which is low in the UK be restored?
It’s true that most work has focused on the potential problems around AI rather than the upsides. This makes it even more important for researchers and developers in AI to interact with journalists from the start to ensure that these ethical issues are tackled. As one respondent to the LSE report said: “The biggest mistake I’ve seen in the past is treating the integration of tech into a social setting as a simple IT-question. In reality it’s a complex social process.”
And for those journalists who still suffer automation anxiety, there is some optimism. As Carl Gustav Linden wrote in an article for Digital Journalism in 2015 , perhaps the question we should be asking is: why after decades of automation are there still so many jobs in journalism? The answer: despite 40 years of automation, journalism as a creative industry has shown resilience, and a strong capacity for adaptation and mitigation of new technology. A journalist R2D2 is still a long way off replacing a human reporter….
What makes a good news story?
By Dr Stephann Makri – 21 October 2019
What makes a good news story? From a news reader’s perspective, perhaps it’s a mixture of things – how interesting or informative it is, how credible it appears to be (in a world where the importance of ‘neutrality’ has been replaced by the need for transparency about the purpose and motives of news), how impactful it is (in terms of changing the reader’s way of thinking about an issue). And the list goes on.
But what makes a good news story from a journalist’s perspective? As part of the DMINR project, funded by the Google News Initiative, we have been asking journalists this question. A ‘newsworthy’ story, for them, is important and timely and accurate, sure. But the best stories reflect ‘creative’ qualities that can further a debate, bring previously under-discussed or little-considered aspects of an issue to the fore or facilitate those shifts in perspectives readers might be hoping to experience (or hoping not to experience) from consuming information.
Given the rise of clickbait and robo-journalism, it is perhaps unsurprising that the creative aspects of journalism are not shining through as strongly as the potential to magnify information exposure (at the cost of quality). Ironically, perhaps it is the value of human journalists themselves that needs to be re-emphasised in our new information landscape – where social media is considered by many to be just as, if not more important as traditional news outlets and where, as a consequence, mis- and disinformation are rife. It is the knowledge and skill that journalists hold – about what will further or change the conversation rather than just continuing it – that should be recognised as immensely important for making sure the news industry continues to write good news and the tech industry provides human-centred solutions to address industry problems such as ‘fake news.’ Rather than AI trying to encapsulate or mimic human knowledge and skill, AI systems should respect and help augment these valuable aspects of what makes skilled professionals skilled.
‘Artificial intelligence’ is not and may never be truly intelligent, at least not to the extent that AI systems are able perform the kind of creative, craft skills as well as human journalists can. But it is truly artificial – in the sense it does not have the same sensitivities around context and bias that a truly ‘intelligent’ system should. By recognising the limitations of AI and, crucially, the strengths of human actors – such as the skill involved in finding a ‘newsworthy’ story angle, we can design future technologies that skilfully blend artificial ‘intelligence’ and human intelligence.
How might we be able to make this blend? Perhaps by leveraging each partner – computer and human – to do what it does best; the computer can crunch vast amounts of data to showcase potential patterns and anomalies. The human can make meaning from that data – using their past experience, judgement and intuition to go on productive ‘fishing expeditions,’ during which they harness their creative instincts to explore the data sources surfaced by the machine, serendipitously discover other sources and make meaningful ‘newsworthy’ connections between them. It is this type of blend between human and machine ‘intelligence’ we hope to make as part of the DMINR project, to empower journalists to keep giving ‘good news’.
Verification and Information Retrieval
By Dr Andrew MacFarlane – 18 April 2019
Journalists are information workers i.e. an essential part of their job is handling information in various ways. This includes finding, extracting, synthesising and verifying information etc [1,2]. All these activities and more are very important, but one which particularly stands out is the verification of information. Journalists need to be sure that the source material they use for a story is reliable and trustworthy. There are many examples of stories being published without proper verification being carried out including the notorious case of Hitler’s diaries (http://tinyurl.com/y6p5aqgp), where a fraudster tricked both journalists and academics into publishing fake information about Adolf Hitler based on forged diaries. With a little more forensic work on the source material, it would have been clear to any journalist that the source material was faked and not reliable or trustworthy (and actually a different news story!). In this blog post we review the process of verification from the information science academic literature, look at how search can support verification of different types, and highlight some tools which are currently available for journalist for the practice of verification. Given this, we provide an overview of the goal for the tool being built as part of the DMINR project.
Verification and information seeking
Consider the following scenario:
Paul is an investigative journalist working for a small newspaper who has received a tip-off of a potentially corrupt Irish businessman (Declan), who owns several commercial property development companies in Europe. The businessman received an urban renewal grant from the local government in London to re-build deteriorating buildings. However, his tip-off claims that no building work ever took place. This is a complex investigation, as it covers multiple jurisdictions, documents in multiple potential data sources to examine – such as corporate accounts, local planning records and potentially other sources.
In order to write his story, Paul needs lots of information from various sources and the tip off engenders the start of Paul’s information seeking process. We consider an information seeking scenario using Ellis’s model (see Figure 1).
Figure 1: Ellis’s Model
Paul starts with a search query (on a web search engine, such as Google (https://www.google.com/), which will bring back some initial information to examine. Typically he will browse through the results looking for Declan, perhaps picking up corporate accounts, local planning records he requires. Each of these sources has their own search functionality e.g. the corporate accounts available online in both Ireland and the U.K. and verifies that the businessmen is actually connected to those accounts. He also searches for the planning records, looking for individuals and companies – looking for links between the accounts and planning records (chaining). No link can be found as there are no relevant planning records (verify). Perhaps Declan has taken money without delivering the appropriate building in London. He needs to check what has been proposed. Paul looks for local government funding for the project from London, again by searching for the company and individuals associated with the project, again by using the search functionality available on that website. He finds a document which provides the information he needs on the project. This project document has an image of a plan for the building, which he saves for further use. The accounts and the costs stated in the project document (chaining) do not match (verify). Paul checks some sources he knows well (monitoring), which provide some evidence on Declan’s past behaviour in Ireland. He checks the Irish court records, and finds that Declan has been found guilty of breaking planning laws in Galway. Paul checks that the court case relates to Declan (verify), and finds that it is. Having done this he returns to the image and checks it against others on Google (verify). The building plan is associated with a building which was due to be built in Ireland, but was only partially built (chaining). He further checks so see if there is a link between this plan and the court case (chaining), and finds that it is (verify). He has a link between the building in Ireland and the proposed building in London. At this point, Paul has sufficient information to write his story, bringing everything together from the different sources, filtering out the information he needs (differentiating). He then extracts the information he needs for the story, and goes through the process of double checking all the information and data he has to ensure that it is accurate and correct (verifying). The process ends and Paul can now write and submit his copy to the editor.
This information seeking scenario shows a potential journey for Paul through the information seeking process, but one aspect that crops up all the time is verifying information. Paul check to make sure at each stage that the information he collects is accurate, and any assertions made in his copy stand up to scrutiny. In the scenario he does a double check at the end, as he knows that the Editor will ask him what he has done to ensure the information used in the story is accurate, and is defendable in court – in case Declan sues the newspaper on the basis of the story published. Although the ideas of verification in this model were derived from both the physical and social scientists, it is clear that this process very much applies to other areas such as journalism. Let us now look at the type of tools that Paul could use in his work when verifying information for the story he is writing.
Search and verification
The seeking behaviour described in the scenario above requires the use of a wide variety of tools to find information and extract it to support Paul’s story. He is looking for a wide variety of different information types including records (e.g. company account data, personal details), documents (e.g. detailed textual descriptions of planning applications) and images (e.g. building layouts and plans). Let us look at some of the tools which are available to users search for these types of sources in the U.K and Ireland.
1. General Search tools
A key tool for journalists like Paul would be Google of course (https://www.google.com/), but there are other search engines which could be considered. Bing (https://www.bing.com/) is the other big player in the search engine market. If privacy is an issue, and the journalist is worried that interested parties are trying to find out about their investigation, they could use private search engines such as StartPage (https://www.startpage.com/) or DuckDuckGo (https://duckduckgo.com/). These services assert they secure your searches by blocking adverting trackers (e.g. so that results you inspect from searches are not delivered to third parties) and keeping history of searches private (e.g. your search terms are not provided to third parties). Specialist search engines can also come in very handy. For example the Trip Database (https://www.tripdatabase.com/) focuses on Health Information. These vertical search engines can be very useful in particular contexts. Some major website also have their own in-built search engine, which can be used to find specific information. For example the UK government’s website (https://www.gov.uk), provides a prominent search box on the landing page, together with a range of topics for users to browse on. Another type of search engine, not domain specific but context specific is searching for people. Services such as Pipl (https://pipl.com/), provide this functionality.
2. Search for company details
One of the key bits of information Paul needs to find is on companies. In the UK, users can search the Companies House website (https://beta.companieshouse.gov.uk/) to search for company information including legal documents (e.g. articles of association), officers of the company (company directors and secretaries), and financial details such as charges (e.g. mortgages) and if appropriate insolvency details. The Companies Registration Office (CRO) Ireland (https://www.cro.ie/) appears to have much the same function and Companies House in the UK. The OpenCorporates database (https://opencorporates.com/) has a much broader scope than either Companies House or the CRO, but does appear to reuse much of what is available on such services.
3. Search for planning records and documents
Finding details about property and planning applications was another key bit of information for Paul’s story. In the UK, details of planning decision or notices are available on the housing and local services page where you can search for a case using the reference number (https://acp.planninginspectorate.gov.uk/) or use other criteria using an advanced search option (https://acp.planninginspectorate.gov.uk/CaseSearch.aspx). Also useful is the Land Registry page (https://www.gov.uk/government/organisations/land-registry), which alloes the user to search for data such as house prices and property ownership information. The Irish Land Registry also allows various types of search, which is map based (https://www.landdirect.ie/pramap/).
4. Search for Government projects
Information on government funded projects can often be spread across a website (if at all available), and in the case of the UK the starting point for any searches of this type will start using the main website search or via Google. For example, information on UK government projects can be found on the Infrastructure and Projects Authority website (https://www.gov.uk/government/organisations/infrastructure-and-projects-authority) or specific data on major projects as part of the transparency and accountability strategy, such data on the major project portfolio (https://www.gov.uk/government/publications/ho-government-major-projects-portfolio-data-2018). Local government in London has a search on it landing page (https://www.londoncouncils.gov.uk/), with a similar look and feel to the main UK government website. The Greater London Authority also has search on their web site (https://www.london.gov.uk/), but has a very different look and feel. The Irish Government has a similar web page to the UK government to facilitate these types of searches (https://www.gov.ie/en/).
5. Search for Court records
In the information seeking scenario, the need to access court records was identified. Again this is not yet available on one search service, but users can look for information on court judgements (https://www.judiciary.uk/judgments) and also for appeals in civil cases (https://casetracker.justice.gov.uk/search.jsp). The Irish Court Service has an advanced search on their website (http://www.courts.ie/courts.ie/Library3.nsf/advancedsearch?openform&l=en).
6. Image search
Paul used Google to examine the source of an image and matched it to some information of interest in another document. This is known as reverse image search, and is available via the Google image search engine (https://images.google.com/) by clicking on a ‘camera’ icon. Users can either past the URL of the image or upload it. Other search engines such as Bing (https://www.bing.com/) provide much the same service, but provide a little more – in the case of Bing this is taking a picture directly and submitting it to find a match. Specialist search engines such as Tineye (https://www.tineye.com/) are also available and have a good reputation. This ‘Visual Search’ method has its restrictions – the images have to be identical for the most part and do not handle perspective – but they are useful for finding and checking specific images and their source.
7. Video search
Not in the scenario above, but of potential use to investigative journalists is the ability to search for information in videos. Consider this BBC story Anatomy of a Killing (https://www.youtube.com/watch?v=4G9S-eoLgX4). In this clip, a mountain range shown in the video is matched using Google earth using geographical and topological information. Other objects such as trees and buildings are matched in the same way. This had to be done manually, which is undoubtedly time consuming. If we could build functionality to extract information from videos automatically, that would be of significant help to investigate journalists. However one bit of functionality is already available. In the past few years due to the ‘Deep Learning’ revolution, image identification and classification has become increasing accurate regular objects can be identified more readily . This type of technology would be useful to objects such as guns, which played an important part of developing the story above. Organisations such as Bellingcat (https://www.bellingcat.com/) do a lot of work on this type of problem.
In all of the websites above, Paul uses a wide variety of different sources and interfaces to obtain the information for his story. It will be a very time consuming activity, as he needs to get to understand the search functionality of a lot of services and get used to different ways of displaying information. It would be useful for Paul to have a ‘one stop shop’ search system, with a consistent way of searching and a consistent user experience. This is the rationale for DMNIR – a tool to aggregate information from a wide variety of sources, and to help Paul in various parts of his information seeking behaviour, particularly chaining and verification which played such an important role in the scenario. Key also is to ensure that Paul is aware of what the system is doing – its working is transparent to him – and that he has control over his search at all times. We are currently working on the different aspects including blending AI with journalists experience to verify news stories  and how to engage with Journalists in the design process .
Notes and References
 David Ellis, Deborah Cox and Katherine Hall. 1993. A comparison of the information seeking patterns of researchers in the physical and social sciences. Journal of documentation, Vol. 49 Issue: 4. pp.356-369. URL (https://doi.org/10.1108/eb026919)
 David Ellis. 1989. A behavioural approach to information retrieval systems design. Journal of Documentation. Vol. 45 Issue: 3. pp. 171-212. URL (https://doi.org/10.1108/eb026843)
 Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of CVPR 2015, Boston, 7-12 June 2015, edited by Kristen Grauman, Eric Learned-Miller, Antonio Torralba, and Andrew Zisserman. URL (http://cs.stanford.edu/people/karpathy/deepimagesent/)
 Sondess Missaoui, Marisela Gutierrez-Lopez, Andrew MacFarlane, Stephann Makri, Colin Porlezza, Glenda Cooper. 2019. How to Blend Journalistic Expertise with Artificial Intelligence for Research and Verifying News Stories. Where is the Human? Bridging the Gap between AI and HCI workhop. CHI 2019.
 Marisela Gutierrez-Lopez, Sondess Missaoui, Stephann Makri, Andrew MacFarlane, Colin Porlezza, Glenda Cooper. 2019. Journalists as Design Partners for AI. Workshop for accurate, impartial and transparent journalism: challenges and solutions. CHI 2019.