For Watson, ingesting all 2.5 million unstructured documents is the easy part. For this, it would extract references to real-world entities, like corporations and people, and start looking for relationships between them, essentially building up context around each entity. This could be connected out to open-entity databases like Freebase, to provide even more context. A journalist might orient the system’s “attention” by indicating which politicians or tax-dodging tycoons might be of most interest. Other texts, like relevant legal codes in the target jurisdiction or news reports mentioning the entities of interest, could also be ingested and parsed.
Watson would then draw on its domain-adapted logic to generate evidence, like “IF corporation A is associated with offshore tax-free account B, AND the owner of corporation A is married to an executive of corporation C, THEN add a tiny bit of inference of tax evasion by corporation C.” There would be many of these types of rules, perhaps hundreds, and probably written by the journalists themselves to help the system identify meaningful and newsworthy relationships. Other rules might be garnered from common sense reasoning databases, like MIT’s ConceptNet. At the end of the day (or probably just a few seconds later), Watson would spit out 100 leads for reporters to follow. The first step would be to peer behind those leads to see the relevant evidence, rate its accuracy, and further train the algorithm. Sure, those follow-ups might still take months, but it wouldn’t be hard to beat the 15 months the ICIJ took in its investigation.