Can American Intelligence Leverage the Data-Mining Revolution?

Will the United States be left behind?

On Friday, the Chief Analytic Methodologist for the Defense Intelligence Agency Josh Kerbel said, in no uncertain terms, that the U.S. intelligence community (IC) needs to make changes to keep up with its changing environment. Its historical fascination with collecting secrets is dangerously outmoded, he claims, given the overwhelming availability of unguarded information on the internet. He is absolutely correct. With that said, the abundance of unclassified, useful data hardly simplifies the task set before the intelligence community; the fact of the matter is that such data is too abundant for the IC’s current capabilities. To address this problem, data mining has become a much-discussed keystone of the future intelligence system, and the current administration is certainly working to capitalize on cross-sector research in data mining.

Part of the rush of interest in data mining stems from changes in the state of the art that vastly increase the technology’s potential. In previous years, data-mining tools analyzed massive data sets looking for connections, especially connections that tie together social networks like terrorist organizations. This style of data analytics underpins widely used national-security data-management programs like Palantir and the army’s Distributed Common Ground System, and will no doubt remain a critical tool for national security.

Nonetheless, change is coming; government officials have clearly signaled a movement towards “non-predicated, or pattern-based, searches – using data to find patterns that reveal new insights.” While current tools have the capacity to connect a user’s query with points of information pulled from otherwise uselessly large and complex data sets, future tools will be able to generate genuinely novel intelligence based on patterns in massive data sets uncovered by computerized statistical modeling.

While this unfolding style of data mining is in its formative stages, it behooves technologists and policy planners alike to keep in mind that the development and deployment of powerful data-mining technologies in national security particularly causes a number of rather unique problems:

1.   The Black Box Problem

Intelligence analysts need to know and report precisely how they arrived at a conclusion; unfortunately, data-mining solutions often lend themselves to a black box-style calculation. That is to say, the user adds data to a program, and the computer spits out a response. All the machine learning, pattern identification, statistical modeling, and extrapolation needed to generate a conclusion passes unobserved. This degree of opacity in a system may work in some sectors where the conclusions are experimental, not intended to be actionable, or where many different programs are run in parallel to verify conclusions by consensus, but opacity is tricky in intelligence. That is not to say that the intelligence community cannot run multiple competing programs or use data mining for nonactionable conclusions. But, by and large, explicitly knowing the source of and rationale for a conclusion is critical for the creation of an intelligence product.

There are both technical and systematic means for addressing this problem. Technologically, a data-mining tool could be designed to generate an audit trail as it executes its program. If the program itself can provide an accounting of its data, statistical model, and conclusions that is comprehensible to the analysts and subject-matter experts using the program (as opposed to the programmers that created it), that would give the user a window into the black box that would allow them to monitor its conclusions.

Systematically, a program could be designed to keep the human in the loop, involving the intelligence analyst in the execution of the program itself. By remaining engaged with the tool, for example, by confirming the model or selecting the data for use, the user has some insight into how the program functions (although it does greatly increase the odds of the user unconsciously biasing the outcome). The black box problem is not insurmountable, but the IC would do well to keep it at the forefront as the technology develops.

2. Anonymizing and Encrypting Data

In order to build a strong statistical model, a data-mining program obviously needs a large dataset. These datasets often describe individuals and their personal information, and that information must be safeguarded. Anonymizing and encrypting the data are two favored methods of doing so. For example, if a healthcare professional was doing research on a dataset that listed tens of thousands of patients and information on their health and lifestyle, the researcher might sanitize the dataset by removing identifying information like social security numbers, names, and zip codes. They then might encrypt their database and calculations to prevent outsiders from gaining access to their data.