P2P

Fall20

Peer to Peer: ILTA's Quarterly Magazine

Issue link: https://epubs.iltanet.org/i/1293067

Contents of this Issue

Navigation

Page 12 of 54

13 I L T A N E T . O R G have transformed how documents are reviewed in discovery. Predictive coding is a form of supervised machine learning that can help determine where information users are seeking may appear in a dataset. To enable this, users simply look at a set of documents and provide "yes/no" labels to indicate whether the documents are of interest or not. When enough documents are labeled, the algorithm behind the system generates a predictive model score and identifies potential documents of interest with great accuracy. Since only a small set of examples is required, the technique is especially effective in dealing with large volumes of data. In discovery, for instance, a collection of several million documents may require only a few thousand exemplars to successfully identify the majority of the relevant documents. Similarly, whether law departments are investigating alternate models of service delivery or looking for information to retain talent, they can hunt down the needle in their haystack by examining only a fraction of their data. Importantly, predictive coding naturally builds on the use of other searching and analytics techniques. A document could be evaluated because it hits on a keyword or is part of a cluster. As long as that document is labeled, it can become a part of the teaching set from which the predictive coding model can learn. It is important to keep in mind that being able to locate relevant information with the help of data analytics does not necessarily equate to finding all instances. For example, imagine a case in which there are 1,000 documents of interest in my dataset. Say data analytics is able to locate 600 documents, of which 500 documents are actually of interest. Since more than 80% of the documents identified are of interest, data analytics is identifying relevant information efficiently. But if I cease my search efforts at that point, I will have only identified 50% of all documents of interest. This discrepancy is why validation of results is so critical, not only in data analytics but also for information retrieval in general. One frequently used validation method for quality assurance in discovery is called an elusion test. In this test, a random sample is drawn from the population believed not to contain any documents of interest. When the sample is evaluated, the content and the proportion of documents that are actually of interest can help users understand what and how much information is being left behind. With the exponential growth of data at the organizational level, law departments have the opportunity to pose the question: what kind of questions can I answer with my data? Because of the unprecedented disruptions facing the law sector, it has become increasingly important for law departments to reexamine what they should automate, where they should allocate their resources and how they should hire. For law departments that want to understand their own data, the analytics tools in discovery offer them a way to quickly identify unique content, explore their data and predict what is relevant. The benefits are clear. With the help of different data analytics—structured, conceptual and predictive—law departments can significantly reduce the amount of time and effort needed to derive insights from their data. ILTA Dr. David Li currently heads the Linguistics, Analytics, & Data Science team at ProSearch. In addition to implementing the technology-assisted review solution for ProSearch, Li researches the innovative application of advanced searching methodologies such as natural language processing and machine learning. With a background in both social and biological science, Li received his Ph.D. in linguistics from the University of Southern California, M.Ed. from Harvard University and B.Sc. from Simon Fraser University.

Articles in this issue

Links on this page

Archives of this issue

view archives of P2P - Fall20