Peer to Peer Magazine

Page 51 of 63

P E E R T O P E E R : I L T A ' S Q U A R T E R L Y M A G A Z I N E | S U M M E R 2 0 1 9 53 natural language processing techniques are essential, or even beneficial, to eDiscovery can only be determined through proper and thorough testing—which should continue as techniques and technologies change. Concept detection is one of the best examples of the need to test the theory that an incremental improvement in NLP technolo will necessarily improve eDiscovery results, as well as the need to continue testing as eDiscovery techniques evolve. Document clustering build on concept detection, which was one of the earliest methods used to improve the efficiency of the eDiscovery process which, to that point, relied primarily on linear review techniques. With clustering, efficiencies derived from (1) increased review speed attributable to reviewing similar documents together; and (2) the reasonable ability to bulk code clusters of similar documents. Ultimately, technolo-assisted review techniques surpassed clustering in terms of efficiency and, the recognition that both techniques were more efficient than linear review in their own right led to the presumption that combining technolo-assisted review techniques with NLP concept detection capabilities would exponentially improve review efficiency. To that end, several concept detection techniques have been used over the years to develop "concept" features that could be used with technolo-assisted review algorithms: matrix factorization approaches like Latent Semantic Indexing (LSI); stochastic modeling approaches like probabilistic LSI; shallow neural network approaches such as Google's Word2Vec; and hierarchical approaches like Facebook's fastText – all a form of dimensionality reduction. The presumption that an incremental increase in NLP capabilities would likewise improve technolo-assisted review was put to the test in a study that was presented in 2017 at the International Conference on Artificial Intelligence and Law. 1 The study tested the efficiency of five separate to some litigation purpose, whether in response to requests for production, exploring opposing party productions, or preparing for depositions or trial. Natural Language Processing Imagine the possibilities. Upon receipt, a request for production of documents is immediately and directly ingested into an eDiscovery tool. A natural language processing engine promptly differentiates the first twenty-seven pages of instructions and definitions from the remaining seventy-two compound, multi-part production requests. And then, that same NLP engine derives the substance of each request, and autonomously reviews the three million document collection to locate all the responsive (and only the responsive) documents for production. Voila… document production done! Unfortunately, we're not quite there yet, and getting there will take time, training, and testing. As discussed below, AI (and, therefore, NLP as an AI application) is not a no-cost panacea. So, what is natural language processing? A computer doesn't read and doesn't speak; nor does a computer recognize language, as such. In order to bridge the gap between human speech (whether written or verbal) and computer processes, NLP can roughly be characterized as an attempt to imbue what, for the computer, is an otherwise completely unstructured sequence of characters or sounds, with a richer structure in order to allow the computer to "process" that sequence in much the same way as a human would understand that same sequence, with the ultimate goal of having a computer replicate the optimal human response to that sequence, whatever that response may be. From that description, it should be obvious that NLP is not merely a single application or single technolo. Rather, NLP comprises a few dozen different application areas backed by hundreds of different AI technologies, machine learning models, rules- based approaches, and linguistic theories. Natural language processing applications run the gamut from simple to complex, from syntactic to semantic, and beyond. At its most elemental level, NLP encompasses the ability to parse those unstructured character sequences into meaningful word boundaries, which is known as word segmentation or tokenization. At the next level, NLP technologies can be trained to derive some measure of linguistic organization from the parsed words. For example, phrase detection recognizes the inherent difference between the phrases "New York" and "endangered species", and their component words. And part-of-speech (PoS) tagging assesses the syntactic category of the words – recognizing, for example, that "food" is a noun and "eat" is a verb. Even more complex NLP applications will infer status and relationships from text. Named entity extraction will recognize "Chevy Chase" as a PERSON, and "Chevy Chase, Maryland" as a LOCATION. Concept detection will evaluate gross similarities in terminolo to infer that the British "lorry" and the American "truck" reflect essentially the same concept. Sentiment analysis is used to automatically assess the emotional tenor of language. And co-reference resolution recognizes all expressions that refer to the same entity in a given text passage – that "I" and "my" are referring to the same person, which differs from "Bill" and "he" in the same passage. Going one step further, textual entailment determines the directional relationship between text fragments, in other words whether the truth of one fragment follows implicitly from another. Each of these NLP applications is an important step in the ultimate goal of natural language understanding, facilitating autonomous actions such as document summarization, question answering, and information extraction. Regardless of their intrinsic capabilities, however, the value of any particular NLP application to the eDiscovery process cannot be presumed. Rather, the question of whether

Articles in this issue

Cover

Archives of this issue

view archives of Peer to Peer Magazine - Summer 2019: Part 2

Summer 2019: Part 2

Contents of this Issue

Navigation

Page 51 of 63

Articles in this issue

Archives of this issue