Peer to Peer Magazine

Page 52 of 63

54 training can happen, and certainly has happened, outside the context of any particular litigation. For example, phrase detection, PoS tagging, named entity extraction, sentiment analysis, and even co-reference resolution typically do not significantly depend on the particulars of a litigation, and indeed those general capabilities already exist at some level. And the results of these general NLP applications can easily be used as features in modern technolo-assisted review tools, which then need to be trained to locate responsive materials. But tailoring those general NLP applications to a specific litigation, and implementing advanced NLP techniques (such as question answering) to derive results, will undoubtedly require even more training. In the first instance, all of the purely linguistic NLP models will need to be updated to incorporate the specific components of the text in the litigation corpus. However, to the extent that the linguistic training is inconsistent with the idiosyncrasies of the litigation corpus, it may not be possible to easily conform the training – e.g., separating "truck" and "lorry" if concept detection has already conflated them into a single concept. In addition, the task model for any NLP technique employed to locate documents will need to be trained to properly associate the pertinent litigation queries (RFPs, for example) with responsive materials (text and documents) in the litigation corpus, for an NLP approach to succeed. Ultimately, this all takes time. It takes time to test the underlying NLP models, to determine which models generate the best features for litigation purposes. And it will take time to develop and train the task models that will use those features and employ an NLP technique to find responsive (or otherwise positive) documents. Image Recognition Much the same can be said about image recognition, although effective image recognition capabilities will be even more difficult to implement than natural language processing. Image recognition in eDiscovery is the ability of a computer to effectively locate desired images by evaluating the characteristics, or features, of those images, rather than having to manually review image after image until all of the desired images have been found. Modern image recognition capabilities depend primarily on metadata extracted from the image file, technolo-assisted review machine learning algorithms both with, and without, using, as features, concepts developed using LSI. Ultimately, the study found that using NLP concept detection does not improve, and typically impairs, the efficiency of technolo-assisted review algorithms. The ICAIL study demonstrates that the value of more advanced NLP techniques in the eDiscovery process, regardless of their intrinsic capabilities and advancements over simpler NLP techniques, needs to be tested – it cannot be presumed. And, as natural language processing techniques move in the direction of autonomous analysis, the implications of adequate training cannot be overlooked. While a number of NLP techniques rely on unsupervised machine learning (i.e., no human intervention), most advanced techniques require some form of supervised machine learning. What that means in the simplest of terms is that someone needs to train the underlying algorithm(s) to accept, recognize, and manage inputs, and derive appropriate results. There are two dimensions to training an NLP approach to eDiscovery – one at a purely linguistic level, and a second at the litigation-specific level. Linguistic

Articles in this issue

Cover

Archives of this issue

view archives of Peer to Peer Magazine - Summer 2019: Part 2

Summer 2019: Part 2

Contents of this Issue

Navigation

Page 52 of 63

Articles in this issue

Archives of this issue