Peer to Peer: ILTA's Quarterly Magazine
Issue link: https://epubs.iltanet.org/i/1521210
69 I L T A N E T . O R G Has the product been tested on actual data, and if so, what were the results? Testing on Enron data can only go so far – especially since that data is likely already included in the training data for the generative AI models. Ensure that the solutions have been evaluated using data from real matters – otherwise, the results may be skewed. In addition, quantitative metrics for the analysis should be requested. The results should include recall and precision metrics if the tool purports to classify documents for responsiveness like a first-pass review. Ask how those metrics were calculated – determine whether a statistically significant sample was used and learn the richness of the data set. If there was a prompt iteration process, find out how that worked and whether the Gen AI tool reviewed the sampled documents multiple times. Although it makes sense to iterate at the outset on the prompt criteria on a small number of documents, once that is finalized and the model is run on a more extensive set, at that point, the calculations should be based on the predictions and not further prompt criteria iterations, to get an accurate baseline on the model's predictions. aiR for Review was tested on client data (with permission), yielding quantitative metrics calculated by data scientists. These metrics were shared with the e-discovery community through articles and an academic paper. How do you build trust in a product through due diligence? In the past, AI has been a bit of a black box. Unraveling why AI classified specific documents a certain way has been challenging. With Gen AI, there can be more transparency in the model's decisions. Determine whether the solution can indicate why the decision was made and what information it was based upon. Then, the judgment and reasoning are reviewed to evaluate the outputs responsibly. Do the prediction and the reasoning make sense, and are they accurate? If the solution provides an answer, verify the accuracy of that answer by reviewing the document. It is easier to review determinations made across each document and confirm that the predictions were correct than to ask a question over a more significant number of documents and review the documents that surface. In that scenario, you can review the documents that it provides, but what about what it did not find? How do you know there is not something in the larger corpus that was not returned that was important? Understand the risk of this approach so that you can mitigate it by running additional searches on the documents that did not serve up as responsive to your inquiry. That will not be foolproof, but at least you are not mindlessly relying on the model. "With Gen AI, there can be more transparency in the model's decisions."