Peer to Peer: ILTA's Quarterly Magazine
Issue link: https://epubs.iltanet.org/i/1293067
25 I L T A N E T . O R G I n the December 2019 ILTA AI/ML Survey, members were given the opportunity to shed light on their journey into data science. The survey illuminated the challenges most practitioners in the legal industry are facing. Data is often not clean or normalized and firms don't have vast datasets. Both of these concerns stem from a central idea: confidence. It is hard to trust data science results coming from dirty, sparse data sets. This concern is not limited to the legal industry. Distrust of data-driven solutions is widespread and exacerbated by uninformed news reports and social media posts by celebrities. Trust in data-driven suggestions will increase with open communication. This article briefly outlines the data science procedure and touches on techniques to address the common problems of dirty and sparse data. Although there are methods to alleviate these difficulties, no data science project is without flaws. Understanding results, project pivoting, and other advanced considerations are discussed at the conclusion of this brief article. Once a problem is defined, the first data science task is data preparation. The initial determination of relevant data requires business expertise. Finding and aggregating that data is purely a technical exercise, after which business experts need to be brought back in to assist in the data cleansing process. While business experts work to correct missing or incorrect data, scientists work on masking and highlighting different pieces to assist later computations. After the data is cleaned to an acceptable degree, it is further processed through transformation and normalization. At this point, the data is ready to be analyzed. The analysis technique can range in complexity from simple eye-test interpretation to cutting edge artificial intelligence. The biggest buzzword in legal tech right now might be "cloud," but coming in at a close second is "machine learning." There are times when machine learning is clearly the better analysis tool, but more often than not there are opportunities for classical methods to solve the problem faster with comparable or even better quality. Faster here can mean faster to develop, faster to use, or both. It is typical for several techniques to be compared to find the optimal balance between quality and performance. An analysis is not worth anything until results are put into the hands of decision-makers. Presentation is the key component of any data science project. It is so important that simply presenting the cleaned data is occasionally enough to make a difference in the firm. Just as the analysis stage has several different techniques that range in complexity, so too does the presentation stage. Typically results are shown in static reports, interactive dashboards, custom websites, or any combination of the three. Business operations are marathons, not sprints. A data science project should provide lasting insights and competitive advantages. This usually requires some form of continuous monitoring and occasional major updates. While not exactly relevant to the article, it is an often overlooked aspect of a data science project and serves as a conclusion to the process. Data science appears to be a lean process that produces lightning-fast results, so where are the bottlenecks? As ILTAns eluded to last year, the early stages of this process are often the hardest. Data cleaning is a collaborative effort between developers, analysts, and business experts. The easiest but most time-consuming method to clean data is the old fashioned way: manual correction from the user interface. This technique puts all the work on the business experts. On the flip side, corrections can be made programmatically on the data source. Cleaning data this way is much faster but can result in a fuzzy correction, leaving some bad data, and removing some good data. Often a combination of the two techniques is recommended to provide the optimal balance between accuracy and