P2P

Fall20

Peer to Peer: ILTA's Quarterly Magazine

Issue link: https://epubs.iltanet.org/i/1293067

Contents of this Issue

Navigation

Page 25 of 54

26 P E E R T O P E E R : I L T A ' S Q U A R T E R L Y M A G A Z I N E | F A L L 2 0 2 0 development time. Alternatively, with enough data, the cleaning process can often be reduced or eliminated by using a sophisticated machine learning approach in the analysis phase. Feature engineering is an advanced technique that can be used to focus on better quality inputs. Alternatively, a large enough neural network can learn to ignore dirty features given enough examples. Law firms may lack sufficient data to adequately train cutting edge neural networks, but nothing is stopping them from using classical analysis techniques. For example, linear regression is the tried and true method for the analysis of continuous data. This technique provides potentially useful results as long as the number of parameters is less than the number of data samples. For example, fitting a line to 2 samples is definitely futile, but a line that describes the average behavior of 10 samples can illuminate a trend that was once hidden. Data science requires a balance of variance. The data needs to be varied enough to provide useful results when analyzed. However, if the variance is too high, the reliability of the analysis suffers. Typically more samples will increase the variance of the data set before reaching a variance plateau. Because of this, analysts can quantify model uncertainty using the number of samples and sample variance. In short, data science can always be performed, but more samples generally lead to more reliable results. The ultimate strength of data science is the ability to analyze datasets of any size and any format, and clearly express the assumptions and uncertainties that go along with the process. Data scientists everywhere can continuously churn out results with convolutional neural networks and Monte Carlo simulations, but the results go unused if decision-makers can't trust them. Performing an analysis is only half the battle, as each project needs to quantify and provide reasoning for model accuracy, uncertainty, bias, privacy, speed, outliers, and more. Answering these questions for a binary machine learning model or convex optimization is easy, but the answers get murky in multi- class classification projects, unsupervised learning, and many other complex analyses. It is impossible to describe in detail the metrics that need to be interpreted in each type of data science project, as the metrics are dependent on the application. In general, it is good to develop multiple metrics in each category of accuracy, speed, privacy, and equity. Setting rough goals at the project onset and refining them as the analysis progresses is a great way to ensure the results will be fair and trustworthy. Unfortunately, some compromises will need to be made, as it is usually impossible to maximize all 4 of those metrics at once. Typically increasing one of them is accomplished by decreasing a subset of the other three. These metrics need to be understood by everyone: developers, analysts, business experts, stakeholders, and anyone else who will come into contact with this data- driven solution. It is the responsibility of the team involved with the design and development of the solution to make sure everyone understands the strengths and limitations of the product. Distrust forms when these metrics are not communicated clearly and the analysis is presented as a black box. Occasionally a project will fail to meet the desired threshold for some or all of the metrics in question. This discovery occurs in the analysis stage and necessitates a change to the project before continuing on to the presentation and monitoring stages. In rare circumstances, scrapping the project is the best option. Typically, the solution needs to be pivoted or viewed through another lens. Again this topic is much too broad to speak on all of the ways a project can be pivoted, but some common examples are batch or client-side processing, restricting access to results, introducing random noise, and human F E A T U R E S

Articles in this issue

Archives of this issue

view archives of P2P - Fall20