publication of the International Legal Technology Association
Issue link: https://epubs.iltanet.org/i/1310179
I L T A W H I T E P A P E R | L I T I G A T I O N A N D P R A C T I C E S U P P O R T 37 include key patterns, anomalies, and data relationships. The techniques range from simple summaries—number of records and minimum/ maximum values—to complex statistical patterns, such as clustering related records. EDA can be performed in different ways, and most likely, you already perform EDA as part of your job. The goals of EDA are to test and measure your data to identify key issues and opportunities for subsequent data analysis. Data always carries the risk of incompleteness, inaccuracy, or other data quality issues. EDA is performed to remove the uncertainty by measuring those potential issues so that you can address them or, at a minimum, document them. You can also identify starting points for your analysis using EDA. A common problem with analysis is not knowing where to start. By starting with detailed information about your data, you can more easily see where the most important or interesting analysis may lie. The output from EDA ranges in format and depends on your analysis goals. EDA can be represented in tabular formats, data visualizations, and/or lists of outliers. Tabular reporting represents information in tables, typically in numeric-based formats. It is a standard starting point for EDA, and for a data set with over 50 million records, you most likely want to start with seeing your EDA measurements in tabular form to show the macro-level trends and issues. From there, you can generate data visualizations for the complex issues, which presents the complicated information in a more intuitive manner. The drawback to most visualizations, however, is that they take more time to generate and refine, so visualizations are not always appropriate. List of outliers can also be generated for review, either as complete or sampled lists. There are many different tools available for performing EDA. Microsoft Excel offers several options for manual and automatic data exploration, including its Power Query data profiling feature. EDA can also be performed with database queries and scripting languages, such as Python and R. Commercial options, which are available from companies like Alteryx and Talend, offer robust features for exploring large, complex data sets. Measuring Completeness Measuring data completeness is perhaps the most common type of EDA for litigation support. The goal for completeness measurements is to determine whether all expected information is present in a specific set of data. There are three primary ways in which completeness can be measured: horizontal, vertical, and inter- data. The following diagram illustrates these completeness measurements. "A common problem with analysis is not knowing where to start."