A quality index for construction big data in EPB shield tunneling

Published in Working paper, to be submitted to Automation in Construction, 2023

The massive quantity of data generated continuously during the EPB shield tunneling process can be utilized for tunneling performance prediction and operation optimization through the development of data-driven models. However, the quality of these data determines an upper bound of the accuracy of developed data-driven models, it plays a critical role in ensuring accurate predictions and informed decision-making. Despite its importance, there is currently no universally accepted method for assessing the quality of data in EPB shield tunneling. Herein, we proposed a concise formulation of a quality index for construction big data in EPB shield tunneling, which is defined as the average value among three distinct sub-terms, i.e., Representativeness, Diversity, and Informativeness. Representativeness originated from the harmfulness of the outliers on the data quality and the developed models, which quantitatively pertains to the number of outliers between clusters. Diversity is assessed by the area of the convex hull that encloses the dataset, measuring the breadth of variation within the dataset. Informativeness evaluate information richness of the dataset in terms of training data-driven models, it can be quantified by the confidence or uncertainty in the predictions generated by the developed model. The developed quantification formulation for quality index was applied to quantitatively assess the quality of three filed datasets, yielding quality index of 0.762, 0.754, and 0.719 respectively. The effectiveness of the quality index is quantitatively demonstrated by examining its correlation with the performance measures of the developed data-driven models in three scenarios. The strong positive correlation coefficient of 0.99 between the quality index and the developed model’s performance illustrates the capability of the quality index to accurately capture the dataset’s potential for developing precise data-driven models.