As the basis of data management and analysis, data quality issues have increasingly become a research hotspot in related fields, which contributes to optimization of big data and artificial intelligence technology. Generally, physical failures or technical defects in data collectors and recorders cause anomalies in collected data. These anomalies will strongly impact on subsequent data analysis and artificial intelligence processes; thus, data should be processed and cleaned accordingly before application. Existing repairing methods based on smoothing will cause a large number of originally correct data points being over-repaired into wrong values. The constraint-based methods such as sequential dependency and SCREEN cannot accurately repair data under complex conditions since the constraints are relatively simple. A time series data repairing method under multi-speed constraints is further proposed based on the principle of minimum repairing. Then, dynamic programming is used to solve the problem of data anomalies with optimal repairing. Specifically, multiple speed intervals are set to constrain time series data, and a series of candidate repairing points are formed for each data point according to the speed constraints. Next, the optimal repair solution is selected from these candidates based on the dynamic programming method. With regard to the feasibility study of this method, an artificial dataset, two real datasets, and another real dataset with real anomalies are employed for experiments in case of different rates of anomalies and data sizes. Experimental results demonstrate that, compared with the existing methods based on smoothing or constraints, the proposed method has better performance in terms of RMS errors and time cost. In addition, the investigation of clustering and classification accuracy with several datasets reveals the impact of data quality on subsequent data analysis and artificial intelligence. The proposed method can improve the quality of data analysis and artificial intelligence results.
Fei Gao, Shaoxu Song, Jianmin Wang. Time Series Data Cleaning under Multi-Speed Constraints. International Journal of Software and Informatics, 2021,11(1):29~54Copy