A Scalable Data Stream Mining Methodology: Stream-based Holistic Analytics and Reasoning in Parallel
可擴展的數據流挖掘方法:綜合分析和推理的並行計算
研究目標: The research target is to design and develop a data-stream-mining system, which is "holistic" meaning it is able to produce a decision-support model of highest possible accuracy in data-mining data streams.
研究目標是要設計及建立一個具整體性的數據流挖掘系統,其中的决策支持模型能夠產生高準確度的數據流。
研究內容: In this project, a scalable data stream mining called Stream-based Holistic Analytics and Reasoning in Parallel (SHARP) is proposed. SHARP is holistic because it consists of several components and they target to improve different aspects of data mining functions such as smoothing the input data streams, reducing the feature search space, finding the optimum feature subset, optimizing parameter values for the classifiers, and allowing incremental classifiers to go ensemble by spawning different classifiers in parallel. Preliminary experiments for three individual components have been tested and demonstrated superiority over existing methods, by our previously published literature. In this project, it is planned that all the components would be fully integrated and tested as a holistic data stream mining system that can produce the best possible performance. It is anticipated that SHARP is capable of eliminating some of the key problems in Big Data especially those associated with high-dimensionality and infinite and continuous data streams.
我們的題案是關於基於數據流整體分析和並行推理(SHARP)的可擴展的數據流挖掘。所謂的整體分析包括幾個組成部分,分別用於提高不同的數據挖掘功能,例如理順數據流輸入,縮窄特徵搜索範圍,尋找最佳特徵子集,優化分類器參數,以及利用生成並行分類器實現分類器的遞增和整合。當中的三項功能已經有了最初的實驗結果,已發表的論文可以證明我們的研究結果比現有的方法優勝。我們計劃整合測試數據流挖掘系統的所有組成部分,以達至最好效果。我們期待SHARP能夠解決大數據中,特別是關於高維和無限連續數據流的幾個關鍵問題。
預期研究成果或擬解決問題: The end result is anticipated to generate a new breed of data stream mining system, which is holistic by design as it embraces: incremental feature selection, parameters optimization, ensemble bagging, and factors analysis/understanding, all for achieving the best possible data mining performance. In particular, this data stream mining system is intended to solve at least 2 Big Data problems, namely high-data-dimensionality, and potentially infinite amount in volume.
我們期待建立一種新的具整體性的數據流挖掘系統,功能包括可遞增的特徵選擇,參數優化,集成學習,以及因素分析理解,全部用於達至最好的數據挖掘性能。更重要的是,這樣的數據流挖掘系統將至少解決高維和潛在無限數據量這兩個大數據的問題。