Characterization and Parallelization of Decision-tree Induction

TitleCharacterization and Parallelization of Decision-tree Induction
Publication TypeJournal Article
Year of Publication2001
AuthorsBradford, J, Fortes, JAB
JournalJournal of Parallel and Distributed Computing
Volume61
Pagination322-349
Date Published03/2001
AbstractThis paper examines the performance and memory-access behavior of the C4.5 decision tree induction program, a representative example of data mining applications, for both uniprocessor and parallel implementations. The goals of this paper are to characterize C4.5, in particular its memory hierarchy usage, and to decrease the runtime of C4.5 by algorithmic improvement and parallelization. Performance is studied via RSIM, an execution driven simulator, for three uniprocessor models that exploit instruction level parallelism to varying degrees. This paper makes the following four contributions. First it presents a complete characterization of the C4.5 decision tree induction program. It was found that the with the exception of the input dataset, the working set fits into an 8-kB data cache; the instruction working set also fits into an 8-kB instruction cache. For datasets smaller than the L2 cache, performance is limited by accesses to main memory. It was also found that multiple-issue can...