Treeperf
1. A sample data set
julia
using Downloads, CSV, DataFrames source = IOBuffer() Downloads.download("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv", source) seekstart(source) iris = CSV.read(source, DataFrame) X = iris[!, [:sepal_length, :sepal_width]] y = iris.species X2 = rand(50000, 1) y2 = X2[:, 1] .+ rand(50000) .> 1
julia
using DecisionTree model = DecisionTreeClassifier() DecisionTree.fit!(model, Matrix(X), y) model2 = DecisionTreeClassifier() DecisionTree.fit!(model2, X2, y2)
To test performance, we benchmark prediction with and without probability.
julia
bp = @benchmark apply_tree_proba(model.root, Matrix(X), model.classes) bv = @benchmark apply_tree(model.root, Matrix(X))
2. Initial performance
3. Observation 1
The Leaf type stores every value in it as a vector. This is both inefficient in terms of space, and requires excess computation when determining how many instances of a particular value are contained in the leaf.
First, I switch the Leaf struct to the following form.
julia
struct Leaf{T, N} features :: NTuple{N, T} majority :: Int values :: NTuple{N, Int} total :: Int end
This should be more efficient to work with, and more compact in memory. With the old Leaf, the iris decision tree took up 7.1kB in memory. With the new structure, this is reduced to 5.9kB, a 17% improvement. With the single-column random data, a 27% improvement is observed (1.5MB → 1.1MB). This is not to be sneezed at.
We see a 79% (91μs → 19μs) reduction in computation time with the iris data, just by minimally adapting the various methods to work with the new Leaf structure and then assuming in that the features are consistent and making the following change:
Since results in a tuple though, we can make use of reshape to efficiently transform a vector of results to a matrix and avoid converting the tuple to a vector.
diff
- stack_function_results(row->apply_tree_proba(tree, row, labels), features) + reinterpret(reshape, Float64, [apply_tree_proba(tree, row, labels) for row in eachrow(features)]) |> + transpose |> Matrix
This results in a further 53% improvement (19μs → 10μs), or a 89% net improvement with the iris data. The large random data shows a smaller, but still notable net improvement of 45% (42μs → 23μs). With the iris data predicting with probability appears to be slightly faster, and with the large random data the overhead has shrunk from 147% to 9%.
There’s also been a significant improvement in the space complexity of prediction with probability, as it has gone from worse than prediction without probability by a factor of ~10, to once again on par with prediction without probability (32 MiB → 3.5 MiB, 140 KiB → 10 KiB).
4. Observation 2
Decision tree paths are being evaluated in a rather naïve way.
Online, we can find a few resources discussing decision tree implementation:
- Evaluating boosted decision trees for billions of users - Facebook Engineering
- Tahoe: Tree Structure-Aware High Performance Inference Engine for Decision Tree Ensemble on GPU
- The Performance of Decision Tree Evaluation Strategies —Andrew Tulloch
- RapidScorer: Fast Tree Ensemble Evaluation by Maximizing Compactness in Data Level Parallelization
These present a few ideas we can try.
4.1. Compiling trees
Thanks to eval, this is pretty easy to try.
julia
function compiletree(tree::Node{S, T}) where {S, T} exec(l::Leaf) = :($(l.values ./ l.total)) exec(n::Node{S, T}) where {S, T} = if n.featval == 0 exec(n.left) else :(if x[$(n.featid)] < $(n.featval) $(exec(n.left)) else $(exec(n.right)) end) end :(function(x::AbstractVector{$S}) $(exec(tree)) end) end
We can test this with the first model:
The generated function takes ~0.04 seconds to compile.
Interestingly, in practice the compiled function is actually worse than our
already improved apply_tree_proba
.
4.2. Structuring the in-memory arrangement for caching
If two nodes are often looked at one after the other, it’s good to arrange them to be next to each other in memory for the sake of cache locality.