Treeperf

1. A sample data set

julia

#

using Downloads, CSV, DataFrames
source = IOBuffer()
Downloads.download("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv", source)
seekstart(source)
iris = CSV.read(source, DataFrame)
X = iris[!, [:sepal_length, :sepal_width]]
y = iris.species

X2 = rand(50000, 1)
y2 = X2[:, 1] .+ rand(50000) .> 1

julia

#

using DecisionTree
model = DecisionTreeClassifier()
DecisionTree.fit!(model, Matrix(X), y)

model2 = DecisionTreeClassifier()
DecisionTree.fit!(model2, X2, y2)

To test performance, we benchmark prediction with and without probability.

julia

#

bp = @benchmark apply_tree_proba(model.root, Matrix(X), model.classes)
bv = @benchmark apply_tree(model.root, Matrix(X))

2. Initial performance

#

Parameter	Iris	1x50000 random
Tree size	7.1 kB	1.5 MB
Predict time	13 μs	17 μs
Predict space	17 KiB	3.5 MiB
Proba time	91 μs	42 μs
Proba space	140 KiB	32MiB

#

DataSet	Predict Time	+freqmap
Breastcancer	0.35s	0.02s
Pima	0.06s	0.01s
Iris	0.04s	0.0s
HP1421	188s
MASS TE	0.3s	0.02s
MASS TR	0.06s	0.01s

3. Observation 1

The Leaf type stores every value in it as a vector. This is both inefficient in terms of space, and requires excess computation when determining how many instances of a particular value are contained in the leaf.

First, I switch the Leaf struct to the following form.

julia

#

struct Leaf{T, N}
    features :: NTuple{N, T}
    majority :: Int
    values   :: NTuple{N, Int}
    total    :: Int
end

This should be more efficient to work with, and more compact in memory. With the old Leaf, the iris decision tree took up 7.1kB in memory. With the new structure, this is reduced to 5.9kB, a 17% improvement. With the single-column random data, a 27% improvement is observed (1.5MB → 1.1MB). This is not to be sneezed at.

We see a 79% (91μs → 19μs) reduction in computation time with the iris data, just by minimally adapting the various methods to work with the new Leaf structure and then assuming in that the features are consistent and making the following change:

diff

#

-    compute_probabilities(labels, leaf.values)
+    collect(leaf.values ./ leaf.total)

Since results in a tuple though, we can make use of reshape to efficiently transform a vector of results to a matrix and avoid converting the tuple to a vector.

diff

#

-    collect(leaf.values ./ leaf.total)
+    leaf.values ./ leaf.total

diff

#

-    stack_function_results(row->apply_tree_proba(tree, row, labels), features)
+    reinterpret(reshape, Float64, [apply_tree_proba(tree, row, labels) for row in eachrow(features)]) |>
+        transpose |> Matrix

This results in a further 53% improvement (19μs → 10μs), or a 89% net improvement with the iris data. The large random data shows a smaller, but still notable net improvement of 45% (42μs → 23μs). With the iris data predicting with probability appears to be slightly faster, and with the large random data the overhead has shrunk from 147% to 9%.

There’s also been a significant improvement in the space complexity of prediction with probability, as it has gone from worse than prediction without probability by a factor of ~10, to once again on par with prediction without probability (32 MiB → 3.5 MiB, 140 KiB → 10 KiB).

4. Observation 2

Decision tree paths are being evaluated in a rather naïve way.

Online, we can find a few resources discussing decision tree implementation:

These present a few ideas we can try.

4.1. Compiling trees

Thanks to eval, this is pretty easy to try.

julia

#

function compiletree(tree::Node{S, T}) where {S, T}
    exec(l::Leaf) = :($(l.values ./ l.total))
    exec(n::Node{S, T}) where {S, T} =
        if n.featval == 0
            exec(n.left)
        else
            :(if x[$(n.featid)] < $(n.featval)
                    $(exec(n.left))
                else
                    $(exec(n.right))
              end)
        end
    :(function(x::AbstractVector{$S}) $(exec(tree)) end)
end

We can test this with the first model:

julia

#

compmodel = eval(compiletree(model.root))

The generated function takes ~0.04 seconds to compile.

Interestingly, in practice the compiled function is actually worse than our already improved apply_tree_proba.

4.2. Structuring the in-memory arrangement for caching

If two nodes are often looked at one after the other, it’s good to arrange them to be next to each other in memory for the sake of cache locality.

1. A sample data set#

2. Initial performance#

3. Observation 1#

4. Observation 2#

4.1. Compiling trees#

4.2. Structuring the in-memory arrangement for caching#

4.3. Tree path encoding#

1. A sample data set

2. Initial performance

3. Observation 1

4. Observation 2

4.1. Compiling trees

4.2. Structuring the in-memory arrangement for caching

4.3. Tree path encoding