1. A sample data set

julia
#
using Downloads, CSV, DataFrames
source = IOBuffer()
Downloads.download("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv", source)
seekstart(source)
iris = CSV.read(source, DataFrame)
X = iris[!, [:sepal_length, :sepal_width]]
y = iris.species

X2 = rand(50000, 1)
y2 = X2[:, 1] .+ rand(50000) .> 1
julia
#
using DecisionTree
model = DecisionTreeClassifier()
DecisionTree.fit!(model, Matrix(X), y)

model2 = DecisionTreeClassifier()
DecisionTree.fit!(model2, X2, y2)

To test performance, we benchmark prediction with and without probability.

julia
#
bp = @benchmark apply_tree_proba(model.root, Matrix(X), model.classes)
bv = @benchmark apply_tree(model.root, Matrix(X))

2. Initial performance

Parameter Iris 1x50000 random
Tree size 7.1 kB 1.5 MB
Predict time 13 μs 17 μs
Predict space 17 KiB 3.5 MiB
Proba time 91 μs 42 μs
Proba space 140 KiB 32MiB
DataSet Predict Time +freqmap
Breastcancer 0.35s 0.02s
Pima 0.06s 0.01s
Iris 0.04s 0.0s
HP1421 188s  
MASS TE 0.3s 0.02s
MASS TR 0.06s 0.01s

3. Observation 1

The Leaf type stores every value in it as a vector. This is both inefficient in terms of space, and requires excess computation when determining how many instances of a particular value are contained in the leaf.

First, I switch the Leaf struct to the following form.

julia
#
struct Leaf{T, N}
    features :: NTuple{N, T}
    majority :: Int
    values   :: NTuple{N, Int}
    total    :: Int
end

This should be more efficient to work with, and more compact in memory. With the old Leaf, the iris decision tree took up 7.1kB in memory. With the new structure, this is reduced to 5.9kB, a 17% improvement. With the single-column random data, a 27% improvement is observed (1.5MB → 1.1MB). This is not to be sneezed at.

We see a 79% (91μs → 19μs) reduction in computation time with the iris data, just by minimally adapting the various methods to work with the new Leaf structure and then assuming in that the features are consistent and making the following change:

diff
#
-    compute_probabilities(labels, leaf.values)
+    collect(leaf.values ./ leaf.total)

Since results in a tuple though, we can make use of reshape to efficiently transform a vector of results to a matrix and avoid converting the tuple to a vector.

diff
#
-    collect(leaf.values ./ leaf.total)
+    leaf.values ./ leaf.total
diff
#
-    stack_function_results(row->apply_tree_proba(tree, row, labels), features)
+    reinterpret(reshape, Float64, [apply_tree_proba(tree, row, labels) for row in eachrow(features)]) |>
+        transpose |> Matrix

This results in a further 53% improvement (19μs → 10μs), or a 89% net improvement with the iris data. The large random data shows a smaller, but still notable net improvement of 45% (42μs → 23μs). With the iris data predicting with probability appears to be slightly faster, and with the large random data the overhead has shrunk from 147% to 9%.

There’s also been a significant improvement in the space complexity of prediction with probability, as it has gone from worse than prediction without probability by a factor of ~10, to once again on par with prediction without probability (32 MiB → 3.5 MiB, 140 KiB → 10 KiB).

4. Observation 2

4.1. Compiling trees

Thanks to eval, this is pretty easy to try.

julia
#
function compiletree(tree::Node{S, T}) where {S, T}
    exec(l::Leaf) = :($(l.values ./ l.total))
    exec(n::Node{S, T}) where {S, T} =
        if n.featval == 0
            exec(n.left)
        else
            :(if x[$(n.featid)] < $(n.featval)
                    $(exec(n.left))
                else
                    $(exec(n.right))
              end)
        end
    :(function(x::AbstractVector{$S}) $(exec(tree)) end)
end

We can test this with the first model:

julia
#
compmodel = eval(compiletree(model.root))

The generated function takes ~0.04 seconds to compile.

Interestingly, in practice the compiled function is actually worse than our already improved apply_tree_proba.

4.2. Structuring the in-memory arrangement for caching

If two nodes are often looked at one after the other, it’s good to arrange them to be next to each other in memory for the sake of cache locality.

4.3. Tree path encoding

Date: 2022-05-12

Author: TEC

Created: 2022-05-16 Mon 11:17