Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type specialized Tables.jl reader #224

Closed
chris-b1 opened this issue Jan 14, 2021 · 6 comments
Closed

Type specialized Tables.jl reader #224

chris-b1 opened this issue Jan 14, 2021 · 6 comments

Comments

@chris-b1
Copy link
Contributor

Motivating example - for this specific case, there is an order of magnitude speedup by specializing and eliding all the type dispatch and hard-coding the schema.

julia> @btime Tables.columntable(DBInterface.execute(db, "SELECT * FROM test_table"));
  28.492 ms (159048 allocations: 4.65 MiB)

julia> @btime fast_path(db)
  2.063 ms (36 allocations: 1.00 MiB)

There may be room to just improve the existing implementation, or perhaps an opt-in version of execute where you know the schema in advance and are willing to pay the compilation of a specialized method.

using SQLite, DBInterface, BenchmarkTools
db = SQLite.DB(":memory:")
t = (i=rand(Int, 10_000), f=rand(10_000))
SQLite.load!(t, db, "test_table")

function fast_path(db)
    stmt = SQLite.Stmt(db, "SELECT * FROM test_table")
    res = (i = Int[], f = Float64[])
    SQLite.execute(stmt, ())
    while true
        push!(res.i, SQLite.sqlite3_column_int64(stmt.handle, 1))
        push!(res.f, SQLite.sqlite3_column_double(stmt.handle, 2))
        st = SQLite.sqlite3_step(stmt.handle)
        if st == SQLite.SQLITE_DONE
               SQLite.sqlite3_reset(stmt.handle)
               return res
        end
    end
end
@quinnj
Copy link
Member

quinnj commented Jan 14, 2021

Yeah, I'm certainly open to ideas; I know part of the problem of the general problem is the lack of strict typing on the sqlite side: i.e. even if you declare a column type, there's no enforcement or guarantee that values will actually be that type. But I like the idea of providing some way to opt in to a faster implementation, maybe by providing a Tables.Schema that can hit a generated function and unroll the code? Want to take a crack at it?

@chris-b1
Copy link
Contributor Author

WIP Branch here that takes a Tables.Schema param - https://github.com/JuliaDatabases/SQLite.jl/compare/master...chris-b1:generated-query?expand=1

In more realistic workloads - data on disk vs memory, and string data, I'm getting more like 10-15% speedups. Not nothing, but given the type safety issues you noted on SQLite side, I haven't entirely convinced myself this is a worthwhile expansion.

@quinnj
Copy link
Member

quinnj commented Feb 22, 2021

Hmmm, interesting. Yeah, I guess I would hope for a bit more of a speedup, but I'm also not too surprised because I know that the difference between the fallbacks for Tables.columns for an input with vs. without Tables.schema can be in the 10-50% difference range in perf. In your perf tests, are you just materializing as a DataFrame? Or some other sink? I also wonder if you could get a bit more efficient by just using a custom typed TypedRow{types} struct instead of a NamedTuple. Maybe. NamedTuples are great, but sometimes they're a bit heavy because they need to allocate the full object.

@chris-b1
Copy link
Contributor Author

I was materializing to a Array of NamedTuples - either constructed directly in my fastpath or using Tables.rowtable for the current implementation.

Another other idea I haven't tested that might have some legs for my real life case is more efficient string parsing. Looking at this code

SQLite.jl/src/SQLite.jl

Lines 404 to 405 in 24b69db

#TODO: test returning a WeakRefString instead of calling `unsafe_string`
sqlitevalue(::Type{T}, handle, col) where {T <: AbstractString} = convert(T, unsafe_string(sqlite3_column_text(handle, col)))

I've got column of Dates stored as strings in SQLite, rather than paying the copy from unsafe_string I could probably parse the Date directly from the buffer sqlite3_column_text returns.

(also, I don't think WeakRefString could work there - it looks like SQLite automatically frees the return value from sqlite3_column_text)

@quinnj
Copy link
Member

quinnj commented Feb 25, 2021

You're right; I specifically didn't use WeakRefString because the value is only valid when the cursor is on the current row.

Materializing as a row-table will be slower than a column-based table, but if you want/need rows, then there's not much you can really do about that. The perf difference there comes down to allocations; you can allocate each column vector once and fill them in vs. allocating each row object (NamedTuple or otherwise).

Yeah, the idea of more efficient date parsing is a good one. In Parsers, I've adapted the Date parsing code from the Dates stdlib module to work on AbstractVector{UInt8}, so you could hook into that machinery probably by wrapping the sqlite pointer like unsafe_wrap(Vector{UInt8}, ...) and passing it to Parsers.parse (or Parsers.xparse which is slightly lower level).

Anyway, happy to chat more ideas if you have 'em.

@quinnj
Copy link
Member

quinnj commented Oct 6, 2022

Closing in favor of more specific ideas for perf improvements

@quinnj quinnj closed this as completed Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants