Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create table view/ import from pandas in a session #258

Open
DanielMao1 opened this issue Aug 21, 2024 · 6 comments
Open

Create table view/ import from pandas in a session #258

DanielMao1 opened this issue Aug 21, 2024 · 6 comments
Labels

Comments

@DanielMao1
Copy link

Hello! : ) I am working on some feature engineering tasks and I find your great job. But I encountered some API limitations. Such APIs are supported by clickhouse and duckdb, but not supported by chDB. So I would like to discuss this issue.

Use case

When I using clickhouse, my use case is like:

  • firstly, I got some pandas dataframes:
import pandas as pd
import clickhouse_connect

# 1. load some dataframes
ads = pd.DataFrame({
    'ad_id': [1, 2, 3, 4, 5],
    'userid': [101, 102, 103, 104, 105]
})

users = pd.DataFrame({
    'userid': [101, 102, 103, 106, 107]
})

pageviews = pd.DataFrame({
    'userid': [101, 102, 103, 101, 105],
    'platform': [1, 2, 1, 2, 1]
})
  • Then I load them into clickhouse:
# 2. import dataframes into clickhouse
client = clickhouse_connect.get_client(host='localhost')

client.command('DROP TABLE IF EXISTS ads')
client.command('DROP TABLE IF EXISTS users')
client.command('DROP TABLE IF EXISTS pageviews')

client.command('CREATE TABLE ads (ad_id Int64, userid Int64) ENGINE = Memory')
client.command('CREATE TABLE users (userid Int64) ENGINE = Memory')
client.command('CREATE TABLE pageviews (userid Int64, platform Int64) ENGINE = Memory')

client.insert_df('ads', ads)
client.insert_df('users', users)
client.insert_df('pageviews', pageviews)
  • Then I run some very complex SQLs (which consists of create temp table statements and join 5-6 tables) on the database and get result as dataframes:
# generate temporary results
temp_query = """
SELECT
    ads.ad_id,
    ads.userid
FROM ads
JOIN users ON ads.userid = users.userid
"""

client.command('DROP TABLE IF EXISTS temp_ads_users')
client.command('CREATE TABLE temp_ads_users ENGINE = Memory AS ' + temp_query)

# generate final results
final_query = """
SELECT
    temp_ads_users.ad_id,
    mode() AS most_common_platform
FROM temp_ads_users
JOIN pageviews ON temp_ads_users.userid = pageviews.userid
GROUP BY temp_ads_users.ad_id
"""

result_df = client.query_df(final_query)
print(result_df)

Describe the solution you'd like

In chDB, I tried the API for a whole day but I can not find any APIs like clickhouse's insert_df nor any APIs like duckdb's duckdb.sql("INSERT INTO my_table BY NAME SELECT * FROM my_df"). I woule like such APIs so that I can import my data into servers, and reuse my analysis query. I do know if it is easy to wrap up on clickhouse c++ library or it is to be implemented in c++. (By the way, the clickhouse insert df is not efficient, that's the reason I try chdb)

Describe alternatives you've considered

I looked the official example do the sql look like this, but the problem is my query may consists of create temp table statement, which cannot be done by chdb.dataframe (It can only query but not create). Instead, I would like to insert the dataframe into db (logically in a session), and in this session I can do many SQLs.

import chdb.dataframe as cdf
import pandas as pd

employees = pd.read_csv("employees.csv")
departments = pd.read_csv("departments.csv")

query_sql = """
select
  emp_id, first_name,
  dep.name as dep_name,
  salary
from __emp__ as emp
    join __dep__ as dep using(dep_id)
order by salary desc;
"""

res = cdf.query(sql=query_sql, emp=employees, dep=departments)
print(res, end="")

Additional context

Look forward to you reply : )

@l1t1
Copy link

l1t1 commented Aug 25, 2024

create temporary table seems works, but cannot select it.

>>> from chdb.session import Session
>>>
>>> db = Session()
>>> db.query("create database db")

>>> db.query("use db")
>>> x="create temporary table data (id UInt32, x UInt32) engine MergeTree order by id sample by id as select number+1 as id, randUniform(1, 100) as x from numbers(10000);"
>>> db.query(x)                                                                         
>>> y='select avg(x) as "avg", round(quantile(0.95)(x), 2) as p95 from data sample 0.1;'>>> db.query(y)
Code: 60. DB::Exception: Table db.data does not exist. (UNKNOWN_TABLE)

>>> x="create table data (id UInt32, x UInt32) engine MergeTree order by id sample by id as select number+1 as id, randUniform(1, 100) as x from numbers(10000);"
>>> db.query(x)                                                                         
>>> db.query(y)                                                                         50.2891,95

@l1t1
Copy link

l1t1 commented Aug 26, 2024

try Query on Pandas DataFrame of https://clickhouse.com/docs/en/chdb/install/python

import chdb
import pandas as pd
df = pd.DataFrame(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
    }
)

chdb.query("SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()

@auxten
Copy link
Member

auxten commented Aug 26, 2024

try Query on Pandas DataFrame of https://clickhouse.com/docs/en/chdb/install/python

import chdb
import pandas as pd
df = pd.DataFrame(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
    }
)

chdb.query("SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()

chdb.dataframe is going to be deprecated soon
Please use the example above to handle dataframe with chDB

@l1t1
Copy link

l1t1 commented Aug 26, 2024

still can't select from table

>>>
>>> chdb.query("create table a engine=Memory as SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()

>>> chdb.query("select * from a").show()                                                Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/chdb/__init__.py", line 78, in query
    raise ChdbError(res.error_message())
chdb.ChdbError: Code: 60. DB::Exception: Unknown table expression identifier 'a' in scope SELECT * FROM a. (UNKNOWN_TABLE)

@l1t1
Copy link

l1t1 commented Aug 26, 2024

when use session, the result is empty.

>>> from chdb.session import Session
>>> db = Session()
>>> db.query("create database db")

>>> db.query("use db")

>>> import chdb
>>> import pandas as pd
>>> df = pd.DataFrame(
...     {
...         "a": [1, 2, 3, 4, 5, 6],
...         "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
...     }
... )
>>> db.query("create table a engine=Memory as SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()

>>> db.query("select * from a").show()

@l1t1
Copy link

l1t1 commented Aug 26, 2024

engine=Log works. engine=Memory did not insert rows

>>> db.query("select count() from a")
0

>>> db.query("create table a1 engine=Log as SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()

>>> db.query("select count() from a1")                                                  3

>>> db.query("select * from a1")                                                        "auxten",9
"jerry",7
"tom",5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants