Load a huge data from BigQuery to python/pandas/dask

Instead of querying, you can always export stuff to cloud storage -> download locally -> load into your dask/pandas dataframe:

  1. Export + Download:

    bq --location=US extract --destination_format=CSV --print_header=false 'dataset.tablename' gs://mystoragebucket/data-*.csv &&  gsutil -m cp gs://mystoragebucket/data-*.csv /my/local/dir/ 
    
  2. Load into Dask:

    >>> import dask.dataframe as dd
    >>> df = dd.read_csv("/my/local/dir/*.csv")
    

Hope it helps.


First, you should profile your code to find out what is taking the time. Is it just waiting for big-query to process your query? Is it the download of data> What is your bandwidth, what fraction do you use? Is it parsing of that data into memory?

Since you can make SQLAlchemy support big-query ( https://github.com/mxmzdlv/pybigquery ), you could try to use dask.dataframe.read_sql_table to split your query into partitions and load/process them in parallel. In case big-query is limiting the bandwidth on a single connection or to a single machine, you may get much better throughput by running this on a distributed cluster.

Experiment!