Reading in csv file as dataframe from hdfs

I know next to nothing about hdfs, but I wonder if the following might work:

with hd.open("/home/file.csv") as f:
    df =  pd.read_csv(f)

I assume read_csv works with a file handle, or in fact any iterable that will feed it lines. I know the numpy csv readers do.

pd.read_csv("/home/file.csv") would work if the regular Python file open works - i.e. it reads the file a regular local file.

with open("/home/file.csv") as f: 
    print f.read()

But evidently hd.open is using some other location or protocol, so the file is not local. If my suggestion doesn't work, then you (or we) need to dig more into the hdfs documentation.


you can use the following code to read csv from hdfs

import pandas as pd
import pyarrow as pa
hdfs_config = {
     "host" : "XXX.XXX.XXX.XXX",
     "port" : 8020,
     "user" : "user"
}
fs = pa.hdfs.connect(hdfs_config['host'], hdfs_config['port'], 
user=hdfs_config['user'])
df=pd.read_csv(fs.open("/home/file.csv"))