Python Pandas read_csv skip rows but keep header

Great answers already. Consider this generalized scenario:

Say your xls/csv has junk rows in the top 2 rows (row #0,1). Row #2 (3rd row) is the real header and you want to load 10 rows starting from row #50 (i.e 51st row).

Here's the snippet:

pd.read_csv('test.csv', header=2, skiprows=range(3, 50), nrows=10)


You can pass a list of row numbers to skiprows instead of an integer.

By giving the function the integer 10, you're just skipping the first 10 lines.

To keep the first row 0 (as the header) and then skip everything else up to row 10, you can write:

pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))

Other ways to skip rows using read_csv

The two main ways to control which rows read_csv uses are the header or skiprows parameters.

Supose we have the following CSV file with one column:

a
b
c
d
e
f

In each of the examples below, this file is f = io.StringIO("\n".join("abcdef")).

  • Read all lines as values (no header, defaults to integers)

    >>> pd.read_csv(f, header=None)
       0
    0  a
    1  b
    2  c
    3  d
    4  e
    5  f
    
  • Use a particular row as the header (skip all lines before that):

    >>> pd.read_csv(f, header=3)
       d
    0  e
    1  f
    
  • Use a multiple rows as the header creating a MultiIndex (skip all lines before the last specified header line):

    >>> pd.read_csv(f, header=[2, 4])                                                                                                                                                                        
       c
       e
    0  f
    
  • Skip N rows from the start of the file (the first row that's not skipped is the header):

    >>> pd.read_csv(f, skiprows=3)                                                                                                                                                                      
       d
    0  e
    1  f
    
  • Skip one or more rows by giving the row indices (the first row that's not skipped is the header):

    >>> pd.read_csv(f, skiprows=[2, 4])                                                                                                                                                                      
       a
    0  b
    1  d
    2  f
    

To expand on @AlexRiley's answer, the skiprows argument takes a list of numbers which determines what rows to skip. So:

pd.read_csv('test.csv', sep='|', skiprows=range(1, 10))

is the same as:

pd.read_csv('test.csv', sep='|', skiprows=[1,2,3,4,5,6,7,8,9])

The best way to go about ignoring specific rows would be to create your ignore list (either manually or with a function like range that returns a list of integers) and pass it to skiprows.

Tags:

Python

Pandas

Csv