Detecting CSV file truncation when NA is a valid value

Detecting CSV file truncation when NA is a valid value

If I use the Pandas read_csv() function, elements of short rows are mapped to NaN by default. I would like to suppress that mapping while interpreting NA as NaN. I'm mostly interested in file truncation as a result of transmission problems, but short rows in the middle of the file should feature the same missing" value. I tried messing around with na_filter=False and keep_default_na=False, and while each seemed to map empty cells to the empty string, neither mapped the string NA to NaN.

NA

NaN

missing"

na_filter=False

keep_default_na=False

NA

NaN

Is there a way to have my cake (NA => NaN) and eat it too (missing values not mapped to NaN)? (I'm using Pandas 0.22.0 w/ Python 3.6.)

NA

NaN

Pandas 0.22.0 w/ Python 3.6.

Example:

col1,col2,col3,col4 1,2,NA,4 4,5 12

Assume the file has been truncated, so the characters "12" are the last in the file (no EOF). With na_filter and keep_default_na at their default values of True, the resulting values are

"12"

na_filter

keep_default_na

True

1,2,NaN,4 4,5,NaN,NaN 12,NaN,NaN,NaN

If I set either to False, I get

False

1,2,NA,4 4,5,, 12,,,

I would like to find some way to get a NaN out of the third column of the first row without also mapping the missing values to NaN.

NaN

You need to give more details, e.g., example data, expected output, attempts at a solution not in paragraph form.
– Rushabh Mehta
39 mins ago

1 Answer
1

By default, Pandas will interpret the following values as NaN (from the docs):

NaN

The default NaN recognized values are ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', 'NA', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''].

NaN

['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', 'NA', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', '']

In order to avoid NA being parsed as a null value, you need to set keep_default_na=False and specify the na_values directly.

NA

keep_default_na=False

na_values

To address your second problem, in order to avoid Pandas throwing errors when reading in a file with a variable number of columns, you need to specify the columns names. Putting it all together:

# new null values, removing NA from the list new_na_values = ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''] # read in the file df = pd.read_csv("<path to file>.csv", keep_default_na=False, na_values=new_na_values, names=["col1", "col2", "col3", "col4"])

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Ciugk