Detecting CSV file truncation when NA is a valid value
Detecting CSV file truncation when NA is a valid value
If I use the Pandas read_csv() function, elements of short rows are mapped to NaN by default. I would like to suppress that mapping while interpreting NA
as NaN
. I'm mostly interested in file truncation as a result of transmission problems, but short rows in the middle of the file should feature the same missing"
value. I tried messing around with na_filter=False
and keep_default_na=False
, and while each seemed to map empty cells to the empty string, neither mapped the string NA
to NaN
.
NA
NaN
missing"
na_filter=False
keep_default_na=False
NA
NaN
Is there a way to have my cake (NA
=> NaN
) and eat it too (missing values not mapped to NaN
)? (I'm using Pandas 0.22.0 w/ Python 3.6.
)
NA
NaN
NaN
Pandas 0.22.0 w/ Python 3.6.
Example:
col1,col2,col3,col4
1,2,NA,4
4,5
12
Assume the file has been truncated, so the characters "12"
are the last in the file (no EOF). With na_filter
and keep_default_na
at their default values of True
, the resulting values are
"12"
na_filter
keep_default_na
True
1,2,NaN,4
4,5,NaN,NaN
12,NaN,NaN,NaN
If I set either to False
, I get
False
1,2,NA,4
4,5,,
12,,,
I would like to find some way to get a NaN
out of the third column of the first row without also mapping the missing values to NaN
.
NaN
NaN
1 Answer
1
By default, Pandas will interpret the following values as NaN
(from the docs):
NaN
The default NaN
recognized values are ['-1.#IND', '1.#QNAN', '1.#IND',
.
'-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', 'NA', '#NA', 'NULL',
'null', 'NaN', '-NaN', 'nan', '-nan', '']
NaN
['-1.#IND', '1.#QNAN', '1.#IND',
'-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', 'NA', '#NA', 'NULL',
'null', 'NaN', '-NaN', 'nan', '-nan', '']
In order to avoid NA
being parsed as a null value, you need to set keep_default_na=False
and specify the na_values
directly.
NA
keep_default_na=False
na_values
To address your second problem, in order to avoid Pandas throwing errors when reading in a file with a variable number of columns, you need to specify the columns names. Putting it all together:
# new null values, removing NA from the list
new_na_values = ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', '']
# read in the file
df = pd.read_csv("<path to file>.csv", keep_default_na=False, na_values=new_na_values, names=["col1", "col2", "col3", "col4"])
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
You need to give more details, e.g., example data, expected output, attempts at a solution not in paragraph form.
– Rushabh Mehta
39 mins ago