Bug in readtable()? – if the first values in a CSV’s column are missing, the whole column is misinterpreted
I am reading a big CSV file (500K lines) with readtable. In the CSV, some columns have the first 250+ lines empty (e.g. ",,,," in the CSV), while the non-missing values below (pretty rare) are either text strings or dates (in the DD-MM-YYYY format). Readtable() somehow interprets these columns as numeric, and so converts all the strings and dates into NaNs – thus, I end up with 100% NaN-filled columns instead of rarely-populated data (among empty strings and NaTs).
Furthermore, if I move the "with-data" lines up – even a few dozens positions up – readtable() starts to read everything normally!
So, it looks like readtable() checks only ~250 first values to determine the type of the column, which, in my opinion, is a bug! (Although I understand that it was likely made to improve speed.)
Is there a way to fix it systematically? I have lots of such CSVs with thousands of columns in them – so, a manual check and manual fix is not an option…
UPD: a test-file (truncated to 510 lines) is attached – the behaviour is still the same. The problem columns are the 2nd and the 3rd (p190, p191). The first non-empty value is on data-line 270.I am reading a big CSV file (500K lines) with readtable. In the CSV, some columns have the first 250+ lines empty (e.g. ",,,," in the CSV), while the non-missing values below (pretty rare) are either text strings or dates (in the DD-MM-YYYY format). Readtable() somehow interprets these columns as numeric, and so converts all the strings and dates into NaNs – thus, I end up with 100% NaN-filled columns instead of rarely-populated data (among empty strings and NaTs).
Furthermore, if I move the "with-data" lines up – even a few dozens positions up – readtable() starts to read everything normally!
So, it looks like readtable() checks only ~250 first values to determine the type of the column, which, in my opinion, is a bug! (Although I understand that it was likely made to improve speed.)
Is there a way to fix it systematically? I have lots of such CSVs with thousands of columns in them – so, a manual check and manual fix is not an option…
UPD: a test-file (truncated to 510 lines) is attached – the behaviour is still the same. The problem columns are the 2nd and the 3rd (p190, p191). The first non-empty value is on data-line 270. I am reading a big CSV file (500K lines) with readtable. In the CSV, some columns have the first 250+ lines empty (e.g. ",,,," in the CSV), while the non-missing values below (pretty rare) are either text strings or dates (in the DD-MM-YYYY format). Readtable() somehow interprets these columns as numeric, and so converts all the strings and dates into NaNs – thus, I end up with 100% NaN-filled columns instead of rarely-populated data (among empty strings and NaTs).
Furthermore, if I move the "with-data" lines up – even a few dozens positions up – readtable() starts to read everything normally!
So, it looks like readtable() checks only ~250 first values to determine the type of the column, which, in my opinion, is a bug! (Although I understand that it was likely made to improve speed.)
Is there a way to fix it systematically? I have lots of such CSVs with thousands of columns in them – so, a manual check and manual fix is not an option…
UPD: a test-file (truncated to 510 lines) is attached – the behaviour is still the same. The problem columns are the 2nd and the 3rd (p190, p191). The first non-empty value is on data-line 270. readtable, missing values, csv MATLAB Answers — New Questions