A union of curiosity and data science

Knowledgebase and brain dump of a database engineer


Open non valid utf_8 character text file with python. (not unicode but non decodable bytes)

 

I attempted to read and write a CSV using python 3.5 and ran across the following error: 
Traceback (most recent call last):
File "C:\<File Path>.csv, line 14, in <module>
if __name__ == "__main__": main()
File "C:\<File Path>.csv", line 8, in main
lines = file.readlines()
File "C:\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4569: character maps to <undefined>

The original data appeared to have no character information and appeared as : Ã¢Â–¡â–¡â–¡â–¡

I added a couple of params to the open method and no more error : " encoding='ascii', errors='surrogateescape' "

 

#!/usr/bin/python3
def main():
    file =  open(r'\\filepath\filename.csv', 'r', encoding='ascii', errors='surrogateescape')    
    lines = file.readlines()
    
    for line in lines:
        print(line, end = '')
  

if __name__ == "__main__": main()