java - Extra EFBFBD bytes in Hadoop thriftfs reading -


in hadoop-0.20 have thriftfs contrib, allow access hdfs in other programming language. hadoop provides hdfs.py script demonstration. problem located in do_get , do_put methods.

if use get download utf-8 text file, it's totally ok, when get file in other encoding, can not original file, downloaded file has many "efbfbd" bytes. guess these java codes on hadoopthriftserver may cause problems.

public string read(thrifthandle tout, long offset,                     int length) throws thriftioexception {    try {      = now();      hadoopthrifthandler.log.debug("read: " + tout.id +                                   " offset: " + offset +                                   " length: " + length);      fsdatainputstream in = (fsdatainputstream)lookup(tout.id);      if (in.getpos() != offset) {        in.seek(offset);      }      byte[] tmp = new byte[length];      int numbytes = in.read(offset, tmp, 0, length);      hadoopthrifthandler.log.debug("read done: " + tout.id);      return new string(tmp, 0, numbytes, "utf-8");    } catch (ioexception e) {      throw new thriftioexception(e.getmessage());    }  } 

the python code in hdfs.py is

output = open(local, 'wb') path = pathname(); path.pathname = hdfs; input = self.client.open(path)  # find size of hdfs file filesize = self.client.stat(path).length  # read 1mb bytes @ time hdfs offset = 0 chunksize = 1024 * 1024 while true:    chunk = self.client.read(input, offset, chunksize)    if not chunk: break    output.write(chunk)    offset += chunksize    if (offset >= filesize): break  self.client.close(input) output.close() 

hope can me.
thanks.


Comments

Popular posts from this blog

java - Intellij Synchronizing output directories .. -

git - Initial Commit: "fatal: could not create leading directories of ..." -