java - Extra EFBFBD bytes in Hadoop thriftfs reading -


in hadoop-0.20 have thriftfs contrib, allow access hdfs in other programming language. hadoop provides hdfs.py script demonstration. problem located in do_get , do_put methods.

if use get download utf-8 text file, it's totally ok, when get file in other encoding, can not original file, downloaded file has many "efbfbd" bytes. guess these java codes on hadoopthriftserver may cause problems.

public string read(thrifthandle tout, long offset,                     int length) throws thriftioexception {    try {      = now();      hadoopthrifthandler.log.debug("read: " + tout.id +                                   " offset: " + offset +                                   " length: " + length);      fsdatainputstream in = (fsdatainputstream)lookup(tout.id);      if (in.getpos() != offset) {        in.seek(offset);      }      byte[] tmp = new byte[length];      int numbytes = in.read(offset, tmp, 0, length);      hadoopthrifthandler.log.debug("read done: " + tout.id);      return new string(tmp, 0, numbytes, "utf-8");    } catch (ioexception e) {      throw new thriftioexception(e.getmessage());    }  } 

the python code in hdfs.py is

output = open(local, 'wb') path = pathname(); path.pathname = hdfs; input = self.client.open(path)  # find size of hdfs file filesize = self.client.stat(path).length  # read 1mb bytes @ time hdfs offset = 0 chunksize = 1024 * 1024 while true:    chunk = self.client.read(input, offset, chunksize)    if not chunk: break    output.write(chunk)    offset += chunksize    if (offset >= filesize): break  self.client.close(input) output.close() 

hope can me.
thanks.


Comments

Popular posts from this blog

How to access named pipes using JavaScript in Firefox add-on? -

multithreading - OPAL (Open Phone Abstraction Library) Transport not terminated when reattaching thread? -

node.js - req param returns an empty array -