PHP 读取HADOOP的HDFS文件
四月 29th, 2010 进行评论
Hadoop的分布式文件系统HDFS为java提供了原生的接口,可以像访问地文件一样的,对HDFS中的文件进行增删改查等操作。
对于其他非java语言的支持,hadoop使用了Thrift。
对于该方式,hadoop里针对thrift的hadoop-0.20.2/src/contrib/thriftfs/readme是这样说的:
Thrift is a software framework for scalable cross-language services
development. It combines a powerful software stack with a code generation
engine to build services that work efficiently and seamlessly
between C++, Java, Python, PHP, and Ruby.
This project exposes HDFS APIs using the Thrift software stack. This
allows applciations written in a myriad of languages to access
HDFS elegantly.
The Application Programming Interface (API)
===========================================
The HDFS API that is exposed through Thrift can be found in if/hadoopfs.thrift.
Compilation
===========
The compilation process creates a server org.apache.hadoop.thriftfs.HadooopThriftServer
that implements the Thrift interface defined in if/hadoopfs.thrift.
Th thrift compiler is used to generate API stubs in python, php, ruby,
cocoa, etc. The generated code is checked into the directories gen-*.
The generated java API is checked into lib/hadoopthriftapi.jar.
There is a sample python script hdfs.py in the scripts directory. This python
script, when invoked, creates a HadoopThriftServer in the background, and then
communicates wth HDFS using the API. This script is for demonstration purposes
only.
由于它说的过于简单,而我又对java世界了解甚少,导致走了不少弯路,现记录如下:
1、下载thrift源码,安装(./bootstrap.sh ;./configure –prefix=/usr/local/thrift; make;sudo make install)
2、将一些必须的文件cp到thrift安装目录下:
cp /path/to/thrift-0.2.0/lib/php/ /usr/local/thrift/lib/ -r
mkdir /usr/local/thrift/lib/php/src/packages/
cp /path/to/hadoop-0.20.2/src/contrib/thriftfs/gen-php/ /usr/local/thrift/lib/php/src/packages/hadoopfs/ -r
3、安装thrift的php扩展(针对php而言)
cd /path/to/thrift-0.2.0/lib/php/src/ext/thrift_protocol;phpize; ./configure;make ;make install;
修改php.ini,添加extension=thrift_protocol.so
4、编译hadoop
cd /path/to/hadoop-0.20.2; ant compile (ant -projecthelp可以查看项目信息;compile 是编译core和contrib目录)
5、启动hadoop的thrift代理
cd /path/to/hadoop-0.20.2/src/contrib/thriftfs/scripts/; ./start_thrift_server.sh [your-port] (如果不输入port,则随机一个port)
6、执行php测试代码
<?php
error_reporting(E_ALL);
ini_set(‘display_errors’, ‘on’);
$GLOBALS['THRIFT_ROOT'] = ‘/usr/local/thrift/lib/php/src‘;
define(‘ETCC_THRIFT_ROOT’, $GLOBALS['THRIFT_ROOT']);
require_once(ETCC_THRIFT_ROOT.’/Thrift.php’ ); require_once(ETCC_THRIFT_ROOT.’/transport/TSocket.php’ ); require_once(ETCC_THRIFT_ ROOT.’/transport/TBufferedTransport.php’ ); require_once(ETCC_THRIFT_ROOT.’/protocol/TBinaryProtocol.php’ );
$socket = new TSocket(your-host, your-port);
$socket->setSendTimeout(10000);
$socket->setRecvTimeout(20000);
$transport = new TBufferedTransport($socket);
$protocol = new TBinaryProtocol($transport);
require_once(ETCC_THRIFT_ROOT.’/packages/hadoopfs/ThriftHadoopFileSystem.php’);
$client = new ThriftHadoopFileSystemClient($protocol);
$transport->open();
try
{
$pathname = new Pathname(array(‘pathname’ => ‘your-hdfs-file-name‘));
$fp = $client->open($pathname);
var_dump($client->stat($pathname));
var_dump($client->read($fp, 0, 1024));
exit;
} catch(Exception $e)
{
print_r($e);
}
$transport->close();
?>
可能出现的问题:
1、可以创建目录或者文件,但是读取不到文件内容
这时可以打开hadoop thrift的log4j配置(如果你用的是log4j记录日志的话),在/path/to/hadoop/conf/log4j.properties 里,修改:
hadoop.root.logger=ALL,console
这时,没执行一条访问HDFS的操作,都会把debug信息打印出来。
我这里看到的是file的id很奇怪,怀疑是在32位机上溢出了,尝试修改未果。之后迁移到64位机,运行正常!
问题代码在:/usr/local/thrift/lib/php/src/protocol/TBinaryProtocol.php (由于我代码里用的TBinaryProtocol类)的readI64函数里。
2、启动start_thrift_server.s失败
Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/hadoop/thriftfs/HadoopThriftServer
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.thriftfs.HadoopThriftServer
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:336)
Could not find the main class: org.apache.hadoop.thriftfs.HadoopThriftServer. Program will exit.
可以查看classpath,是否正确。我是添加了以下语句才正常的:
CLASSPATH=$CLASSPATH:$TOP/build/contrib/thriftfs/classes/:$TOP/build/classes/:$TOP/conf/
3、安装thrift出现问题
运行./bootstrap.sh出错: 查看boost是否安装,或者版本过旧
make出错:需要jdk 1.6以上
make出错:ImportError: No module named java_config_2 可能因为python升级,导致java-config不可用,重新安装java-config即可
line 832: X–tag=CXX: command not found : 把thrif/libtool文件内所有的“$echo”改为”$ECHO”即可(应该是libtool版本问题)
如果过于频繁的出现问题,且现象说明是大部分软件版本过旧,那可以考虑emerge –sync更新全部软件。
比较惨的时候,emerge总是被masked,那就手动安装一些依赖库吧,比如boost:
首先,下载boost包 http://www.boost.org/users/download/ ;并解压到/usr/local下(想安装的地方)
然后,建软连接到/usr/include下:ln -s /usr/local/boost-version/boost /usr/include/boost,这样就安装完了不需要编译的部分
如果还要继续安装需要编译的部分,那么进入到boost目录,运行bootstrap.sh脚本,生成bjam,运行./bjam install即可
4、ant安装报找不到jdk
但是我已经在/etc/profile,每个用户的.bash_profile里都把JAVA_HOME指向jdk目录了,而且echo $JAVA_HOME的结果也是jdk目录。
查看ant脚本,在第一行echo $JAVA_HOME,发现为空。无奈,只能手工把javahome添加到ant脚本里。
5、hadoop版本统一以后,可以正常启动datanode了,但是tasktracker还是起不来,在log里找到:
2010-04-30 18:07:31,975 ERROR org.apache.hadoop.mapred.TaskTracker: Shutting down. Incompatible buildVersion.
JobTracker’s: 0.20.3-dev from by ms on Thu Apr 29 17:44:22 CST 2010
TaskTracker’s: 0.20.3-dev from by root on Fri Apr 30 17:48:14 CST 2010
解决方法:
把master的hadoop目录,整个覆盖过去!
(尝试过用ms账号重新ant hadoop,无效;即版本一致了,但是安装时间不一致也不行。。。。)