Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have setup a Dataproc cluster on Google Cloud. It is sup and running and I can access HDFS and copy files from the SSH 'in browser" console. So the problem is not on the Dataproc side.

I am now using Pentaho (ELT software) to copy files. Pentaho needs to access the Master and the Data Nodes.

I have the following error message :

456829 [Thread-143] WARN org.apache.hadoop.hdfs.DataStreamer - Abandoning BP-1097611520-10.132.0.7-    1611589405814:blk_1073741911_1087
456857 [Thread-143] WARN org.apache.hadoop.hdfs.DataStreamer - Excluding datanode DatanodeInfoWithStorage[10.132.0.9:9866,DS-6586e84b-cdfd-4afb-836a-25348a5080cb,DISK]
456870 [Thread-143] WARN org.apache.hadoop.hdfs.DataStreamer - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/jmonteilx/pentaho-shim-test-file.test could only be replicated to 0 nodes instead of minReplication (=1).  There are 2 datanode(s) running and 2 node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1819)
    at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2569)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)

The IP address used in the log is the internal IP of my firs datanode in Dataproc. I need to use the External IP.

My question is the following,

Anything to change in the config files in the client file to do so ?

I have tried :

<property>    
        <name>dfs.client.use.datanode.hostname</name>    
        <value>true</value>
</property>

Without success, Many thanks,

question from:https://stackoverflow.com/questions/65938388/google-dataproc-writing-from-a-client-app-uses-clusters-internal-ip-for-datanod

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
848 views
Welcome To Ask or Share your Answers For Others

1 Answer

ETL tool can not access DataNodes via external IPs from the on premise data center because probably your firewall rules block access from internet or you created Dataproc cluster with internal IPs.

That said, allowing access to HDFS from internet is a security risk. By default Dataproc cluster do not configure secure authentication w/ Kerberos, so if you decide to open up cluster to internet you at very least should configure secure access to it.

Preferred solution is to establish secure network connection between on premise and GCP clusters and access HDFS via it. You can read more about options for this in GCP documentation.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...