2018-10-22         Malcolm

What's the difference between HDFS federation and a whole new HDFS cluster

I want to know the the difference between HDFS federation and a whole new HDFS cluster, should I use the federation or just build a whole new cluster? Federation/ViewFS would allow you to access a brand new NameNode (cluster) nameservice from an existing cluster or bridge two existing clusters. [XXX]Generally federation is used when you have a very large cluster (1000+) and you're pushing the limits of what you can store in HDFS. Federation allows you divide your namespace and maintain all your data in one HDFS instance. Depending on how you're using your data, you...

 hadoop                     2 answers                     91 view
 2018-10-22         Marguerite

How does cp command work in Hadoop?

I am reading "Hadoop: The Defnitive Guide" and to explain my question let me quote from the book distcp is implemented as a MapReduce job where the work of copying is done by the maps that run in parallel across the cluster. There are no reducers. Each file is copied by a single map, and distcp tries to give each map approximately the same amount of data by bucketing files into roughly equal allocations. By default, up to 20 maps are used, but this can be changed by specifying the -m argument to distcp.and in a footnote Even for a single file copy, the distcp variant...

 hadoop                     1 answers                     93 view
 2018-10-22         Bertha

What is the purpose of hdfs dfs -test /dir_name?

I am reading a code and found too many use of hdfs dfs -test /dir_name.I am curious about how does it execute its operations. The command contacts the NameNode of the Hadoop cluster and returns a status code of 0 if that path exists, otherwise non-zero if it doesn't. DocsSource methods [XXX]

 hadoop                     1 answers                     95 view
 2018-10-22         Harold

Does a file need to be in HDFS in order to use it in distributed cache?

I get Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/path/to/my.jar, expected: hdfs://ec2-xx-xx-xx-xxx.compute-1.amazonaws.comif I try to add a local file to distributed cache in hadoop. When the file is on HDFS, I don't get this error (obviously, since it's using the expected FS). Is there a way to use a local file in distributed cache without first copying it to hdfs? Here is a code snippet:Configuration conf = job.getConfiguration();FileSystem fs = FileSystem.getLocal(conf);Path dependency = fs.makeQualified(new Path("/local/path/to/my...

 hadoop                     3 answers                     99 view
 2018-10-22         Edwiin

Is there a way to have a secondary storage or backup for data blocks in Hadoop?

I have Hadoop running on a cluster that has non-dedicated nodes (i.e. it shares nodes with other applications/users). When the other users are using a cluster's node, it is not allowed to run Hadoop jobs in that node. Thus, it is possible that only a few nodes are available in a given moment, and that this few nodes do not have all data blocks (replicas) need by the Hadoop job. I also have a big Network-Attached Storage that is used for backup. So, I am wondering if there is a way to use it as a secondary storage for Hadoop. For example, if some data block is missing in the...

 hadoop                     1 answers                     102 view
 2018-10-22         Uriah

HDFS Replication - Data Stored

I am a relative newbie to hadoop and want to get a better understanding of how replication works in HDFS.Say that I have a 10 node system(1 TB each node), giving me a total capacity of 10 TB. If I have a replication factor of 3, then I have 1 original copy and 3 replicas for each file. So, in essence, only 25% of my storage is original data. So my 10 TB cluster is in effect only 2.5 TB of original(un-replicated) data.Please let me know if my train of thought is correct. Your thinking is a little off. A replication factor of 3 means that you have 3 total copies of your d...

 hadoop                     1 answers                     102 view
 2018-10-22         Daisy

Adding a new line between documents with hadoop getmerge

I am trying to get a bunch of files from hadoop and merge them into one big file and I would like to have a newline between each document. hadoop fs -getmerge <src> <localdst> addnl should do exactly that, but it does not seem to add a newline no matter what!I also tried hadoop fs -getmerge <src> <localdst> -nl After seeing this :https://issues.apache.org/jira/browse/HADOOP-7340But this is also not working. Am I missing something? Does this work for anyone?Thanks! If you're happy with writing some code to do this (and not relying on the shell comm...

 hadoop                     3 answers                     6 view
 2018-10-22         Ken

Changing replication of existing files in HDFS

I tried changing the replica factor from 3 to 1 and restarting the services. But the replication factor remains the sameCan anyone suggest me how to change the replication factor of existing files? This is the fsck report: Minimally replicated blocks: 45 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 45 (100.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 1 Average block replication: 2.0 Corrupt blocks: 0 Missing replicas: 45 (33.333332 %) DecommissionedReplicas: 45 Nu...

 hadoop                     1 answers                     13 view
 2018-10-22         Constance

Cannot add a new property in HDFS/config/advance/custome-core-site

how to add a property in HDFS/config/advance/custome-core-site? You add properties by scrolling to the bottom of that pane and adding one.But you must login as an Ambari administrator [XXX]

 hadoop                     1 answers                     10 view
 2018-10-22         Curitis

Set up hiveserver2 and hive metastore on seperate node

Is it possible to set up hive metastore and hive server2 services on separate nodes? I know that HDP ambari forces you to set up the two on the same node, along with webhcat, I believe, but what about other venders such as Cloudera? and others? hiveserver and hive-metastore server are independent daemons that can be run different nodes. A thrift based connection is used for communication. In Cloudera distribution and MapR we have an option, I'nk Hortonworks also should include. [XXX]

 hadoop                     1 answers                     17 view
 2018-10-22         Cedric

Spark 1.2 cannot connect to HDFS on HDP 2.2

I follow this tour http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/ to install Spark on HDP 2.2.But it tells me that dfs refued my connection!What I command:./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10Here is the log:tput: No value for $TERM and no -T specifiedSpark assembly has been built with Hive, including Datanucleus jars on classpath15/02/04 13:52:51 WARN util.NativeCodeLoader: Unable to load native-hadoop ...

 hadoop                     1 answers                     15 view
 2018-10-22         Dominic

Plain vanilla Hadoop installation vs Hadoop installation using Ambari

I recently downloaded hadoop distribution from Apache and got it up and running quite fast; download the hadoop tar ball, untar it at a location and some configuration setting. The thing here is that I am able to see the various configuration files like: yarn-site.xml, hdfs-site.xml etc; and I know the hadoop home location.Next, I installed hadoop (HDP) Using Ambari.Here comes the confusion part. It seems Ambarin installs the hdp in /usr/hdp; however the directory structure in plain vanilla hadoop vs Ambari is totally different. I am not able to locate the configuration f...

 hadoop                     3 answers                     19 view
 2018-10-22         Karen

Hadoop issue with Sqoop installation

I have Hadoop(pseudo distributed mode), Hive, sqoop and mysql installed in my local machine.But when I am trying to run sqoop Its giving me the following errorError: /usr/lib/hadoop does not exist!Please set $HADOOP_COMMON_HOME to the root of your Hadoop installation.Then I set the sqoop-env-template.sh file with all the information. Beneath is the snapshot of the sqoop-env-template.sh file.Even after providing the hadoop hive path I face the same error.I've installed hadoop in /home/hduser/hadoop version 1.0.3hive in /home/hduser/hive version 0.11.0sqoop in /ho...

 hadoop                     1 answers                     19 view
 2018-10-22         Darcy

Dynamic aggregation in SQL (Hive)

I have two tables. Table A with 3 columns: userid, a start date, and end date. Table B with events and datetimestamps. I would like to aggregate Table B up to the datetimes between the start and end date based on Table A. So something like...select a.userid, count(distinct b.eventid) as eventsfrom table ainner join table bon a.userid=b.useridand b.datetime between a.starttime and b.endtimegroup by a.useridBut Hive doesn't like that... I'm using Hadoop HortonWorks. Would appreciate any guidance! Move the between condition to where as only equality conditions in joins are ...

 hadoop                     1 answers                     52 view
 2018-10-22         Sidney

Is Spark Streaming with a custom receiver a more generalized replacement for Flume in all use cases?

Our use case is (1) consuming data from ActiveMQ, (2) performing transformations through a general purpose reusable streaming process, and then (3) publishing to Kafka. In our case, step (2) would be a reusable Spark Streaming 'service' that would provide an event_source_id, enrich each record with metadata, and then publish to Kafka.The straightforward approach I see is ActiveMQ -> Flume -> Spark Streaming -> Kafka. Flume seems like an unnecessary extra step and network traffic. As far as I can tell, a Spark Streaming custom receiver would provide a more general solution f...

 hadoop                     1 answers                     58 view
 2018-10-22         Jeff

How to install Kafka on hadoop cluster?

I want to install the latest release of Kafka on our HortonWorks Hadoop cluster that contains 2 master nodes, 2 edge nodes and 8 data nodes. The plan is to install Kafka on 2 out of 8 data node boxes.Kafka will need to handle up to few million events a day an probably few batch copies of files of a size of 0.5Gb-1.2Gb.Questions: any special configuration to data nodes or to kafka I need to consider in order to avoid a potential performance deterioration of kafka (or data nodes)?How Kafka is normally deployed (on dedicated boxes or is it ok to run it on data nodes)? You c...

 hadoop                     1 answers                     43 view

Page 1 of 278  |  Show More Pages:  Top Prev Next Last