2018-10-22         Bishop

why does _spark_metadata has all parquet partitioned files inside 0 but cluster having 2 workers?

I have a small spark cluster with one master and two workers. I have a Kafka streaming app which streams data from Kafka and writes to a directory in parquet format and in append mode.So far I am able to read from Kafka stream and write it to a parquet file using the following key line. val streamingQuery = mydf.writeStream.format("parquet").option("path", "/root/Desktop/sampleDir/myParquet").outputMode(OutputMode.Append).option("checkpointLocation", "/root/Desktop/sampleDir/myCheckPoint").start()I have checked in both of the workers. There are 3-4 snappy parquet files got ...

 apache-spark                     1 answers                     70 view
 2018-10-22         Clare

How many executors are assigned to listen to a kafka topic in Spark-kafka Integration in Spark 2.1?

I have a Spark cluster with 17 executors in total. I have integrated Spark 2.1 with Kafka and reading the data from topic like:val df = spark .readStream .format("kafka") .options("kafka.bootstrap.servers","localhost:9092") .options("subscribe","test") .load Now I want to know that when I'll Submit my spark application in cluster mode how many executors (out of the total 17) will be assigned to listen to a Kafka topic and creating micro-batches in Structured Streaming.Also, how can I limit the size of a micro-batch in Structured Streaming when reading from Kafka? St...

 apache-spark                     1 answers                     94 view
 2018-10-22         Verna

How to restrict foreach on a streaming RDD only once upon receiver completion

I have created a Custom Receiver to fetch records pertaining to a specific query from Elastic Search and have implemented Streaming RDD transformations to process the data generated by the receiver. The final RDD is a sorted list of name value pairs and I want to read the top 20 results programmatically rather than write to an external file.I use "foreach" on the RDD and take the top 20 values into a list. I see that forEach is processed every time there is a new microbatch from the receiver.However, I want the foreach computation to be done only once when the receiver has...

 apache-spark                     1 answers                     63 view
 2018-10-22         Yvette

Whats the best way to trigger stoppage of a streaming context once the receiver is done fetching all the records

I have implemented a custom receiver to fetch data from Elastic Search in batches of x records continuously. Once, the receiver is done fetching all the records from Elastic Search for a particular query, I want to stop the streaming context. What's the best approach to do so? You have to decide, Max Time to successfully fetch all x records and then apply ssc.awaitTermination(Max time); function or you can manually stop your streaming context using ssc.stop(); [XXX]

 apache-spark                     1 answers                     65 view
 2018-10-22         Edward

What is better - more data receivers or more threads per consumer?

In Spark Streaming documentation they suggest to parallelize the data receiving (link). They suggest an example to create multiple Data Receivers: val numStreams = 5val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) }Doing that you will get 5 cores to receive data luckily in 5 different machines. But in terms of performance, why is this option better than having a Data Receiver with 5 threads? (For instance for machines with more than 5 cores)val numThreads = 5val topicList = Map("topic1" -> numThreads)val kafkaStream = KafkaUtils.createStrea...

 apache-spark                     1 answers                     31 view
 2018-10-22         Viola

spark streaming DirectKafkaInputDStream: kafka data source can easily stress the driver node

I am building a prototype with spark streaming 1.5.0. DirectKafkaInputDStream is used. And a simple stage to read from kafka by DirectKafkaInputDStream can't handle massive amount of messages. The stage spends longer time then batch interval, once the message rate reach or exceed a certain value. And the rate is much lower than I expect. ( I have done another benchmark of my kafka cluster with multiple consumer instances in different servers)JavaPairInputDStream<String, String> recipeDStream = KafkaUtils.createDirectStream(jssc, String.class, ...

 apache-spark                     1 answers                     43 view
 2018-10-22         Noah

What is the necessary version of spark streaming kafka?

I am using spark-core_2.10-2.0.2 and kafka_2.10-0.10.0.1. I want to use spark-streaming-kafka.What version of spark-streaming-kafka should I use? Your build.sbt will contain something like this as you are using kafka 0.10.x:libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.0.0"Support info for spark-kafka-connector are here. Integration info is here. [XXX]

 apache-spark                     1 answers                     50 view
 2018-10-22         Kyle

How to consolidate the spark streaming into array to kafka

Currently, I have the following df+-------+--------------------+-----+| key| created_at|count|+-------+--------------------+-----+|Bullish|[2017-08-06 08:00...| 12||Bearish|[2017-08-06 08:00...| 1|+-------+--------------------+-----+I use the following to stream the data to kafkadf.selectExpr("CAST(key AS STRING) AS key", "to_json(struct(*)) AS value") .writeStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092").option("topic","chart3").option("checkpointLocation", "/tmp/checkpoints2") .outputMode("complete") .start()The problem h...

 apache-spark                     1 answers                     51 view
 2018-10-22         Wendy

Storing Kafka Offsets in a File vs Hbase

I am developing a Spark-Kafka Streaming program where i need to capture the kafka partition offsets, inorder to handle failure scenarios.Most of the devs are using Hbase as a storage for offsets, but how would it be if i use a file on hdfs or local disk to store offsets which is simple and easy?I am trying to avoid using a Nosql for storing offsets.Can i know what are the advantages and disadvantages of using a file over hbase for storing offsets? Just use Kafka. Out of the box, Apache Kafka stores consumer offsets within Kafka itself. [XXX]I too have similar usecase...

 apache-spark                     2 answers                     56 view
 2018-10-22         Gladys

Apache Spark Structured Streaming + Kafka - Attempt to send response via channel for which there is no connection

I am using Spark Structured Streaming (2.3.0) with Kafka (1.0.0). val event_stream: DataStreamReader = spark .readStream .format(_source) .option("kafka.bootstrap.servers", _brokers) .option("subscribe", topic) .option("startingOffsets", "earliest") .option("failOnDataLoss", "false")I am testing the pipeline with 100 G of data for one Kafka topic. On the Kafka broker (3 bootstrap nodes with 2G heap/ 4G RAM each) I see this WARN message on a very frequent basis(almost every second):WARN Attempting to send response via channel for which there is no open connection, conn...

 apache-spark                     1 answers                     91 view
 2018-10-22         Jodie

Couldn't find leaders for Set([TOPICNNAME,0])) When we are uisng in Apache Saprk

We are using the Apache Spark 1.5.1 and kafka_2.10-0.8.2.1 and Kafka DirectStream API to fetch data from Kafka using Spark.We created the topics in Kafka with the following settings ReplicationFactor :1 and Replica : 1When all of the Kafka instances are running, the Spark job works fine. When one of the Kafka instances in the cluster is down, however, we get the exception reproduced below. After some time, we restarted the disabled Kafka instance and tried to finish the Spark job, but Spark was had already terminated because of the exception. Because of this, we could not ...

 apache-spark                     2 answers                     92 view
 2018-10-22         Judy

Duplication of data to redshift from kafka with spark streaming and spark redshift-connector

I'm trying to set a data pipe line from Mysql binlogs to Redshift.I'm writing data to kafka from mysql binlogs(using a tool- cannedbeer,a fork from mypipe) and then using spark streaming to write those messages to redshift using spark redshift connector.The problem I'm facing is that same message was getting written multiple times onto redshift.Is this because of job failure in the foreachRDD method(side effect of writing to Redshift) of Dstream .Can u shed some light on this problem and how to solve it.Thanks in advance. Logging each part and seeing where it got duplica...

 apache-spark                     1 answers                     95 view
 2018-10-22         Carter

Use Spark to Write Kafka Messages Directly to a File

For a class project, I need a Spark Java program to listen as a Kafka consumer and write all of a Kafka topic's received messages to a file (e.g. "/user/zaydh/my_text_file.txt").I am able to receive the messages in as a JavaPairReceiverInputDStream object; I can also convert it to a JavaDStream<String> (this is from the Spark Kafka example).However, I could not find a good Java syntax to write this data to what is a essentially a single log file. I tried using foreachRDD on the JavaDStream object, but I could not find a clean, parallel-safe way to sink it to a single ...

 apache-spark                     1 answers                     98 view
 2018-10-22         Jennifer

Spark Streaming Kafka createDirectStream - Spark UI shows input event size as zero

I have implemented Spark Streaming using createDirectStream. My Kafka producer is sending several messages every second to a topic with two partitions.On Spark streaming side, i read kafka messages every second and them I'm windowing them on 5 second window size and frequency.Kafka message are properly processed, i'm seeing the right computations and prints. But in Spark Web UI, under Streaming section, it is showing number of events per window as Zero. Please see this image:I'm puzzled why is it showing Zero, shouldn't it show number of Kafka messages being feed into Spark...

 apache-spark                     1 answers                     102 view
 2018-10-22         Taylor

Spark - Mixed case sensitivity in Spark DataFrame, Spark SQL, and/or Databricks Table

I have data from SQL Server that I need to manipulate in Apache Spark (Databricks).In SQL Server, three of this table's key columns use a case-sensitive COLLATION option, so that these specific columns are case-sensitive, but others in the table are not. These columns are short, alpha-numeric identifiers from a vendor application, and we must be able to use them in a case-sensitive manner in predicates and join conditions, while being able to use others in a case-insensitive manner.The table was exported as CSV.Is there a way to mix case-sensitive and case-insensitive colum...

 apache-spark                     1 answers                     17 view
 2018-10-22         Horace

Specify pyspark dataframe schema with string longer than 256

I'm reading a source that got descriptions longer then 256 chars. I want to write them to Redshift.According to: https://github.com/databricks/spark-redshift#configuring-the-maximum-size-of-string-columns it is only possible in Scala.According to this: https://github.com/databricks/spark-redshift/issues/137#issuecomment-165904691it should be a workaround to specify the schema when creating the dataframe. I'm not able to get it to work.How can I specify the schema with varchar(max)?df = ...from sourceschema = StructType([ StructField('field1', StringType(), True), Stru...

 apache-spark                     1 answers                     18 view

Page 1 of 216  |  Show More Pages:  Top Prev Next Last