I am writing a storage writer for spark structured streaming which will partition the given dataframe and write to a different blob store account. The spark documentation says the it ensures exactly once
semantics for file sinks but also says that the exactly once semantics are only possible if the source is re-playable and the sink is idempotent.
Is the blob store an idempotent sink if I write in parquet format?
Also how will the behavior change if I am doing
streamingDF.writestream.foreachbatch(...writing the DF here...).start()
? Will it still guarantee exactly once semantics?
Possible duplicate : How to get Kafka offsets for structured query for manual and reliable offset management?
Update#1 : Something like -
output
.writeStream
.foreachBatch((df: DataFrame, _: Long) => {
path = storagePaths(r.nextInt(3))
df.persist()
df.write.parquet(path)
df.unpersist()
})
See Question&Answers more detail:os