# How to join 2 Datasets in a way similar to RDD?

## How to join 2 Datasets in a way similar to RDD?

Contents

Problem Description:

I have 2 data frames

``````val a = Map(1 -> 2, 3 -> 5).toSeq.toDF("key", "value")
val b = Map(1 -> 4, 2 -> 5).toSeq.toDF("key", "value")
``````

I join them on the column `key`

``````a.join(b, "key").show(false)

+---+-----+-----+
|key|value|value|
+---+-----+-----+
|1  |2    |4    |
+---+-----+-----+
``````

I would instead like the result

``````+---+-----+
|key|value|
+---+-----+
|1  |{2, 4}
+---+-----
``````

I would like the column `value` to be an array that aggragates the values. Is there an idiomatic way to do this? The `join` on 2 RDDs does this by default.

## Solution – 1

To go the fully typed route, you could go the Dataset route:

``````case class MyKeyValuePair(key: String, value: String)
case class MyOutput(key: String, values: Seq[String])

val a = Map(1 -> 2, 3 -> 5).toSeq.toDF("key", "value").as[MyKeyValuePair]
val b = Map(1 -> 4, 2 -> 5).toSeq.toDF("key", "value").as[MyKeyValuePair]

val output = a.joinWith(b, a("key") === b("key"), "inner").map{
case ( pair1, pair2 ) => MyOutput(pair1.key, Seq(pair1.value, pair2.value))
}

output.show
+---+------+
|key|values|
+---+------+
|  1|[2, 4]|
+---+------+
``````

This gives you fine grained control over what you want your output to exactly look like, and gives you compile-time safety. You just have to define some case classes to type your datasets with.

Hope this helps!

Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.