Joins (CX)
About Joins
Joins allow you to combine rows from 2 or more data sources based on a related column of data that they share. By using a join, you can gather and analyze the combined data more efficiently and effectively, creating more insights.
The data modeler only supports left outer joins.
Understanding Left Outer Joins
To understand how a left outer join works, let’s look at an example.
Look at the image of the dataset below. The first data source on top is our “left” data source, and the second data source on the bottom is our “right” data source.
Let’s say these tables represent the data you can find in each survey:
Store Locations (Left Source)
Location ID | Location name |
555 | Provo |
777 | Dublin |
999 | Seattle |
1000 | Tokyo |
Customer Feedback (Right Source)
Customer ID | Satisfaction (1-5) | Location ID |
101 | 2 | 555 |
102 | 4 | 777 |
103 | 5 | 999 |
104 | 5 | 222 |
You choose to join the data based on Location ID. This field is called a “join key.”
This is the data unique to the second dataset. This would be removed:
Customer ID | Satisfaction (1-5) | Location ID |
104 | 5 | 222 |
This is the final output, or all of the data that would be included in your results:
Location ID | Location name | Customer ID | Satisfaction (1-5) |
555 | Provo | 101 | 2 |
777 | Dublin | 102 | 4 |
999 | Seattle | 103 | 5 |
1000 | Tokyo | Null | Null |
Notice how the Provo, Dublin, and Seattle results contain columns of data from both the Right and Left sources, because these rows shared common location IDs.
There was no data for Tokyo in the Right source, and so the Tokyo row has null values under the new Customer ID and Satisfaction columns.
Importance of Unique Join Keys
Because join keys help to identify rows that need to be combined form the left and right sources, we recommend making sure the join key you use acts like a unique identifier. Otherwise, if there are multiple records in the right source that match the join key of the left source, only one of them will be pulled randomly.
Example: Let’s look at the example we discussed above. We have the same left source. But in the right source, we have these rows:
Customer ID | Satisfaction (1-5) | Location ID |
101 | 2 | 555 |
107 | 4 | 555 |
In the resulting join, only one of the 555 rows will be saved, but not both.
If the join key isn’t unique for each record on a right source, and you want to include all records from both the right and left source, a union should be used instead. Unions pull in each of the records separately instead of combining the rows of information.
Creating Joins
- Create a data model.
- Add at least 2 sources to your data model.
Qtip: Make sure you include all the fields you need in your data sources, including the common field you’ll use to join your data (e.g., a unique ID). - Click the plus sign ( + ) next to the data source you want to act as your Left data source.
- Select Join.
- Name the output. This is helpful if you plan to add multiple joins to your dataset.
- Under Input, select the Right data source.
- Create a join condition. Match the field that each dataset has in common.
Example: Here we are mapping our Unique ID field from each data source to each other.Qtip: Fields of any type can be used in join conditions. We highly recommend using a unique identifier that matches across both data sources.
- If you have at least 1 other, separate data source below the sources you joined, you can create another join using the existing join.
- You can create separate joins in the same dataset. In this screenshot, you’d join the data from 2022 Survey and Actionability, but not the top 2 data sources.
- Finish creating your data model with an output dataset.
Using Auxiliary Joins
Auxiliary joins let you join multiple sources with the same join condition. Thus, they are useful when you want to create multiple, overlapping joins using the same data sources.
Earlier, we talked about how left outer joins work with 2 sources: a right source and a left source. Auxiliary joins let you set multiple left sources for the same right source in a join.
Example of an Auxiliary Join
Let’s say you have a database of store locations, with names linked to IDs. You have 2 years of surveys where you collect feedback on your stores. In the example below, we’ll find the satisfaction rating and customer ID for 2020 and 2021, and link them to a location name.
Store Locations (Left Source)
Location ID | Location name |
555 | Provo |
777 | Dublin |
999 | Seattle |
1000 | Tokyo |
Customer Feedback 2020 (Right Source)
Customer ID | Satisfaction (1-5) | Location ID |
101 | 2 | 555 |
102 | 4 | 777 |
103 | 5 | 999 |
104 | 5 | 222 |
Customer Feedback 2021 (Auxiliary Right Source)
Customer ID | Satisfaction (1-5) | Location ID |
656 | 5 | 1000 |
838 | 4 | 222 |
979 | 3 | 999 |
343 | 5 | 777 |
You join the data based on Location ID.
This is the final output, or all of the data that would be included in your results:
Location ID | Location Name | 2020 Customer ID | 2020 Satisfaction | 2021 Customer ID | 2021 Satisfaction |
777 | Dublin | 102 | 4 | 343 | 5 |
1000 | Tokyo | N/A | N/A | 656 | 5 |
999 | Seattle | 103 | 5 | 979 | 3 |
555 | Provo | 101 | 2 | N/A | N/A |
Notice how 2020 and 2021 data have become separate columns in the same output dataset.
Because 2020 did not have data for Tokyo, but 2021 did, the 2020 columns are empty (N/A) for Tokyo. Likewise, 2021 had no Provo data.
Records with location ID “222” from either year were excluded from the final data source, since the store locations file had no corresponding location for that ID. See Understanding Left Outer Joins for an explanation of how data is excluded.
Creating an Auxiliary Join
- Click the join.
- Go to Join settings.
- Click the plus sign ( + ).
- Select the Right Input.
Example: This is the second source you want to join to your left source. In our example above, this would be the 2021 customer feedback survey. - Set the left column’s join condition.
- Set the right column’s join condition.
Repeat steps as needed to add more auxiliary joins.
Auxiliary Joins vs. Subsequent Joins
In addition to auxiliary joins, you can click the plus sign ( + ) next to your completed join and select Join. We’ll call this method a “subsequent join.”
Generally, auxiliary joins are useful when you want to create multiple overlapping joins using the same data sources. Subsequent joins are best if you want to create a join with 2 additional data sources that were not included in your first join.
Qtip: Subsequent joins are also useful if you want to make multiple joins where a second join key needs to be pulled from another source. For example, let’s say you have a survey with Store ID.
- You use the Store ID to join this data to a separate source with more location data.
- The location data source also has a field called “Store Manager.”
- In your third source, you have company hierarchy information that you’re joining based on the “Store Manager” field.
Using this method, survey data, location data, and the company hierarchy can all be combined in one dataset.
However, there are times that auxiliary joins and subsequent joins can have the same output.
Example: Expanding on our example above, you would get the same results if you created a subsequent join with Location ID from Store Locations on the left and 2021’s location ID on the right.
However, when you create the first join, it doesn’t result in just one Location ID field. For this example, you end up with a column for both the Store Locations and 2020 Customer Feedback versions of the Location ID. If you try to join your data using the Location ID from the 2020 Customer Feedback, the results would be different than they’d be with the auxiliary join.
Here’s what that table of results would look like. Because 2020 has no data for Tokyo, the existing 2021 Tokyo data gets excluded fro the final results.
Location ID | Location Name | 2020 Customer ID | 2020 Satisfaction | 2021 Customer ID | 2021 Satisfaction |
777 | Dublin | 102 | 4 | 343 | 5 |
1000 | Tokyo | N/A | N/A | N/A | N/A |
999 | Seattle | 103 | 5 | 979 | 3 |
555 | Provo | 101 | 2 | N/A | N/A |