Synapse Serverless SQL Pool Processing Too Much Data From Cosmos
Hi,
We have an online platform for our clients that currently stores data they input in Cosmos, in JSON format, including some nested JSONs.
For these clients, we also provide a variety of Power BI reports based on this Cosmos data. To retrieve and sometimes transform the data from Cosmos, we use Synapse T-SQL queries to create views from the Cosmos data, that Power BI can then link to.
All Synapse queries use the following to flatten the Cosmos data into a suitable format for Power BI:
– OPENROWSET – to access the data in a specific Cosmos container
– CROSS APPLY/OUTER APPLY – to access data stored in nested JSONs
– WHERE clause to reduce the amount of data stored in the final view that is accessed by Power BI
However, as a result, we encountering an issue with the amount of data processed each time a report is refreshed.
It appears that despite the WHERE clause, the OPENROWSET retrieves all of the data from the container before filtering the rows. For example, if the Cosmos container stores 16,000 rows which if queried would require 3GB of data to be processed by the query, even if we add a WHERE clause that reduces the final amount of rows to 1,000, 3GB of data are still processed by the query each time it is run.
Consequently, we are encountering scenarios where a Power BI report containing less than 100 rows of data are still processing massive amounts of data in the Synapse queries, incurring high costs.
As we have multiple queries running multiple times per day, we are facing escalating costs, e.g.
3GB per query x 15 queries x 5 refreshes per day = 225 GB data processed per day, when it should be significantly lower. This example is also conservative as it is based on test data rather than live, and it is after we have reduced the number of daily refreshes for our clients to reduce the costs.
As data is added to the container each month, the amount processed and the cost is only increasing.
Furthermore, this has led to situations where we have exceeded our Synapse capacity, causing the Power BI reports to be unable to refresh at all, preventing clients to be able to review their data.
Does anyone have any suggestions as to how we can limit the data being returned in the OPENROWSET clause so it does not process the whole Cosmos container’s worth of data when a query is run?
Hi, We have an online platform for our clients that currently stores data they input in Cosmos, in JSON format, including some nested JSONs. For these clients, we also provide a variety of Power BI reports based on this Cosmos data. To retrieve and sometimes transform the data from Cosmos, we use Synapse T-SQL queries to create views from the Cosmos data, that Power BI can then link to. All Synapse queries use the following to flatten the Cosmos data into a suitable format for Power BI:- OPENROWSET – to access the data in a specific Cosmos container- CROSS APPLY/OUTER APPLY – to access data stored in nested JSONs- WHERE clause to reduce the amount of data stored in the final view that is accessed by Power BI However, as a result, we encountering an issue with the amount of data processed each time a report is refreshed. It appears that despite the WHERE clause, the OPENROWSET retrieves all of the data from the container before filtering the rows. For example, if the Cosmos container stores 16,000 rows which if queried would require 3GB of data to be processed by the query, even if we add a WHERE clause that reduces the final amount of rows to 1,000, 3GB of data are still processed by the query each time it is run. Consequently, we are encountering scenarios where a Power BI report containing less than 100 rows of data are still processing massive amounts of data in the Synapse queries, incurring high costs. As we have multiple queries running multiple times per day, we are facing escalating costs, e.g. 3GB per query x 15 queries x 5 refreshes per day = 225 GB data processed per day, when it should be significantly lower. This example is also conservative as it is based on test data rather than live, and it is after we have reduced the number of daily refreshes for our clients to reduce the costs. As data is added to the container each month, the amount processed and the cost is only increasing.Furthermore, this has led to situations where we have exceeded our Synapse capacity, causing the Power BI reports to be unable to refresh at all, preventing clients to be able to review their data. Does anyone have any suggestions as to how we can limit the data being returned in the OPENROWSET clause so it does not process the whole Cosmos container’s worth of data when a query is run? Read More