If you have operational data sitting in BigQuery that powers dashboards through tools like Tableau, Looker, or Apache Superset, putting a tool like Facet on top of your BigQuery datasets can enable business and technical users to interact with the data in an interactive, exploratory fashion.
The engineering team at Facet wanted to better understand what performance end-users of our tool on top of BigQuery may experience as monthly data volume increases. This post explains the findings of our BigQuery benchmark tests run to-date and our methodology.
We previously published results for Snowflake on our blog: Snowflake supports interactive analytics at scale.
- Our tests show that BigQuery provides response times suitable for exploring high volume, high dimensionality, time series data in an interactive fashion with five concurrent users running “slice and dice” type queries. The figures below show the median and 99th percentile response times and BigQuery slots consumed for each test we conducted.
- For 50% of all queries executed, the response time was under 2 seconds for a moderately sized table with 100 million events per month and under 4 seconds for a table with 1.2 billion events per month.
- The 99th percentile response times, a measure of worst case performance, were slightly over 8 seconds in all our tests and over 10 seconds in one of our tests.
- As the dataset size increases, BigQuery intelligently parallelized query execution across more slots in its serverless compute environment to return results in a reasonable time period. For example, looking at Figure 1 below, BigQuery consumed 11 slots in the median case for each query in the 100M Row / 40 GB dataset, and in the 300M Row / 117 GB dataset, it used 31 slots in the median case.
Figure 1: Median response times of 3 benchmark tests conducted on one month of the Star Schema Benchmark dataset where the number of table rows was 100 million, 300 million, and 1.2 billion and the BigQuery on-demand pricing model was used.
Figure 2: 99th percentile response times of 3 benchmark tests conducted on one month of the Star Schema Benchmark dataset where the number of table rows was 100 million, 300 million, and 1.2 billion and the BigQuery on-demand pricing model was used.
Our test methodology
Our benchmark tests used the Star Schema Benchmark dataset, which has been in use for over a decade to assess data warehouse performance. We created three datasets consisting of different row counts, from 100 million rows to 1.2 billion rows. The datasets were generated with an appropriate scale factor to reach the target data volume over a one month time period and loaded into BigQuery. In BigQuery, the line order, customer, supplier, and part tables were denormalized into one flattened table. Interactive analytics is typically done on de-normalized, wide datasets with all facts and dimensions residing in one table.
After loading the data, we simulated up to five users simultaneously asking “aggregate over time” and “topN” queries over a one week timeframe for 12 minutes. We did this for each of the datasets. To minimize effects of caching by BigQuery and that we used wide datasets, we randomly varied the columns selected by each query along with the time ranges filtered on. This behavior mimicked here is the common user behavior of the Facet application, where you look at a few metrics and a few dimensions, add some filters, watch the data update, add more new filters, update the data, and repeat until you have your answer. We also excluded response times for a very small number of cached queries run.
You can get up and running on Facet over data in your BigQuery in a matter of minutes through our self-serve on-boarding, or you can request a demo. We would also be happy to share some additional results on datasets larger than what we presented here and how you might be able to optimize your BigQuery setup for optimal performance on top of Facet.