Big Query-How to get total number of new rows inserted in a big query table as part of batch…

4 min readJul 28, 2021

Big Query-How to get total number of new rows inserted in a big query table as part of batch processing

I have seen a few questions in stackoverlow around batch metadata that is to find the number of rows inserted in a big query table for every run. The question here is not to get the total record count of a table after a batch data pipeline has run but to find out the total number of rows that were inserted in a table after the pipeline has run.

For example, a big query table is everyday loaded as part of batch processing and there is a need to find the number of new records that were inserted in the table or if there were any errors while loading the table?

This can be implemented by monitoring big query logs and creating specific sinks. The details of the step are explained below

The below big query table CRIME gets loaded as part of a big query job which is an INSERT process that runs every day. The same implementation will also work for MERGE (Insert and Update) but the query execution on log table will differ.

Step 1

In the cloud logging, go to logs explorer and select resource type as “bigquery_dataset” and dataset_id as the name of the big query dataset where big query table is present for which we want to filter specific logs. We can further filter down on specific logs by providing the table name.

When a big query job that loads data into CRIME table is triggered, we can see below logs,

The highlighted log provides the details about the number of records inserted into the table as part of batch process. You would also see logs related to “tableDataRead” that we are particularly not interested in this case.

Hence, the query to get specific logs can be further drilled down to as below

resource.type=”bigquery_dataset” resource.labels.dataset_id=”demo_dataset”protoPayload.methodName=”google.cloud.bigquery.v2.JobService.InsertJob”protoPayload.resourceName: “projects/graphical-reach-285218/datasets/demo_dataset/tables/crime”protoPayload.metadata.tableDataChange.reason=”QUERY”

Step 2

Once we have identified the filter in the logs for a specific case, a sink can be created to route log entries from Cloud Logging to Big Query as shown below.

While creating sink, select sink service should be selected as “Big Query data set” as we need to route the logs to Big Query for dashboard creation. There are other sink services that can be selected based on the requirement. A separate dataset “Bigquery_Logs” is created where the logs will route to.

Step 3

Once the sink is created, the logs will be routed to big query dataset as shown below and the insertedRowsCount will be present in the column protopayload_auditlog.metadataJson. The value for insertedRowsCount can be extracted by using standard SQL String functions. If multiple runs are happening in a day, the latest insert count can be identified based on time stamp column in the audit table.

One important point here is to understand that the audit table created above is not partitioned as I did not select “use partitioned tables” while selecting the sink destination. Hence, everyday a new audit/log table will be created as shown below

However, while creating the sink if the “use partitioned tables” option is selected, all the logs will be stored into a single table that will be partitioned by day, thus improving the query performance.

This is how we can monitor the logs, create a filter in the logs on specific use case or error case and sink the result in Big Query as explained above. Dashboards can be created using this table in Big query.

I am sure there will be other ways to do this but will open doors for other ideas and learning.

Happy Learning:-)

Big Query-How to get total number of new rows inserted in a big query table as part of batch processing

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by radhika sharma

No responses yet