As you might have noticed, this is not the first time I have been trying to send messages to BigQuery. First, I used Cloud Functions for that. The reason was simple: GCP offered only Dataflow to do the job, which was quite expensive. Especially when the size of messages was around a few KB. The Dataflow setup was just overwhelming.

After a while, we noticed that Cloud Functions wouldn’t cut it due to message size limitations. That’s why I created a Cloud Run version of the same thing. It is still an insignificantly cheap, usable and low-maintenance setup.

Both of these solutions require your messages to be reasonably small. What comes, goes directly to the BigQuery. But what if the messages you process are not small? In that case, Dataflow used to be the best solution. But not anymore!

There is now an option for that

Since 28 July, Pub/Sub has an option in the Topic setup where you can set the dataset in BigQuery as a direction for your messages, and it’s done for you. From the explanation, the billing is as convenient as if you would use Cloud Functions or Cloud Run.

There are multiple options in the setup. For our use case, we only needed to work with “Use topic schema“. Once the option is not used, the data is saved in a column named data. The format of saved data is base64. I believe it is useful if you want to save a lot of data without needing further processing.

If you set up topics with schema and enable the “Use topic schema“ option, the messages get pushed directly to BigQuery in the format from the topic.

AVRO vs. Proto buffer format

At first, I thought that Pub/Sub gets the format from the message the same way my implementations did. Boy, oh boy, was I wrong. You need to define the schema. You can pick from two format options, Avro and Protocol Buffer.

First, I tried the Avro format. It seems to have a long history, it is well documented. But! I had a hard time defining values that can accept null values. Let’s have this schema:

{
  "fields": [
    {
      "name": "test",
      "type": ["string", "null"]
    }
  ],
  "name": "Avro",
  "type": "record"
}

This format has no problem to accept null items, let’s say this object:

{
  "test": null
}


For some reason, the Protocol Buffer schema has no problem accepting the messages with null values. Let’s have this schema:

syntax = "proto3";

message ProtocolBuffer {
  string string_field = 1;
  int32 int_field = 2;
  bool bool_field = 3;
}

It accepts:

 

{
  "bool_field": true,
  "int_field": null
}

Schema is fairly limited: I was investigating wrappers from google/protobuf/wrappers.proto with no luck. Imports are not supported. You also can’t use formats like Timestamp. Check the issue on issuetracker, that might have changed. Another problem with the schema is that Pub/Sub doesn’t log that the schema was not used properly anywhere. It is entirely handled by you, the programmer. If you don’t log the issue in the app, it won’t appear anywhere. You can observe a metric named pubsub.googleapis.com/topic/send_request_count to see if any topic had any schema validation issues. Once the schema is not used properly, the metric will increase invalid_argument counter.

Terraform module

Yes, I prepared a Terraform module for that. In a way, I just had to add more arguments to the existing module. Here is a small example:

locals {
 schema = [
   {
     name : "i",
     type : "INTEGER",
     mode : "NULLABLE",
   },
 ]
}

resource "google_bigquery_dataset" "dataset" {
 dataset_id = "pubsub_to_bq_example_dataset"
 location   = "EU"
}

resource "google_bigquery_table" "table" {
 dataset_id          = google_bigquery_dataset.dataset.dataset_id
 table_id            = "test_table"
 deletion_protection = false

 schema = jsonencode(local.schema)
}

module "pubsub_to_bq" {

 source  = "AckeeCZ/pubsub/gcp"

 version = "2.3.0"

 project = var.project
 topics = {
   "topic-a" : {
     schema_definition : <<-EOT
     syntax = "proto3";

     message ProtocolBuffer {
       int32 i = 1;
     }
     EOT
     schema_type = "PROTOCOL_BUFFER"
     bigquery_config : {
       table            = "${var.project}.${google_bigquery_table.table.dataset_id}.${google_bigquery_table.table.table_id}"
       use_topic_schema = true
     }
   }
 }
}
 

The final words

I just hope GCP won’t overprize it once people start using it. I do not want to write a fourth article on how to send messages from Pub/Sub to BigQuery. Don’t get me wrong, it is nice to share, but if Google created this setup in the first place, we wouldn’t need to waste time with all these small little setups.

Overall, it’s good that they added the option. Hopefully, it will get advertised better and people will stop using the expensive Dataflow templates. If you enjoyed this post, like, comment and subscribe.

Are you interested in working together? We wanna know more. Let’s discuss it in person!

Get in touch >