radhika sharma
4 min readOct 24, 2020

--

Google Data Engineering- Creating a Data Transformation Pipeline with Cloud Dataprep Quest.

While preparing for my Google Cloud Certified Professional Data Engineer Certification, I did some hands on practice with labs related to Big query, Data Flow, Cloud Composer, Tensor flow etc. Each of these labs can be done using google lab credits for which you need to pay. For the cloud data prep lab, the time you get is around 1 hour 15 mins and the entire lab needs to be completed in this time.

Almost all the task in this lab is straightforward and cane be easily done based on the instructions .

As per one of the task, big query “ecommerce” dataset and table “all_sessions_raw_dataprep” will be created which will be imported while preparing the data prep flow. One of the issue that might be faced while importing a dataset from big query during a data prep flow can be “Could not create datasets. Requested data was not found”. Sorry for not providing the exact error as I did not save it because I did not think of writing this post then:-) .

If such error comes, you need to go and check the directory path in the user profile stage. Google storage bucket is used while uploading and running the data prep jobs. Please see below

The user profile page can be looked at by clicking on below user profile. I have logged in using my own account but this may vary depending on the account using you may have logged in.

The upload directory, job run directory and temp directory should have the correct GCS location. There is a possibility that all the above path may be set to an old GCS location which would have been deleted or may not exist. For me, all the directories didn’t exist as I had deleted the GCS bucket and directories.

After I created a new GCS bucket and provided the right paths for all the above mentioned directories, I was able to import and add big query dataset in cloud data prep.

Rest of the task 1 to 7 will be completed based on the instructions in the lab.

However for task 8, below are the steps that can be followed before running and scheduling cloud data prep jobs to Big query.

  • Below flow will be created as part of task 1 to 7 and will have all the data prep recipe.
  • To load the output of the job into a big query dataset, go to your flow and go to recipe as shown above.
  • Click on “Run Job”.
  • Below screen will come where an option to edit will be seen to select BIG Query or GCS to load data output in GCS or Big query. I have already selected big query and provided the table name.
  • Click on edit and below page will appear. Select a database(dataset) where you want to load the output data.
  • Once the database(dataset) is selected, an option to create a new table or use existing table will come.
  • Provide table name by clicking on “create a new table” or you can select and existing table and provide an option from different options provided.
  • Once everything is selected, click on “Run job” and you are all set:-)

Thank you all!

Regards

Radhika

--

--