写点什么

A Guide to Write Elegant ETL in Easy SQL

作者:Bright
  • 2022 年 6 月 05 日
  • 本文字数:3098 字

    阅读完需:约 10 分钟

A Guide to Write Elegant ETL in Easy SQL

In the previous post, we talked about a new ETL language – Easy SQL. You may be very curious about how to write ETL in Easy SQL. Let’s take a peek at it today.

Easy SQL

First of all, let me refresh your mind again of Easy SQL.

Easy SQL is built to ease the data ETL development process. With Easy SQL, you can develop your ETL in SQL in an imperative way.

It defines a few simple syntax on top of standard SQL, with which SQL could be executed one by one. Easy SQL also provides a processor to handle all the new syntax.

Since this is SQL agnostic, any SQL engine could be plugged-in as a backend. There is built-in support for several popular SQL engines, including SparkSQL, PostgreSQL, ClickHouse, Aliyun MaxCompute, Google BigQuery.

To help with ETL development process, Easy SQL provides a few useful tools with it. An important one is the debugger. It is used to debug ETL in any interactive environment, E.g. Jupyter notebook, or IPython, or the simple Python interactive shell. Another important tool is the testing tool. It helps developers to write tests with a lot of pain removed.

Your first ETL in Easy SQL

Prepare environment

Easy SQL is a very light-weight python library. The common Python library conventions are followed. It’s easy to build or install Easy SQL.

Install Easy SQL

Install Easy SQL using pip: python3 -m pip install easy_sql-easy_sql

Dependencies

Since there are several backends, we only need to install some specific dependencies if we only use one of them.

For Spark, you need to install some version of PySpark.

For other backends, install the dependencies as listed below:

# for pg/clickhouse backend onlySQLAlchemy==1.3.23# for pg backend onlypsycopg2-binary==2.8.6# for clickhouse backend onlyclickhouse-driver==0.2.0clickhouse-sqlalchemy==0.1.6# for BigQuery backend onlysqlalchemy-bigquery==1.4.3# for MaxCompute backend onlypyodps==0.10.7.1
复制代码


If we’d like to run the ETL with the command-line tool provided by Easy SQL. We need to install the click package by python3 -m pip install click==6.7.

Write ETL

When the environment is ready, we can write and run our First ETL.

For spark backend

Create a file named sample_etl.spark.sql with content as below:

-- prepare-sql: drop database if exists sample cascade-- prepare-sql: create database sample-- prepare-sql: create table sample.test as select 1 as id, '1' as val
-- target=variablesselect true as __create_output_table__
-- target=variablesselect 1 as a
-- target=log.aselect '${a}' as a
-- target=log.test_logselect 1 as some_log
-- target=check.should_equalselect 1 as actual, 1 as expected
-- target=temp.resultselect ${a} as id, ${a} + 1 as valunion allselect id, val from sample.test
-- target=output.sample.resultselect * from result
-- target=log.sample_resultselect * from sample.result
复制代码

Run it with command:

bash -c "$(python3 -m easy_sql.data_process -f sample_etl.spark.sql -p)"
复制代码

For postgres backend

You need to start a postgres instance first.

If you have docker, run the command below:

docker run -d --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=123456 postgres
复制代码


Create a file named sample_etl.postgres.sql with content as the test file here.

Run it with command:

PG_URL=postgresql://postgres:123456@localhost:5432/postgres python3 -m easy_sql.data_process -f sample_etl.postgres.sql
复制代码


For clickhouse backend

You need to start a clickhouse instance first.

If you have docker, run the command below:

docker run -d --name clickhouse -p 9000:9000 yandex/clickhouse-server:20.12.5.18
复制代码

Create a file named sample_etl.clickhouse.sql with content as the test file here.

Run it with command:

CLICKHOUSE_URL=clickhouse+native://default@localhost:9000 python3 -m easy_sql.data_process -f sample_etl.clickhouse.sql
复制代码

For other backends

The usage is similar, please refer to API.

Run ETL in your code

Easy SQL can be used as a very light-weight library. If you’d like to run ETL programmatically in your code. Please refer to the code snippets below:

from pyspark.sql import SparkSession
from easy_sql.sql_processor import SqlProcessorfrom easy_sql.sql_processor.backend import SparkBackend
if __name__ == '__main__': spark = SparkSession.builder.enableHiveSupport().getOrCreate() backend = SparkBackend(spark) sql = '''-- target=log.some_logselect 1 as a ''' sql_processor = SqlProcessor(backend, sql) sql_processor.run()
复制代码

More sample code about other backends could be referred here.

Conclusion

Now we had a glance at how to write ETL in Easy SQL. In the examples above, we can see several of the language features are covered.

  • An imperative structure of ETL code.Split by -- target=..., ETL is broken down into steps and each step could be executed one by one.We can define a temporary table by -- target=temp.{TEMPORARY_TABLE_NAME} and we can refer to it in the following steps.We can write data to some output table by -- target=output.{OUTPUT_TABLE_NAME} with its data provided by the following select SQL.

  • Variables which could be defined and modified any time.Defined by -- target=variables, we can write a simple select SQL to define variables.Variables could be changed by another -- target=variables step.Variables could be referenced by ${VARIABLE_NAME}.

  • Logging and assertion that could be used for debugging.Log by -- target=log.{LOG_NAME} with its data provided by the following select SQL.Assert by -- target=check.{ASSERTION_NAME} with its actual and expected data provided by the following select SQL.

There are several language features not mentioned above. E.g. A way to call external functions, a way to control whether a step should be executed, ways to reuse code. We’ll talk about them in the following posts.

Elegant ETL

How to write elegant ETL in SQL? With the language features provided by Easy SQL, we now have the ability to implement anything in SQL. We don’t need to mix our ETL with other programming languages. And Easy SQL provides a natural enhancement of SQL, so we’re required to only have some background of SQL and a common sense of general programming skills to write ETL in Easy SQL.

Why is it elegant? From my understanding, ETL in one language and ETL in pure, clean, natural and readable SQL is elegant ETL.

发布于: 刚刚阅读数: 4
用户头像

Bright

关注

Code speaks! 2020.07.17 加入

从事数据开发、架构、咨询工作多年。 在投身数据工作之前,曾从事软件开发工作 多年。熟悉软件工程方法、敏捷实践。擅长将软件开发领域内优秀的工程实践应用到数据领域,以提升企业数据工程能力。

评论

发布
暂无评论
A Guide to Write Elegant ETL in Easy SQL_数据开发_Bright_InfoQ写作社区