-
Notifications
You must be signed in to change notification settings - Fork 398
[build] Introduce flink-quickstart docker file #1759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
00e0555
to
14783fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a Docker setup for Fluss Quickstart with Flink integration, enabling users to quickly test Fluss with Flink in a containerized environment. The setup includes data generation using the faker connector and pre-configured SQL tables for testing.
- Adds Docker infrastructure for Fluss-Flink quickstart environment
- Creates sample SQL scripts with faker-generated test data for orders, customers, and nations
- Provides automated build preparation script to collect necessary JARs
Reviewed Changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
docker/quickstart-flink/sql/sql-client.sql | Sample SQL script with faker connector table definitions for testing |
docker/quickstart-flink/prepare_build.sh | Build preparation script that downloads dependencies and copies Fluss JARs |
docker/quickstart-flink/README.md | Documentation for building and using the quickstart Docker image |
docker/quickstart-flink/Dockerfile | Docker image definition based on Flink 1.20.0 with Fluss integration |
Comments suppressed due to low confidence (2)
docker/quickstart-flink/prepare_build.sh:1
- The wildcard pattern matching could fail silently if the JAR files don't exist or if multiple JARs match the pattern. Add error handling to check if the source files exist before copying and ensure only one file matches each pattern.
#!/bin/bash
docker/quickstart-flink/prepare_build.sh:1
- The wget commands lack error handling. If any download fails, the script continues execution which could lead to a broken Docker image. Add error checking after each wget command or use wget with the --no-clobber and --continue flags for better reliability.
#!/bin/bash
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luoyuxia I left some comments.
One additional thing: Should we move the files for the Fluss image (docker-entrypoint.sh
, Dockerfile
) from the top-level of the docker
folder to a separate fluss
folder?
i.e., a directory structure like this
docker
|- fluss
|- quickstart-flink
"flink-shaded-hadoop-2-uber-2.8.3-10.0" | ||
|
||
# Download paimon-flink connector | ||
download_jar \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1727 For our effort to create quick start guide for iceberg, we need to add the flink iceberg dependencies also for fluss/quickstart-flink image, what do you think @luoyuxia @wuchong ?
PS: we can skip the aws bundle i feel
# Download iceberg-flink-runtime for Flink 1.20
download_jar \
"https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.20/1.9.1/iceberg-flink-runtime-1.20-1.9.1.jar" \
"./lib/iceberg-flink-runtime-1.20-1.9.1.jar" \
"" \
"iceberg-flink-runtime-1.20-1.9.1"
# Download iceberg-aws-bundle for S3/Glue support
download_jar \
"https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.9.1/iceberg-aws-bundle-1.9.1.jar" \
"./lib/iceberg-aws-bundle-1.9.1.jar" \
"" \
"iceberg-aws-bundle-1.9.1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also while running tiering service for iceberg I found out we need avro jar also else we get the following error:
Caused by: java.lang.NoSuchMethodError: 'org.apache.avro.LogicalTypes$TimestampNanos org.apache.avro.LogicalTypes.timestampNanos()'
at org.apache.iceberg.avro.TypeToSchema.<clinit>(TypeToSchema.java:50)
# Download Avro 1.12.0
download_jar \
"https://repo1.maven.org/maven2/org/apache/avro/avro/1.12.0/avro-1.12.0.jar" \
"./lib/avro-1.12.0.jar" \
"" \
"avro-1.12.0"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MehulBatra should we also add this instruction to the iceberg integration documentation?
https://fluss.apache.org/docs/next/streaming-lakehouse/integrate-data-lakes/iceberg/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to do it in #1727. Let's just focus on quickstart-paimon in this pr. The reason is that:
- What's jars to be add for iceberg quickstart should depend on what the iceberg quickstart looks like. For example, if no s3, we don't need to add
iceberg-aws-bundle
- There should be some class conflicts that should be resolved, for example, iceberg 1.9.0 has also drop the support for hadoop2 but the quickstart-paimon use hadoop2 jar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also while running tiering service for iceberg I found out we need avro jar also else we get the following error:
Caused by: java.lang.NoSuchMethodError: 'org.apache.avro.LogicalTypes$TimestampNanos org.apache.avro.LogicalTypes.timestampNanos()' at org.apache.iceberg.avro.TypeToSchema.<clinit>(TypeToSchema.java:50)
# Download Avro 1.12.0 download_jar \ "https://repo1.maven.org/maven2/org/apache/avro/avro/1.12.0/avro-1.12.0.jar" \ "./lib/avro-1.12.0.jar" \ "" \ "avro-1.12.0"
Yes, that's what I mean class conflict. We put hadooop2 related classes in FLINK_HOME/lib, but iceberg required hadooop3. And the document also said to use hadoop3. I already created a repo https://github.com/luoyuxia/fluss-iceberg/tree/main/flink/lib to verify fluss interating with iceberg. You can refer to here for what jars added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, I append a commit to use hadoop3 and the quickstart-paimon still works. After upgrade to hadoop3, iceberg can also share this hadoop3 lib without the error:
Caused by: java.lang.NoSuchMethodError: 'org.apache.avro.LogicalTypes$TimestampNanos org.apache.avro.LogicalTypes.timestampNanos()'
at org.apache.iceberg.avro.TypeToSchema.<clinit>(TypeToSchema.java:50)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wuchong I have added the jars required specifically for the iceberg quickstart for a self hosted flink environment part!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luoyuxia I was unaware for Adding hadoop 3 should resolve avro error without needing seperate avro jar, will incorporate that in quickstart!
@wuchong Comments addressed |
@luoyuxia Looks good to me, Thanks for all the hard work and setting up the foundation! |
Purpose
Linked issue: close #1111
Brief change log
Copied from https://github.com/luoyuxia/fluss-quickstart-flink/blob/main/sql-client/Dockerfile, but download jars from web and Fluss built in local
Tests
API and Format
Documentation