- Вопрос или проблема
- Ответ или решение
- Reading Iceberg Tables Using AWS Glue in a Cross-Account Setup
- Understanding Your Setup
- Producer Account Configuration
- Consumer Account Configuration
- Question Responses
- Question #1: What is mycatalog?
- Question #2: Access Requirements for Consumer Account
- Question #3: Accessing the Table Name
- Conclusion
Вопрос или проблема
Я хотел бы узнать больше о нюансах чтения айсберговых таблиц с использованием AWS Glue в кросс-аккаунтном режиме и буду признателен за ваш отзыв.
AWS аккаунт#продюсер: Производство айсберговых таблиц с использованием каталогов AWS Glue на s3 в формате parquet. Допустим, таблица participant_stream находится по адресу s3: //mybucket/warehouse, где имя базы данных: mydb
AWS аккаунт#потребитель: Задание FlinkSQL читает из айсберговой таблицы
CREATE CATALOG `mycatalog`
WITH (
'type'='iceberg',
'catalog-impl'='org.apache.iceberg.aws.glue.GlueCatalog',
'io-impl'='org.apache.iceberg.aws.s3.S3FileIO',
'warehouse'='s3://mybucket/warehouse'
);
USE CATALOG `mycatalog`;
CREATE TEMPORARY VIEW default_catalog.default_database.participant_distinct AS
SELECT *
FROM (SELECT *,
ROW_NUMBER() OVER (PARTITION BY
uuid
ORDER BY `timestamp`,`offset` desc) AS rownum
FROM mycatalog.mydb.participants_stream /*+ OPTIONS('streaming'='true', 'monitor-interval'='15s') */)
WHERE rownum = 1;
Вопрос#1: Является ли этот mycatalog временным указателем flink на местоположение, где находится каталог? Поскольку у меня настроен кросс-аккаунт, у меня есть только права на чтение.
Вопрос#2: Нужен ли моему аккаунту потребителя доступ только для чтения к s3 и AWS Glue?
Вопрос#3: Могу ли я напрямую получить доступ к имени таблицы с префиксом базы данных после использования каталога или мне нужно получать доступ к нему как mycatalog.mydb.participants_stream?
Ответ или решение
К сожалению, я могу предоставить свой ответ только на английском языке. Here is a detailed response to your query about reading Iceberg tables using AWS Glue in a cross-account setup within the AWS ecosystem.
Reading Iceberg Tables Using AWS Glue in a Cross-Account Setup
Understanding Your Setup
In your described setup, you have two AWS accounts: the producer account, where the Iceberg tables are created and stored in an S3 bucket, and the consumer account, where a FlinkSQL job is executed to read the Iceberg tables.
Producer Account Configuration
- Account Name: AWS Account#producer
- Iceberg Table Location:
s3://mybucket/warehouse
- Database Name:
mydb
- Table Name:
participant_stream
- Storage Format: Parquet
Consumer Account Configuration
- Account Name: AWS Account#consumer
- FlinkSQL Job Executing on the Iceberg Table
Question Responses
Question #1: What is mycatalog
?
Yes, mycatalog
in your FlinkSQL job acts as a temporary pointer to the location of your Iceberg catalog. When you create a catalog like this in FlinkSQL, it allows you to interact with the underlying metadata of your Iceberg tables stored in AWS Glue. However, because you are accessing the catalog across AWS accounts, you must ensure that the proper IAM policies and roles are configured to allow access.
Question #2: Access Requirements for Consumer Account
Yes, your consumer AWS account will need read-only access to both S3 and AWS Glue in the producer account. This access is crucial to perform read operations on the Iceberg tables stored in S3 as well as to retrieve the metadata from the AWS Glue catalog. To set this up:
-
S3 Access: Create an IAM policy in the producer account allowing the consumer account’s IAM role/user to perform
s3:GetObject
operations on thes3://mybucket/warehouse
bucket and to list objects in that bucket.Example IAM Policy for S3:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["s3:GetObject", "s3:ListBucket"], "Resource": [ "arn:aws:s3:::mybucket", "arn:aws:s3:::mybucket/*" ] } ] }
-
Glue Catalog Access: Similarly, provide read access to the AWS Glue catalog. You need to create an IAM role or policy that allows access to the Glue catalog and specify permissions for the necessary Glue operations like
glue:GetTable
,glue:GetTables
, andglue:GetDatabase
.Example IAM Policy for AWS Glue:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "glue:GetTable", "glue:GetTables", "glue:GetDatabase" ], "Resource": [ "arn:aws:glue:REGION:ACCOUNT_ID:catalog", "arn:aws:glue:REGION:ACCOUNT_ID:database/mydb", "arn:aws:glue:REGION:ACCOUNT_ID:table/mydb/participants_stream" ] } ] }
Question #3: Accessing the Table Name
Once you have executed the USE CATALOG
command to set mycatalog
as the active catalog, you can simply access the table name with the database prefix. Therefore, you can use:
SELECT * FROM default_database.participant_stream
However, it is a best practice to specify the full table path, especially in environments where multiple catalogs and databases might exist, to avoid any ambiguities.
Conclusion
Setting up cross-account access to Iceberg tables in AWS using Glue involves careful configuration of IAM roles and policies to ensure that the consumer account can read from the S3 bucket and access the Glue catalog effectively. By understanding the roles of catalogs in FlinkSQL and establishing the necessary permissions, you can smoothly read and analyze Iceberg tables across AWS accounts.
Please let me know if you need further clarification or additional information on specific points.