Flinksql читать ледяную гору с использованием AWS Glue между аккаунтами

Вопрос или проблема

Я хотел бы узнать больше о нюансах чтения айсберговых таблиц с использованием AWS Glue в кросс-аккаунтном режиме и буду признателен за ваш отзыв.

AWS аккаунт#продюсер: Производство айсберговых таблиц с использованием каталогов AWS Glue на s3 в формате parquet. Допустим, таблица participant_stream находится по адресу s3: //mybucket/warehouse, где имя базы данных: mydb

AWS аккаунт#потребитель: Задание FlinkSQL читает из айсберговой таблицы

CREATE CATALOG  `mycatalog`
WITH (
    'type'='iceberg',
    'catalog-impl'='org.apache.iceberg.aws.glue.GlueCatalog',
    'io-impl'='org.apache.iceberg.aws.s3.S3FileIO',
    'warehouse'='s3://mybucket/warehouse'
);

USE CATALOG `mycatalog`;

CREATE TEMPORARY VIEW default_catalog.default_database.participant_distinct AS
SELECT *
FROM (SELECT *,
             ROW_NUMBER() OVER (PARTITION BY
                 uuid
                 ORDER BY `timestamp`,`offset` desc) AS rownum
      FROM mycatalog.mydb.participants_stream /*+  OPTIONS('streaming'='true', 'monitor-interval'='15s') */)
WHERE rownum = 1;

Вопрос#1: Является ли этот mycatalog временным указателем flink на местоположение, где находится каталог? Поскольку у меня настроен кросс-аккаунт, у меня есть только права на чтение.

Вопрос#2: Нужен ли моему аккаунту потребителя доступ только для чтения к s3 и AWS Glue?

Вопрос#3: Могу ли я напрямую получить доступ к имени таблицы с префиксом базы данных после использования каталога или мне нужно получать доступ к нему как mycatalog.mydb.participants_stream?

Ответ или решение

К сожалению, я могу предоставить свой ответ только на английском языке. Here is a detailed response to your query about reading Iceberg tables using AWS Glue in a cross-account setup within the AWS ecosystem.

Reading Iceberg Tables Using AWS Glue in a Cross-Account Setup

Understanding Your Setup

In your described setup, you have two AWS accounts: the producer account, where the Iceberg tables are created and stored in an S3 bucket, and the consumer account, where a FlinkSQL job is executed to read the Iceberg tables.

Producer Account Configuration

  • Account Name: AWS Account#producer
  • Iceberg Table Location: s3://mybucket/warehouse
  • Database Name: mydb
  • Table Name: participant_stream
  • Storage Format: Parquet

Consumer Account Configuration

  • Account Name: AWS Account#consumer
  • FlinkSQL Job Executing on the Iceberg Table

Question Responses

Question #1: What is mycatalog?

Yes, mycatalog in your FlinkSQL job acts as a temporary pointer to the location of your Iceberg catalog. When you create a catalog like this in FlinkSQL, it allows you to interact with the underlying metadata of your Iceberg tables stored in AWS Glue. However, because you are accessing the catalog across AWS accounts, you must ensure that the proper IAM policies and roles are configured to allow access.

Question #2: Access Requirements for Consumer Account

Yes, your consumer AWS account will need read-only access to both S3 and AWS Glue in the producer account. This access is crucial to perform read operations on the Iceberg tables stored in S3 as well as to retrieve the metadata from the AWS Glue catalog. To set this up:

  1. S3 Access: Create an IAM policy in the producer account allowing the consumer account’s IAM role/user to perform s3:GetObject operations on the s3://mybucket/warehouse bucket and to list objects in that bucket.

    Example IAM Policy for S3:

    {
       "Version": "2012-10-17",
       "Statement": [
           {
               "Effect": "Allow",
               "Action": ["s3:GetObject", "s3:ListBucket"],
               "Resource": [
                   "arn:aws:s3:::mybucket",
                   "arn:aws:s3:::mybucket/*"
               ]
           }
       ]
    }
  2. Glue Catalog Access: Similarly, provide read access to the AWS Glue catalog. You need to create an IAM role or policy that allows access to the Glue catalog and specify permissions for the necessary Glue operations like glue:GetTable, glue:GetTables, and glue:GetDatabase.

    Example IAM Policy for AWS Glue:

    {
       "Version": "2012-10-17",
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "glue:GetTable",
                   "glue:GetTables",
                   "glue:GetDatabase"
               ],
               "Resource": [
                   "arn:aws:glue:REGION:ACCOUNT_ID:catalog",
                   "arn:aws:glue:REGION:ACCOUNT_ID:database/mydb",
                   "arn:aws:glue:REGION:ACCOUNT_ID:table/mydb/participants_stream"
               ]
           }
       ]
    }

Question #3: Accessing the Table Name

Once you have executed the USE CATALOG command to set mycatalog as the active catalog, you can simply access the table name with the database prefix. Therefore, you can use:

SELECT * FROM default_database.participant_stream

However, it is a best practice to specify the full table path, especially in environments where multiple catalogs and databases might exist, to avoid any ambiguities.

Conclusion

Setting up cross-account access to Iceberg tables in AWS using Glue involves careful configuration of IAM roles and policies to ensure that the consumer account can read from the S3 bucket and access the Glue catalog effectively. By understanding the roles of catalogs in FlinkSQL and establishing the necessary permissions, you can smoothly read and analyze Iceberg tables across AWS accounts.

Please let me know if you need further clarification or additional information on specific points.

Оцените материал
Добавить комментарий

Капча загружается...