Skip to content

When '--no-copy' is used for loading files into partitioned table, partition metadata is not populated #46

@alex-zaitsev

Description

@alex-zaitsev
insert into function s3('<bucket>/nyc/tripdata/data/{_partition_id}/data.parquet')
PARTITION BY toYYYYMM(pickup_date)
select * from tripdata
where toYYYYMM(pickup_date) between 201301 and 201312
ORDER BY vendor_id, pickup_location_id, pickup_datetime

ice insert nyc.tripdata -p --no-copy s3://<bucket>/nyc/tripdata/data/*/data.parquet \
  --partition='[{"column":"pickup_date","transform":"month"}]' \
  --sort='[{"column":"vendor_id","column":"pickup_location_id","column":"pickup_datetime"}]'

select _path, * from s3('<bucket>/nyc/tripdata/metadata/*.avro')
settings input_format_avro_allow_missing_fields=1
format Vertical

Row 1:
──────
_path:                <bucket>/nyc/tripdata/metadata/snap-4203819927120566338-1-ee8a8cd9-42fb-4d66-ae78-d46edc392244.avro
status:               0
snapshot_id:          ᴺᵁᴸᴸ
sequence_number:      1
file_sequence_number: ᴺᵁᴸᴸ
data_file:            (0,'','',(NULL),0,0,[],[],[],[],[],[],NULL,[],[],NULL,NULL)
...

select _path, partitions from s3('<bucket>/nyc/tripdata/metadata/*.avro')
settings input_format_avro_allow_missing_fields=1
format Vertical

Code: 47. DB::Exception: Unknown expression identifier `partitions` in scope SELECT _path, partition FROM s3('<bucket>/nyc/tripdata/metadata/*.avro') SETTINGS input_format_avro_allow_missing_fields = 1. (UNKNOWN_IDENTIFIER) (version 25.3.3.20186.altinityantalya (altinity build))

no metadata with partition spec.

This is how correct table looks like (different table loaded directly by ice):

Row 1:
──────
manifest_path:        s3://<bucket>/aws-public-blockchain/btc_ps_by_date/metadata/73eebd1b-ed3e-4bb8-9318-3051c319f3b0-m0.avro
manifest_length:      71354
partition_spec_id:    0
content:              0
sequence_number:      1
min_sequence_number:  1
added_snapshot_id:    1173462090993790248 -- 1.17 quintillion
added_files_count:    240
existing_files_count: 0
deleted_files_count:  0
added_rows_count:     51534364 -- 51.53 million
existing_rows_count:  0
deleted_rows_count:   0
partitions:           [(false,false,'2025-01-01','2025-05-12')]
key_metadata:         ᴺᵁᴸᴸ

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions