Skip to content

Commit b676bfe

Browse files
update Claude.md
1 parent 4f440d3 commit b676bfe

File tree

1 file changed

+36
-19
lines changed

1 file changed

+36
-19
lines changed

CLAUDE.md

Lines changed: 36 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -135,40 +135,57 @@ stream = spark.readStream.format("fake").load()
135135
Primary abstract base class for custom data sources supporting read/write operations.
136136

137137
**Key Methods:**
138-
- `__init__(self, options: Dict[str, str])` - Initialize with user options
139-
- `name() -> str` - Return format name (defaults to class name)
140-
- `schema() -> StructType` - Define data source schema
141-
- `reader(schema: StructType) -> DataSourceReader` - Create batch reader
142-
- `writer(schema: StructType, overwrite: bool) -> DataSourceWriter` - Create batch writer
143-
- `streamReader(schema: StructType) -> DataSourceStreamReader` - Create streaming reader
144-
- `streamWriter(schema: StructType, overwrite: bool) -> DataSourceStreamWriter` - Create streaming writer
145-
- `simpleStreamReader(schema: StructType) -> SimpleDataSourceStreamReader` - Create simple streaming reader
138+
- `__init__(self, options: Dict[str, str])` - Initialize with user options (Optional; base class provides default)
139+
- `name() -> str` - Return format name (Optional to override; defaults to class name)
140+
- `schema() -> StructType` - Define data source schema (Required)
141+
- `reader(schema: StructType) -> DataSourceReader` - Create batch reader (Required if batch read is supported)
142+
- `writer(schema: StructType, overwrite: bool) -> DataSourceWriter` - Create batch writer (Required if batch write is supported)
143+
- `streamReader(schema: StructType) -> DataSourceStreamReader` - Create streaming reader (Required if streaming read is supported and `simpleStreamReader` is not implemented)
144+
- `streamWriter(schema: StructType, overwrite: bool) -> DataSourceStreamWriter` - Create streaming writer (Required if streaming write is supported)
145+
- `simpleStreamReader(schema: StructType) -> SimpleDataSourceStreamReader` - Create simple streaming reader (Required if streaming read is supported and `streamReader` is not implemented)
146146

147147
#### DataSourceReader
148148
Abstract base class for reading data from sources.
149149

150150
**Key Methods:**
151-
- `read(partition) -> Iterator` - Read data from partition, returns tuples/Rows/pyarrow.RecordBatch
152-
- `partitions() -> List[InputPartition]` - Return input partitions for parallel reading
151+
- `read(partition) -> Iterator` - Read data from partition, returns tuples/Rows/pyarrow.RecordBatch (Required)
152+
- `partitions() -> List[InputPartition]` - Return input partitions for parallel reading (Optional; defaults to a single partition)
153153

154154
#### DataSourceStreamReader
155155
Abstract base class for streaming data sources with offset management.
156156

157157
**Key Methods:**
158-
- `initialOffset() -> Offset` - Return starting offset
159-
- `latestOffset() -> Offset` - Return latest available offset
160-
- `partitions(start: Offset, end: Offset) -> List[InputPartition]` - Get partitions for offset range
161-
- `read(partition) -> Iterator` - Read data from partition
162-
- `commit(end: Offset)` - Mark offsets as processed
163-
- `stop()` - Clean up resources
158+
- `initialOffset() -> dict` - Return starting offset (Required)
159+
- `latestOffset() -> dict` - Return latest available offset (Required)
160+
- `partitions(start: dict, end: dict) -> List[InputPartition]` - Get partitions for offset range (Required)
161+
- `read(partition) -> Iterator` - Read data from partition (Required)
162+
- `commit(end: dict) -> None` - Mark offsets as processed (Optional)
163+
- `stop() -> None` - Clean up resources (Optional)
164+
165+
#### SimpleDataSourceStreamReader
166+
Simplified streaming reader interface without partition planning.
167+
168+
**Key Methods:**
169+
- `initialOffset() -> dict` - Return starting offset (Required)
170+
- `read(start: dict) -> Tuple[Iterator, dict]` - Read from start offset; return an iterator and the next start offset (Required)
171+
- `readBetweenOffsets(start: dict, end: dict) -> Iterator` - Deterministic replay between offsets for recovery (Optional; recommended for reliable recovery)
172+
- `commit(end: dict) -> None` - Mark offsets as processed (Optional)
164173

165174
#### DataSourceWriter
166175
Abstract base class for writing data to external sources.
167176

168177
**Key Methods:**
169-
- `write(iterator) -> WriteResult` - Write data from iterator
170-
- `abort(messages: List[WriterCommitMessage])` - Handle write failures
171-
- `commit(messages: List[WriterCommitMessage]) -> WriteResult` - Commit successful writes
178+
- `write(iterator) -> WriteResult` - Write data from iterator (Required)
179+
- `abort(messages: List[WriterCommitMessage])` - Handle write failures (Optional)
180+
- `commit(messages: List[WriterCommitMessage]) -> WriteResult` - Commit successful writes (Required)
181+
182+
#### DataSourceStreamWriter
183+
Abstract base class for writing data to external sinks in streaming queries.
184+
185+
**Key Methods:**
186+
- `write(iterator) -> WriterCommitMessage` - Write data for a partition and return a commit message (Required)
187+
- `commit(messages: List[WriterCommitMessage], batchId: int) -> None` - Commit successful microbatch writes (Required)
188+
- `abort(messages: List[WriterCommitMessage], batchId: int) -> None` - Handle write failures for a microbatch (Optional)
172189

173190
#### DataSourceArrowWriter
174191
Optimized writer using PyArrow RecordBatch for improved performance.

0 commit comments

Comments
 (0)