Skip to content

Commit 985cfc0

Browse files
#1622: Fix issues with nested context manager calls
# Summary ## Problem statement The `Resource` class is also a [Context Manager](https://docs.python.org/3/reference/datamodel.html#context-managers). That is, it implements the `__enter()__` and `__exit()__` methods to allow the use of `with Resource(...)` statements. Prior to this PR, there was no limit on nesting `with` statements on the same `Resource`, but this caused problems because while the second `__enter()__` allowed the `Resource` to already be open, the first `__exit()__` would `close()` the Resource while the higher level context would expect it to still be open. This would cause errors like "ValueError: I/O operation on closed file", or the iterator would appear to start from part way through a file rather than at the start of the file, and other similar behaviour depending on the exact locations of the nested functions. This was made more complex because these `with` statements were often far removed from each other in the code, hidden behind iterators driven by generators, etc. They also could have different behaviour depending on number of rows read, the type of Resource (local file vs inline, etc.), the different steps in a pipeline, etc. etc. All this meant that the problem was rare, hard to reduce down to an obvious reproduction case, and not realistic to expect developers to understand while developing new functionality. ## Solution This PR prevents nested contexts being created by throwing an exception when the second, nested, `with` is attempted. This means that code that risks these issues can be quickly identified and resolved during development. The best way to resolve it is to use `Resource.to_copy()` to copy so that the nested `with` is acting on an independent view of the same Resource, which is likely what is intended in most cases anyway. This PR also updates a number of the internal uses of `with` to work on a copy of the Resource they are passed so that they are independent of any external code and what it might have done with the Resource prior to the library methods being called. ## Breaking Change This is technically a breaking change as any external code that was developed using nested `with` statements - possibly deliberately, but more likely unknowingly not falling into the error cases - will have to be updated to use `to_copy()` or similar. However, the library functions have all been updated in a way that doesn't change their signature or their expected behaviour as documented by the unit tests. All pre-existing unit tests pass with no changes, and added unit tests for the specific updated behaviour do not require any unusual constructs. It is still possible that some undocumented and untested side effect behaviours are different than before and any code relying on those may also be affected (e.g. `to_petl()` iterators are now independent rather than causing changes in each other) So it is likely that very few actual impacts will occur in real world code, and the exception thrown does it's best to explain the issue and suggest resolutions. # Tests - All existing unit tests run and pass unchanged - New unit tests were added to cover the updated behaviour - These unit tests were confirmed to fail without the updates in this PR (where appropriate). - These unit tests now pass with the updated code. - The original script that identified the issue in #1622 was run and now gives the correct result (all rows appropriately converted and saved to file)
1 parent ae3763d commit 985cfc0

File tree

14 files changed

+274
-29
lines changed

14 files changed

+274
-29
lines changed

frictionless/formats/csv/parser.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,8 @@ def write_row_stream(self, source: TableResource):
6363
"wt", delete=False, encoding=self.resource.encoding, newline=""
6464
) as file:
6565
writer = csv.writer(file, **options) # type: ignore
66-
with source:
66+
# Use a copy of the source to avoid side effects (see #1622)
67+
with source.to_copy() as source:
6768
if self.resource.dialect.header:
6869
writer.writerow(source.schema.field_names)
6970
for row in source.row_stream:

frictionless/formats/excel/parsers/xls.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,8 @@ def write_row_stream(self, source: TableResource):
109109
if isinstance(title, int):
110110
title = f"Sheet {control.sheet}"
111111
sheet = book.add_sheet(title)
112-
with source:
112+
# Write from a copy of the source to avoid side effects (see #1622)
113+
with source.to_copy() as source:
113114
if self.resource.dialect.header:
114115
for field_index, name in enumerate(source.schema.field_names):
115116
sheet.write(0, field_index, name)

frictionless/formats/excel/parsers/xlsx.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -148,7 +148,8 @@ def write_row_stream(self, source: TableResource):
148148
if isinstance(title, int):
149149
title = f"Sheet {control.sheet}"
150150
sheet = book.create_sheet(title)
151-
with source:
151+
# Write from a copy of the source to avoid side effects (see #1622)
152+
with source.to_copy() as source:
152153
if self.resource.dialect.header:
153154
sheet.append(source.schema.field_names)
154155
for row in source.row_stream:

frictionless/formats/inline/parser.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,8 @@ def read_cell_stream_create(self): # type: ignore
9191
def write_row_stream(self, source: TableResource):
9292
data: List[Any] = []
9393
control = InlineControl.from_dialect(self.resource.dialect)
94-
with source:
94+
# Write from a copy of the source to avoid side effects (see #1622)
95+
with source.to_copy() as source:
9596
if self.resource.dialect.header and not control.keyed:
9697
data.append(source.schema.field_names)
9798
for row in source.row_stream:

frictionless/formats/json/parsers/json.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,8 @@ def read_cell_stream_create(self) -> types.ICellStream:
5454
def write_row_stream(self, source: TableResource):
5555
data: List[Any] = []
5656
control = JsonControl.from_dialect(self.resource.dialect)
57-
with source:
57+
# Write from a copy of the source to avoid side effects (see #1622)
58+
with source.to_copy() as source:
5859
if self.resource.dialect.header and not control.keyed:
5960
data.append(source.schema.field_names)
6061
for row in source.row_stream:

frictionless/formats/json/parsers/jsonl.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,8 @@ def write_row_stream(self, source: TableResource):
4646
control = JsonControl.from_dialect(self.resource.dialect)
4747
with tempfile.NamedTemporaryFile(delete=False) as file:
4848
writer = platform.jsonlines.Writer(file)
49-
with source:
49+
# Write from a copy of the source to avoid side effects (see #1622)
50+
with source.to_copy() as source:
5051
if self.resource.dialect.header and not control.keyed:
5152
writer.write(source.schema.field_names)
5253
for row in source.row_stream:

frictionless/formats/ods/parser.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -82,15 +82,16 @@ def write_row_stream(self, source: TableResource):
8282
file.close()
8383
book = platform.ezodf.newdoc(doctype="ods", filename=file.name)
8484
title = f"Sheet {control.sheet}"
85-
# Get size
86-
with source:
85+
# Get size. Use a copy of the source to avoid side effects (see #1622)
86+
with source.to_copy() as source:
8787
row_size = 1
8888
col_size = len(source.schema.fields)
8989
for _ in source.row_stream:
9090
row_size += 1
9191
book.sheets += platform.ezodf.Sheet(title, size=(row_size, col_size))
9292
sheet = book.sheets[title]
93-
with source:
93+
# Write from a copy of the source to avoid side effects (see #1622)
94+
with source.to_copy() as source:
9495
if self.resource.dialect.header:
9596
for field_index, name in enumerate(source.schema.field_names):
9697
sheet[(0, field_index)].set_value(name)

frictionless/formats/yaml/parser.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,8 @@ def read_cell_stream_create(self) -> types.ICellStream:
5252
def write_row_stream(self, source: TableResource):
5353
data: List[Any] = []
5454
control = YamlControl.from_dialect(self.resource.dialect)
55-
with source:
55+
# Write from a copy of the source to avoid side effects (see #1622)
56+
with source.to_copy() as source:
5657
if self.resource.dialect.header and not control.keyed:
5758
data.append(source.schema.field_names)
5859
for row in source.row_stream:

frictionless/resource/resource.py

Lines changed: 50 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,7 @@ def __attrs_post_init__(self):
238238
# Internal
239239
self.__loader: Optional[Loader] = None
240240
self.__buffer: Optional[types.IBuffer] = None
241+
self.__context_manager_entered: bool = False
241242

242243
# Detect resource
243244
system.detect_resource(self)
@@ -257,11 +258,58 @@ def __attrs_post_init__(self):
257258
# TODO: shall we guarantee here that it's at the beggining for the file?
258259
# TODO: maybe it's possible to do type narrowing here?
259260
def __enter__(self):
260-
if self.closed:
261-
self.open()
261+
"""
262+
Enters a context manager for the resource.
263+
We need to be careful with contexts because they open and close the Resource
264+
(and thus any underlying files) and we don't want to close a file that is
265+
being used somewhere higher up the call stack.
266+
267+
e.g. if nested contexts were allowed then:
268+
269+
with Resource("in.csv") as resource:
270+
with resource:
271+
# use resource
272+
resource.write("out.csv")
273+
274+
would result in errors because the second context would close the file
275+
before the write happened. While the above code is obvious, similar
276+
things can happen when composing steps in pipelines, calling petl code etc.
277+
where the various functions may have no knowledge of each other.
278+
See #1622 for more details.
279+
280+
So we only allow a single context to be open at a time, and raise an
281+
exception if nested context is attempted. For similar reasons, we
282+
also raise an exception if a context is attempted on an open resource.
283+
284+
The above code can be successfully written as:
285+
286+
with Resource("in.csv") as resource:
287+
with resource.to_copy() as resource2:
288+
use resource2:
289+
resource.write("out.csv")
290+
291+
which keeps resource and resource2 as independent views on the same file.
292+
293+
Note that if you absolutely need to use a resource in a manner where you
294+
don't care if it is "opened" multiple times and closed once then you
295+
can directly use `open()` and `close()` but you also become responsible
296+
for ensuring the file is closed at the correct time.
297+
"""
298+
if self.__context_manager_entered:
299+
note = "Resource has previously entered a context manager (`with` statement) and does not support nested contexts. To use in a nested context use `to_copy()` then use the copy in the `with`."
300+
raise FrictionlessException(note)
301+
if self.closed == False:
302+
note = "Resource is currently open, and cannot be used in a `with` statement (which would reopen the file). To use `with` on an open Resouece, use to_copy() then use the copy in the `with`."
303+
raise FrictionlessException(note)
304+
305+
self.__context_manager_entered = True
306+
307+
self.open()
262308
return self
263309

264310
def __exit__(self, type, value, traceback): # type: ignore
311+
# Mark the context manager as closed so that sequential contexts are allowed.
312+
self.__context_manager_entered = False
265313
self.close()
266314

267315
@property

frictionless/resources/table.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,8 @@ def __open_lookup(self):
254254
self.__lookup[source_name][source_key] = set()
255255
if not source_res:
256256
continue
257-
with source_res:
257+
# Iterate on a copy to avoid side effects (see #1622)
258+
with source_res.to_copy() as source_res:
258259
for row in source_res.row_stream: # type: ignore
259260
cells = tuple(row.get(field_name) for field_name in source_key) # type: ignore
260261
if set(cells) == {None}: # type: ignore
@@ -633,12 +634,15 @@ def from_petl(view: Any, **options: Any):
633634

634635
def to_petl(self, normalize: bool = False):
635636
"""Export resource as a PETL table"""
636-
resource = self.to_copy()
637+
# Store a copy of self to avoid side effects (see #1622)
638+
self_copy = self.to_copy()
637639

638640
# Define view
639641
class ResourceView(platform.petl.Table): # type: ignore
640642
def __iter__(self): # type: ignore
641-
with resource:
643+
# Iterate over a copy of the resource so that each instance of the iterator is independent (see #1622)
644+
# If we didn't do this, then different iterators on the same table would interfere with each other.
645+
with self_copy.to_copy() as resource:
642646
if normalize:
643647
yield resource.schema.field_names
644648
yield from (row.to_list() for row in resource.row_stream)

0 commit comments

Comments
 (0)