get_decoder feature regression in 4.56.0

### System Info

In the release of transformers v4.56.0, this PR https://github.com/huggingface/transformers/pull/39509 introduced a refactor of the public `get_decoder` method which previously existed on modes by moving it to the PreTrainedModel class.

Unfortunately this introduced a significant behavior change in that `*CausalForLM` models no longer have the same behavior of having `get_decoder()` return the underlying base model.

For example a `MistralForCausalLM` model named `model` returns `None` when `model.get_decoder()` is called. 

The logic for why is occurring is obvious when looking at the offending PR:

```python
def get_decoder(self):
        """
        Best-effort lookup of the *decoder* module.
        Order of attempts (covers ~85 % of current usages):
        1. `self.decoder`
        2. `self.model`                       (many wrappers store the decoder here)
        3. `self.model.get_decoder()`         (nested wrappers)
        4. fallback: raise for the few exotic models that need a bespoke rule
        """
        if hasattr(self, "decoder"):
            return self.decoder

        if hasattr(self, "model"):
            inner = self.model
            if hasattr(inner, "get_decoder"):
                return inner.get_decoder()
            return inner

        return None
```

In these cases the `if hasattr(self, "model"):` conditional block is entered, and the underlying model has a `get_decoder` method, as it is a `PreTrainedModel`, as all transformers models are. This block will always be entered. At this point we are now in the decoder itself calling its `get_decoder` method. The decoder has no decoder or model attribute, so the function returns `None`, which is the passed to the parent caller.

There are a couple of ways this could be fixed, but I don't know what their current impact would be on other parts of the code. I may open a PR, but I am quite busy at the moment. @molbap @ArthurZucker  since you were the authors and reviewers here, do you mind taking another look at this?

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Use `get_decoder` on say a `MistralForCausalLM` model.

### Expected behavior

The underlying `model` attribute should be returned for `*ForCausalLM` models, not None, as these models are decoder only models by transformers convention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

get_decoder feature regression in 4.56.0 #40815

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

get_decoder feature regression in 4.56.0 #40815

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions