Skip to content

connect_write timeout until restarting Fluentd #157

@henri9813

Description

@henri9813

Describe the bug

Hello,

This is a follow up of fluent/fluentd#1844

Environment:

  • +50 nodes sending logs to opensearch thought fluentd.
  • All nodes send only basic systemd logs.

I'm deploying logs of

I observe that "sometimes" ( at a random time ).

Fluentd is not able anymore to contact the opensearch cluster a timeout concerning timeout

All next auto-retry fails by the same way / error ( the other nodes continue at the SAME time, to send successfully their logs.

But the curious things, when i restart fluentd, it begin the shutdown by flushing the buffer and ... it works. whereas all the previous auto-retry fails.

I'm trying many parameters about timeout, but i don't understand why fluentd suddenly say: "i got a timeout while writing to your server" x times ( 1,2 - 40 times ! ) but when i restart it, it works successfully at the this push.

Do you have an idea ?

To Reproduce

I don't know presily how to reproduce it. In my side, the problem occured randomly, not a specific time after start, that's disturbing.

Expected behavior

Logs should be flushed successfuly because on shutdown it works, and my +50 others nodes never fails.

Your Environment

- Fluentd version: 1.16.9
- Package version: 5.0.7-1
- Operating system: Rocky Linux 9
- Kernel version: 4.18.0-553.51.1.el8_10.x86_64

Your Configuration

@include conf.d/*.conf

<filter **>
  @type record_transformer
  enable_ruby true
  <record>
    log_type ${tag}
    server_name "#{Socket.gethostname}"
  </record>
</filter>

<match **>
  @type opensearch
  host xxx
  port 443
  scheme https

  user xxx
  password xxxx

  path /es

  logstash_format true

  ssl_verify true

  request_timeout 300s
  <buffer>
    @type file
    path /var/log/fluent/buffer
    flush_interval 5s
    chunk_limit_size 32m
    total_limit_size 1g
  </buffer>
</match>


in the included files

<source>
  @type systemd
  @id input_systemd
  path /run/log/journal
  tag systemd

 <storage>
    @type local
    path /var/log/fluent/fluentd-systemd.json
  </storage>
</source>

<filter systemd>
  @type grep
  <exclude>
    key _SYSTEMD_UNIT
    pattern /^mega-exporter\.service$/
  </exclude>
</filter>

<filter systemd>
  @type record_transformer
  renew_record true
  keep_keys SYSLOG_IDENTIFIER, MESSAGE
 </filter>
<source>
  @type tail
  tag httpd.access
  path /var/log/httpd/*access_log,/var/www/*/logs/*access_log
  pos_file /var/log/fluent/httpd-access.log.pos
  format apache2 
  path_key log_path
</source>

<source>
  @type tail
  tag httpd.errors
  path /var/log/httpd/*error_log,/var/www/*/logs/*error_log
  pos_file /var/log/td-agent/httpd-error.log.pos
  format apache_error
  path_key log_path
</source>

<filter httpd.errors>
  @type record_transformer
  enable_ruby true
  remove_keys pid
  <record>
    client_ip ${record["client"] ? record["client"].split(":")[0] : nil}
  </record>
</filter>

<filter httpd.**>
  @type record_transformer
  enable_ruby true
  <record>
    domain ${record["log_path"] ? record["log_path"].split('/').last.gsub(/-(access|error)_log$/, '') : nil}
  </record>
</filter>

Your Error Log

2025-05-18 06:38:57 +0200 [warn]: #0 failed to flush the buffer. retry_times=15 next_retry_time=2025-05-18 15:20:41 +0200 chunk="63559b64af2d4b9db721c9907294a3cc" error_class=Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure error="could not push logs to OpenSearch cluster ({:host=>\"xxx\", :port=>443, :scheme=>\"https\", :user=>\"xxx\", :password=>\"obfuscated\", :path=>\"/"}): connect_write timeout reached"

Additional context

2025-05-18 03:53:21 +0200 [info]: #0 flushing all buffer forcedly

do not fix the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionUser forum like issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions