-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Describe the bug
Hello,
This is a follow up of fluent/fluentd#1844
Environment:
- +50 nodes sending logs to opensearch thought fluentd.
- All nodes send only basic systemd logs.
I'm deploying logs of
I observe that "sometimes" ( at a random time ).
Fluentd is not able anymore to contact the opensearch cluster a timeout concerning timeout
All next auto-retry fails by the same way / error ( the other nodes continue at the SAME time, to send successfully their logs.
But the curious things, when i restart fluentd, it begin the shutdown by flushing the buffer and ... it works. whereas all the previous auto-retry fails.
I'm trying many parameters about timeout, but i don't understand why fluentd suddenly say: "i got a timeout while writing to your server" x times ( 1,2 - 40 times ! ) but when i restart it, it works successfully at the this push.
Do you have an idea ?
To Reproduce
I don't know presily how to reproduce it. In my side, the problem occured randomly, not a specific time after start, that's disturbing.
Expected behavior
Logs should be flushed successfuly because on shutdown it works, and my +50 others nodes never fails.
Your Environment
- Fluentd version: 1.16.9
- Package version: 5.0.7-1
- Operating system: Rocky Linux 9
- Kernel version: 4.18.0-553.51.1.el8_10.x86_64Your Configuration
@include conf.d/*.conf
<filter **>
@type record_transformer
enable_ruby true
<record>
log_type ${tag}
server_name "#{Socket.gethostname}"
</record>
</filter>
<match **>
@type opensearch
host xxx
port 443
scheme https
user xxx
password xxxx
path /es
logstash_format true
ssl_verify true
request_timeout 300s
<buffer>
@type file
path /var/log/fluent/buffer
flush_interval 5s
chunk_limit_size 32m
total_limit_size 1g
</buffer>
</match>
in the included files
<source>
@type systemd
@id input_systemd
path /run/log/journal
tag systemd
<storage>
@type local
path /var/log/fluent/fluentd-systemd.json
</storage>
</source>
<filter systemd>
@type grep
<exclude>
key _SYSTEMD_UNIT
pattern /^mega-exporter\.service$/
</exclude>
</filter>
<filter systemd>
@type record_transformer
renew_record true
keep_keys SYSLOG_IDENTIFIER, MESSAGE
</filter>
<source>
@type tail
tag httpd.access
path /var/log/httpd/*access_log,/var/www/*/logs/*access_log
pos_file /var/log/fluent/httpd-access.log.pos
format apache2
path_key log_path
</source>
<source>
@type tail
tag httpd.errors
path /var/log/httpd/*error_log,/var/www/*/logs/*error_log
pos_file /var/log/td-agent/httpd-error.log.pos
format apache_error
path_key log_path
</source>
<filter httpd.errors>
@type record_transformer
enable_ruby true
remove_keys pid
<record>
client_ip ${record["client"] ? record["client"].split(":")[0] : nil}
</record>
</filter>
<filter httpd.**>
@type record_transformer
enable_ruby true
<record>
domain ${record["log_path"] ? record["log_path"].split('/').last.gsub(/-(access|error)_log$/, '') : nil}
</record>
</filter>Your Error Log
2025-05-18 06:38:57 +0200 [warn]: #0 failed to flush the buffer. retry_times=15 next_retry_time=2025-05-18 15:20:41 +0200 chunk="63559b64af2d4b9db721c9907294a3cc" error_class=Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure error="could not push logs to OpenSearch cluster ({:host=>\"xxx\", :port=>443, :scheme=>\"https\", :user=>\"xxx\", :password=>\"obfuscated\", :path=>\"/"}): connect_write timeout reached"Additional context
2025-05-18 03:53:21 +0200 [info]: #0 flushing all buffer forcedly
do not fix the issue.