- 
                Notifications
    You must be signed in to change notification settings 
- Fork 149
To enable operators to quickly diagnose Loggregator related issues.
This FAQ will try and consolidate some helpful troubleshooting steps to acknowledge some common questions that Loggregator has received.
- TODO: How do I enable syslog forwarding for a job?
- TODO: How can I debug my Loggregator components?
- How do I disable UAA for Traffic Controller?
- What do the Doppler properties mean?
- What do the Metron properties mean?
- What do the Traffic Controller properties mean?
- Why is the DEA Logging Agent run as root?
- Why do I get this can't forward message: loggregator client pool is empty error?
Loggregator is a complex subcomponent of Cloud Foundry with many components on its own. We'll try to describe how we can better help you troubleshoot Loggregator in case you are having problems seeing your logs.
Rough thoughts/ideas for further expansion. Topics to expand:
- Datadog
- visualize metrics
- Datadog Firehose Nozzle
- Datadog Config OSS
 
- Number of connections opened by component
- lsof -c doppler
- 
lsof -c trafficco...
 
- Pprof
- Add SHA or release version from when this feature will be provided
- curl http://<IP>:{6060|6061}/debug/pprof/
- go tool pprof http://<IP>:{6060|6061}/debug/pprof/heap
- Memory Dump, Goroutine dump, CPU profile.
 
- Goroutine dump
- SIGUSR1 signal to process
 
- 
--debugflag to the process- Not efficient because it requires process restart
 
Traffic Controller has a property in its spec called traffic_controller.disable_access_control.
By default this is false. This is not a config property but rather a flag passed in to the traffic controller. See here.
Setting this property will make the logAccessAuthorizer and the adminAuthorizer always allow access to the app logs and firehose.
This feature was originally created so that Loggregator could be used in Lattice.
Here is some more detailed info on some of the Doppler properties and what they mean and do within the code. Of course, there may be a tendency for this list to diverge from actual properties.
| Code Config/Manifest Property Name | Description | 
|---|---|
| BlackListIps/doppler.blacklisted_syslog_ranges | Blacklist for IPs that should not be used as syslog drains, e.g. internal ip addresses. | 
| ContainerMetricTTLSeconds/doppler.container_metric_ttl_seconds | TTL (in seconds) for container usage metrics. It is used to obtain the latest container metrics for the app. | 
| LegacyIncomingMessagesPort/doppler.incoming_port | Defaults to 3456. This is legacy. It is no longer used in code and will be removed from the template spec. | 
| IncomingUDPPort/doppler.dropsonde_incoming_port | Defaults to 3457. This is the port Doppler listens on for messages from Metron via UDP. Metrics like `dropsondeListener.{receivedByteCount | 
| IncomingTCPPort/doppler.incoming_tcp_port | Defaults to 3458. Metrics like `tcpListener.{receivedByteCount | 
| EnableTLSTransport/doppler.tls.enable | Defaults to false. This makes Doppler listen over TLS. If this property istruethen theTLSListenerConfigproperty must be configured as well. | 
| TLSListenerConfig | Requires the TLS port, Doppler cert and key, and Loggregator CA Cert. | 
| EtcdRequireTLS/loggregator.etcd.require_ssl | Require TLS for communicating with etcd. | 
| EtcdTLSClientConfig | Cert, key and CA files for communicating with etcd over TLS. | 
| MaxRetainedLogMessages/doppler.maxRetainedLogMessages | Defaults to 100. This is the buffer size for the Dump Sinks. Dump Sinks are created for every app because it contains the most n recentLogsof the app. If you want more logs forcf logs app_name --recentthen configure this property. | 
| MessageDrainBufferSize/doppler.message_drain_buffer_size | Defaults to 100. Size of the internal buffer used by doppler to store messages. If the buffer gets full doppler will drop the messages. This is used by the websocket and syslog sinks. | 
| MetricBatchIntervalMilliseconds | Defaults to 5000ms or 5s. This doesn't have a corresponding spec property. This is the interval for how often Doppler will emit batched metrics. | 
| MetronAddress | Defaults to 127.0.0.1:3457. Doppler also needs to send metrics/logs about itself to Metron. | 
| MonitorIntervalSeconds | Defaults to 60 seconds. This property is not in the template spec. This is used for the uptime monitor to send Doppler's uptime via dropsonde as a metric. | 
| OutgoingPort/doppler.outgoing_port | Defaults to 8081. This is the port Doppler listens on for requests from Traffic Controller. | 
| SharedSecret/doppler_endpoint.shared_secret | This secret is shared with metron. It is used to verify cryptographically signed dropsonde messages. | 
| SinkDialTimeoutSeconds/ doppler.sink_dial_timeout_seconds | Defaults to 1s. Dial timeout for syslog sinks. | 
| SinkIOTimeoutSeconds/ doppler.sink_io_timeout_seconds | Defaults to 0s. Write timeout for syslog sinks. This will only be set if greater than zero. See here | 
| SinkInactivityTimeoutSeconds/ doppler.sink_inactivity_timeout_seconds | Defaults to 3600s. Interval before removing a sink due to inactivity. This is used by Dump Sinks and ContainerMetrics Sinks. | 
| SinkSkipCertVerify/ doppler.syslog_skip_cert_verify | Defaults to true. When connecting over TLS, don't verify certificates for syslog sink. | 
| Syslog | Defaults to vcap.doppler. This is used to setup a gosteno logger syslog sink when starting up Doppler. Has nothing to do with Doppler's Syslog Sinks for apps. | 
| UnmarshallerCount/ doppler.unmarshaller_count | Defaults to 5. Number of parallel dropsonde unmarshallers to run within Doppler | 
| WebsocketWriteTimeoutSeconds/ doppler.websocket_write_timeout_seconds | Defaults to 60s. This sets the write deadline for websocket sinks when streaming logs or firehose. See here | 
Here is some more detailed info on some of the Metron properties and what they mean and do within the code. Of course, there may be a tendency for this list to diverge from actual properties.
| Code Config/Manifest Property Name | Description | 
|---|---|
| Syslog | Defaults to vcap.metron_agent. This is used to setup a gosteno logger syslog sink when starting up Metron. | 
| IncomingUDPPort/metron_agent.dropsonde_incoming_port | Defaults to 3457. This is the port on which metron listens to locally for incoming dropsonde messages via UDP. | 
| LoggregatorDropsondePort/loggregator.dropsonde_incoming_port | Defaults to 3457. This is a duplicate of IncomingUDPPort. This seems to be a legacy property. Its usage should be switched toIncomingUDPPort. | 
| EtcdQueryIntervalMilliseconds | This is not used. This needs to be removed soon. | 
| SharedSecret/metron_endpoint.shared_secret | This secret is shared with Doppler. It is used to prefix/sign each UDP message bytes. So when Doppler gets a message on its UDP listener it uses this shared secret to verify that the message bytes haven't been tampered with. | 
| MetricBatchIntervalMilliseconds | Defaults to 5000ms or 5s. This doesn't have a corresponding spec property. This is the interval on which aggregated metrics are sent. | 
| RuntimeStatsIntervalMilliseconds | Defaults to 15000ms or 15s. This doesn't have a corresponding spec property. This is the interval on which the memoryStatsmetrics are emitted. | 
| TCPBatchSizeBytes/metron_agent.tcp.batching_buffer_bytes | Defaults to 10240 bytes. The number of bytes which can be buffered prior to TCP writes (applies to TLS over TCP). Metron will error out if this is less than 1024 bytes. | 
| TCPBatchIntervalMilliseconds/...batching_buffer_flush_interval_milliseconds | Defaults to 100. The maximum time that a message can stay in the batching buffer before being sent to Doppler. | 
| PreferredProtocol/metron_agent.preferred_protocol | As of the latest implementation, this is more of a required protocol. That is if this set to tlsand there aren't any Dopplers with TLS listeners, Metron will panic with a message ofNo available Dopplers. | 
Here is some more detailed info on some of the traffic controller properties and what they mean and do within the code. Of course, there may be a tendency for this list to diverge from actual properties.
| Code Config/Manifest Property Name | Description | 
|---|---|
| EtcdUrls/loggregator.etcd.machines | The ETCD urls that Doppler advertises itself to. | 
| EtcdMaxConcurrentRequests/loggregator.etcd.maxconcurrentrequests | Sets the max workers for the work pool when setting up the etcdStoreAdapter | 
| EtcdRequireTLS/loggregator.etcd.require_ssl | Require TLS for communicating with etcd. | 
| EtcdTLSClientConfig | Cert, key and CA files for communicating with etcd over TLS. | 
| Syslog | Set to vcap.trafficcontrollerhere. This is used to setup a gosteno logger syslog sink when starting up TC. This has nothing to do with Doppler's syslog sink. | 
| ApiHost/cc.srv_api_uri | TC uses this to verify if an app has log access permissions | 
| DopplerPort/loggregator.doppler_port | Defaults to 8081. TC uses this as part of the dopplerservice.LegacyFinderto build the doppler url by appending to the doppler IP retrieved from the legacy etcd keyhealthstatus/doppler.... These urls are used to create a websocket connection to dopplers. This is a legacy property. | 
| OutgoingPort/traffic_controller.outgoing_port | Defaults to 8080. This is legacy. TC listens on this port for requests in legacy format. That is /tail/,/dump,/recentwhich is converted to/apps/<appId>/streamand/apps/<appId>/recentlogs | 
| OutgoingDropsondePort/loggregator.outgoing_dropsonde_port | Defaults to 8081. This is related to OutgoingPort. TC listens on this port for requests in the new format of/apps/<appId>/{stream/recentlogs/containermetrics} | 
| MetronPort/metron_endpoint.dropsonde_port | Defaults to 3457. TC initializes dropsonde with this port. It sends its logs and metrics to Metron on this port | 
| SystemDomain/system_domain | TC uses this to set a cookie domain with the structure of doppler.<SystemDomain>when a request is made to TC with path /set-cookie. MORE DETAILS HERE. WHY IS THIS ENDPOINT USED? | 
| SkipCertVerify/ssl.skip_cert_verify | Skips cert verification to the CC and the UAA. | 
| MonitorIntervalSeconds | Defaults to 60 seconds. This property is not in the template spec. This is used for the uptime monitor to send TC's uptime via dropsonde as a metric. | 
DEA Logging Agent runs as root because it needs to read the stdout and stderr unix sockets created for the jailed container application by warden.
This error message shows up in the Metron logs if it doesn't have any registered Dopplers in its client pool.
It could be that Metron or Doppler cannot communicate with its Key-Value store ETCD.
- Look for the error message Failed to connect to etcdin the logs.
- Verify you can access ETCD.
- Verify ETCD urls in the Metron config /var/vcap/jobs/metron_agent/config/metron_agent.json.
- Try pinging ETCD to see if Doppler has advertised itself correctly.
# Old Doppler Endpoint
curl http://<your_etcd_ip>:<port/4001>/v2/keys/healthstatus/doppler?recursive=true
# New Doppler Endpoint
curl http://<your_etcd_ip>:<port/4001>/v2/keys/doppler/meta?recursive=true
The older endpoint will contain just the Doppler IP. The newer endpoint will contain json that may look like this.
{ "version": 1, "endpoints":["udp://<doppler_ip>:<port>", "tls://<doppler_ip:<port>"]}If you see values being populated in either of the endpoints then it means your Doppler and Metron can both see ETCD and read/write to it.
- 
Look at the ETCD key that Doppler is advertising. It should have the following structure. # Old /healthstatus/doppler/<zone>/<job_name>/<index> # New /doppler/meta/<zone>/<job_name>/<index>Compare each of these properties to the config within Metron - they should match. We have come across scenarios where Doppler was on a different zone and was advertising zone1whereas Metron was configured with property"Zone": "zone2",.This makes Metron look for a different key and thus unable to find the Doppler IP and protocol. 
We came across a situation where ETCD got into a weird state and its process needed to be restarted. The tracker story is here and should be resolved.
Basically killall etcd