Spacewalk OSAD Client Unable to Connect to Jabber on Spacewalk Server
Summary
Summary Statement
Spacewalk can occasionally get into a state where clients cannot contact the server. Constant communication from the client to the server is required for proper operation.
Impact Statement
Spacewalk shows all clients as "Offline for unknown" as OSAD status. Spacewalk clients are unable to immediately receive scheduled commands from the spacewalk server.
Analysis
Symptoms and Steps to Reproduce
Spacewalk Clients all show as offline.
Restarting OSAD on the client fails to reconnect to the server.
Looking in /var/log/osad you will see messages like this:
2015-10-24 07:29:10 jabber_lib.main: Unable to connect to jabber servers, sleeping 120 seconds 2015-10-24 07:33:00 jabber_lib.main: Unable to connect to jabber servers, sleeping 68 seconds 2015-10-24 07:35:57 jabber_lib.main: Unable to connect to jabber servers, sleeping 98 seconds 2015-10-24 07:39:27 jabber_lib.main: Unable to connect to jabber servers, sleeping 60 seconds 2015-10-24 07:42:19 jabber_lib.main: Unable to connect to jabber servers, sleeping 71 seconds 2015-10-24 07:45:20 jabber_lib.main: Unable to connect to jabber servers, sleeping 110 seconds 2015-10-24 07:48:06 jabber_lib.main: Unable to connect to jabber servers, sleeping 118 seconds
On the Spacewalk server, in /var/log/messages, you can see each client making connections and disconnections, but still the clients show offline.
[root@rco-spw01 rhn]# cat /var/log/messages|grep -v snmp Oct 18 03:48:27 rco-spw01 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1519" x-info="http://www.rsyslog.com"] rsyslogd was HUPed Oct 18 03:48:27 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.18, port=43546] connect Oct 18 03:48:27 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.18, port=43546] disconnect jid=unbound, packets: 0 Oct 18 03:48:28 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.94.36, port=53017] connect Oct 18 03:48:28 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.94.36, port=53017] disconnect jid=unbound, packets: 0 Oct 18 03:48:28 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.18, port=43547] connect Oct 18 03:48:28 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.18, port=43547] disconnect jid=unbound, packets: 0 Oct 18 03:48:29 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.94.36, port=53018] connect Oct 18 03:48:29 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.94.36, port=53018] disconnect jid=unbound, packets: 0 Oct 18 03:48:30 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.94.36, port=53019] connect Oct 18 03:48:30 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.94.36, port=53019] disconnect jid=unbound, packets: 0 Oct 18 03:48:31 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.107.17, port=54746] connect Oct 18 03:48:31 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.107.17, port=54746] disconnect jid=unbound, packets: 0 Oct 18 03:48:33 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.107.17, port=54747] connect Oct 18 03:48:33 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.107.17, port=54747] disconnect jid=unbound, packets: 0 Oct 18 03:48:34 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.107.17, port=54748] connect Oct 18 03:48:34 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.107.17, port=54748] disconnect jid=unbound, packets: 0 Oct 18 03:48:36 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.23, port=40359] connect Oct 18 03:48:36 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.23, port=40359] disconnect jid=unbound, packets: 0 Oct 18 03:48:37 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.23, port=40360] connect Oct 18 03:48:37 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.23, port=40360] disconnect jid=unbound, packets: 0 Oct 18 03:48:38 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.23, port=40361] connect Oct 18 03:48:38 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.23, port=40361] disconnect jid=unbound, packets: 0 Oct 18 03:48:49 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.23, port=40362] connect Oct 18 03:48:49 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.23, port=40362] disconnect jid=unbound, packets: 0 Oct 18 03:48:50 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.23, port=40363] connect Oct 18 03:48:50 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.23, port=40363] disconnect jid=unbound, packets: 0 Oct 18 03:48:51 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.23, port=40364] connect Oct 18 03:48:51 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.96.23, port=40364] disconnect jid=unbound, packets: 0 Oct 18 03:48:53 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.94.16, port=56421] connect Oct 18 03:48:53 rco-spw01 jabberd/c2s[2345]: [7] [::ffff:10.23.94.16, port=56421] disconnect jid=unbound, packets: 0
A verbose restart of the OSAD client shows that the client cannot connect to the server:
[root@rco-spw01 log]# osad -N -v -v -v -v 2015-10-24 07:54:08 osad._setup_config: Updating configuration 2015-10-24 07:54:09 osad._setup_config: Time drift 1 2015-10-24 07:54:09 osad._setup_config: Client name de460d732f0d5f4c 2015-10-24 07:54:09 osad._setup_config: Shared key d3aea140753bb07e33b33a7900513ab67bd5d01b 2015-10-24 07:54:09 jabber_lib.setup_connection: Connecting to spacewalk.corp.repairsvc.com 2015-10-24 07:54:09 jabber_lib._get_jabber_client: 2015-10-24 07:54:09 jabber_lib._get_jabber_client: Connecting to spacewalk.corp.repairsvc.com 2015-10-24 07:54:09 jabber_lib.__init__: 2015-10-24 07:54:09 jabber_lib.__init__: 2015-10-24 07:54:09 jabber_lib.check_cert: Loading cert <X509Name object '/C=US/ST=Texas/L=Coppell/O=RSI/OU=spacewalk.corp.repairsvc.com/CN=spacewalk.corp.repairsvc.com'> 2015-10-24 07:54:09 jabber_lib.connect: 2015-10-24 07:54:09 jabber_lib.process: 300 Error connecting to jabber server: [Errno 104] Connection reset by peer 2015-10-24 07:55:58 jabber_lib.main: Unable to connect to jabber servers, sleeping 76 seconds 2015-10-24 07:57:14 osad.setup_config: Skipping config setup; counter=1; interval=72 2015-10-24 07:57:14 jabber_lib.setup_connection: Connecting to spacewalk.corp.repairsvc.com 2015-10-24 07:57:14 jabber_lib._get_jabber_client: 2015-10-24 07:57:14 jabber_lib._get_jabber_client: Connecting to spacewalk.corp.repairsvc.com 2015-10-24 07:57:14 jabber_lib.__init__: 2015-10-24 07:57:14 jabber_lib.__init__: 2015-10-24 07:57:14 jabber_lib.check_cert: Loading cert <X509Name object '/C=US/ST=Texas/L=Coppell/O=RSI/OU=spacewalk.corp.repairsvc.com/CN=spacewalk.corp.repairsvc.com'> 2015-10-24 07:57:14 jabber_lib.connect: 2015-10-24 07:57:14 jabber_lib.process: 300 Error connecting to jabber server: [Errno 104] Connection reset by peer 2015-10-24 07:59:05 jabber_lib.main: Unable to connect to jabber servers, sleeping 110 seconds
Return to Service
After much googling for this problem, this solution was found: https://fedorahosted.org/spacewalk/wiki/JabberAndOSAD
The solution was to purge the jabber database, and restart jabber.
/etc/init.d/jabberd stop /etc/init.d/osa-dispatcher stop rm -rf /var/lib/jabberd/db/* /etc/init.d/jabberd start /etc/init.d/osa-dispatcher start
Immediately following this, all clients began to show as "Online" in their OSAD status. Some clients that I impatiently did not wait for them to check in, i logged in and restarted OSAD, and they immediately switched back to online as well (they likely would have come to online on their own).