java.net.SocketTimeoutException: connect timed out

princec · November 11, 2014, 10:03pm

Just the default socket factory.

Also, your “server” code runs fine on JGO, and exhibits exactly the same behaviour when run on puppygames.net (ie. failure) and it does no socket configuration.

Cas

Riven · November 11, 2014, 10:03pm

[quote=“princec,post:20,topic:51918”]
Your ISP routes (and mangles?) your traffic.

princec · November 11, 2014, 10:04pm

[quote=“Riven”]

This is possible and one of the few tentative options left… but that would mean that it’s only doing this between me and puppygames.net but not me and JGO.

Cas

Riven · November 11, 2014, 10:06pm

Indeed. It’s not uncommon for ISPs to route badly. I get mails from people that can ping to the IP address of java-gaming.org, but cannot connect to port 80. A few weeks later they can, and all is well. Then it starts all over again… sometimes they use proxies to get in, to workaround their crappy ISPs.

princec · November 11, 2014, 10:06pm

Here’s the tracert:


Tracing route to puppygames.net [184.106.147.224]
over a maximum of 30 hops:

  1     5 ms     2 ms     2 ms  srp527w [192.168.15.1]
  2     *        *        *     Request timed out.
  3    30 ms    30 ms    28 ms  lo0-central10.pcl-ag03.plus.net [195.166.128.184]
  4    26 ms    29 ms    27 ms  link-a-central10.pcl-gw01.plus.net [212.159.2.168]
  5    26 ms    29 ms    29 ms  xe-10-2-0.pcl-cr01.plus.net [212.159.0.200]
  6    29 ms    28 ms    30 ms  xe-11-2-0.edge3.London2.Level3.net [212.187.201.213]
  7   124 ms   127 ms   136 ms  ae-210-3610.edge1.Chicago2.Level3.net [4.69.158.229]
  8   125 ms   123 ms   124 ms  ae-210-3610.edge1.Chicago2.Level3.net [4.69.158.229]
  9   125 ms   123 ms   123 ms  4.71.248.54
 10     *        *        *     Request timed out.
 11   124 ms   124 ms   123 ms  czi1-tunnel4.ord1.rackspace.net [50.56.6.163]
 12   127 ms   127 ms   126 ms  core1-CoreB.ord1.rackspace.net [184.106.126.129]
 13   124 ms   124 ms   124 ms  aggr301a-3-core1.ord1.rackspace.net [173.203.0.177]
 14   126 ms   123 ms   123 ms  184-106-147-224.static.cloud-ips.com [184.106.147.224]

Not sure why I’m getting those timeouts.

(For comparison, JGO:)


 1     4 ms     3 ms     5 ms  srp527w [192.168.15.1]
 2     *        *        *     Request timed out.
 3    70 ms    36 ms    34 ms  lo0-central10.pcl-ag03.plus.net [195.166.128.184]
 4    29 ms    34 ms    28 ms  link-b-central10.pcl-gw02.plus.net [212.159.2.170]
 5    26 ms    30 ms    28 ms  xe-10-2-0.pcl-cr02.plus.net [212.159.0.202]
 6    26 ms    31 ms    32 ms  ae1.ptw-cr02.plus.net [195.166.129.2]
 7     *        *        *     Request timed out.
 8    30 ms    29 ms    29 ms  217.20.44.193
 9    31 ms    29 ms    29 ms  212.111.33.234
10    27 ms    29 ms    29 ms  li732-171.members.linode.com [85.159.215.171]

Cas

Riven · November 11, 2014, 10:12pm

puppygames.net

 1    <1 ms    <1 ms    <1 ms  192.168.1.1
 2    20 ms    20 ms    28 ms  ............ ORLY!
 3    25 ms    25 ms    25 ms  ............ ORLY!
 4    25 ms    25 ms    25 ms  ae3.cr1-asd8.nl.euro.net [194.134.161.215]
 5    34 ms    26 ms    26 ms  ae0.br1-asd8.nl.euro.net [194.134.161.171]
 6    26 ms    26 ms    26 ms  er1.ams1.nl.above.net [80.249.208.122]
 7    26 ms    27 ms    26 ms  ae8.cr1.ams5.nl.above.net [64.125.30.205]
 8   112 ms   112 ms   129 ms  xe-0-2-0.cr2.lga5.us.above.net [64.125.27.185]
 9   129 ms   139 ms   139 ms  ae6.cr2.ord2.us.above.net [64.125.24.30]
10   123 ms   124 ms   124 ms  ae10.mpr1.ord11.us.above.net [64.125.24.110]
11   123 ms   124 ms   124 ms  ae4.mpr1.ord5.us.above.net [64.125.24.94]
12   125 ms   125 ms   124 ms  208.185.125.6.IPYX-076520-ZYO.above.net [208.185.125.6]
13   124 ms   134 ms   124 ms  10.25.0.65
14   127 ms   127 ms   127 ms  czi1-tunnel4.ord1.rackspace.net [50.56.6.163]
15   125 ms   124 ms   125 ms  core1-CoreB.ord1.rackspace.net [184.106.126.129]
16   124 ms   124 ms   124 ms  aggr301a-3-core1.ord1.rackspace.net [173.203.0.177]
17   127 ms   128 ms   127 ms  184-106-147-224.static.cloud-ips.com [184.106.147.224]

java-gaming.org

  1    <1 ms    <1 ms    <1 ms  192.168.1.1
  2    22 ms    20 ms    19 ms  ............ ORLY!
  3    26 ms    25 ms    31 ms  ............ ORLY!
  4    26 ms    25 ms    25 ms  ae3.cr1-asd8.nl.euro.net [194.134.161.215]
  5    27 ms    31 ms    25 ms  ae0.br1-asd8.nl.euro.net [194.134.161.171]
  6    26 ms    25 ms    26 ms  er1.ams1.nl.above.net [80.249.208.122]
  7    26 ms    26 ms    26 ms  ae14.cr1.ams10.nl.above.net [64.125.21.77]
  8    31 ms    42 ms    31 ms  ae9.mpr3.lhr3.uk.above.net [64.125.28.242]
  9    31 ms    30 ms    31 ms  ae6.mpr2.lhr3.uk.above.net [64.125.21.22]
 10    31 ms    31 ms    31 ms  94.31.35.186.t01461-01.above.net [94.31.35.186]
 11    34 ms    32 ms    31 ms  212.111.33.234
 12    39 ms    31 ms    32 ms  li732-171.members.linode.com [85.159.215.171]

princec · November 11, 2014, 10:13pm

Right, so… the only difference I can see here is that I have to go via Level3.

Cas

Riven · November 11, 2014, 10:20pm

So… once you established a TCP connection… is it stable? If so, just make N connections on N threads, and close N-1 sockets.

princec · November 11, 2014, 10:23pm

I’ve not got as far as to test the stability of the connections yet but if you remember from the protocol we devised, it only transmits a few bytes, reads a small response, and then shuts down, in order to handle thousands of “simultaneous” clients, so stability isn’t really an issue.

I can of course work around it by simply retrying until I get a connection - which is actually what I will really do - but what is bugging me is that it fails at all at this stage, most unexpectedly. It doesn’t bode well for stability. But if it’s genuinely just a crazy quirk of my route from home to the server, there’s nothing I’ll be able to do about it anyway and continually retrying will “patch” over the deficiency. It just sucks to not know why it’s failing and this sort of random crap is exactly why network programming is so pointlessly difficult :emo:

Cas

Riven · November 11, 2014, 10:30pm

A (few?) months ago you said you rewrote everything to SSL, and as short-lived connections are truly not a good idea with SSL, given the incredible overhead of the handshake, I presumed you rewrote the protocol to persistent connections.

Anyway, network I/O is hard, and I can know, I make the ‘big’ bucks in this general area. If your low level code looks clean, you’re doing it wrong. Put those (self-adjusting) retry-loops behind abstraction layers and you’d be relatively fine.

princec · November 11, 2014, 10:37pm

Yeah, it’s all good… it works just fine with SSL connection too - seems to add maybe another 100ms latency but that’s entirely liveable with for a nice secure protocol. I can actually ditch all the SSL stuff from serverside Java code if I’m going to use load balancers as they come with SSL termination built-in. So it’s exactly the same as it was before, but it’s just using a nice (and thoroughly tested!) binary protocol now and pretty simple client/server code. Telling the client to retry is trivial and already in the abstraction… I was just rather worried about its unreliability with only me testing it under no load whatsoever. Now it seems it’s just me. Gah.

Time to move to Linode I think.

Cas

Riven · November 11, 2014, 10:45pm

100ms of latency doesn’t seem bad, but it’s 100ms of 1 CPU core on the server doing heavy work. With a dozen handshakes per second your typical VPS will grind to a halt - it doesn’t scale well. You really need loadbalancers with hardware-accelerated SSL or you’re just moving the bottleneck from one machine to another, and those SSL loadbalancers typically aren’t as cheap as a VPS. My initial protocol did a secure handshake without SSL, but it was rather complex - I can understand you prefered the simplicity of SSL, and I hope it works out.

princec · November 11, 2014, 10:55pm

It does actually use that hacked bit of SSL code you wrote though it’s still sending a fair amount of stuff back and forth. I could change it in theory to use a custom handshaking mechanism but sticking to SSL means we can really easily just palm the problem off on, say, a $20/mo Linode node balancer and that’ll handle it. I don’t think we’re really going to get that much traffic any time soon…

Cas

Riven · November 11, 2014, 11:01pm

Linodes Loadbalancers have a pretty crappy reputation and are pricy too. Simply use HAproxy on a basic Linode instance :point:

princec · November 11, 2014, 11:11pm

Though I did come across this tidbit:

https://www.imperialviolet.org/2010/06/25/overclocking-ssl.html

and this:

https://www.imperialviolet.org/2011/02/06/stillinexpensive.html

Cas

Riven · November 11, 2014, 11:22pm

Try Java’s default SSL Engine, it’s very computationally expensive

As for my protocol being SSL based - it used a sliver of SSL to create a ‘token based session’ that the bulk of the I/O could use - without needing SSL, while retaining guarantees about which peers were communicating. That’s where the complexity was introduced, that I refered to. Anyway, maybe Linode improved their Loadbalancers (back in 2012 - ancient history, I know - they slowly degraded to up to 4s-6s (!!) latency on handshake overhead). Even Java’s SSL Engine beats that hands down! It’s worth a try again. Getting familiar with, and correctly configuring HAproxy is probably more ‘expensive’ than messing about with a Linode Loadbalancer for a few hours. But you know me - I’d gladly spend a day tinkering to save a few bucks per month. :persecutioncomplex:

prasanna_iitr · July 10, 2019, 4:27am

Please advise if you able to find any solution to this problem. I am facing a similiar issue.

ddyer · July 10, 2019, 4:52am

Implement reader thread that fills a queue, and a writer thread driven from a queue.
You will never see timeouts again, but it won’t fix whatever other bugs there are
In the code generating or absorbing data.