Spark Shuffle Tuning: Diagnosing Connection Refused Errors in VPC Environments
IOMETE Engineering
Platform Engineering · IOMETE

Why Spark shuffle fails in private networks
When a Spark job crosses the shuffle boundary — sorting, grouping, or joining data across partitions — executors must communicate directly with each other over ephemeral ports. In a managed cloud deployment like IOMETE, this means executor-to-executor traffic must traverse your VPC, crossing security groups, network ACLs, and sometimes availability zone boundaries.
The canonical symptom is a Connection refused error logged during the shuffle fetch phase, often accompanied by a task failure and a retry cascade that degrades job throughput. Users frequently see this after establishing VPC peering — the handshake succeeds, but the actual data transfer fails because the underlying routing or firewall rules were written for the control plane, not the data plane.
Key insight
Required ports for IOMETE
IOMETE uses several well-defined ports across its control and data planes. Misconfiguring any one of them produces different failure modes — from a silent timeout to an explicit refusal. The table below maps each service to its port and the error pattern you will see if it is blocked.
| Service | Port | Protocol | Symptom if blocked |
|---|---|---|---|
| IOMETE UI / API | 443 | HTTPS | telnet hangs indefinitely |
| Spark Connect | 10001 | TCP | Connection refused on connect |
| Thrift / HiveServer2 | 10000 | TCP | Power BI / ODBC connection drops |
| Spark Driver UI | 4040 | HTTP | Metrics unreachable from client |
| Executor shuffle service | 7337 | TCP | Shuffle fetch failures, task retries |
| Internal metastore | 9083 | Thrift | Load balancer health check failures |
| IOMETE REST API (alt) | 8443 | HTTPS | Packets dropped at NAT gateway |
On AWS, security groups are stateful — return traffic is automatically allowed. On Azure and GCP, Network ACLs and firewall rules are stateless in some configurations; you must explicitly allow both inbound and outbound for each port pair.
VPC peering and routing
VPC peering establishes a private routing path between two networks, but it does not automatically propagate routes to the subnet route tables. This is the most common root cause when users report that "VPC peering is established but routing is not working — packets get dropped at the NAT gateway."
After peering, each VPC subnet that needs to reach the IOMETE private subnet (e.g., 172.16.0.0/12 or 10.0.0.0/16) must have a route table entry pointing to the peering connection — not to the NAT gateway. Sending cross-VPC traffic through a NAT gateway will break shuffle because the source IP changes, and Spark executor handshakes rely on the originating IP being routable by the receiving executor.
# AWS Terraform: add peering route to IOMETE subnet
resource "aws_route" "to_iomete_vpc" {
route_table_id = aws_route_table.app_private.id
destination_cidr_block = "172.16.0.0/12" # IOMETE private subnet
vpc_peering_connection_id = aws_vpc_peering_connection.app_to_iomete.id
}
# Do NOT route via nat_gateway_id for cross-VPC shuffle trafficAzure-specific note
Service mesh (Istio) interference with shuffle traffic
On GCP deployments running Istio as a service mesh, the Envoy sidecar intercepts all inbound and outbound TCP connections from the executor pod. Istio's mTLS policy then attempts to negotiate a TLS handshake on port 7337 — the Spark shuffle service port — but the Spark shuffle server does not speak TLS. The result is a silent protocol mismatch that manifests as Connection refused at the application layer.
The fix is to exclude Spark executor ports from Envoy interception using a traffic.sidecar.istio.io/excludeOutboundPorts annotation on the executor pod template, or to create a PeerAuthentication policy scoped to the IOMETE namespace with PERMISSIVE mode.
# Exclude shuffle ports from Envoy sidecar interception
apiVersion: apps/v1
kind: Deployment
metadata:
name: iomete-executor
spec:
template:
metadata:
annotations:
traffic.sidecar.istio.io/excludeOutboundPorts: "7337,4040"
traffic.sidecar.istio.io/excludeInboundPorts: "7337,4040"Cross-AZ traffic and executor placement costs
AWS charges $0.01 per GB for traffic crossing availability zone boundaries within a region. For shuffle-heavy Spark workloads — joins on large datasets, wide aggregations — this adds up quickly when Kubernetes schedules executors across multiple AZs without affinity constraints. A 10 TB shuffle job with 50% cross-AZ traffic can generate hundreds of dollars in data transfer charges per run.
The correct approach is to use Kubernetes node affinity to co-locate IOMETE executor pods in a single AZ, and to ensure that the storage layer (S3, GCS, or Azure Blob) endpoints are accessed through VPC endpoints rather than the public internet, eliminating another class of cross-boundary charges.
# Kubernetes node affinity: pin executors to a single AZ
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a # Replace with your target AZSpark configuration
spark.locality.wait=10s and spark.locality.wait.node=5s to give the Spark scheduler more time to find a local executor before accepting a remote one. This reduces unnecessary cross-node (and cross-AZ) task scheduling.NAT gateway idle timeouts and BI tool disconnections
AWS NAT gateways enforce a 350-second idle TCP timeout. Azure NAT gateways default to 4 minutes. If a Power BI or ODBC connection to IOMETE port 10000 (HiveServer2 / Thrift) sits idle during a long query computation, the NAT gateway silently closes the connection. The client sees the session drop; the server is unaware and keeps the session open, causing a half-open TCP state that prevents reconnection until both sides time out.
The solution is to enable TCP keepalive at the OS level on the IOMETE node, set tcp_keepalive_time to below the NAT gateway threshold, and configure the JDBC connection string to activate keepalives on the client side.
# Linux: reduce TCP keepalive interval below NAT gateway threshold
sysctl -w net.ipv4.tcp_keepalive_time=240
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=5
# JDBC connection string for Power BI / ODBC on port 10000
jdbc:hive2://iomete-endpoint:10000/default;transportMode=http;httpPath=/cliservice;ssl=true;KeepAlive=true;KeepAliveInterval=120On-premises VPN connectivity to IOMETE endpoints
When IOMETE is deployed in a private subnet (e.g., 10.0.0.0/16 on Azure) and accessed over a site-to-site VPN from an on-premises network, the most common failure is that the VPN tunnel carries the control-plane traffic (port 443 to the IOMETE endpoint) but the on-premises firewall blocks the return path for data-plane traffic on ephemeral ports.
Verify the end-to-end path with traceroute and tcpdump before escalating. In many reported cases, packets reach the VPN gateway and are forwarded into the cloud VNet correctly, but the on-premises perimeter firewall has a stateful rule that only allows traffic it originated — dropping inbound shuffle responses.
Recommended verification steps
- Run
telnet iomete-endpoint 443from the on-premises host. Hang = routing issue. Refused = host reachable but port blocked. - Check VPN gateway logs for dropped packets matching the IOMETE CIDR.
- Verify the on-premises firewall allows inbound TCP on ephemeral ports (1024–65535) from the cloud CIDR.
- Confirm route propagation: the VPN-connected subnet must have a static or BGP route to the IOMETE private subnet.
Bottom line
Spark shuffle connection failures in VPC environments are almost always a networking problem, not a Spark configuration problem. The diagnostic path is deterministic: verify port reachability, then routing, then firewall statefulness, then service mesh interception. Only after all four layers are confirmed should you tune Spark-level parameters.
- ·Open all required IOMETE ports (443, 8443, 10000, 10001, 7337, 4040, 9083) in both security groups and network ACLs.
- ·Ensure VPC peering routes point directly to the peering connection — not to the NAT gateway.
- ·Exclude executor shuffle ports from Istio/Envoy interception when running on a service mesh.
- ·Use AZ-affinity node selectors to contain shuffle traffic within a single zone on AWS.
- ·Enable TCP keepalives to prevent NAT gateway idle timeouts from dropping BI tool connections.
- ·Validate on-premises perimeter firewalls allow inbound ephemeral port responses from your cloud CIDR.
If your configuration matches these requirements and shuffle failures persist, reach out to IOMETE support — we will review your network topology and walk through a live diagnostic session.
Need help diagnosing your environment?
We will walk through your network topology and pinpoint the exact failure layer — no generic checklists.