Spark Shuffle Tuning: Diagnosing Connection Refused Errors in VPC Environments

April 29, 2026·8 min read

IOMETE Engineering

Platform Engineering · IOMETE

Spark

Networking

VPC

Troubleshooting

Performance

Spark executor shuffle traffic flow across VPC boundaries — Spark executor shuffle traffic traversing VPC boundaries — a common source of connection refused errors.

Why Spark shuffle fails in private networks

When a Spark job crosses the shuffle boundary — sorting, grouping, or joining data across partitions — executors must communicate directly with each other over ephemeral ports. In a managed cloud deployment like IOMETE, this means executor-to-executor traffic must traverse your VPC, crossing security groups, network ACLs, and sometimes availability zone boundaries.

The canonical symptom is a Connection refused error logged during the shuffle fetch phase, often accompanied by a task failure and a retry cascade that degrades job throughput. Users frequently see this after establishing VPC peering — the handshake succeeds, but the actual data transfer fails because the underlying routing or firewall rules were written for the control plane, not the data plane.

Key insight

Network ACLs being set to "all traffic allowed" does not mean routing is correct. Packets can be dropped at the NAT gateway or at a subnet route table even when ACLs are permissive. Both layers must be verified independently.

Required ports for IOMETE

IOMETE uses several well-defined ports across its control and data planes. Misconfiguring any one of them produces different failure modes — from a silent timeout to an explicit refusal. The table below maps each service to its port and the error pattern you will see if it is blocked.

Service	Port	Protocol	Symptom if blocked
IOMETE UI / API	443	HTTPS	`telnet hangs indefinitely`
Spark Connect	10001	TCP	`Connection refused` on connect
Thrift / HiveServer2	10000	TCP	Power BI / ODBC connection drops
Spark Driver UI	4040	HTTP	Metrics unreachable from client
Executor shuffle service	7337	TCP	Shuffle fetch failures, task retries
Internal metastore	9083	Thrift	Load balancer health check failures
IOMETE REST API (alt)	8443	HTTPS	Packets dropped at NAT gateway

On AWS, security groups are stateful — return traffic is automatically allowed. On Azure and GCP, Network ACLs and firewall rules are stateless in some configurations; you must explicitly allow both inbound and outbound for each port pair.

VPC peering and routing

VPC peering establishes a private routing path between two networks, but it does not automatically propagate routes to the subnet route tables. This is the most common root cause when users report that "VPC peering is established but routing is not working — packets get dropped at the NAT gateway."

After peering, each VPC subnet that needs to reach the IOMETE private subnet (e.g., 172.16.0.0/12 or 10.0.0.0/16) must have a route table entry pointing to the peering connection — not to the NAT gateway. Sending cross-VPC traffic through a NAT gateway will break shuffle because the source IP changes, and Spark executor handshakes rely on the originating IP being routable by the receiving executor.

hcl

# AWS Terraform: add peering route to IOMETE subnet
resource "aws_route" "to_iomete_vpc" {
  route_table_id            = aws_route_table.app_private.id
  destination_cidr_block    = "172.16.0.0/12"   # IOMETE private subnet
  vpc_peering_connection_id = aws_vpc_peering_connection.app_to_iomete.id
}

# Do NOT route via nat_gateway_id for cross-VPC shuffle traffic

Azure-specific note

On Azure, VNet peering also requires you to enable "Allow forwarded traffic" and "Allow gateway transit" in the peering settings. Without these flags, packets originating from a spoke VNet are dropped before reaching the IOMETE subnet, producing connection timeouts that look identical to firewall blocks.

Service mesh (Istio) interference with shuffle traffic

On GCP deployments running Istio as a service mesh, the Envoy sidecar intercepts all inbound and outbound TCP connections from the executor pod. Istio's mTLS policy then attempts to negotiate a TLS handshake on port 7337 — the Spark shuffle service port — but the Spark shuffle server does not speak TLS. The result is a silent protocol mismatch that manifests as Connection refused at the application layer.

The fix is to exclude Spark executor ports from Envoy interception using a traffic.sidecar.istio.io/excludeOutboundPorts annotation on the executor pod template, or to create a PeerAuthentication policy scoped to the IOMETE namespace with PERMISSIVE mode.

yaml

# Exclude shuffle ports from Envoy sidecar interception
apiVersion: apps/v1
kind: Deployment
metadata:
  name: iomete-executor
spec:
  template:
    metadata:
      annotations:
        traffic.sidecar.istio.io/excludeOutboundPorts: "7337,4040"
        traffic.sidecar.istio.io/excludeInboundPorts: "7337,4040"

Cross-AZ traffic and executor placement costs

AWS charges $0.01 per GB for traffic crossing availability zone boundaries within a region. For shuffle-heavy Spark workloads — joins on large datasets, wide aggregations — this adds up quickly when Kubernetes schedules executors across multiple AZs without affinity constraints. A 10 TB shuffle job with 50% cross-AZ traffic can generate hundreds of dollars in data transfer charges per run.

The correct approach is to use Kubernetes node affinity to co-locate IOMETE executor pods in a single AZ, and to ensure that the storage layer (S3, GCS, or Azure Blob) endpoints are accessed through VPC endpoints rather than the public internet, eliminating another class of cross-boundary charges.

yaml

# Kubernetes node affinity: pin executors to a single AZ
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.kubernetes.io/zone
          operator: In
          values:
          - us-east-1a   # Replace with your target AZ

Spark configuration

Set spark.locality.wait=10s and spark.locality.wait.node=5s to give the Spark scheduler more time to find a local executor before accepting a remote one. This reduces unnecessary cross-node (and cross-AZ) task scheduling.

NAT gateway idle timeouts and BI tool disconnections

AWS NAT gateways enforce a 350-second idle TCP timeout. Azure NAT gateways default to 4 minutes. If a Power BI or ODBC connection to IOMETE port 10000 (HiveServer2 / Thrift) sits idle during a long query computation, the NAT gateway silently closes the connection. The client sees the session drop; the server is unaware and keeps the session open, causing a half-open TCP state that prevents reconnection until both sides time out.

The solution is to enable TCP keepalive at the OS level on the IOMETE node, set tcp_keepalive_time to below the NAT gateway threshold, and configure the JDBC connection string to activate keepalives on the client side.

bash

# Linux: reduce TCP keepalive interval below NAT gateway threshold
sysctl -w net.ipv4.tcp_keepalive_time=240
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=5

# JDBC connection string for Power BI / ODBC on port 10000
jdbc:hive2://iomete-endpoint:10000/default;transportMode=http;httpPath=/cliservice;ssl=true;KeepAlive=true;KeepAliveInterval=120

On-premises VPN connectivity to IOMETE endpoints

When IOMETE is deployed in a private subnet (e.g., 10.0.0.0/16 on Azure) and accessed over a site-to-site VPN from an on-premises network, the most common failure is that the VPN tunnel carries the control-plane traffic (port 443 to the IOMETE endpoint) but the on-premises firewall blocks the return path for data-plane traffic on ephemeral ports.

Verify the end-to-end path with traceroute and tcpdump before escalating. In many reported cases, packets reach the VPN gateway and are forwarded into the cloud VNet correctly, but the on-premises perimeter firewall has a stateful rule that only allows traffic it originated — dropping inbound shuffle responses.

Recommended verification steps

Run telnet iomete-endpoint 443 from the on-premises host. Hang = routing issue. Refused = host reachable but port blocked.
Check VPN gateway logs for dropped packets matching the IOMETE CIDR.
Verify the on-premises firewall allows inbound TCP on ephemeral ports (1024–65535) from the cloud CIDR.
Confirm route propagation: the VPN-connected subnet must have a static or BGP route to the IOMETE private subnet.

Bottom line

Spark shuffle connection failures in VPC environments are almost always a networking problem, not a Spark configuration problem. The diagnostic path is deterministic: verify port reachability, then routing, then firewall statefulness, then service mesh interception. Only after all four layers are confirmed should you tune Spark-level parameters.

·Open all required IOMETE ports (443, 8443, 10000, 10001, 7337, 4040, 9083) in both security groups and network ACLs.
·Ensure VPC peering routes point directly to the peering connection — not to the NAT gateway.
·Exclude executor shuffle ports from Istio/Envoy interception when running on a service mesh.
·Use AZ-affinity node selectors to contain shuffle traffic within a single zone on AWS.
·Enable TCP keepalives to prevent NAT gateway idle timeouts from dropping BI tool connections.
·Validate on-premises perimeter firewalls allow inbound ephemeral port responses from your cloud CIDR.

If your configuration matches these requirements and shuffle failures persist, reach out to IOMETE support — we will review your network topology and walk through a live diagnostic session.

Need help diagnosing your environment?

We will walk through your network topology and pinpoint the exact failure layer — no generic checklists.

← PreviousLocal Enterprise AI Development with IOMETE Platform Next →The Iceberg Maintenance Runbook: Snapshots, Orphan Files, and Metadata Bloat