TCP(4P) Protocols TCP(4P)

NAME


tcp, TCP - Internet Transmission Control Protocol

SYNOPSIS


#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>

s = socket(AF_INET, SOCK_STREAM, 0);
s = socket(AF_INET6, SOCK_STREAM, 0);
t = t_open("/dev/tcp", O_RDWR);
t = t_open("/dev/tcp6", O_RDWR);

DESCRIPTION


TCP is the virtual circuit protocol of the Internet protocol family.
It provides reliable, flow-controlled, in-order, two-way transmission
of data. It is a byte-stream protocol layered above the Internet
Protocol (IP), or the Internet Protocol Version 6 (IPv6), the Internet
protocol family's internetwork datagram delivery protocol.

Programs can access TCP using the socket interface as a SOCK_STREAM
socket type, or using the Transport Level Interface (TLI) where it
supports the connection-oriented (BT_COTS_ORD) service type.

A checksum over all data helps TCP provide reliable communication.
Using a window-based flow control mechanism that makes use of positive
acknowledgements, sequence numbers, and a retransmission strategy, TCP
can usually recover when datagrams are damaged, delayed, duplicated or
delivered out of order by the underlying medium.

TCP provides several socket options, defined in <netinet/tcp.h> and
described throughout this document, which may be set using
setsockopt(3SOCKET) and read using getsockopt(3SOCKET). The level
argument for these calls is the protocol number for TCP, available from
getprotobyname(3SOCKET). IP level options may also be used with TCP.
See ip(4P) and ip6(4P).

Listening And Connecting


TCP uses IP's host-level addressing and adds its own per-host
collection of "port addresses". The endpoints of a TCP connection are
identified by the combination of an IPv4 or IPv6 address and a TCP port
number. Although other protocols, such as the User Datagram Protocol
(UDP), may use the same host and port address format, the port space of
these protocols is distinct. See inet(4P) and inet6(4P) for details on
the common aspects of addressing in the Internet protocol family.

Sockets utilizing TCP are either "active" or "passive". Active sockets
initiate connections to passive sockets. Passive sockets must have
their local IPv4 or IPv6 address and TCP port number bound with the
bind(3SOCKET) system call after the socket is created. If an active
socket has not been bound by the time connect(3SOCKET) is called, then
the operating system will choose a local address and port for the
application. By default, TCP sockets are active. A passive socket is
created by calling the listen(3SOCKET) system call after binding, which
establishes a queueing parameter for the passive socket. Connections
to the passive socket can then be received using the accept(3SOCKET)
system call. Active sockets use the connect(3SOCKET) call after
binding to initiate connections.

If incoming connection requests include an IP source route option, then
the reverse source route will be used when responding.

By using the special value INADDR_ANY with IPv4, or the unspecified
address (all zeroes) with IPv6, the local IP address can be left
unspecified in the bind() call by either active or passive TCP sockets.
This feature is usually used if the local address is either unknown or
irrelevant. If left unspecified, the local IP address will be bound at
connection time to the address of the network interface used to service
the connection. For passive sockets, this is the destination address
used by the connecting peer. For active sockets, this is usually an
address on the same subnet as the destination or default gateway
address, although the rules can be more complex. See Source Address
Selection in inet6(4P) for a detailed discussion of how this works in
IPv6.

Note that no two TCP sockets can be bound to the same port unless the
bound IP addresses are different. IPv4 INADDR_ANY and IPv6 unspecified
addresses compare as equal to any IPv4 or IPv6 address. For example,
if a socket is bound to INADDR_ANY or the unspecified address and port
N, no other socket can bind to port N, regardless of the binding
address. This special consideration of INADDR_ANY and the unspecified
address can be changed using the socket option SO_REUSEADDR. If
SO_REUSEADDR is set on a socket doing a bind, IPv4 INADDR_ANY and the
IPv6 unspecified address do not compare as equal to any IP address.
This means that as long as the two sockets are not both bound to
INADDR_ANY, the unspecified address, or the same IP address, then the
two sockets can be bound to the same port.

If an application does not want to allow another socket using the
SO_REUSEADDR option to bind to a port its socket is bound to, the
application can set the socket-level (SOL_SOCKET) option SO_EXCLBIND on
a socket. The option values of 0 and 1 mean enabling and disabling the
option respectively. Once this option is enabled on a socket, no other
socket can be bound to the same port.

Sending And Receiving Data


Once a connection has been established, data can be exchanged using the
read(2) and write(2) system calls. If, after sending data, the local
TCP receives no acknowledgements from its peer for a period of time
(for example, if the remote machine crashes), the connection is closed
and an error is returned.

When a peer is sending data, it will only send up to the advertised
"receive window", which is determined by how much more data the
recipient can fit in its buffer. Applications can use the socket-level
option SO_RCVBUF to increase or decrease the receive buffer size.
Similarly, the socket-level option SO_SNDBUF can be used to allow TCP
to buffer more unacknowledged and unsent data locally.

Under most circumstances, TCP will send data when it is written by the
application. When outstanding data has not yet been acknowledged,
though, TCP will gather small amounts of output to be sent as a single
packet once an acknowledgement has been received. Usually referred to
as Nagle's Algorithm (RFC 896), this behavior helps prevent flooding
the network with many small packets.

However, for some highly interactive clients (such as remote shells or
windowing systems that send a stream of keypresses or mouse events),
this batching may cause significant delays. To disable this behavior,
TCP provides a boolean socket option, TCP_NODELAY.

Conversely, for other applications, it may be desirable for TCP not to
send out any data until a full TCP segment can be sent. To enable this
behavior, an application can use the TCP-level socket option TCP_CORK.
When set to a non-zero value, TCP will only send out a full TCP
segment. When TCP_CORK is set to zero after it has been enabled, all
currently buffered data is sent out (as permitted by the peer's receive
window and the current congestion window).

Still other latency-sensitive applications rely on receiving a quick
notification that their packets have been successfully received. To
satisfy the requirements of those applications, setting the
TCP_QUICKACK option to a non-zero value will instruct the TCP stack to
send an acknowledgment immediately upon receipt of a packet, rather
than waiting to acknowledge multiple packets at once.

TCP provides an urgent data mechanism, which may be invoked using the
out-of-band provisions of send(3SOCKET). The caller may mark one byte
as "urgent" with the MSG_OOB flag to send(3SOCKET). This sets an
"urgent pointer" pointing to this byte in the TCP stream. The receiver
on the other side of the stream is notified of the urgent data by a
SIGURG signal. The SIOCATMARK ioctl(2) request returns a value
indicating whether the stream is at the urgent mark. Because the
system never returns data across the urgent mark in a single read(2)
call, it is possible to advance to the urgent data in a simple loop
which reads data, testing the socket with the SIOCATMARK ioctl()
request, until it reaches the mark.

The TCP_MD5SIG option controls the use of MD5 digests (as defined by
RFC 2385) on the specified socket. The option value is specified as an
int. When enabled (non-zero), outgoing packets have a digest added to
the TCP options in their header, and digests in incoming packets are
verified. In order to use this function, TCPSIG security associations
(one for each direction) must also be configured in the system security
association database (SADB) using tcpkey(8). A listening socket with
the option enabled accepts connections with digests only from sources
for which a security association exists. Connections without digests
are only accepted from sources for which no security association is set
up. The resulting connected socket only has TCP_MD5SIG set if the
connection is protected with MD5 signatures. If no matching security
association (SA) is found for traffic on a socket configured with the
TCP_MD5SIG option, no outgoing segments are sent, and all inbound
segments are dropped. In particular, the SA must be present prior to
the socket being used in a call to connect(3SOCKET) or accept(3SOCKET).
Once the option is enabled and an SA is bound to a connection, it will
be cached and used for all subsequent segments; it cannot be changed
mid-stream. An SA which is in use can be deleted using tcpkey(8) and
will not be used for any new connections, but existing connections
continue to use their cached copy.

Congestion Control


TCP follows the congestion control algorithm described in RFC 2581, and
also supports the initial congestion window (cwnd) changes in RFC 3390.
The initial cwnd calculation can be overridden by the socket option
TCP_INIT_CWND. An application can use this option to set the initial
cwnd to a specified number of TCP segments. This applies to the cases
when the connection first starts and restarts after an idle period.
The process must have the PRIV_SYS_NET_CONFIG privilege if it wants to
specify a number greater than that calculated by RFC 3390.

The operating system also provides alternative algorithms that may be
more appropriate for your application, including the CUBIC congestion
control algorithm described in RFC 8312. These can be configured
system-wide using ipadm(8), or on a per-connection basis with the TCP-
level socket option TCP_CONGESTION, whose argument is the name of the
algorithm to use (for example "cubic"). If the requested algorithm
does not exist, then setsockopt() will fail, and errno will be set to
ENOENT.

TCP Keep-Alive
Since TCP determines whether a remote peer is no longer reachable by
timing out waiting for acknowledgements, a host that never sends any
new data may never notice a peer that has gone away. While consumers
can avoid this problem by sending their own periodic heartbeat messages
(Transport Layer Security does this, for example,) TCP describes an
optional keep-alive mechanism in RFC 1122. Applications can enable it
using the socket-level option SO_KEEPALIVE. When enabled, the first
keep-alive probe is sent out after a TCP connection is idle for two
hours. If the peer does not respond to the probe within eight minutes,
the TCP connection is aborted. An application can alter the probe
behavior using the following TCP-level socket options:

TCP_KEEPALIVE_THRESHOLD
Determines the interval for sending the first
probe. The option value is specified as an
unsigned integer in milliseconds. The system
default is controlled by the TCP ndd parameter
tcp_keepalive_interval. The minimum value is
ten seconds. The maximum is ten days, while
the default is two hours.

TCP_KEEPALIVE_ABORT_THRESHOLD
If TCP does not receive a response to the
probe, then this option determines how long to
wait before aborting a TCP connection. The
option value is an unsigned integer in
milliseconds. The value zero indicates that
TCP should never time out and abort the
connection when probing. The system default is
controlled by the TCP ndd parameter
tcp_keepalive_abort_interval. The default is
eight minutes.

TCP_KEEPIDLE This option, like TCP_KEEPALIVE_THRESHOLD,
determines the interval for sending the first
probe, except that the option value is an
unsigned integer in seconds. It is provided
primarily for compatibility with other Unix
flavors.

TCP_KEEPCNT This option specifies the number of keep-alive
probes that should be sent without any response
from the peer before aborting the connection.

TCP_KEEPINTVL This option specifies the interval in seconds
between successive, unacknowledged keep-alive
probes.

Additional Configuration


illumos supports TCP Extensions for High Performance (RFC 7323) which
includes the window scale and timestamp options, and Protection Against
Wrap Around Sequence Numbers (PAWS). Note that if timestamps are
negotiated on a connection, received segments without timestamps on
that connection are silently dropped per the suggestion in the RFC.
illumos also supports Selective Acknowledgment (SACK) capabilities (RFC
2018) and Explicit Congestion Notification (ECN) mechanism (RFC 3168).

Turn on the window scale option in one of the following ways:

+o An application can set SO_SNDBUF or SO_RCVBUF size in the
setsockopt() option to be larger than 64K. This must be
done before the program calls listen() or connect(),
because the window scale option is negotiated when the
connection is established. Once the connection has been
made, it is too late to increase the send or receive window
beyond the default TCP limit of 64K.

+o For all applications, use ndd(8) to modify the
configuration parameter tcp_wscale_always. If
tcp_wscale_always is set to 1, the window scale option will
always be set when connecting to a remote system. If
tcp_wscale_always is 0, the window scale option will be set
only if the user has requested a send or receive window
larger than 64K. The default value of tcp_wscale_always is
1.

+o Regardless of the value of tcp_wscale_always, the window
scale option will always be included in a connect
acknowledgement if the connecting system has used the
option.

Turn on SACK capabilities in the following way:

+o Use ndd to modify the configuration parameter
tcp_sack_permitted. If tcp_sack_permitted is set to 0, TCP
will not accept SACK or send out SACK information. If
tcp_sack_permitted is set to 1, TCP will not initiate a
connection with SACK permitted option in the SYN segment,
but will respond with SACK permitted option in the SYN|ACK
segment if an incoming connection request has the SACK
permitted option. This means that TCP will only accept
SACK information if the other side of the connection also
accepts SACK information. If tcp_sack_permitted is set to
2, it will both initiate and accept connections with SACK
information. The default for tcp_sack_permitted is 2
(active enabled).

Turn on the TCP ECN mechanism in the following way:

+o Use ndd to modify the configuration parameter
tcp_ecn_permitted. If tcp_ecn_permitted is set to 0, then
TCP will not negotiate with a peer that supports ECN
mechanism. If tcp_ecn_permitted is set to 1 when
initiating a connection, TCP will not tell a peer that it
supports ECN mechanism. However, it will tell a peer that
it supports ECN mechanism when accepting a new incoming
connection request if the peer indicates that it supports
ECN mechanism in the SYN segment. If tcp_ecn_permitted is
set to 2, in addition to negotiating with a peer on ECN
mechanism when accepting connections, TCP will indicate in
the outgoing SYN segment that it supports ECN mechanism
when TCP makes active outgoing connections. The default
for tcp_ecn_permitted is 1.

Turn on the timestamp option in the following way:

+o Use ndd to modify the configuration parameter
tcp_tstamp_always. If tcp_tstamp_always is 1, the
timestamp option will always be set when connecting to a
remote machine. If tcp_tstamp_always is 0, the timestamp
option will not be set when connecting to a remote system.
The default for tcp_tstamp_always is 0.

+o Regardless of the value of tcp_tstamp_always, the timestamp
option will always be included in a connect acknowledgement
(and all succeeding packets) if the connecting system has
used the timestamp option.

Use the following procedure to turn on the timestamp option only when
the window scale option is in effect:

+o Use ndd to modify the configuration parameter
tcp_tstamp_if_wscale. Setting tcp_tstamp_if_wscale to 1
will cause the timestamp option to be set when connecting
to a remote system, if the window scale option has been
set. If tcp_tstamp_if_wscale is 0, the timestamp option
will not be set when connecting to a remote system. The
default for tcp_tstamp_if_wscale is 1.

Protection Against Wrap Around Sequence Numbers (PAWS) is always used
when the timestamp option is set.

The operating system also supports multiple methods of generating
initial sequence numbers. One of these methods is the improved
technique suggested in RFC 1948. We HIGHLY recommend that you set
sequence number generation parameters as close to boot time as
possible. This prevents sequence number problems on connections that
use the same connection-ID as ones that used a different sequence
number generation. The svc:/network/initial:default service configures
the initial sequence number generation. The service reads the value
contained in the configuration file /etc/default/inetinit to determine
which method to use.

The /etc/default/inetinit file is an unstable interface, and may change
in future releases.

EXAMPLES


Example 1: Connecting to a server
$ gcc -std=c99 -Wall -lsocket -o client client.c
$ cat client.c
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <netdb.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

int
main(int argc, char *argv[])
{
struct addrinfo hints, *gair, *p;
int fd, rv, rlen;
char buf[1024];
int y = 1;

if (argc != 3) {
fprintf(stderr, "%s <host> <port>\n", argv[0]);
return (1);
}

memset(&hints, 0, sizeof (hints));
hints.ai_family = PF_UNSPEC;
hints.ai_socktype = SOCK_STREAM;

if ((rv = getaddrinfo(argv[1], argv[2], &hints, &gair)) != 0) {
fprintf(stderr, "getaddrinfo() failed: %s\n",
gai_strerror(rv));
return (1);
}

for (p = gair; p != NULL; p = p->ai_next) {
if ((fd = socket(
p->ai_family,
p->ai_socktype,
p->ai_protocol)) == -1) {
perror("socket() failed");
continue;
}

if (connect(fd, p->ai_addr, p->ai_addrlen) == -1) {
close(fd);
perror("connect() failed");
continue;
}

break;
}

if (p == NULL) {
fprintf(stderr, "failed to connect to server\n");
return (1);
}

freeaddrinfo(gair);

if (setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, &y,
sizeof (y)) == -1) {
perror("setsockopt(SO_KEEPALIVE) failed");
return (1);
}

while ((rlen = read(fd, buf, sizeof (buf))) > 0) {
fwrite(buf, rlen, 1, stdout);
}

if (rlen == -1) {
perror("read() failed");
}

fflush(stdout);

if (close(fd) == -1) {
perror("close() failed");
}

return (0);
}
$ ./client 127.0.0.1 8080
hello
$ ./client ::1 8080
hello

Example 2: Accepting client connections
$ gcc -std=c99 -Wall -lsocket -o server server.c
$ cat server.c
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <netdb.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <arpa/inet.h>

void
logmsg(struct sockaddr *s, int bytes)
{
char dq[INET6_ADDRSTRLEN];

switch (s->sa_family) {
case AF_INET: {
struct sockaddr_in *s4 = (struct sockaddr_in *)s;
inet_ntop(AF_INET, &s4->sin_addr, dq, sizeof (dq));
fprintf(stdout, "sent %d bytes to %s:%d\n",
bytes, dq, ntohs(s4->sin_port));
break;
}
case AF_INET6: {
struct sockaddr_in6 *s6 = (struct sockaddr_in6 *)s;
inet_ntop(AF_INET6, &s6->sin6_addr, dq, sizeof (dq));
fprintf(stdout, "sent %d bytes to [%s]:%d\n",
bytes, dq, ntohs(s6->sin6_port));
break;
}
default:
fprintf(stdout, "sent %d bytes to unknown client\n",
bytes);
break;
}
}

int
main(int argc, char *argv[])
{
struct addrinfo hints, *gair, *p;
int sfd, cfd;
int slen, wlen, rv;

if (argc != 3) {
fprintf(stderr, "%s <port> <message>\n", argv[0]);
return (1);
}

slen = strlen(argv[2]);

memset(&hints, 0, sizeof (hints));
hints.ai_family = PF_UNSPEC;
hints.ai_socktype = SOCK_STREAM;
hints.ai_flags = AI_PASSIVE;

if ((rv = getaddrinfo(NULL, argv[1], &hints, &gair)) != 0) {
fprintf(stderr, "getaddrinfo() failed: %s\n",
gai_strerror(rv));
return (1);
}

for (p = gair; p != NULL; p = p->ai_next) {
if ((sfd = socket(
p->ai_family,
p->ai_socktype,
p->ai_protocol)) == -1) {
perror("socket() failed");
continue;
}

if (bind(sfd, p->ai_addr, p->ai_addrlen) == -1) {
close(sfd);
perror("bind() failed");
continue;
}

break;
}

if (p == NULL) {
fprintf(stderr, "server failed to bind()\n");
return (1);
}

freeaddrinfo(gair);

if (listen(sfd, 1024) != 0) {
perror("listen() failed");
return (1);
}

fprintf(stdout, "waiting for clients...\n");

for (int times = 0; times < 5; times++) {
struct sockaddr_storage stor;
socklen_t alen = sizeof (stor);
struct sockaddr *addr = (struct sockaddr *)&stor;

if ((cfd = accept(sfd, addr, &alen)) == -1) {
perror("accept() failed");
continue;
}

wlen = 0;

do {
wlen += write(cfd, argv[2] + wlen, slen - wlen);
} while (wlen < slen);

logmsg(addr, wlen);

if (close(cfd) == -1) {
perror("close(cfd) failed");
}
}

if (close(sfd) == -1) {
perror("close(sfd) failed");
}

fprintf(stdout, "finished.\n");

return (0);
}
$ ./server 8080 $'hello\n'
waiting for clients...
sent 6 bytes to [::ffff:127.0.0.1]:59059
sent 6 bytes to [::ffff:127.0.0.1]:47448
sent 6 bytes to [::ffff:127.0.0.1]:54949
sent 6 bytes to [::ffff:127.0.0.1]:55186
sent 6 bytes to [::1]:62256
finished.

DIAGNOSTICS


A socket operation may fail if:

EISCONN A connect() operation was attempted on a socket
on which a connect() operation had already been
performed.

ETIMEDOUT A connection was dropped due to excessive
retransmissions.

ECONNRESET The remote peer forced the connection to be
closed (usually because the remote machine has
lost state information about the connection due
to a crash).

ECONNREFUSED The remote peer actively refused connection
establishment (usually because no process is
listening to the port).

EADDRINUSE A bind() operation was attempted on a socket
with a network address/port pair that has
already been bound to another socket.

EADDRNOTAVAIL A bind() operation was attempted on a socket
with a network address for which no network
interface exists.

EACCES A bind() operation was attempted with a
"reserved" port number and the effective user
ID of the process was not the privileged user.

ENOBUFS The system ran out of memory for internal data
structures.

SEE ALSO


svcs(1), ioctl(2), read(2), write(2), accept(3SOCKET), bind(3SOCKET),
connect(3SOCKET), getprotobyname(3SOCKET), getsockopt(3SOCKET),
listen(3SOCKET), send(3SOCKET), inet(4P), inet6(4P), ip(4P), ip6(4P),
smf(7), ndd(8), svcadm(8), tcpkey(8)

K. Ramakrishnan, S. Floyd, and D. Black, The Addition of Explicit
Congestion Notification (ECN) to IP, RFC 3168, September 2001.

M. Mathias, J. Mahdavi, S. Ford, and A. Romanow, TCP Selective
Acknowledgement Options, RFC 2018, October 1996.

S. Bellovin, Defending Against Sequence Number Attacks, RFC 1948, May
1996.

D. Borman, B. Braden, V. Jacobson, and R. Scheffenegger, Ed., TCP
Extensions for High Performance, RFC 7323, September 2014.

Jon Postel, Transmission Control Protocol - DARPA Internet Program
Protocol Specification, RFC 793, Network Information Center, SRI
International, Menlo Park, CA., September 1981.

A. Heffernan, Protection of BGP Sessions via the TCP MD5 Signature
Option, RFC 2385, August 1998.

NOTES


The tcp service is managed by the service management facility, smf(7),
under the service identifier svc:/network/initial:default.

Administrative actions on this service, such as enabling, disabling, or
requesting restart, can be performed using svcadm(8). The service's
status can be queried using the svcs(1) command.

illumos May 2, 2024 illumos

tribblix@gmail.com :: GitHub :: Privacy