NIO和epoll的关系

ezksd

2020-08-27

java

nio在linux对应的实现是epoll。在linux中，所有I/O都抽象为文件（包括网络和文件读写)，用文件描述符（fd）来标识。fd是一个非负整数，其中0，1，2分别对应stdin，stdout，stderr。
epoll包含三个函数，分别是：

epoll_create：创建一个epoll instance并返回一个fd代表他，对应Selector。
epoll_ctl：注册I/O事件，对应SelectableChannel.register()。
epoll_wait：等待I/O事件，对应Selector.select()。

OP_ACCEPT 和 OP_CONNECT

  EPOLLIN
         The associated file is available for read(2) operations.

  EPOLLOUT
         The associated file is available for write(2) operations.

  EPOLLRDHUP (since Linux 2.6.17)
         Stream socket peer closed connection, or shut down writing
         half of connection.  (This flag is especially useful for writ‐
         ing simple code to detect peer shutdown when using edge-trig‐
         gered monitoring.)

  EPOLLPRI
         There is an exceptional condition on the file descriptor.  See
         the discussion of POLLPRI in poll(2).

  EPOLLERR
         Error condition happened on the associated file descriptor.
         This event is also reported for the write end of a pipe when
         the read end has been closed.

         epoll_wait(2) will always report for this event; it is not
         necessary to set it in events when calling epoll_ctl().

  EPOLLHUP
         Hang up happened on the associated file descriptor.

         epoll_wait(2) will always wait for this event; it is not nec‐
         essary to set it in events when calling epoll_ctl().

         Note that when reading from a channel such as a pipe or a
         stream socket, this event merely indicates that the peer
         closed its end of the channel.  Subsequent reads from the
         channel will return 0 (end of file) only after all outstanding
         data in the channel has been consumed.

  EPOLLET
         Requests edge-triggered notification for the associated file
         descriptor.  The default behavior for epoll is level-trig‐
         gered.  See epoll(7) for more detailed information about edge-
         triggered and level-triggered notification.

         This flag is an input flag for the event.events field when
         calling epoll_ctl(); it is never returned by epoll_wait(2).

  EPOLLONESHOT (since Linux 2.6.2)
         Requests one-shot notification for the associated file de‐
         scriptor.  This means that after an event notified for the
         file descriptor by epoll_wait(2), the file descriptor is dis‐
         abled in the interest list and no other events will be re‐
         ported by the epoll interface.  The user must call epoll_ctl()
         with EPOLL_CTL_MOD to rearm the file descriptor with a new
         event mask.

         This flag is an input flag for the event.events field when
         calling epoll_ctl(); it is never returned by epoll_wait(2).

可以注意到，和SelectionKey中的事件有一些差别，比如这里没有OP_ACCEPT和OP_CONNECT。那么这两个事件是做什么的🤔？

ACCEPT

public int translateInterestOps(int ops) {
    int newOps = 0;
    if ((ops & SelectionKey.OP_ACCEPT) != 0)
        newOps |= Net.POLLIN;
    return newOps;
}

OP_ACCEPT变成了Net.POLLIN。而对于CONNECT：

public int translateInterestOps(int ops) {
    int newOps = 0;
    if ((ops & SelectionKey.OP_READ) != 0)
        newOps |= Net.POLLIN;
    if ((ops & SelectionKey.OP_WRITE) != 0)
        newOps |= Net.POLLOUT;
    if ((ops & SelectionKey.OP_CONNECT) != 0)
        newOps |= Net.POLLCONN;
    return newOps;
}

POLLCONN和POLLOUT一样均为4，通过socket的状态进行区分。如果socket未连接代表OP_CONNECT,已连接代表OP_WRITE。如果说把POLLIN拆分成ACCEPT和READ尚可理解，那把OUT拆成WRITE和CONNECT是为什么？

OP_CONNECT是在做什么

这里有一个非常容易误解的地方，客户端调用connect,服务端调触发OP_ACCEPT事件，调用accept之后客户端触发OP_CONNECT事件，调用finishConnect。看上去和三次握手完全一致，但完全不是那回事，通过wireshark调试得知在服务端调用accept时三次握手已经完成了。那么OP_CONNECT和finishConnect分别是在做什么？

1	boolean polled = Net.pollConnectNow(fd);

这是一个native方法：

jint fd = fdval(env, fdo);
struct pollfd poller;
int result;

poller.fd = fd;
poller.events = POLLOUT;
poller.revents = 0;
if (timeout < -1) {
    timeout = -1;
} else if (timeout > INT_MAX) {
    timeout = INT_MAX;
}
result = poll(&poller, 1, (int)timeout);

可以看到jni方法只是在用poll()检查该fd的POLLOUT事件。而POLLOUT表示socket缓冲区可写，隐含连接已经建立。所以对于阻塞的finishConnect()，他会阻塞到连接建立，而非阻塞的finishConnect，用返回值代表连接是否建立。这个方法名很有误导性，~~建议改为doesItFinishConnect~~，或者说非阻塞的connect好像用处不大。

错误

epoll事件转换为NIO事件：

public boolean translateReadyOps(int ops, int initialOps,
                                 SelectionKeyImpl sk) {
    int intOps = sk.nioInterestOps(); // Do this just once, it synchronizes
    int oldOps = sk.nioReadyOps();
    int newOps = initialOps;

    if ((ops & Net.POLLNVAL) != 0) {
        // This should only happen if this channel is pre-closed while a
        // selection operation is in progress
        // ## Throw an error if this channel has not been pre-closed
        return false;
    }

    if ((ops & (Net.POLLERR | Net.POLLHUP)) != 0) {
        newOps = intOps;
        sk.nioReadyOps(newOps);
        // No need to poll again in checkConnect,
        // the error will be detected there
        readyToConnect = true;
        return (newOps & ~oldOps) != 0;
    }

    if (((ops & Net.POLLIN) != 0) &&
        ((intOps & SelectionKey.OP_READ) != 0) &&
        (state == ST_CONNECTED))
        newOps |= SelectionKey.OP_READ;

    if (((ops & Net.POLLCONN) != 0) &&
        ((intOps & SelectionKey.OP_CONNECT) != 0) &&
        ((state == ST_UNCONNECTED) || (state == ST_PENDING))) {
        newOps |= SelectionKey.OP_CONNECT;
        readyToConnect = true;
    }

    if (((ops & Net.POLLOUT) != 0) &&
        ((intOps & SelectionKey.OP_WRITE) != 0) &&
        (state == ST_CONNECTED))
        newOps |= SelectionKey.OP_WRITE;

    sk.nioReadyOps(newOps);
    return (newOps & ~oldOps) != 0;
}

可以看出：

POLLNVAL没有设置任何ReadyOps，POLLNAVAL的值为32，在上表中没有对应的项。如注释所说，应该是API使用错误不去管他。
POLLERR和ROLLHUP原封不动复制了intOps，也就是会触发所有注册的事件。

这里还翻到Netty的一个issuehttps://github.com/netty/netty/issues/924。

if ((readyOps & SelectionKey.OP_CONNECT) != 0) {
    // remove OP_CONNECT as otherwise Selector.select(..) will always return without blocking
    // See https://github.com/netty/netty/issues/924
    int ops = k.interestOps();
    ops &= ~SelectionKey.OP_CONNECT;
    k.interestOps(ops);

    unsafe.finishConnect();
}

// Process OP_WRITE first as we may be able to write some queued buffers and so free memory.
if ((readyOps & SelectionKey.OP_WRITE) != 0) {
    // Call forceFlush which will also take care of clear the OP_WRITE once there is nothing left to write
    ch.unsafe().forceFlush();
}

// Also check for readOps of 0 to workaround possible JDK bug which may otherwise lead
// to a spin loop
if ((readyOps & (SelectionKey.OP_READ | SelectionKey.OP_ACCEPT)) != 0 || readyOps == 0) {
    unsafe.read();
}

netty的OP_CONNECT处理在第一位，当对方发送Reset时，首先会进入unsafe.finishConnect()，而这里并没有取消事件也没有关闭连接的逻辑。

最后

注意到上面代码的最后一行，Java把IN事件分离成ACCEPT和READ，但是Netty又把ACCEPT和READ统一起来，ServerSocket的读处理就是调用Accept:

protected int doReadMessages(List<Object> buf) throws Exception {
    SocketChannel ch = SocketUtils.accept(javaChannel());

    try {
        if (ch != null) {
            buf.add(new NioSocketChannel(this, ch));
            return 1;
        }
    } catch (Throwable t) {
        logger.warn("Failed to create a new channel from an accepted socket.", t);

        try {
            ch.close();
        } catch (Throwable t2) {
            logger.warn("Failed to close a socket.", t2);
        }
    }

    return 0;
}

整个事情的感觉就是都有自己这么做的理由，但是组合起来就十分滑稽。