AndroidGeek
首发于AndroidGeek
okhttp和http 2.0相遇引发的"血案"

okhttp和http 2.0相遇引发的"血案"

1.起因

今天在做流量分析时,从服务端记录的请求日志发现流量排名前10的用户里有一个用户的某个网络请求竟然在很短的时间内请求了10000多次! 那段时间内的流量占比如下图:


2.分析

考虑到客户端采用AOP的方式记录了所有的网络请求信息,于是赶紧把客户端日志拉上来进行分析,看看是不是某些条件触发的,却意外地并没有看到频繁地网络请求! 我的内心是崩溃的!!!

既然客户端日志查不出什么问题,于是只能继续从服务端日志入手,仔细看一下简直吓尿了,那一万多个请求竟然都是同一个requestId,于是怀疑是服务端的日志记录有问题,但是可以看到各个请求虽然requestId一样,但是时间戳是不一样的,那就排除了服务端的日志记录出错的可能!


那就只能是客户端的网络请求的重试机制有问题了!!!


但是OkHttp的重试机制很简单啊,就是添加拦截器,示例代码如下:

client.interceptors().add(new Interceptor() {
        @Override
        public Response intercept(Chain chain) throws IOException {
            Request request = chain.request();

            // try the request
            Response response = chain.proceed(request);

            int tryCount = 0;
            while (!response.isSuccessful() && tryCount < 3) {

                Log.d("intercept", "Request is not successful - " + tryCount);

                tryCount++;

                // retry the request
                response = chain.proceed(request);
            }

            // otherwise just pass the original response on
            return response;
        }
    });

而且由于并不是一个很紧迫的配置项,所以当时并没有给它添加重试的拦截器。


那就只可能是它自身的重试机制出问题了,由于我们App里面用的是okhttp版本是3.4.2,其对应的重试代码如下:

@Override public Response intercept(Chain chain) throws IOException {
    Request request = chain.request();

    streamAllocation = new StreamAllocation(
        client.connectionPool(), createAddress(request.url()));

    int followUpCount = 0;
    Response priorResponse = null;
    while (true) {
      if (canceled) {
        streamAllocation.release();
        throw new IOException("Canceled");
      }

      Response response = null;
      boolean releaseConnection = true;
      try {
        response = ((RealInterceptorChain) chain).proceed(request, streamAllocation, null, null);
        releaseConnection = false;
      } catch (RouteException e) {
        // The attempt to connect via a route failed. The request will not have been sent.
        if (!recover(e.getLastConnectException(), true, request)) throw e.getLastConnectException();
        releaseConnection = false;
        continue;
      } catch (IOException e) {
        // An attempt to communicate with a server failed. The request may have been sent.
        if (!recover(e, false, request)) throw e;
        releaseConnection = false;
        continue;
      } finally {
        // We're throwing an unchecked exception. Release any resources.
        if (releaseConnection) {
          streamAllocation.streamFailed(null);
          streamAllocation.release();
        }
      }

      // Attach the prior response if it exists. Such responses never have a body.
      if (priorResponse != null) {
        response = response.newBuilder()
            .priorResponse(priorResponse.newBuilder()
                .body(null)
                .build())
            .build();
      }

      Request followUp = followUpRequest(response);

      if (followUp == null) {
        if (!forWebSocket) {
          streamAllocation.release();
        }
        return response;
      }

      closeQuietly(response.body());

      if (++followUpCount > MAX_FOLLOW_UPS) {
        streamAllocation.release();
        throw new ProtocolException("Too many follow-up requests: " + followUpCount);
      }

      if (followUp.body() instanceof UnrepeatableRequestBody) {
        throw new HttpRetryException("Cannot retry streamed HTTP body", response.code());
      }

      if (!sameConnection(response, followUp.url())) {
        streamAllocation.release();
        streamAllocation = new StreamAllocation(
            client.connectionPool(), createAddress(followUp.url()));
      } else if (streamAllocation.stream() != null) {
        throw new IllegalStateException("Closing the body of " + response
            + " didn't close its backing stream. Bad interceptor?");
      }

      request = followUp;
      priorResponse = response;
    }
  }

这段代码是在RetryAndFollowUpInterceptor中的intercept()方法中,这个类的目的很明确,就是用于重试和跳转,但是redirect最大次数是20, 那怎么会无限重试呢?

显然只可能是在if (++followUpCount > MAX_FOLLOW_UPS) 这个条件判断之前,继续分析下去会发现在出现RouteException或者IOException时,会调用recover()方法,而如果每次recover()都返回true的话,就不会释放连接,而是一直重试下去。

那有没有可能我们这个请求是发生了RouteException或者IOException呢?

看到服务端有如下记录:



注意其中的errorDomain和errorDescription,很清楚地可以看到是发生了shutdown的IOException,其实这种情况一般发生在服务器处于较大压力的情况下,再看一下recover()这个方法:

/**
   * Report and attempt to recover from a failure to communicate with a server. Returns true if
   * {@code e} is recoverable, or false if the failure is permanent. Requests with a body can only
   * be recovered if the body is buffered.
   */
  private boolean recover(IOException e, boolean routeException, Request userRequest) {
    streamAllocation.streamFailed(e);

    // The application layer has forbidden retries.
    if (!client.retryOnConnectionFailure()) return false;

    // We can't send the request body again.
    if (!routeException && userRequest.body() instanceof UnrepeatableRequestBody) return false;

    // This exception is fatal.
    if (!isRecoverable(e, routeException)) return false;

    // No more routes to attempt.
    if (!streamAllocation.hasMoreRoutes()) return false;

    // For failure recovery, use the same route selector with a new connection.
    return true;
  }

由于应用层并没有禁止重试,Request的body()也并非UnrepeatableRequestBody,而且路由选择也不为false, 所以最终返回true,从而判定为连接可以恢复,最终导致无限重连,耗费大量的流量

之后在看okhttp的issues时发现其实这个bug早在2016.7.28(对应版本为3.4.1)就有了提了,如下图所示:


只是这个bug的修复在3.4.2并没有完成,而是一直到3.5.0才完成,详细信息见链接[Gracefully recover from an HTTP/2 connection shutdown at start of request. #2992](Gracefully recover from an HTTP/2 connection shutdown at start of request. by dave-r12 · Pull Request #2992 · square/okhttp) 所以如果遇到类似问题的同学,请将okhttp升级到3.5.0以上。


3.Root Cause

不管是从okhttp的issue还是我们自己的日志记录来看,这种情况的出现都是非常小概率的,那到底是什么情况下会出现shutdown这样的IOException呢?

采用charles抓包,发现该网络请求采用的竟然是http 2.0协议,如下图所示:



为了更好地理解h2的请求过程,简单地了解一下h2与1.1请求的区别。


h2是基于Google的SPDY协议的,其实h2和1.1最大的区别就是多路复用,即1.1中必须一个连接对应一个stream,而h2中可以多个stream共用一个连接,如下图所示:


其实现的方式就是将要传输的信息分割成一个个二进制帧,其中首部信息会放到HEADER Frame中,而request body会放到DATA Frame中,示意图如下:


帧传输的示意图如下:



而数据帧的格式其实在okhttp的注释中就有,如下:



p.s:不得不佩服维护okhttp的团队注释写得非常好。


由于TCP协议的滑动窗口,因而1.1中每次建立连接后都是需要慢启动,即数据先是慢慢地传,之后滑动窗口变大,才能高速传递,而h2由于共用连接,不仅提高于吞吐量,而且可以使平均传输速度增大,如下的动图很好地表示了h2与1.1在传输相同图片时的对比:



另外,由于h2进行了首部压缩,所以其实它还可以节省流量。

了解了这些,知道h2是通过帧传输的,就能很快地梳理出h2协议的请求过程,如下图所示:




其中重要的节点我都用红色标出来了,如前面所说,h2协议是一帧一帧地读取地,所以正常的请求下Http2.Reader.nextFrame()方法中的type是TYPE_HEADERS和TYPE_DATA,该方法如下:

@Override public boolean nextFrame(Handler handler) throws IOException {
      try {
        source.require(9); // Frame header size
      } catch (IOException e) {
        return false; // This might be a normal socket close.
      }

      /*  0                   1                   2                   3
       *  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       * |                 Length (24)                   |
       * +---------------+---------------+---------------+
       * |   Type (8)    |   Flags (8)   |
       * +-+-+-----------+---------------+-------------------------------+
       * |R|                 Stream Identifier (31)                      |
       * +=+=============================================================+
       * |                   Frame Payload (0...)                      ...
       * +---------------------------------------------------------------+
       */
      int length = readMedium(source);
      if (length < 0 || length > INITIAL_MAX_FRAME_SIZE) {
        throw ioException("FRAME_SIZE_ERROR: %s", length);
      }
      byte type = (byte) (source.readByte() & 0xff);
      byte flags = (byte) (source.readByte() & 0xff);
      int streamId = (source.readInt() & 0x7fffffff); // Ignore reserved bit.
      if (logger.isLoggable(FINE)) logger.fine(formatHeader(true, streamId, length, type, flags));

      switch (type) {
        case TYPE_DATA:
          readData(handler, length, flags, streamId);
          break;

        case TYPE_HEADERS:
          readHeaders(handler, length, flags, streamId);
          break;

        case TYPE_PRIORITY:
          readPriority(handler, length, flags, streamId);
          break;

        case TYPE_RST_STREAM:
          readRstStream(handler, length, flags, streamId);
          break;

        case TYPE_SETTINGS:
          readSettings(handler, length, flags, streamId);
          break;

        case TYPE_PUSH_PROMISE:
          readPushPromise(handler, length, flags, streamId);
          break;

        case TYPE_PING:
          readPing(handler, length, flags, streamId);
          break;

        case TYPE_GOAWAY:
          readGoAway(handler, length, flags, streamId);
          break;

        case TYPE_WINDOW_UPDATE:
          readWindowUpdate(handler, length, flags, streamId);
          break;

        default:
          // Implementations MUST discard frames that have unknown or unsupported types.
          source.skip(length);
      }
      return true;
    }

一直到服务端出现某种错误关闭连接,okhttp就会从数据帧中读取到TYPE_GOAWAY的标志位,然后进入readGoAway()的流程中,该方法如下:

private void readGoAway(Handler handler, int length, byte flags, int streamId)
        throws IOException {
      if (length < 8) throw ioException("TYPE_GOAWAY length < 8: %s", length);
      if (streamId != 0) throw ioException("TYPE_GOAWAY streamId != 0");
      int lastStreamId = source.readInt();
      int errorCodeInt = source.readInt();
      int opaqueDataLength = length - 8;
      ErrorCode errorCode = ErrorCode.fromHttp2(errorCodeInt);
      if (errorCode == null) {
        throw ioException("TYPE_GOAWAY unexpected error code: %d", errorCodeInt);
      }
      ByteString debugData = EMPTY;
      if (opaqueDataLength > 0) { // Must read debug data in order to not corrupt the connection.
        debugData = source.readByteString(opaqueDataLength);
      }
      handler.goAway(lastStreamId, errorCode, debugData);
    }

注意到这个方法中的handler其实是实现了Handler接口的FramedConnection.Reader,其goAway()方法如下:

@Override public void goAway(int lastGoodStreamId, ErrorCode errorCode, ByteString debugData) {
      if (debugData.size() > 0) { // TODO: log the debugData
      }

      // Copy the streams first. We don't want to hold a lock when we call receiveRstStream().
      FramedStream[] streamsCopy;
      synchronized (FramedConnection.this) {
        streamsCopy = streams.values().toArray(new FramedStream[streams.size()]);
        shutdown = true;
      }

      // Fail all streams created after the last good stream ID.
      for (FramedStream framedStream : streamsCopy) {
        if (framedStream.getId() > lastGoodStreamId && framedStream.isLocallyInitiated()) {
          framedStream.receiveRstStream(ErrorCode.REFUSED_STREAM);
          removeStream(framedStream.getId());
        }
      }
    }

注意其中将shutdown设置为true了,之后,当StreamAllocation再次调用nextFrame()时,就会抛出名为"shutdown"的IOException, 主要代码如下:

private FramedStream newStream(int associatedStreamId, List<Header> requestHeaders, boolean out,
      boolean in) throws IOException {
    boolean outFinished = !out;
    boolean inFinished = !in;
    boolean flushHeaders;
    FramedStream stream;
    int streamId;

    synchronized (frameWriter) {
      synchronized (this) {
        if (shutdown) {
          throw new IOException("shutdown");
        }
        ...
    return stream;
  }

由于中间没有捕获IOException的操作,所以这个异常就会一直往上传递到RetryAndFollowUpInterceptor的intercept()方法中,最终导致前面所说的死循环。


而至于服务端为何会在某些请求中shutdown, 查看流量统计就可以发现这个接口在一天时间内有高达几亿次请求,远远超过了当初设计的要求,而导致这个的原因是客户端用户的快速增长以及轮询请求太过频繁,从而使服务端不得不关闭一些连接以防止宕机。


4.为何客户端日志没有记录到重试的请求

这个其实就涉及到okhttp的执行流程了,这里有很多的博客理解有很大的错误,其实各个Interceptor并不是顺序执行的,而是递归执行的,有些Interceptor甚至会执行多次。

而客户端进行AOP时选择的切面是RealCall中的getResponseWithInterceptorChain()方法中,而重试发生在内部的RetryAndFollowUpInterceptor中,所以无法得知。


5.扩展问题

那如果这里不采用AOP,而是采用为okhttpclient增加Interceptor的方法,能够记录下重试的请求吗?

注意以下代码:

private Response getResponseWithInterceptorChain() throws IOException {
    // Build a full stack of interceptors.
    List<Interceptor> interceptors = new ArrayList<>();
    interceptors.addAll(client.interceptors());
    interceptors.add(retryAndFollowUpInterceptor);
    interceptors.add(new BridgeInterceptor(client.cookieJar()));
    interceptors.add(new CacheInterceptor(client.internalCache()));
    interceptors.add(new ConnectInterceptor(client));
    if (!retryAndFollowUpInterceptor.isForWebSocket()) {
      interceptors.addAll(client.networkInterceptors());
    }
    interceptors.add(new CallServerInterceptor(
        retryAndFollowUpInterceptor.isForWebSocket()));

    Interceptor.Chain chain = new RealInterceptorChain(
        interceptors, null, null, null, 0, originalRequest);
    return chain.proceed(originalRequest);
  }

显然,为client增加的Interceptor(确切地说是应用拦截器)在最外层,那它对于RetryAndFollowUpInterceptor的内部执行过程就一无所知了,所以即使采用为okhttpclient增加Interceptor的方法,也无法记录下重试的过程。

但是如果为client增加networkInterceptors(这些被称为网络拦截器),由于在最内层,所以是可以记录到包括重试在内的所有网络请求的。

而如果要想采用AOP的方式记录下重试的过程的话,还是只能对RetryAndFollowUpInterceptor内层的Interceptor进行切面,比如ConnectInterceptor或者CallServerInterceptor.

编辑于 2017-08-31

文章被以下专栏收录

    爱android,爱coding; 爱Material Degisn,也爱Best Practices; 爱最流行的开源框架,也爱最底层的内核架构;