all articles

how github detects generated js file

2016-12-21 @sunderls

github diff

如果在github上放下minify后的js文件的话,github上会自动识别出来这不是给人看的diff,所以会折叠起来显示「Diff suppressed. Click to show」。

这当然是好的设计了,不过如果是没有minify过的dist代码的话,怎么才能隐藏diff呢? 换个说法,github是怎么识别generated code的呢?

在google搜索了一下,学习到以下内容,分享一下。

generated.rb

github: https://github.com/github/linguist/blob/master/lib/linguist/generated.rb

github在判别代码是否是生成的的时候,用的是↑的ruby代码,做的事情是根据不同的扩展名检测文本的特征,得出true/false的判断,代码非常易懂, 搜索了.js之后发现,判断js文件是否是生成的代码共有5种情况:

1. 平均列数 > 110

如果是minify过后的js文件,换行是被去掉的,这就会出现特别长的行,所以如果一行的平均列数超过一定值,就可以判定为生成的代码,不适合人眼阅读。

https://github.com/github/linguist/blob/master/lib/linguist/generated.rb#L93-L108

    # Internal: Is the blob minified files?
    #
    # Consider a file minified if the average line length is
    # greater then 110c.
    #
    # Currently, only JS and CSS files are detected by this method.
    #
    # Returns true or false.
    def minified_files?
      return unless ['.js', '.css'].include? extname
      if lines.any?
        (lines.inject(0) { |n, l| n += l.length } / lines.length) > 110
      else
        false
      end
    end

可以看到,除了js之外,css也适用于这种判定。

2. 有sourcemap

比如如果用sass编译成css或者用babel编译es6的话,可以选择在生成的文件中加入sourcemap的选项,这样Chrome等浏览器可以让debug变的方便。github如果在最后两行发现了开头是//# sourceMappingURL的行的话,就判定为生成文件。注意目前貌似只支持js,不含css。

https://github.com/github/linguist/blob/master/lib/linguist/generated.rb#L110-L123

# Internal: Does the blob contain a source map reference?
#
# We assume that if one of the last 2 lines starts with a source map
# reference, then the current file was generated from other files.
#
# We use the last 2 lines because the last line might be empty.
#
# We only handle JavaScript, no CSS support yet.
#
# Returns true or false.
def has_source_map?
  return false unless extname.downcase == '.js'
  lines.last(2).any? { |line| line.start_with?('//# sourceMappingURL') }
end

3. 如果是CoffeeScript编译而来

较新版本的CoffeeScript编译后的第一行以// Generated by开头,所以github以此来判断。注意这里识别子只是// Generated by,不含Coffeescript字样,在任何js文件头部加上这个识别子都会被当作CoffeeScript的生成物。

对于更早版本的CoffeeScript,github对文件内部特征进行了判定,比如头部以(function() {闭包开始啊,变量名以下划线开始等等。个人觉得没太大必要。

https://github.com/github/linguist/blob/master/lib/linguist/generated.rb#L139-L176


# Internal: Is the blob of JS generated by CoffeeScript?
#
# CoffeeScript is meant to output JS that would be difficult to
# tell if it was generated or not. Look for a number of patterns
# output by the CS compiler.
#
# Return true or false
def compiled_coffeescript?
  return false unless extname == '.js'

  # CoffeeScript generated by > 1.2 include a comment on the first line
  if lines[0] =~ /^\/\/ Generated by /
    return true
  end

  if lines[0] == '(function() {' &&     # First line is module closure opening
      lines[-2] == '}).call(this);' &&  # Second to last line closes module closure
      lines[-1] == ''                   # Last line is blank

    score = 0

    lines.each do |line|
      if line =~ /var /
        # Underscored temp vars are likely to be Coffee
        score += 1 * line.gsub(/(_fn|_i|_len|_ref|_results)/).count

        # bind and extend functions are very Coffee specific
        score += 3 * line.gsub(/(__bind|__extends|__hasProp|__indexOf|__slice)/).count
      end
    end

    # Require a score of 3. This is fairly arbitrary. Consider
    # tweaking later.
    score >= 3
  else
    false
  end
end

4. 如果是PEG.js编译而来

话说我还是第一次听说PEG.js,看了一下官网,貌似是用来写parser的js库,比如可以用来写一套自定义的语法。具体没深究,反正github检测了文件的前4行合并后的字符串,如果匹配了/^(?:[^\/]|\/[^\*])*\/\*(?:[^\*]|\*[^\/])*Generated by PEG.js/的话,就判定为PEG.js生成物。

这里的正则有点绕,在js环境下,?:表示不记忆位置(因为这里为了|而必须加括号,括号表示局部匹配,具体可以看「深入了解js的正则表达式」),但是这里又没进行引用所以其实没啥乱用的感觉(ruby是这样的么?我真不确定)。所以上述正则实际为:

/^([^\/]|\/[^\*])*\/\*([^\*]|\*[^\/])*Generated by PEG.js/

也就是说,合并后的前四行中:

  1. 必须有/*,且之前可以有其他字符。
  2. 识别子Generated by PEG.js之前不能有*/结束符

https://github.com/github/linguist/blob/master/lib/linguist/generated.rb#L218-L233

# Internal: Is the blob of JS a parser generated by PEG.js?
#
# PEG.js-generated parsers are not meant to be consumed by humans.
#
# Return true or false
def generated_parser?
  return false unless extname == '.js'

  # PEG.js-generated parsers include a comment near the top  of the file
  # that marks them as such.
  if lines[0..4].join('') =~ /^(?:[^\/]|\/[^\*])*\/\*(?:[^\*]|\*[^\/])*Generated by PEG.js/
    return true
  end

  false
end

5. 如果是Thrift的生成文件。

具体Thrift听说过但是没用过,不过在之前的「浏览器中 protocol buffers v.s. json」这篇文章中有试过protocol buffers,感觉应该差不多。Thrift应该也能生成api定义的js文件用来decode。

github检测的时候比较简单了,就看了看文件前六行是否有包含Autogenerated by Thrift Compiler字样。

https://github.com/github/linguist/blob/master/lib/linguist/generated.rb#L283-L285

# Internal: Is the blob generated by Apache Thrift compiler?
#
# Returns true or false
def generated_apache_thrift?
  return false unless APACHE_THRIFT_EXTENSIONS.include?(extname)
  return lines.first(6).any? { |l| l.include?("Autogenerated by Thrift Compiler") }
end

总结

github共有5种情况来折叠js文件的diff。如果需要故意隐藏某些js文件的diff的话,可以参考第2,3条的原理,在头尾部加上特定内容以混淆github的视听。