Apache Spark で遊びたい(2) - Tambourine作業メモ

本家のQuick Start(https://spark.apache.org/docs/latest/quick-start.html)をやってみる。

最近ちょっと勉強しかけているScalaの方をやってみる。

> spark-shell 
2018-05-03 16:40:22 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.0.116:4040
Spark context available as 'sc' (master = local[*], app id = local-1525333240695).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

なんかWarningがでているけど、よくわからない。

まずは、Datasetというものを作ってみろとのこと。テキストファイルからも作れるらしい。とりあえず手元にあったWASのSystemOut.logを食わせてみる。100MBぐらいのサイズ。

scala> val testFile = spark.read.textFile("SystemOut.log")
2018-05-03 17:30:36 WARN  ObjectStore:6666 - Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
2018-05-03 17:30:36 WARN  ObjectStore:568 - Failed to get database default, returning NoSuchObjectException
2018-05-03 17:30:38 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
testFile: org.apache.spark.sql.Dataset[String] = [value: string]

相変わらずWarningは出るものの、ちゃんとできたっぽい。

scala> testFile.count
res4: Long = 160086                                                             

scala> testFile.first
res5: String = ************ Start Display Current Environment ************

件数をとったり、1レコード目を取ったり。

> wc -l SystemOut.log
  160086 SystemOut.log                                            
> head -n1 SystemOut.log 
************ Start Display Current Environment ************

うん。あってる(あたりまえ)。

filterを使って、新しいDatasetを作ることもできる

scala> val errLine = testFile.filter(line => line.contains("ERROR"))
errLine: org.apache.spark.sql.Dataset[String] = [value: string]

scala> errLine.count
res6: Long = 166

chainさせて1行でやってもいい

scala> testFile.filter(line => line.contains("ERROR")).count
res7: Long = 166

まあ、こういうことだな。

> grep ERROR SystemOut.log |wc -l
     166

うむ、あってる(あたりまえだってば)。