前言

未经博主允许不得转载。
博主地址是:http://blog.csdn.net/freewebsys

1,关于ORC库

ORC的全称是(Optimized Row Columnar),ORC文件格式是一种Hadoop生态圈中的列式存储格式,它的产生早在2013年初,最初产生自Apache Hive,用于降低Hadoop数据存储空间和加速Hive查询速度。

2,安装使用
apt-get -y install build-essential cmake
curl -sSLO  https://dlcdn.apache.org/orc/orc-1.7.3/orc-1.7.3.tar.gz
tar -zxf orc-1.7.3.tar.gz
orc-1.7.3.tar.gz
mkdir build
cd build
cmake .. -DBUILD_JAVA=OFF
make install
# 直接就安装到了/usr/local/bin 目录下了。

然后就可以把csv 多个格式转换成orc格式了。
orc-contents orc-memory orc-metadata orc-scan orc-statistics

# orc-contents 
Usage: orc-contents [options] <filename>...
Options:
	-h --help
	-c --columns		Comma separated list of top-level column fields
	-t --columnTypeIds	Comma separated list of column type ids
	-n --columnNames	Comma separated list of column names
	-b --batch		Batch size for reading
Print contents of ORC files.

# csv-import 
Usage: csv-import [-h] [--help]
                  [-d <character>] [--delimiter=<character>]
                  [-s <size>] [--stripe=<size>]
                  [-c <size>] [--block=<size>]
                  [-b <size>] [--batch=<size>]
                  [-t <string>] [--timezone=<string>]
                  <schema> <input> <output>
Import CSV file into an Orc file using the specified schema.
The timezone is writer timezone of timestamp types.
Compound types are not yet supported.

# orc-metadata 
Usage: orc-metadata [-h] [--help] [-r] [--raw] [-v] [--verbose] <filename>

#创建一个 CSV的格式,是考试成绩。
vi student.csv
zhangsan,13,100,98
lisi,14,89,88
wangwu,13,60,78
zhaoliu,12,56,67

# 转换成 stuct 结构
csv-import "struct<name:string,age:int,math:int,english:int>" student.csv student.orc
[2022-04-12 14:00:10] Start importing Orc file...
[2022-04-12 14:00:10] Finish importing Orc file.
[2022-04-12 14:00:10] Total writer elasped time: 0.005487s.
[2022-04-12 14:00:10] Total writer CPU time: 0.005466s.

# orc-contents  student.orc 
{"name": "zhangsan", "age": 13, "math": 100, "english": 98}
{"name": "lisi", "age": 14, "math": 89, "english": 88}
{"name": "wangwu", "age": 13, "math": 60, "english": 78}
{"name": "zhaoliu", "age": 12, "math": 56, "english": 67}

#查看一个字段
# orc-contents -n name  student.orc 
{"name": "zhangsan"}
{"name": "lisi"}
{"name": "wangwu"}
{"name": "zhaoliu"}

# 统计
# orc-statistics student.orc 
File student.orc has 5 columns
*** Column 0 ***
Column has 4 values and has null value: no

*** Column 1 ***
Data type: String
Values: 4
Has null: no
Minimum: lisi
Maximum: zhaoliu
Total length: 25

*** Column 2 ***
Data type: Integer
Values: 4
Has null: no
Minimum: 12
Maximum: 14
Sum: 52
...

3,golang代码

golang 可以使用github.com/scritchley/orc 库,直接进行orc 文件的读取。
可以直接把数据读取出来。

package main

import (
	"testing"
	"log"
	"fmt"
	"github.com/scritchley/orc"
)

func TestReadNullAtEnd(t *testing.T) {
	r, err := orc.Open("student.orc")

    if err != nil {
        log.Fatal(err)
    }

    selected := r.Schema().Columns()
    c := r.Select(selected...)
    defer c.Close()

    vals := make([]interface{}, len(selected))
    ptrVals := make([]interface{}, len(selected))
    strVals := make([]string, len(selected))
    for i := range vals {
        ptrVals[i] = &vals[i]
    }

    for c.Stripes() {
        for c.Next() {
            err := c.Scan(ptrVals...)
            if err != nil {
                log.Fatal(err)
            }
            for i := range ptrVals {
                strVals[i] = fmt.Sprint(ptrVals[i])
                log.Println(strVals[i])
            }
        }
    }


    if err := c.Err(); err != nil {
        log.Fatal(err)
    }
}
# go test -v orc_read_test.go 
=== RUN   TestReadNullAtEnd
2022/04/12 14:13:56 zhangsan
2022/04/12 14:13:56 13
2022/04/12 14:13:56 100
2022/04/12 14:13:56 98
2022/04/12 14:13:56 lisi
2022/04/12 14:13:56 14
2022/04/12 14:13:56 89
2022/04/12 14:13:56 88
2022/04/12 14:13:56 wangwu
2022/04/12 14:13:56 13
2022/04/12 14:13:56 60
2022/04/12 14:13:56 78
2022/04/12 14:13:56 zhaoliu
2022/04/12 14:13:56 12
2022/04/12 14:13:56 56
2022/04/12 14:13:56 67
--- PASS: TestReadNullAtEnd (0.00s)
PASS
ok  	command-line-arguments	0.008s

4,orc总结

ORC格式化的数据文件,处理起来非常方便。
ORC在大数据处理上非常常用的格式,学习起来非常方便。