|
| 1 | +# hive-third-functions |
| 2 | + |
| 3 | +[](https://travis-ci.org/aaronshan/hive-third-functions) |
| 4 | +[](https://github.com/aaronshan/hive-third-functions/tree/master/README.md) |
| 5 | +[](https://github.com/aaronshan/hive-third-functions/tree/master/README-zh.md) |
| 6 | +[](https://github.com/aaronshan/hive-third-functions/releases) |
| 7 | + |
| 8 | +## Introduction |
| 9 | + |
| 10 | +hive-third-functions 包含了一些很有用的hive udf函数,特别是数组和json函数. |
| 11 | + |
| 12 | +> 注意: |
| 13 | +> hive-third-functions支持hive-0.11.0或更高版本. |
| 14 | +
|
| 15 | +## 编译 |
| 16 | + |
| 17 | +### 1. 安装依赖 |
| 18 | + |
| 19 | +目前, jdo2-api-2.3-ec.jar 在maven中央仓库中已经不可用, 因此我们不得不自己下载并安装到本地的maven库中. 命令如下: |
| 20 | + |
| 21 | +``` |
| 22 | +wget http://www.datanucleus.org/downloads/maven2/javax/jdo/jdo2-api/2.3-ec/jdo2-api-2.3-ec.jar -O ~/jdo2-api-2.3-ec.jar |
| 23 | +mvn install:install-file -DgroupId=javax.jdo -DartifactId=jdo2-api -Dversion=2.3-ec -Dpackaging=jar -Dfile=~/jdo2-api-2.3-ec.jar |
| 24 | +``` |
| 25 | + |
| 26 | +### 2. 用mvn打包 |
| 27 | + |
| 28 | +``` |
| 29 | +cd ${project_home} |
| 30 | +mvn clean package |
| 31 | +``` |
| 32 | + |
| 33 | +如果你想跳过单元测试,可以这样运行: |
| 34 | +``` |
| 35 | +cd ${project_home} |
| 36 | +mvn clean package -DskipTests |
| 37 | +``` |
| 38 | + |
| 39 | +命令执行完成后, 将会在target目录下生成hive-third-functions-${version}-shaded.jar文件. |
| 40 | + |
| 41 | +你也可以直接在发布页下载打包好了最新版本 [发布页](https://github.com/aaronshan/hive-third-functions/releases). |
| 42 | + |
| 43 | +> 当前最新的版本是 `2.1.2` |
| 44 | +
|
| 45 | +## 函数 |
| 46 | + |
| 47 | +### 1. 字符函数 |
| 48 | + |
| 49 | +| 函数| 描述 | |
| 50 | +|:--|:--| |
| 51 | +|pinyin(string) -> string | 将汉字转换为拼音| |
| 52 | +|md5(string) -> string | md5 哈希| |
| 53 | +|sha256(string) -> string |sha256 哈希| |
| 54 | + |
| 55 | +### 2. 数组函数 |
| 56 | + |
| 57 | +| 函数| 描述 | |
| 58 | +|:--|:--| |
| 59 | +|array_contains(array<E>, E) -> boolean | 判断数组是否包含某个值.| |
| 60 | +|array_equals(array<E>, array<E>) -> boolean | 判断两个数组是否相等.| |
| 61 | +|array_intersect(array, array) -> array | 返回两个数组的交集.| |
| 62 | +|array_max(array<E>) -> E | 返回数组中的最大值.| |
| 63 | +|array_min(array<E>) -> E | 返回数组中的最小值.| |
| 64 | +|array_join(array, delimiter, null_replacement) -> string | 使用给定的连接符来连接数组中的元素, `null_replacement`是一个可选项, 用来替代空值.| |
| 65 | +|array_distinct(array) -> array | 移除数组中的重复元素.| |
| 66 | +|array_position(array<E>, E) -> long | 返回给定元素在数组中第一次出现的位置 (如果没找到, 返回0).| |
| 67 | +|array_remove(array<E>, E) -> array | 删除数组中的给定元素.| |
| 68 | +|array_reverse(array) -> array | 反转一个数组.| |
| 69 | +|array_sort(array) -> array | 对数组排序, 数组中的元素必需是可排序的.| |
| 70 | +|array_concat(array, array) -> array | 连接两个数组.| |
| 71 | +|array_value_count(array<E>, E) -> long | 统计数组中包含给定元素的个数.| |
| 72 | +|array_slice(array, start, length) -> array | 对数组进行分片操作,start为正数从前开始分片, start为负数从后开始分片, 长度为指定的长度.| |
| 73 | +|array_element_at(array<E>, index) -> E | 返回指定位置的数组元素. 如果索引位置 < 0, 则从尾部开始计数并返回.| |
| 74 | + |
| 75 | +### 3. map函数 |
| 76 | +| 函数| 描述 | |
| 77 | +|:--|:--| |
| 78 | +|map_build(x<K>, y<V>) -> map<K, V>| 根据指定的键/值对数组创建map.| |
| 79 | +|map_concat(x<K, V>, y<K, V>) -> map<K,V> | 返回两个map的并集. 如果一个键在 `x` 和 `y`中同时出现, 那对应值来自`y`.| |
| 80 | +|map_element_at(map<K, V>, key) -> V | 如果指定的`key`存在,返回对应的值, 否则返回 `NULL` .| |
| 81 | +|map_equals(x<K, V>, y<K, V>) -> boolean | 判断map x 和 map y是否相等.| |
| 82 | + |
| 83 | +### 4. 日期函数 |
| 84 | + |
| 85 | +| 函数| 描述 | |
| 86 | +|:--|:--| |
| 87 | +|day_of_week(date_string \| date) -> int | 一周的第几天,周一返回 1, 周日返回 7, 出错返回null.| |
| 88 | +|day_of_year(date_string \| date) -> int | 一年的第几天. 值的范围从 1 到 366.| |
| 89 | +|zodiac_en(date_string \| date) -> string | 将日期转换为星座英文| |
| 90 | +|zodiac_cn(date_string \| date) -> string | 将日期转换为星座中文 | |
| 91 | +|type_of_day(date_string \| date) -> string | 获取日期的类型(1: 法定节假日, 2: 正常周末, 3: 正常工作日 4:攒假的工作日),错误返回-1. | |
| 92 | + |
| 93 | +### 5. json函数 |
| 94 | + |
| 95 | +| 函数| 描述 | |
| 96 | +|:--|:--| |
| 97 | +|json_array_get(json, jsonPath) -> array(varchar) |returns the element at the specified index into the `json_array`. The index is zero-based.| |
| 98 | +|json_array_length(json, jsonPath) -> array(varchar) |returns the array length of `json` (a string containing a JSON array).| |
| 99 | +|json_array_extract(json, jsonPath) -> array(varchar) |extract json array by given jsonPath.| |
| 100 | +|json_array_extract_scalar(json, jsonPath) -> array(varchar) |like `json_array_extract`, but returns the result value as a string (as opposed to being encoded as JSON).| |
| 101 | +|json_extract(json, jsonPath) -> array(varchar) |extract json by given jsonPath.| |
| 102 | +|json_extract_scalar(json, jsonPath) -> array(varchar) |like `json_extract`, but returns the result value as a string (as opposed to being encoded as JSON).| |
| 103 | +|json_size(json, jsonPath) -> array(varchar) |like `json_extract`, but returns the size of the value. For objects or arrays, the size is the number of members, and the size of a scalar value is zero.| |
| 104 | + |
| 105 | +### 6. 位函数 |
| 106 | + |
| 107 | +| 函数| 描述 | |
| 108 | +|:--|:--| |
| 109 | +|bit_count(x, bits) -> bigint | count the number of bits set in `x` (treated as bits-bit signed integer) in 2’s complement representation | |
| 110 | +|bitwise_and(x, y) -> bigint | returns the bitwise AND of `x` and `y` in 2’s complement arithmetic.| |
| 111 | +|bitwise_not(x) -> bigint | returns the bitwise NOT of `x` in 2’s complement arithmetic. | |
| 112 | +|bitwise_or(x, y) -> bigint | returns the bitwise OR of `x` and `y` in 2’s complement arithmetic.| |
| 113 | +|bitwise_xor(x, y) -> bigint | returns the bitwise XOR of `x` and `y` in 2’s complement arithmetic. | |
| 114 | + |
| 115 | +### 7. 中国身份证函数 |
| 116 | + |
| 117 | +| 函数| 描述 | |
| 118 | +|:--|:--| |
| 119 | +|id_card_province(string) -> string |从身份证号获取省份| |
| 120 | +|id_card_city(string) -> string |从身份证号获取城市| |
| 121 | +|id_card_area(string) -> string |从身份证号获取区/县| |
| 122 | +|id_card_birthday(string) -> string |从身份证号获取生日| |
| 123 | +|id_card_gender(string) -> string |从身份证号获取性别| |
| 124 | +|is_valid_id_card(string) -> boolean |鉴定身份证号是否有效.| |
| 125 | +|id_card_info(string) -> json |获取身份证号信息. 包活省份、城市、区县等.| |
| 126 | + |
| 127 | +### 8. 坐标系函数 |
| 128 | + |
| 129 | +| 函数| 描述 | |
| 130 | +|:--|:--| |
| 131 | +|wgs_distance(double lat1, double lng1, double lat2, double lng2) -> double | 计算 WGS84坐标距离, 单位米. | |
| 132 | +|gcj_to_bd(double,double) -> json | GCJ-02(火星坐标系) 转为 BD-09(百度坐标系), 谷歌、高德——>百度| |
| 133 | +|bd_to_gcj(double,double) -> json | BD-09(百度坐标系) 转为 GCJ-02(火星坐标系), 百度——>谷歌、高德| |
| 134 | +|wgs_to_gcj(double,double) -> json | WGS84(地球坐标系) 转为 GCJ02(火星坐标系)| |
| 135 | +|gcj_to_wgs(double,double) -> json | GCJ02(火星坐标系) 转为 GPS84(地球坐标系), 输出的坐标精度在1到2米.| |
| 136 | +|gcj_extract_wgs(double,double) -> json | GCJ02(火星坐标系) 转为 GPS84, 输出的坐标精度在0.5米. 但是计算比`gcj_to_wgs`耗时长. | |
| 137 | + |
| 138 | +> 关于互联网地图坐标系的说明见: [当前互联网地图的坐标系现状](https://github.com/aaronshan/hive-third-functions/tree/master/README-geo.md) |
| 139 | +
|
| 140 | + |
| 141 | +### 9. url函数 |
| 142 | + |
| 143 | +| 函数| 描述 | |
| 144 | +|:--|:--| |
| 145 | +|url_encode(value) -> string | escapes value by encoding it so that it can be safely included in URL query parameter names and values| |
| 146 | +|url_decode(value) -> string | unescape the URL encoded value. This function is the inverse of `url_encode`. | |
| 147 | + |
| 148 | +## 用法 |
| 149 | + |
| 150 | +将下面这些内容写入 `${HOME}/.hiverc` 文件, 或者也可以按需在hive命令行环境中执行. |
| 151 | + |
| 152 | +``` |
| 153 | +add jar ${jar_location_dir}/hive-third-functions-${version}-shaded.jar |
| 154 | +create temporary function array_contains as 'cc.shanruifeng.functions.array.UDFArrayContains'; |
| 155 | +create temporary function array_equals as 'cc.shanruifeng.functions.array.UDFArrayEquals'; |
| 156 | +create temporary function array_intersect as 'cc.shanruifeng.functions.array.UDFArrayIntersect'; |
| 157 | +create temporary function array_max as 'cc.shanruifeng.functions.array.UDFArrayMax'; |
| 158 | +create temporary function array_min as 'cc.shanruifeng.functions.array.UDFArrayMin'; |
| 159 | +create temporary function array_join as 'cc.shanruifeng.functions.array.UDFArrayJoin'; |
| 160 | +create temporary function array_distinct as 'cc.shanruifeng.functions.array.UDFArrayDistinct'; |
| 161 | +create temporary function array_position as 'cc.shanruifeng.functions.array.UDFArrayPosition'; |
| 162 | +create temporary function array_remove as 'cc.shanruifeng.functions.array.UDFArrayRemove'; |
| 163 | +create temporary function array_reverse as 'cc.shanruifeng.functions.array.UDFArrayReverse'; |
| 164 | +create temporary function array_sort as 'cc.shanruifeng.functions.array.UDFArraySort'; |
| 165 | +create temporary function array_concat as 'cc.shanruifeng.functions.array.UDFArrayConcat'; |
| 166 | +create temporary function array_value_count as 'cc.shanruifeng.functions.array.UDFArrayValueCount'; |
| 167 | +create temporary function array_slice as 'cc.shanruifeng.functions.array.UDFArraySlice'; |
| 168 | +create temporary function array_element_at as 'cc.shanruifeng.functions.array.UDFArrayElementAt'; |
| 169 | +create temporary function bit_count as 'cc.shanruifeng.functions.bitwise.UDFBitCount'; |
| 170 | +create temporary function bitwise_and as 'cc.shanruifeng.functions.bitwise.UDFBitwiseAnd'; |
| 171 | +create temporary function bitwise_not as 'cc.shanruifeng.functions.bitwise.UDFBitwiseNot'; |
| 172 | +create temporary function bitwise_or as 'cc.shanruifeng.functions.bitwise.UDFBitwiseOr'; |
| 173 | +create temporary function bitwise_xor as 'cc.shanruifeng.functions.bitwise.UDFBitwiseXor'; |
| 174 | +create temporary function map_build as 'cc.shanruifeng.functions.map.UDFMapBuild'; |
| 175 | +create temporary function map_concat as 'cc.shanruifeng.functions.map.UDFMapConcat'; |
| 176 | +create temporary function map_element_at as 'cc.shanruifeng.functions.map.UDFMapElementAt'; |
| 177 | +create temporary function map_equals as 'cc.shanruifeng.functions.map.UDFMapEquals'; |
| 178 | +create temporary function day_of_week as 'cc.shanruifeng.functions.date.UDFDayOfWeek'; |
| 179 | +create temporary function day_of_year as 'cc.shanruifeng.functions.date.UDFDayOfYear'; |
| 180 | +create temporary function type_of_day as 'cc.shanruifeng.functions.date.UDFTypeOfDay'; |
| 181 | +create temporary function zodiac_cn as 'cc.shanruifeng.functions.date.UDFZodiacSignCn'; |
| 182 | +create temporary function zodiac_en as 'cc.shanruifeng.functions.date.UDFZodiacSignEn'; |
| 183 | +create temporary function pinyin as 'cc.shanruifeng.functions.string.UDFChineseToPinYin'; |
| 184 | +create temporary function md5 as 'cc.shanruifeng.functions.string.UDFMd5'; |
| 185 | +create temporary function sha256 as 'cc.shanruifeng.functions.string.UDFSha256'; |
| 186 | +create temporary function json_array_get as 'cc.shanruifeng.functions.json.UDFJsonArrayGet'; |
| 187 | +create temporary function json_array_length as 'cc.shanruifeng.functions.json.UDFJsonArrayLength'; |
| 188 | +create temporary function json_array_extract as 'cc.shanruifeng.functions.json.UDFJsonArrayExtract'; |
| 189 | +create temporary function json_array_extract_scalar as 'cc.shanruifeng.functions.json.UDFJsonArrayExtractScalar'; |
| 190 | +create temporary function json_extract as 'cc.shanruifeng.functions.json.UDFJsonExtract'; |
| 191 | +create temporary function json_extract_scalar as 'cc.shanruifeng.functions.json.UDFJsonExtractScalar'; |
| 192 | +create temporary function json_size as 'cc.shanruifeng.functions.json.UDFJsonSize'; |
| 193 | +create temporary function id_card_province as 'cc.shanruifeng.functions.card.UDFChinaIdCardProvince'; |
| 194 | +create temporary function id_card_city as 'cc.shanruifeng.functions.card.UDFChinaIdCardCity'; |
| 195 | +create temporary function id_card_area as 'cc.shanruifeng.functions.card.UDFChinaIdCardArea'; |
| 196 | +create temporary function id_card_birthday as 'cc.shanruifeng.functions.card.UDFChinaIdCardBirthday'; |
| 197 | +create temporary function id_card_gender as 'cc.shanruifeng.functions.card.UDFChinaIdCardGender'; |
| 198 | +create temporary function is_valid_id_card as 'cc.shanruifeng.functions.card.UDFChinaIdCardValid'; |
| 199 | +create temporary function id_card_info as 'cc.shanruifeng.functions.card.UDFChinaIdCardInfo'; |
| 200 | +create temporary function wgs_distance as 'cc.shanruifeng.functions.geo.UDFGeoWgsDistance'; |
| 201 | +create temporary function gcj_to_bd as 'cc.shanruifeng.functions.geo.UDFGeoGcjToBd'; |
| 202 | +create temporary function bd_to_gcj as 'cc.shanruifeng.functions.geo.UDFGeoBdToGcj'; |
| 203 | +create temporary function wgs_to_gcj as 'cc.shanruifeng.functions.geo.UDFGeoWgsToGcj'; |
| 204 | +create temporary function gcj_to_wgs as 'cc.shanruifeng.functions.geo.UDFGeoGcjToWgs'; |
| 205 | +create temporary function gcj_extract_wgs as 'cc.shanruifeng.functions.geo.UDFGeoGcjExtractWgs'; |
| 206 | +create temporary function url_encode as 'cc.shanruifeng.functions.url.UDFUrlEncode'; |
| 207 | +create temporary function url_decode as 'cc.shanruifeng.functions.url.UDFUrlDecode'; |
| 208 | +``` |
| 209 | + |
| 210 | +你可以在hive的命令杭中使用下面的语句来查看函数的细节. |
| 211 | +``` |
| 212 | +hive> describe function zodiac_cn; |
| 213 | +zodiac_cn(date) - from the input date string or separate month and day arguments, returns the sing of the Zodiac. |
| 214 | +``` |
| 215 | + |
| 216 | +或者 |
| 217 | + |
| 218 | +``` |
| 219 | +hive> describe function extended zodiac_cn; |
| 220 | +zodiac_cn(date) - from the input date string or separate month and day arguments, returns the sing of the Zodiac. |
| 221 | +Example: |
| 222 | + > select zodiac_cn(date_string) from src; |
| 223 | + > select zodiac_cn(month, day) from src; |
| 224 | +``` |
| 225 | + |
| 226 | +### 示例 |
| 227 | +``` |
| 228 | + select pinyin('中国') => zhongguo |
| 229 | + select md5('aaronshan') => 95686bc0483262afe170b550dd4544d1 |
| 230 | + select sha256('aaronshan') => d16bb375433ad383169f911afdf45e209eabfcf047ba1faebdd8f6a0b39e0a32 |
| 231 | +``` |
| 232 | + |
| 233 | +``` |
| 234 | +select day_of_week('2016-07-12') => 2 |
| 235 | +select day_of_year('2016-01-01') => 1 |
| 236 | +select type_of_day('2016-10-01') => 1 |
| 237 | +select type_of_day('2016-07-16') => 2 |
| 238 | +select type_of_day('2016-07-15') => 3 |
| 239 | +select type_of_day('2016-09-18') => 4 |
| 240 | +select zodiac_cn('1989-01-08') => 魔羯座 |
| 241 | +select zodiac_en('1989-01-08') => Capricorn |
| 242 | +``` |
| 243 | + |
| 244 | +``` |
| 245 | +select array_contains(array(16,12,18,9), 12) => true |
| 246 | +select array_equals(array(16,12,18,9), array(16,12,18,9)) => true |
| 247 | +select array_intersect(array(16,12,18,9,null), array(14,9,6,18,null)) => [null,9,18] |
| 248 | +select array_max(array(16,13,12,13,18,16,9,18)) => 18 |
| 249 | +select array_min(array(16,12,18,9)) => 9 |
| 250 | +select array_join(array(16,12,18,9,null), '#','=') => 16#12#18#9#= |
| 251 | +select array_distinct(array(16,13,12,13,18,16,9,18)) => [9,12,13,16,18] |
| 252 | +select array_position(array(16,13,12,13,18,16,9,18), 13) => 2 |
| 253 | +select array_remove(array(16,13,12,13,18,16,9,18), 13) => [16,12,18,16,9,18] |
| 254 | +select array_reverse(array(16,12,18,9)) => [9,18,12,16] |
| 255 | +select array_sort(array(16,13,12,13,18,16,9,18)) => [9,12,13,13,16,16,18,18] |
| 256 | +select array_concat(array(16,12,18,9,null), array(14,9,6,18,null)) => [16,12,18,9,null,14,9,6,18,null] |
| 257 | +select array_value_count(array(16,13,12,13,18,16,9,18), 13) => 2 |
| 258 | +select array_slice(array(16,13,12,13,18,16,9,18), -2, 3) => [9,18] |
| 259 | +select array_element_at(array(16,13,12,13,18,16,9,18), -1) => 18 |
| 260 | +``` |
| 261 | + |
| 262 | +``` |
| 263 | +select map_build(array('key1','key2'), array(16,12)) => {"key1":16,"key2":12} |
| 264 | +select map_concat(map_build(array('key1','key2'), array(16,12)), map_build(array('key1','key3'), array(17,18))) => {"key1":17,"key2":12,"key3":18} |
| 265 | +select map_element_at(map_build(array('key1','key2'), array(16,12)), 'key1') => 16 |
| 266 | +select map_equals(map_build(array('key1','key2'), array(16,12)), map_build(array('key1','key2'), array(16,12))) => true |
| 267 | +``` |
| 268 | + |
| 269 | +``` |
| 270 | +select id_card_info('110101198901084517') => {"valid":true,"area":"东城区","province":"北京市","gender":"男","city":"北京市"} |
| 271 | +``` |
| 272 | + |
| 273 | +``` |
| 274 | +select json_array_get("[{\"a\":{\"b\":\"13\"}}, {\"a\":{\"b\":\"18\"}}, {\"a\":{\"b\":\"12\"}}]", 1); => {"a":{"b":"18"}} |
| 275 | +select json_array_get('["a", "b", "c"]', 0); => a |
| 276 | +select json_array_get('["a", "b", "c"]', 1); => b |
| 277 | +select json_array_get('["c", "b", "a"]', -1); => a |
| 278 | +select json_array_get('["c", "b", "a"]', -2); => b |
| 279 | +select json_array_get('[]', 0); => null |
| 280 | +select json_array_get('["a", "b", "c"]', 10); => null |
| 281 | +select json_array_get('["c", "b", "a"]', -10); => null |
| 282 | +select json_array_length("[{\"a\":{\"b\":\"13\"}}, {\"a\":{\"b\":\"18\"}}, {\"a\":{\"b\":\"12\"}}]"); => 3 |
| 283 | +select json_array_extract("[{\"a\":{\"b\":\"13\"}}, {\"a\":{\"b\":\"18\"}}, {\"a\":{\"b\":\"12\"}}]", "$.a.b"); => ["\"13\"","\"18\"","\"12\""] |
| 284 | +select json_array_extract_scalar("[{\"a\":{\"b\":\"13\"}}, {\"a\":{\"b\":\"18\"}}, {\"a\":{\"b\":\"12\"}}]", "$.a.b") => ["13","18","12"] |
| 285 | +select json_extract("{\"a\":{\"b\":\"12\"}}", "$.a.b"); => "12" |
| 286 | +select json_extract_scalar("{\"a\":{\"b\":\"12\"}}", "$.a.b") => 12 |
| 287 | +select json_extract_scalar('[1, 2, 3]', '$[2]'); |
| 288 | +select json_extract_scalar(json, '$.store.book[0].author'); |
| 289 | +select json_size('{"x": {"a": 1, "b": 2}}', '$.x'); => 2 |
| 290 | +select json_size('{"x": [1, 2, 3]}', '$.x'); => 3 |
| 291 | +select json_size('{"x": {"a": 1, "b": 2}}', '$.x.a'); => 0 |
| 292 | +``` |
| 293 | + |
| 294 | +``` |
| 295 | +select gcj_to_bd(39.915, 116.404) => {"lng":116.41036949371029,"lat":39.92133699351022} |
| 296 | +select bd_to_gcj(39.915, 116.404) => {"lng":116.39762729119315,"lat":39.90865673957631} |
| 297 | +select wgs_to_gcj(39.915, 116.404) => {"lng":116.41024449916938,"lat":39.91640428150164} |
| 298 | +select gcj_to_wgs(39.915, 116.404) => {"lng":116.39775550083061,"lat":39.91359571849836} |
| 299 | +select gcj_extract_wgs(39.915, 116.404) => {"lng":116.39775549316407,"lat":39.913596801757805} |
| 300 | +``` |
| 301 | + |
| 302 | +``` |
| 303 | +select url_encode('http://shanruifeng.cc/') => http%3A%2F%2Fshanruifeng.cc%2F |
| 304 | +``` |
0 commit comments