Skip to content

Commit a6488fd

Browse files
authored
Merge pull request #51 from clojure-finance/main
release 1.1.0
2 parents f27c6bf + 2b5a725 commit a6488fd

22 files changed

+657
-528
lines changed

doc/documentation.md

Lines changed: 80 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -207,46 +207,8 @@ You can also group by the combination of keys. (Use the above two rules together
207207
;; get the min of the two columns grouped by ...
208208
```
209209

210-
211-
212-
- sort
213-
214-
**Immediately** sort the dataframe
215-
216-
| Argument | Type | Function | Remarks |
217-
| ------------------ | ----------------------- | ------------------------ | ------------------------------------------------------------ |
218-
| `dataframe` | Clojask.DataFrame | The operated object | |
219-
| `trending list` | Collection (seq vector) | Indicates the sort order | Example: ["Salary" "+" "Employee" "-"] means that sort the Salary in ascending order, if equal sort the Employee in descending order |
220-
| `output-directory` | String | The output path | |
221-
222-
**Example**
223-
224-
```clojure
225-
(sort y ["+" "Salary"] "resources/sort.csv")
226-
;; sort by Salary ascendingly
227-
```
228-
229-
230210

231-
- compute
232211

233-
Compute the result. The pre-defined lazy operations will be executed in pipeline, ie the result of the previous operation becomes the argument of the next operation.
234-
235-
| Argument | Type | Function | Remarks |
236-
| ---------------- | ----------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
237-
| `dataframe` | Clojask.DataFrame | The operated object | |
238-
| `num of workers` | int (max 8) | The number of worker instances (except the input and output nodes) | If this argument >= 2, will use [onyx](http://www.onyxplatform.org/) as the distributed platform |
239-
| `output path` | String | The path of the output csv file | Could exist or not. |
240-
| [`exception`] | boolean | Whether an exception during calculation will cause termination | Is useful for debugging or detecting empty fields |
241-
242-
**Example**
243-
244-
```clojure
245-
(compute x 8 "../resources/test.csv" :exception true)
246-
;; computes all the pre-registered operations
247-
```
248-
249-
250212

251213
- inner-join / left-join / right-join
252214

@@ -258,40 +220,52 @@ You can also group by the combination of keys. (Use the above two rules together
258220

259221
*Will automatically pipeline the registered operations and filters like `compute`. You could think of join as first compute the two dataframes then join.*
260222

261-
| Argument | Type | Function | Remarks |
262-
| ------------------- | ------------------- | ------------------------------------------------------------ | ------------------------------------------------- |
263-
| `dataframe a` | Clojask.DataFrame | The operated object | |
264-
| `dataframe b` | Clojask.DataFrame | The operated object | |
265-
| `a join keys` | String / Collection | The keys of a to be aligned | Find the specification [here](#groupby-keys) |
266-
| `b join keys` | String / Collection | The keys of b to be aligned | Find the specification [here](#groupby-keys) |
267-
| `number of workers` | int (max 8) | Number of worker nodes doing the joining | |
268-
| `distination file` | string | The file path to the distination | Will be emptied first |
269-
| [`exception`] | boolean | Whether an exception during calculation will cause termination | Is useful for debugging or detecting empty fields |
223+
| Argument | Type | Function | Remarks |
224+
| ------------- | ------------------- | --------------------------- | -------------------------------------------- |
225+
| `dataframe a` | Clojask.DataFrame | The operated object | |
226+
| `dataframe b` | Clojask.DataFrame | The operated object | |
227+
| `a join keys` | String / Collection | The keys of a to be aligned | Find the specification [here](#groupby-keys) |
228+
| `b join keys` | String / Collection | The keys of b to be aligned | Find the specification [here](#groupby-keys) |
270229

271-
**Example**
230+
**Return**
231+
232+
A Clojask.JoinedDataFrame
233+
234+
- Unlike Clojask.DataFrame, it only supports three operations:
235+
- `print-df`
236+
- `get-col-names`
237+
- `compute`
238+
- This means you cannot further apply complicated operations to a joined dataframe. An alternative is to first compute the result, then read it in as a new dataframe.
239+
240+
**Example**
272241

273242
```clojure
274243
(def x (dataframe "path/to/a"))
275244
(def y (dataframe "path/to/b"))
276245

277-
(inner-join x y ["col a 1" "col a 2"] ["col b 1" "col b 2"] 8 "path/to/distination" :exception true)
246+
(def z (inner-join x y ["col a 1" "col a 2"] ["col b 1" "col b 2"]))
247+
(compute z 8 "path/to/output")
278248
;; inner join x and y
279249

280-
(left-join x y ["col a 1" "col a 2"] ["col b 1" "col b 2"] 8 "path/to/distination" :exception true)
250+
(def z (left-join x y ["col a 1" "col a 2"] ["col b 1" "col b 2"]))
251+
(compute z 8 "path/to/output")
281252
;; left join x and y
282253

283-
(right-join x y ["col a 1" "col a 2"] ["col b 1" "col b 2"] 8 "path/to/distination" :exception true)
254+
(def z (right-join x y ["col a 1" "col a 2"] ["col b 1" "col b 2"]))
255+
(compute z 8 "path/to/output")
284256
;; right join x and y
285257
```
286258

259+
260+
287261
- reorderCol / renameCol
288262

289263
Reorder the columns / rename the column names in the dataframe
290264

291-
| Argument | Type | Function | Remarks |
292-
| ------------------- | ------------------ | ------------------------------------------------------------ | ------------------------------------------------- |
293-
| `dataframe a` | Clojask.DataFrame | The operated object | |
294-
| `a columns` | Clojure.collection | The new set of column names | Should be existing headers in dataframe a if it is `reorderCol` |
265+
| Argument | Type | Function | Remarks |
266+
| ------------- | ------------------ | --------------------------- | ------------------------------------------------------------ |
267+
| `dataframe a` | Clojask.DataFrame | The operated object | |
268+
| `a columns` | Clojure.collection | The new set of column names | Should be existing headers in dataframe a if it is `reorderCol` |
295269

296270

297271
**Example**
@@ -301,3 +275,54 @@ You can also group by the combination of keys. (Use the above two rules together
301275
(.renameCol y ["Employee" "new-Department" "EmployeeName" "Salary"])
302276
```
303277

278+
279+
280+
281+
- sort
282+
283+
**Immediately** sort the dataframe
284+
285+
| Argument | Type | Function | Remarks |
286+
| ------------------ | ----------------------- | ------------------------ | ------------------------------------------------------------ |
287+
| `dataframe` | Clojask.DataFrame | The operated object | |
288+
| `trending list` | Collection (seq vector) | Indicates the sort order | Example: ["Salary" "+" "Employee" "-"] means that sort the Salary in ascending order, if equal sort the Employee in descending order |
289+
| `output-directory` | String | The output path | |
290+
291+
**Example**
292+
293+
```clojure
294+
(sort y ["+" "Salary"] "resources/sort.csv")
295+
;; sort by Salary ascendingly
296+
```
297+
298+
299+
300+
- compute
301+
302+
Compute the result. The pre-defined lazy operations will be executed in pipeline, ie the result of the previous operation becomes the argument of the next operation.
303+
304+
| Argument | Type | Function | Remarks |
305+
| ---------------- | ------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
306+
| `dataframe` | Clojask.DataFrame | The operated object | |
307+
| `num of workers` | int (max 8) | The number of worker instances (except the input and output nodes) | Use [onyx](http://www.onyxplatform.org/) as the distributed platform |
308+
| `output path` | String | The path of the output csv file | Could exist or not. |
309+
| [`exception`] | boolean | Whether an exception during calculation will cause termination | Is useful for debugging or detecting empty fields |
310+
| [`select`] | String / Collection of strings | The name of the columns to select. Better to first refer to function `get-col-names` about all the names. (Similar to `SELECT` in sql ) | Can only specify either of select and exclude |
311+
| [`exclude`] | String / Collection of strings | The name of the columns to exclude | Can only specify either of select and exclude |
312+
313+
**Example**
314+
315+
```clojure
316+
(compute x 8 "../resources/test.csv" :exception true)
317+
;; computes all the pre-registered operations
318+
319+
(compute x 8 "../resources/test.csv" :select "col a")
320+
;; only select column a
321+
322+
(compute x 8 "../resources/test.csv" :select ["col b" "col a"])
323+
;; select two columns, column b and column a in order
324+
325+
(compute x 8 "../resources/test.csv" :exclude ["col b" "col a"])
326+
;; select all columns except column b and column a, other columns are in order
327+
```
328+

examples/multi-threading.clj

Lines changed: 0 additions & 15 deletions
This file was deleted.

src/main/clojure/aggregate/aggre_onyx_comps.clj

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
[onyx.test-helper :refer [with-test-env feedback-exception!]]
99
;; [tech.v3.dataset :as ds]
1010
[clojure.data.csv :as csv]
11-
[clojask.utils :refer [eval-res eval-res-ne filter-check]]
11+
[clojask.utils :as u]
1212
[clojure.set :as set]
1313
[clojask.groupby :refer [read-csv-seq]])
1414
(:import (java.io BufferedReader FileReader BufferedWriter FileWriter)))
@@ -38,10 +38,11 @@
3838

3939

4040
(defn worker-func-gen
41-
[df exception]
41+
[df exception aggre-funcs index formatter]
4242
(reset! dataframe df)
43-
(let [aggre-funcs (.getAggreFunc (.row-info (deref dataframe)))
44-
formatters (.getFormatter (.col-info (deref dataframe)))
43+
(let [
44+
;; aggre-funcs (.getAggreFunc (.row-info (deref dataframe)))
45+
formatters formatter
4546
;; key-index (.getKeyIndex (.col-info (deref dataframe)))
4647
;; formatters (set/rename-keys formatters key-index)
4748
]
@@ -52,7 +53,10 @@
5253
(let [data (read-csv-seq (:file seq))
5354
pre (:d seq)
5455
data-map (-> (iterate inc 0)
55-
(zipmap (apply map vector data)))]
56+
(zipmap (apply map vector data)))
57+
reorder (fn [a b]
58+
;; (println [a b])
59+
(u/gets (concat a b) index))]
5660
;; (mapv (fn [_]
5761
;; (let [func (first _)
5862
;; index (nth _ 1)]
@@ -62,7 +66,9 @@
6266
res []]
6367
(if (= aggre-funcs [])
6468
;; {:d (vec (concat pre res))}
65-
{:d (mapv concat (repeat pre) (apply map vector res))}
69+
(if (= res [])
70+
{:d [(u/gets pre index)]}
71+
{:d (mapv reorder (repeat pre) (apply map vector res))})
6672
(let [func (first (first aggre-funcs))
6773
index (nth (first aggre-funcs) 1)
6874
res-funcs (rest aggre-funcs)
@@ -223,15 +229,17 @@
223229
{:zookeeper/address "127.0.0.1:2188"
224230
:zookeeper/server? true
225231
:zookeeper.server/port 2188
226-
:onyx/tenancy-id id})
232+
:onyx/tenancy-id id
233+
:onyx.log/file "_clojask/clojask.log"})
227234

228235
(def peer-config
229236
{:zookeeper/address "127.0.0.1:2188"
230237
:onyx/tenancy-id id
231238
:onyx.peer/job-scheduler :onyx.job-scheduler/balanced
232239
:onyx.messaging/impl :aeron
233240
:onyx.messaging/peer-port 40200
234-
:onyx.messaging/bind-addr "localhost"})
241+
:onyx.messaging/bind-addr "localhost"
242+
:onyx.log/file "_clojask/clojask.log"})
235243

236244
(def env (onyx.api/start-env env-config))
237245

@@ -250,11 +258,11 @@
250258

251259
(defn start-onyx-aggre
252260
"start the onyx cluster with the specification inside dataframe"
253-
[num-work batch-size dataframe dist exception]
261+
[num-work batch-size dataframe dist exception aggre-func index formatter]
254262
(try
255263
(workflow-gen num-work)
256264
(config-env)
257-
(worker-func-gen dataframe exception) ;;need some work
265+
(worker-func-gen dataframe exception aggre-func index formatter) ;;need some work
258266
(catalog-gen num-work batch-size)
259267
(lifecycle-gen "./_clojask/grouped" dist)
260268
(flow-cond-gen num-work)

src/main/clojure/clojask/ColInfo.clj

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@
107107

108108
(getKeys
109109
[this]
110-
col-keys)
110+
(mapv (fn [index] (get index-key index)) (take (count index-key) (iterate inc 0))))
111111

112112
(getKeyIndex
113113
[this]

src/main/clojure/clojask/clojask_aggre.clj

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,19 @@
44
[clojure.java.io :as io]
55
[taoensso.timbre :refer [debug info] :as timbre]
66
[clojure.string :as string]
7-
[clojask.api.aggregate :refer [start]])
7+
[clojask.api.aggregate :refer [start]]
8+
[clojask.utils :as u])
89
(:import (java.io BufferedReader FileReader BufferedWriter FileWriter)))
910

1011
(def df (atom nil))
12+
(def aggre-func (atom nil))
13+
(def select (atom nil))
1114

1215
(defn inject-dataframe
13-
[dataframe]
16+
[dataframe a b]
1417
(reset! df dataframe)
18+
(reset! aggre-func a)
19+
(reset! select b)
1520
)
1621

1722
(defn c-count
@@ -39,7 +44,9 @@
3944
:lifecycle/after-task-stop close-writer})
4045

4146
(defrecord ClojaskOutput
42-
[memo]
47+
[memo
48+
aggre-func
49+
select]
4350
p/Plugin
4451
(start [this event]
4552
;; Initialize the plugin, generally by assoc'ing any initial state.
@@ -52,7 +59,7 @@
5259
(let [data (mapv (fn [_] (if (coll? _) _ [_])) (deref memo))]
5360
;; (.write (:clojask/wtr event) (str data "\n"))
5461
(if (apply = (map count data))
55-
(mapv #(.write (:clojask/wtr event) (str (string/join "," %) "\n")) (apply map vector data))
62+
(mapv #(.write (:clojask/wtr event) (str (string/join "," (u/gets % select)) "\n")) (apply map vector data))
5663
(throw (Exception. "aggregation result is not of the same length"))))
5764
this)
5865

@@ -86,7 +93,7 @@
8693
;; before write-batch is called repeatedly.
8794
true)
8895

89-
(write-batch [this {:keys [onyx.core/write-batch clojask/wtr clojask/aggre-func]} replica messenger]
96+
(write-batch [this {:keys [onyx.core/write-batch clojask/wtr]} replica messenger]
9097
;; keys [:Departement]
9198
;; Write the batch to your datasink.
9299
;; In this case we are conjoining elements onto a collection.
@@ -111,6 +118,8 @@
111118
;; from your task-map here, in order to improve the performance of your plugin
112119
;; Extending the function below is likely good for most use cases.
113120
(defn output [pipeline-data]
114-
(let [aggre-func (.getAggreFunc (:row-info (deref df)))]
115-
(->ClojaskOutput (volatile! (doall (take (count aggre-func)
116-
(repeat start)))))))
121+
(let []
122+
(->ClojaskOutput (volatile! (doall (take (count (deref aggre-func))
123+
(repeat start))))
124+
(deref aggre-func)
125+
(deref select))))

src/main/clojure/clojask/clojask_groupby.clj

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,13 @@
88

99
(def dataframe (atom nil))
1010
(def groupby-keys (atom nil))
11+
(def write-index (atom nil))
1112

1213
(defn inject-dataframe
13-
[df groupby-key]
14+
[df groupby-key index]
1415
(reset! dataframe df)
15-
(reset! groupby-keys groupby-key))
16+
(reset! groupby-keys groupby-key)
17+
(reset! write-index index))
1618

1719
(defn- inject-into-eventmap
1820
[event lifecycle]
@@ -35,7 +37,7 @@
3537
(def writer-aggre-calls
3638
{:lifecycle/before-task-start inject-into-eventmap})
3739

38-
(defrecord ClojaskGroupby []
40+
(defrecord ClojaskGroupby [write-index]
3941
p/Plugin
4042
(start [this event]
4143
;; Initialize the plugin, generally by assoc'ing any initial state.
@@ -90,7 +92,7 @@
9092
;(.write wtr (str msg "\n"))
9193
;; !! define argument (debug)
9294
;; (def groupby-keys [:Department :EmployeeName])
93-
(output-groupby dist (:d msg) groupby-keys key-index formatter)))
95+
(output-groupby dist (:d msg) groupby-keys key-index formatter write-index)))
9496

9597
(recur (rest batch)))))
9698
true))
@@ -101,4 +103,4 @@
101103
;; from your task-map here, in order to improve the performance of your plugin
102104
;; Extending the function below is likely good for most use cases.
103105
(defn groupby [pipeline-data]
104-
(->ClojaskGroupby))
106+
(->ClojaskGroupby (deref write-index)))

0 commit comments

Comments
 (0)