Skip to content

liuzi/spark_reconmendation_engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RecommendEngine

RecommendationEngine provides four models for making recommendations based on users' behavior data provided by tianyu.

You can run the algorithms using spark-submit.

Building RecommendEngine

Replace build.sbt and hdfs.scala of the project package

cd /$path_of_the_project/RecommendEngine/
cp repFiles/dcos/hdfs.scala src/main/scala/tianyu/algorithm/util/
cp repFiles/dcos/build.sbt .

Use simple build tool SBT to build the project

sbt clean
sbt package

Then you can find the packaged jar recommendationengine_2.11-1.0.jar under ~/RecommendEngine/target/scala-2.11/

Using scripts to build package and send it to aritifact

Modify RecommendEngine/scripts/pack2dcos.sh, including the target path and the path of proxy_scp which is used for uploading the local jar to dcos.

And run the following command.

source scripts/pack2dcos.sh

Spark submit codes

Common configuration of spark and class

  • example codes
/$path_of_spark_package/bin/spark-submit \
	--class tianyu.algorithm.ARDcosTest \
	--jars /$fullpath/scopt_2.11-3.3.0.jar,/$fullpath/spark-avro_2.11-3.2.0.jar \
	--master local[*] \
	--executor-memory 2G \
	~/myjars/recommendationengine_2.11-1.0.jar \
    ...
  • Instruction of parameters
  1. --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
  2. --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
  3. --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
  4. --executor-memory how much memory acquired for running the spark application
  5. --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
  6. application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
  7. application-arguments: Arguments passed to the main method of your main class, if any

Association Rules (Items)

  • Class: tianyu.algorithm.ARDcosTest:
  • Codes for parameters:
/$path_of_spark_package/bin/spark-submit \
	--class tianyu.algorithm.ARDcosTest \
	...
	--outDir /tianyu/analysis \
	--timeProcess true \
	--endTime now \
	--numPastMonths 3 \
	--maxItems 1000 \
	--minSupport 0.01 \
	--minConfidence 0.8 \
	--topN 10 \
	--numPartitions 200
  • Instruction of parameters
  1. --outDir: Root directory for writing analysis results.
  2. --timeProcess: True is for running AR with time processing and false is for without time processing.
  3. --endTime: Specified end time for cutting out the data in logs. "now" is for current time. Other time must be formatted as yyyyMMdd.
  4. --numPastMonths: Length of period for cutting out the data into transactions and recent history (/month)
  5. --maxItems: The maximum of items in a transaction. Transaction will be filter out if it has more than such number of items.
  6. --minSupport: Minimum support for filtering frequents
  7. --minConfidence: Minimum confidence for filtering association rules
  8. --topN: Maximum of recommendations for each user
  9. --numPartitions: Number of tasks for an stage

Cluster (Users)

  • Class: tianyu.algorithm.ClusterDcosTest:
  • Codes for parameters:
/$path_of_spark_package/bin/spark-submit \
    --class tianyu.algorithm.ClusterDcosTest \
    ...
    ~/myjars/recommendationengine_2.11-1.0.jar \
    --outDir /tianyu/lynnDockerTest \
    --numCluster 5 \
    --maxIterations 10 \
    --topN 10 
  • Instruction of parameters
  1. --outDir: Root directory for writing analysis results.
  2. --numCluster: Number of clusters which users are divided into.
  3. --maxIterations: Maximum of iterations for clustering users.
  4. --topN: Maximum of recommendations for each user

Cosine Similarity (Items)

  • Class: tianyu.algorithm.CSDcosTest:
  • Codes for parameters:
  • Instruction of parameters

Matrix Factorization (Users and Items)

  • Class: tianyu.algorithm.MFDcosTest:
  • Codes for parameters:
/$path_of_spark_package/bin/spark-submit \
    --class tianyu.algorithm.MFDcosTest \
    ...
    ~/myjars/recommendationengine_2.11-1.0.jar \
    --outDir /tianyu/lynnDockerTest \
    --rank 10 \
    --reg 1 \
    --maxIter 10 \
    --topN 10 \
    --numBlocks 200
  • Instruction of parameters
  1. --outDir: Root directory for writing analysis results.
  2. --rank: Rank of factors matrix for both users and items.
  3. --reg : Regularization parameter in ALS.
  4. --maxIter: Maximum number of iterations to run.
  5. --topN: Maximum of recommendations for each user.
  6. --numBlocks: Number of blocks the users and items will be partitioned into in order to parallelize computation.

Adjusted Cosine Similarities (Items)

  • Class: tianyu.algorithm.CSDcosTest:
  • Codes for parameters:
/$path_of_spark_package/bin/spark-submit \
    --class tianyu.algorithm.ClusterDcosTest \
    ...
    ~/myjars/recommendationengine_2.11-1.0.jar \
    --outDir /tianyu/analysis \
    --minSim 0.7 \
    --topSim 20 \
    --minCommons 5 \
    --topN 10 
  • Instruction of parameters
  1. --outDir: Root directory for writing analysis results.
  2. --minSim: Minimum of adjusted similarities to filter (i,j) pair similarities.
  3. --topSim : Maximum of similar items for each item (for calculate predictions)
  4. --minCommons: Minimum number of same (rui)s to determine two items are similar.
  5. --topN: Maximum of recommendations for each user.

Search for a user's history and recommendation in a specific model

  • Class: tianyu.algorithm.Comparison:
  • Codes for parameters:
/$path_of_spark_package/bin/spark-submit \
    --class tianyu.algorithm.Comparison \
    ...
    ~/myjars/recommendationengine_2.11-1.0.jar \
    --rootDir /tianyu/lynnDockerTest \
    --user zc14607X \
    --alg Association \
    --subType Full_History \
    --DateTime 2017-06-15-08
  • Instruction of parameters
  1. --rootDir: Root directory of results file.
  2. --user: Name, ID or account of the user to look up.
  3. --alg: Name of recommendation model.(Association,Cluster,ALS,CosSim)
  4. --subType: Subsidiary type of the specific recommendation model.(Full_History,Time_Window,Basic)
  5. --DateTime 2017-06-15-08: Time when the analysis are executed, formatted yyyy-MM-dd-HH

About

Recommendation engine using four kinds of models, in sclala (2016-2017)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published