Cluster with 10 nodes

    0

    0

    The code defines a function, cal, that takes a SparkSession as an input and calculates the number of nodes in a directed graph. The function reads in the data shown in the following table and calculates the number of nodes in the graph shown in the table by counting the number of lines and multiplying it by 1.

    Node 1 2 3 ... 9 10

    The function then reduces the count of nodes by multiplying by the total number of users in the data. Finally, the function takes 1000 nodes as the output.

    # You have a cluster of 10 nodes (each node having 24 CPU cores).
    # The following code works, but it may crash on huge data sets, or at the very least,
    # it may not take advantage or the cluster's full processing capabilities.
    # Why and what number of partitions would be optimal
    
    def cal(sparkSession: SparkSession): Unit = {
        val NumNode = 10 
        val userActivityRdd: RDD[UserActivity] = readUserActivityData(sparkSession).repartition(NumNode)
        val result = userActivityRdd
          .map(e => (e.userId, 1L))
          .reduceByKey(_ + _)
          
        result.take(1000)
      }
    Codiga Logo
    Codiga Hub
    • Rulesets
    • Playground
    • Snippets
    • Cookbooks
    soc-2 icon

    We are SOC-2 Compliance Certified

    G2 high performer medal

    Codiga – All rights reserved 2022.