Explode column with dense vectors in multiple rows

January 31, 2018, at 5:56 PM

I have a Dataframe with two columns: BrandWatchErwaehnungID and word_counts. The word_counts column is the output of `CountVectorizer (a sparse vector). After dropped the empty rows I have created two new columns one with the indices of the sparse vector and one with their values.

help0 = countedwords_text['BrandWatchErwaehnungID','word_counts'].rdd\
    .filter(lambda x : x[1].indices.size!=0)\
    .map(lambda x : (x[0],x[1],DenseVector(x[1].indices) , DenseVector(x[1].values))).toDF()\
    .withColumnRenamed("_1", "BrandWatchErwaenungID").withColumnRenamed("_2", "word_counts")\
    .withColumnRenamed("_3", "word_indices").withColumnRenamed("_4", "single_word_counts")

I needed to convert them to dense vectors before adding to my Dataframe due to spark did not accept numpy.ndarray. My problem is that I now want to explode that Dataframeon the word_indices column but the explode method from pyspark.sql.functions does only support arrays or map as input.

I have tried:

help1 = help0.withColumn('b' , explode(help0.word_indices))

and get the following error:

cannot resolve 'explode(`word_indices')' due to data type mismatch: input to function explode should be array or map type

Afterwards I tried:

help1 = help0.withColumn('b' , explode(help0.word_indices.toArray()))

Which also did not worked... Any suggestions?

Answer 1

You have to use udf:

from pyspark.sql.functions import udf, explode
from pyspark.sql.types import *
from pyspark.ml.linalg import *
def indices(v):
   if isinstance(v, DenseVector):
      return list(range(len(v)))
   if isinstance(v, SparseVector):
      return v.indices.tolist()
df = spark.createDataFrame([
   (1, DenseVector([1, 2, 3])), (2, SparseVector(5, {4: 42}))], 
   ("id", "v"))
df.select("id", explode(indices("v"))).show()
# +---+---+
# | id|col|
# +---+---+
# |  1|  0|
# |  1|  1|
# |  1|  2|
# |  2|  4|
# +---+---+
sqlite3.OperationalError: no such column: - Python

sqlite3.OperationalError: no such column: - Python

I am trying to Inserat something from Input into my DatabaseBut getting the Error:

Looping through data in a CSV file in order to output &#39;1&#39; and &#39;0&#39; to a text file (Python)

Looping through data in a CSV file in order to output '1' and '0' to a text file (Python)

I have recently started learning Python and have run into a problem in trying to format some data for a project I am working onI have managed to take in a CSV file as an input and I am now trying to go through that data and output '1's and '0's based upon the data, in to a text file

Could not find wheel packages using pip command on CentOS?

Could not find wheel packages using pip command on CentOS?

If I am running pip install scipy on Ubuntu, pip finds whl package and installs it but for centos, it tries to download the source and compile and install it explicitlyI have observed this with lots of packages while installing on centos I would like to know is there anything...