Hi,i just have to play and learn how to use this algorithm provided by spark-ml to do some feature extractions from some text using Google`s Word2Vec algorithm, i mean, why not to use my actual cv?

Before that, probably you will have to convert the pdf file to text file. Actually i am working with a new company, named StratioBD, a big data company. It means new fellows and a new laptop, in this case, ubuntu 16.04, so i will use pdftotext to create my-cv.txt. In order to install it, run this command:

sudo apt-get install poppler-utils

aroman@aroman:~/Descargas$ pdftotext CV-MrAlonsoIsidoroRoman-EN.pdf my-cv.txt

then, you can run this code in your spark-shell:

import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

val documentDF = sc.textFile(“/home/aroman/Descargas/my-cv.txt”).flatMap(_.split(” “)).map(word=> Array(word.mkString(” “))).toDF(“text”)
val word2Vec = new Word2Vec()

// this vector size means the dimensionality of the feature vector
val model = word2Vec.fit(documentDF)

val result = model.transform(documentDF)
result.collect().foreach {
case Row(text: Seq[_], features: Vector) =>
println(s”Text: [${text.mkString(“\n “)}] => \nVector: $features\n”) }

the output of the commands looks like:

Text: [s e r v i c e s] =>
Vector: [0.13635428249835968,-0.08279827982187271,-0.1509108990430832]

Text: [r e l a t e d] =>
Vector: [-0.014164328575134277,0.1441897749900818,0.10989004373550415]

Text: [t o] =>
Vector: [-0.05383288860321045,0.14697717130184174,-0.002313454868271947]

Text: [h o w] =>
Vector: [-0.10772007703781128,0.11851165443658829,0.1298576146364212]

Text: [q u a l i t y] =>
Vector: [0.08013373613357544,-0.09192681312561035,0.03734902665019035]

Text: [m a n a g e m e n t] =>
Vector: [-0.012027840130031109,0.008246302604675293,-0.16607992351055145]

Text: [i s] =>
Vector: [0.08250036090612411,0.0025535225868225098,-0.07933773845434189]

Text: [p e r f o r m e d] =>
Vector: [0.061619579792022705,0.06447243690490723,0.15335990488529205]

Text: [a t] =>
Vector: [0.11168009042739868,0.055535417050123215,-0.05068780854344368]

Text: [t h e i r] =>
Vector: [0.14672346413135529,-0.08875509351491928,-0.11893811076879501]

Text: [f a c t o r i e s .] =>
Vector: [-0.08635739237070084,-0.011294126510620117,-0.14689451456069946]

Text: [T h e] =>
Vector: [-0.0338105745613575,-0.11832380294799805,-0.1510833501815796]

Text: [p r o j e c t] =>
Vector: [0.14000384509563446,0.08132614940404892,0.15977203845977783]

Pending to finish it…


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s