Verileri StandardScaler
(from pyspark.mllib.feature import StandardScaler
) ile ölçeklendirmek istiyorum, şimdi işlevi dönüştürmek için RDD değerlerini geçerek yapabilirim, ancak sorun anahtarı korumak istiyorum. Anahtarımı koruyarak verilerimi ölçeklendirdiğim var mı?Verileri Spark ile gruplara göre ölçeklendirmek mümkün mü?
Numune veri kümesi
0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,smurf.
İthalat
import sys
import os
from collections import OrderedDict
from numpy import array
from math import sqrt
try:
from pyspark import SparkContext, SparkConf
from pyspark.mllib.clustering import KMeans
from pyspark.mllib.feature import StandardScaler
from pyspark.statcounter import StatCounter
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
kod Porsiyon
sc = SparkContext(conf=conf)
raw_data = sc.textFile(data_file)
parsed_data = raw_data.map(Parseline)
Parseline
fonksiyonu:
def Parseline(line):
line_split = line.split(",")
clean_line_split = [line_split[0]]+line_split[4:-1]
return (line_split[-1], array([float(x) for x in clean_line_split]))
Bunun beni bu hata "TypeError verir bu kodu kullanmak istediğinizde: ilişkisiz yöntem birleştirme(), ilk bağımsız değişkeni olarak NoneType örneğiyle çağrılmalıdır (bunun yerine StatCounter örneği var) ", herhangi bir fikrin var mı? – Iman
Verilerinizde eksik değerler var mı? Türleri nelerdir? – zero323
veri yapısı bu gibi bir şeydir [Label, array ([sayısal float değerleri listesi]), her etiket normal veya saldırıdır ve eksik değerler yoktur – Iman