.

Searching for a PASS(or may be IASS) that allow to have virtual environment for parallel processing. I have a large dataset(1.5 GB in csv format), need to process this data and also visualize it again and again. My PC configuration is 2 GB RAM, and processor is Intel(R) Pentium(R) Dual CPU E2180; with this configuration I am unable to even open whole dataset in Excel or in R, using Gephi for visualization; so need any platform that will allow to process data and use Gephi too.

Please suggest some solution to deal with massive data specially in context of infrastructure support.

My requirement is similar to Hadoop but it doesn't work for me because I need to use Gephi and Python programs. In case it can work kindly point out how.

Tags: largedata, pass, python

Views: 538

Reply to This

Replies to This Discussion

Can you give more detail about your data and the analysis?

If you need to use Gephi, I would guess that it has to be a graph analysis. How many nodes? how many edges?

Clearly it's more than can fit in RAM, so you a graph DB. I would recommend Neo4j or ArangoDB. As far as PaaS, Heroku offers a Neo4j hosting service. There is an official Gephi plugin for Neo4j.

I hope this helps,

Peter



Peter Higdon said:

Can you give more detail about your data and the analysis?

I have Train and Test dataset in Test set there is one column missing which I need to predict(supervised learning problem); Train dataset have total 900 features and all features are anonymous. I really don't know how to start; how to deal with "900" features.

That's first problem; Secondly since the data is really big so first I thought to work on small part of test dataset(fist 100 rows) and then when i fill that I got correct method for problem I will use any virtual environment which allow to work on such big dataset(all rows together).

If you need to use Gephi, I would guess that it has to be a graph analysis. How many nodes? how many edges?

Previously I thought to use Gephi but now since I am unable to find any relation between nodes(different rows) so dropped the idea of Gephi.

Clearly it's more than can fit in RAM, so you a graph DB. I would recommend Neo4j or ArangoDB. As far as PaaS, Heroku offers a Neo4j hosting service. There is an official Gephi plugin for Neo4j.

I hope this helps,

Thanks for help, in case I will need Gephi, defiantly I will try for Neo4j plugin.

Usually a subset of 5-20% of data is used for training. The rest is used for testing. Try training with a small data set and increase it incrementally until you're using most of your RAM. Do the visualization after the training.

Why do you NEED to use Gephi? Graph analysis is only useful for certain algorithms. You probably need to use a matplotlib visualization to verify the training data before looking at it in Guephi.

I should have been more specific: what type of algorithm are you trying to run?

I recommend watching Stanford's Andrew Ng Machine Learning youtube lectures if you over your head.

Shaifali Agrawal said:



Peter Higdon said:

Can you give more detail about your data and the analysis?

I have Train and Test dataset in Test set there is one column missing which I need to predict(supervised learning problem); Train dataset have total 900 features and all features are anonymous. I really don't know how to start; how to deal with "900" features.

That's first problem; Secondly since the data is really big so first I thought to work on small part of test dataset(fist 100 rows) and then when i fill that I got correct method for problem I will use any virtual environment which allow to work on such big dataset(all rows together).

If you need to use Gephi, I would guess that it has to be a graph analysis. How many nodes? how many edges?

Previously I thought to use Gephi but now since I am unable to find any relation between nodes(different rows) so dropped the idea of Gephi.

Clearly it's more than can fit in RAM, so you a graph DB. I would recommend Neo4j or ArangoDB. As far as PaaS, Heroku offers a Neo4j hosting service. There is an official Gephi plugin for Neo4j.

I hope this helps,

Thanks for help, in case I will need Gephi, defiantly I will try for Neo4j plugin.

Train data set contain 900 features; first one for id, last one for prediction value let say it y, rest features are anonymous with name d1 to d888. So total remained features equals to 888.

Yes I used matplotlib in my first try in which I created 888 graphs each with a different feature on xaxis and y(prediction value from training data) on yaxis; but it was not a good idea as i have to analyse 888 graphs manually and also it provide me only direct relation between x and y.

While discussing problem with one of my friend(have done Andrew Ng ML course), he also suggested me same course; so seems like moving on right track!

Algorithm is not decided yet but multiple and logistic regression are on way on try first. If you can suggest any other it will be a big help. Let me remind/know you that it's a supervised regression problem. I am having problem in dealing with large dimensioned data( for algorithm). infrastructure/hardware problem is another issue.

Dropped idea of using Gephi. At present using only Python and its libraries(numpy, Scipy, matplotlib, pandas) to analyse data.

RSS

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service