Subscribe to DSC Newsletter

In this post, I discuss the basic characteristics of code that I have personally used to extract online data - in a process these days often called data-mining.  I intend to cover some general features.  Those that wish to do so can also compile the coding samples.

Over the years, I have programmed in a number of computer programming languages including Visual Basic, Perl, Python, and LISP (AutoLISP).  The coding samples on this blog are written in Java, my language of preference.  I never received any formal programming instruction beyond high school except for a short course at Sun Educational Systems.  Nor do I consider myself a programmer.  I think that the ability to handle code is simply something that some people should be prepared to acquire in order to achieve certain desirable outcomes.  Java has a well-developed GUI or windowing environment.  Also, I personally prefer an object-oriented approach.  So this explains my choice of Java as a programming language.  However, I certainly the support the use of other languages.

When crawling over the internet, I tend to come across the same nagging problem repeatedly:  I never know exactly what to expect.  Styles, formats, structures, and scripting can differ between sites, and these things can also change over time.  So code written today might become obsolete in the near future.  However, there are also basic issues to deal with such as how information flows over the net:  that is to say, slowly and unreliably compared to a hard-drive.  On a hard-drive when a person accesses a file, the process is almost immediate, and it rarely fails.  Online, the user cannot be certain whether or when data will be downloaded.  Then there is the question of what to do if the files are not forthcoming:  how to backtrack for retrieval later; when to try again; and when to give up since the files associated with some links might simply be missing.

In order to deal with the uncertainty, I generally make use of a thread for the page or file that must be downloaded as shown below under WebFileLoader.java.  Among other reasons to use a thread is to be able to halt the program in a controlled manner.  Once on a thread, the program can wait for the data to load or instructions from the user or some other program to halt the load.  Loading a file from the internet is similar to loading it from a hard-drive except that the file must clearly be from an online resource.  In order to show that the resource is online, a URL is used to create an InputStream as shown under WebInputStream.java.  The URL class can be found in java.net while InputStream is from java.io.

Those unfamiliar with Java will note how the program throws events that often must be caught - the most common in this application being IOException.  The program can also be designed to throw user-defined events - for example, the recognition of non-conformance scenarios.  I personally consider this facility useful for recording operational analytics:  for instance, from 2,000 download attempts, there might be 1677 files downloaded, 201 faulty links, and 122 download failures.  These are reductive statistics that only confirm whether or not I successfully did the download.  If a download failure made the program record the path of the file, type of file, and time of download, it might be apparent that the server could not access specific types of files from particular locations during a certain time period.  If the size in bytes is recorded, a file-size limit might be evident.

A programming problem I faced many years ago was to get the downloading thread to confirm that it has finished without necessarily being limited to one particular order source.  For instance, if the program is downloading all at the same time a webpage, a binary executable, and an image, at some point it might have to say, "The person who ordered this image should please pick it up."  When food is ordered from a fast-food outlet, it would be unproductive for the server to be limited to a single customer.  So the basic idea is to allow for the gradual accumulation of data - and for different parts of the program to place orders and be notified as files are downloaded.  The facility that makes this notification process possible is shown below under Forward.java, which I admit is a peculiar class for most programmers to use.

With the data flowing at different rates possibly from many locations, a person must then determine what to do with it.  Generally speaking, I download pages and files in bulk and then apply additional processing to extract analytics.  It simply takes too long to examine a lot of data directly over the internet.  There is a concern about how far or deep to search for documents since links can open up links to infinity.  Well, for sure I wouldn't want to download a document from the same location more than once, and I wouldn't want to count a link to that document from a particular page more than once.  Once the files are on a local hard-drive, it is necessary to confront some the deeper questions such as what now?  Myself, I almost always have a specific objective.

It appears fashionable for people to download everything and then look for everything.  I think there is a bit of a misunderstanding in terms of how data-mining works in a practical sense.  Really, everything gives us nothing.  If I look at a picture of a forest, and I ask children to look for things in the picture, a proposal from one child to look at everything might cause me to say, "Such as . . . ?"  "Let's just systemically go through everything!"  "For example . . . ?"  Using the analogy, even a mining company has some idea what it hopes to find.  What it wants to find legitimizes the cost and effort of finding it.  A company doesn't simply buy parcels of land and start combing through it for anything.

The coding below was part of a larger program that I used to scan through the website of a product testing agency.  I wanted to compile a listing of products that conformed to a particular safety standard.  The data could then be sorted by manufacturer, year of production, and other specific characteristics.  I immediately received interest by those that install the product to provide the listing, and I suppose had I remained in that industry I would have made every effort to do so.  Stepping back from the listing, it is possible to gain an historic profile of products that changed as consumer expectations and desires changed, giving us a kind of product morphology related to external events such as fuel shortages, income levels, economic conditions, and social aesthetics.  Happy data-mining.

WebInputStream.java

import java.io.*;
import java.net.*;

public class WebInputStream {
    URL definedURL = null;
    InputStream IS = null;

    public WebInputStream() {
    }

    public WebInputStream(String title) throws IOException {
        getInputStream(title);
    }

    public InputStream getCurrentStream() {
        return IS;
    }

    public InputStream getInputStream(String title) throws MalformedURLException, IOException {
        definedURL = null;
        try {
            definedURL = new URL(title);
            IS = definedURL.openStream();
            System.out.println("URL Successful");
        }
        catch(MalformedURLException e) {}
        return IS;
    }
}

 

WebFileLoader.java

import java.io.*;
import java.net.*;

public class WebFileLoader implements Runnable {
    Thread mainT = null;
    boolean jobIsDone = true;
    boolean isSafe = true;
    InputStream fis = null;
    Forward evoker = null;

    String loaded = "";
    byte[] array = new byte[100000];

    public WebFileLoader() {}

    public void start() {
        if(mainT == null) {
            mainT = new Thread(this);
            jobIsDone = false;
            mainT.start();
        }
    }

    public void stop() {
        if(mainT != null) {
            jobIsDone = true;
            close();
            mainT = null;
        }
    }

    public void open(String title) {
        isSafe = false;
        try {
            WebInputStream web = new WebInputStream(title);
            fis = web.getCurrentStream();
            isSafe = true;
        }
        catch(MalformedURLException murl) {
            System.out.println("URL Misformed");
            stop();
        }
        catch(IOException iox) {
            System.out.println("Fire IO-open Error");
            stop();
        }
    }

    public void close() {
        if(fis != null) {
            try {
                fis.close();
                fis = null;
            }
            catch(IOException iox) {
                System.out.println("Fire IO-close Error");
            }
        }
    }

    public void pause(int delay) {
        try { mainT.sleep(delay); }
        catch(InterruptedException iex) {}
    }

    public long size() {
        long len = 0;
        try {
            len = fis.available();
        }
        catch(IOException iox) {
            System.out.println("Fire IO-size Error");
        }
        return len;
    }

    public void skip(long len) {
        try {
            fis.skip(len);
        }
        catch(IOException iox) {
            System.out.println("Fire IO-skip Error");
        }
    }

    public void prepareData(Forward evoker, String title) {
        this.evoker = evoker;
        load(title);
    }

    public void load(String title) {
        reset();
        start();

        if(mainT != null) high(title);
    }

    public void reset() {
        loaded = "";
    }

    public void run() {
        if(!jobIsDone) {

            /* Start entering interesting code here */

        }
    }

    public void high(String title) {
        open(title);

        if(isSafe) {
            try {
                int held = 0;
                while(!jobIsDone) {
                    if((held = fis.read(array)) != -1) {
                        loaded += new String(array, 0, held);
                        pause(5);
                    }
                    else break;
                }
            }
            catch(IOException iox) {
                System.out.println("Fire IO-runsFromMain Error");
            }
            catch(OutOfMemoryError oom) {
                System.out.println("Too much data to preload");
            }

            stop();
        }
        if(evoker != null) evoker.connect();
    }
}

 

Forward.java

import java.util.*;
import java.lang.reflect.*;

public class Forward {
    Method target = null;
    Object current = null;
    Object[] arguments = null;

    public Forward() {}

    public Forward(Object cur, String meth) {
        setNewMethod(cur, meth);
    }

    public void setNewMethod(Object cur, String methodName) {
        current = cur;
        arguments = new Object[] {};

        try {
            target = (current.getClass()).getDeclaredMethod(methodName, new Class[] {});
        }
        catch(NoSuchMethodException nsme) {}
        catch(SecurityException se) {}
    }

    public Object connect() {
        Object obj = null;
        try {
            obj = target.invoke(current, arguments);
        }
        catch(IllegalAccessException iae) {}
        catch(IllegalArgumentException ire) {}
        catch(InvocationTargetException e) {}
        return obj;
    }
}

 

 

Views: 15050

Tags: applications, bots, coding, crawlers, data, developers, development, downloading, examples, indexing, More…java, mining, programmers, programming, software, sourcecode, spiders, techniques

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Don Philip Faithful on April 9, 2016 at 6:36am

Tunde, after reviewing the code that I posted, I don't see any coding dependencies.  You should be able to build your application making use of the material posted above.  Are you familiar with how to construct GUIs using Java?  Just decide what you would like to do; and then construct the system to do it.  You can access the input stream or the string input from the above.  The hard part for me has always been the "peculiarities" of the language such as the specific methods; this is because I gave my reference books to a coworker.  I should really purchase replacement books.

Comment by tunde on April 8, 2016 at 2:39pm

how can i get the full code sir? really interested. thanks.

Comment by Ogunfunminiyi O. Frankfurt on November 25, 2013 at 8:48am

Good job. Cool for learning about data mining

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service