Subscribe to DSC Newsletter

I am not an expert in database design, since most of my career I have worked with alternate data storage / data access solutions. But one of the very first projects I had to do back in 1985 when I was a student was to write the code for a fully functional database architecture, in Pascal, from scratch. You will probably find some of my questions naive, and some intriguing.

  • When were variable-width fields first introduced? How is this type of data stored, depending on vendor or implementation: arrays, variable arrays or linked lists, or something else? Why is it always necessary to specify a max length? Is it to make indexing more efficient, or because of limitations in the way packets are transmitted across intranet, or across the Internet? Is NoSQL technology better at dealing with variable-width records?
  • How do you store images or videos in databases? I'm talking about physically storing the videos, e.g. using the 'blob' binary data type. And how do you store vector images? In graph databases?
  • Why is importing CSV columns containing variable-width text so cumbersome with SQL Server? Is it any easier with other database systems?
  • Are there any fast, efficient database clients allowing you to perform some of the computations (maybe sorting or simple analytics) in-memory with traditional SQL: NOT on the database server itself, but locally on your machine, or even on some external machine? Or is the only option based on cloud-computing technology, Map Reduce, Hadoop etc.  By fast client, I mean something far superior to Toad or Brio (the two clients I've been working with), as basic data selection (without join) using their interface, on your desktop, is 100 times slower than accessing the database server straight via a Python script connecting directly to the database server.

Related articles

Views: 4447

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Arthur Copp on September 2, 2013 at 11:01am

Regarding the last question, I've developed a new n-gram inverted index technique for CSV and other file types.  Rows are selected by wildcard matching column values. Optional commas between wildcard patterns specify column position. It's fast enough for search-as-you-type on over 100 GB text dumps. All it requires is disk space for the index. I'll post something later about it. I have a demo running on a 45 GB Wikipedia snapshot at http://www.grep-wikipedia.com. Just enter random strings and see what you hit.

Comment by Chris on July 25, 2013 at 12:05pm

Hello every one! talking about Oracle, I sniffed a cool video http://youtu.be/1GU4Imbo6R8 talking about Hadoop and No SQL.

Another cool site on how to build a team sql critical learning success factors: (1) set audacious goals, (2) learn bit by bit, and (3) know your limits. Title is what driving dogs tell us about learning

You have to scroll down the page to see it http://robertlambert.net/ 

Comment by Richard Rankin on July 24, 2013 at 1:49pm

For large amounts of text data I have had success using Oracle Text (they change the name frequently) using CLOBs I can store a great deal of text and manipulate it well. On load I use a simple classifier trained on previous data. In this way I can create an index for newspaper articles as "sports", "world", "national", "travel", etc. As each CLOB is loaded it is indexed using a dictionary (a list of words of interest I've created). Querying uses SQL plus a rich query syntax for "this word but not this word within 30 words" or other complex queries. I also create thesauri of various types. One simple type might contain synonyms where a Brit might have used the word "hare" when an American would have used "rabbit". I want a query on one word to cause a hit on the other. There may be domain thesauri for a dictionary using medical terms. I used this extensively and problematically in bioinformatics when handling taxonomies. Not all taxonomists agree and taxonomy is tree structure whee variations may be anywhere. A query does not return a simple group of rows but rather a result rather like a Google query. An algorithm generates a score based on the documents fit to the query. The results are ordered by this score and the value is displayed followed by some of the text or dictionary keywords within it. It's quite flexible. Since you can create a cartridge, you can also create storage structures and indexes of your own design using both object and relational capabilities combined. Further, you can design your own optimizer extension to gather data and storage statistics to deliver query plan generation assistance to the optimizer.

Comment by Chris on July 24, 2013 at 12:47pm

Sql is quite complex and detail demanding, my empathy to you Dr. Vincent! One year analyst, another year number crunching led me embrace SQL even more. The nice thing is that SQL render our mind as simple as the computer can be. The ugly thing is simplicity is extremely difficult and can cause database relationships unstable. I learn a lot from you, did not know that there exists NoSQL! Keep going!

Comment by jaap Karman on July 12, 2013 at 4:08am

@Vincent,
I hope the explanation why using a hash method (using internal memory) can be that faster (50) than IO dependant  (DSBMS SQL) based tools was clear enough.

There is a real difference between a good analyst (you) trying to do some technical computer work or someone on the other edge understanding the technical computer part extending it to analysts (me).
I always have enjoyed the cooperation with analysts on this rare edge within the IT. The analysts are always been a curse within standard IT as they are that different to the classic operations (OLTP/Batch).

The bad performance (very slow response) and no reactions or capabilities to improve or help is very recognizable. That situation is something that can be changed although it will be though sometimes.  
 

Your laptop will have some 8Gb 4 cores running at a speed of 2,4Ghz and some 500Gb Dasd possible a basic GPU. These are common high performance laptop specifications. (Cost about $1K)
Compared to the Hollerith age this is tremendous powerful.

Compared to really big-data datacenters it is really a peanut tiny boy-toy. What must be behind the Prism project collection is factors bigger: big-brother,  big-data, big budgets,  big datacenters, big ..  .
Compared  to what becoming common  at commercials is also very tiny. Recently joined a discussion: just a server for analytics specified at: 32 core (hyperthreading not mentioned) 1Tb memory ,I/O storage and DBMS not mentioned expectation is many Tb’s. (cost estimate several $100k).

 

Some limits:

- Processor speed is not increasing anymore( for more than 5 years now). Limit heat/light speed. Instead multicore and parallel processing (eg GPU limited floatings) is getting focus.

- Running out of the resource internal memory can cause Thrashing http://en.wikipedia.org/wiki/Thrashing_(computer_science) the root cause can be hashing.   

This is why real time monitoring and data analyses with forecasting can be that important. Well operated computersystems are collecting a lot of data (big data) that can be analyzed.

A dog tail: big data analyses is used to get the operations in place for the working data analysts/scientists.         

 

It is the same with all data-acess logging and audit trails are mostly very big data. With security access fraude preventation etc. this could be an emerging area.  Analyzing this kind of data is requiring almost the same size of computer resources as the ones generating them.

Comment by James F on July 11, 2013 at 9:55pm

@Vincent: but to paraphrase an old perl saying: some people, when confronted with a problem, think "I know, I'll use distributed processing."  Now they have two problems.  

If a database server slows down because it's been overloaded with client requests, something has gone wrong: either the hardware wasn't sized correctly, or the storage is underspecced, or the networking, or the database schemas, or the queries that are being fired aren't properly optimised. If you're getting better results from a Perl script on your laptop than the same join on the server, then kudos to you, but that points to a weakness in the database set up, rather than a fundamental problem with the technology.  If you see this happening time and again, that points to a weakness in the DBA or infrastructure teams that you're working with.

The cost of using your laptop may seem infinitesimal, but then you have to contend with source control on every one of those laptops, data desyncronisation between different versions, and a lot of other problems that were solved a long time ago by keeping the data and analysis in the RDBMS. 

Of course, you could always run the perl scripts on the server...

Comment by Vincent Granville on July 11, 2013 at 3:05pm

@Jaap: You wrote Nothing is faster  than keeping it as close to its origin as possible. This is true, but when servers are overloaded by client requests, they become incredibly slow. In my experience, I've found that extracting / importing two sets of 50 MM rows from two tables onto my local machine, then running a Perl script to "do a join" (using hash tables), then exporting the result as a new table on the DB server, can be 50 times faster than getting this done within the remote DB server environment with a single basic SQL query (unless maybe if you are a true DB guru familiar with all sorts of tricks and platform-dependent query optimization - very few business analysts or analytic scientists are). I've seen this slowness happening time and again. I don't see why the action that I described could not be implemented transparently to the user by the SQL engine. Getting more new servers in the cloud, close to the source, would be more efficient, but expensive (cost of using my laptop is $0). Using the cloud solution (rather than my laptop) works well unless you have many heavy data consumers like myself. If "data consumption" does not grow faster than cloud capacity (given the fixed budget that enterprises can spend on these resources), yes the cloud is better than my laptop for most users. But not for me, and since I am an outlier (one of the biggest data user) I can just keep doing my transfers and local processing - I pretty much automated this process for myself now (or for whoever interested in my solution).

Comment by jaap Karman on July 11, 2013 at 4:11am

Vincent,

Getting the designs principles of an OLTP and then SQL - DBMS is It think the best approach why the working Analytic persons / data scientist.  ACID (http://en.wikipedia.org/wiki/ACID ) and DDL DML MDL are ways of isolation to hide the DBA work from the analists.  


A nice document: http://www.bileader.com/A_Practical_Guide_to_Analytics_E-book.pdf (may be you've missed)

 

An OLTP was Original designed (1980-s) for the online operational process, little updates retrieves limited hardware, failover recovery very important. The SQL came in (IBM F.Codd) to overcome vendor lock-in by the many other DBMS-types (non relational). The additional cost/overhead was accepted, as weighted against needed dedicated knowlegdge (Hr-capital).

Optimizing meant randomizing as much as possible locations of records. The goal to achieve is minimize the chance different users are trying to update the same data locations. Locking (deadlock prevention),enqueing journalling and logging all added to it. The result is one simpel record request is causing a lot of IO.

 

As analyst you are wanting not in little ammounts but in big ammounts.

As analyst you would have many values pre calculated, this is were cubes (teradata Netezza) are coming in.

As analyst you would like to have columns accessed fast not the rows. That is where eg Vertica is coming in.

https://my.vertica.com/docs/5.1.6/PDF/Concepts%20Guide.pdf

All is driven by optimizing IO constraints, the hardware is changing perhaps the classic IO can be eliminated.
Precalcultating in cubes (Kimball/Immon) defining en optimizing for read-only high speed. Grid processing (servers), In memory, clustering  (the hadoop hype) is what is having the focus now.
Going back to Desktop-pc to do processing is not the future (anymore) as it was many years ago.

Something new is evolving: code can be processed at the DBMS side. (See Richard's note). Sometimes indicated as In-database processing. Nothing is faster than keeping it as close to its origin as possible.

The idea: data not moving out of an DBMS but the code (processing the data) moving into the DBMS itself.

With PMML/mining the scoring node can be deployed to most modern databases. The scoring model is mostly a rather simple to deploy (java) code.        
     

Comment by Richard Rankin on July 10, 2013 at 3:31am

Hadoop and similar database types use a completely different approach to attain their speeds. But of course, there's no such thing as a classical "join" in Hadoop/MapReduce.

Comment by Richard Rankin on July 10, 2013 at 3:28am

Vince, creating a cluster for a relational database requires using shared storage and since you are shipping a lot of data, you have to set up another network for the purposes using the fastest (and most expensive) communication hardware and media such as Infiniband or fast fiber. This is also the reason the nodes are usually near one another. Working at high speeds you need to remember that in a nanosecond light travels less than a foot. I read about a guy working on a transpacific internet line and his bosses kept asking for better performance. So he worked it all out mathematically considering the media, switching etc. and then went to the next meeting, handed out his report and said "We can't get more speed out of the line because we've hit the limit of the speed of light." The only solution would be to lay more cable across the Pacific, a very expensive thing to do.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service