Subscribe to DSC Newsletter

I am not an expert in database design, since most of my career I have worked with alternate data storage / data access solutions. But one of the very first projects I had to do back in 1985 when I was a student was to write the code for a fully functional database architecture, in Pascal, from scratch. You will probably find some of my questions naive, and some intriguing.

  • When were variable-width fields first introduced? How is this type of data stored, depending on vendor or implementation: arrays, variable arrays or linked lists, or something else? Why is it always necessary to specify a max length? Is it to make indexing more efficient, or because of limitations in the way packets are transmitted across intranet, or across the Internet? Is NoSQL technology better at dealing with variable-width records?
  • How do you store images or videos in databases? I'm talking about physically storing the videos, e.g. using the 'blob' binary data type. And how do you store vector images? In graph databases?
  • Why is importing CSV columns containing variable-width text so cumbersome with SQL Server? Is it any easier with other database systems?
  • Are there any fast, efficient database clients allowing you to perform some of the computations (maybe sorting or simple analytics) in-memory with traditional SQL: NOT on the database server itself, but locally on your machine, or even on some external machine? Or is the only option based on cloud-computing technology, Map Reduce, Hadoop etc.  By fast client, I mean something far superior to Toad or Brio (the two clients I've been working with), as basic data selection (without join) using their interface, on your desktop, is 100 times slower than accessing the database server straight via a Python script connecting directly to the database server.

Related articles

Views: 4518

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Arthur Copp on September 2, 2013 at 11:01am

Regarding the last question, I've developed a new n-gram inverted index technique for CSV and other file types.  Rows are selected by wildcard matching column values. Optional commas between wildcard patterns specify column position. It's fast enough for search-as-you-type on over 100 GB text dumps. All it requires is disk space for the index. I'll post something later about it. I have a demo running on a 45 GB Wikipedia snapshot at http://www.grep-wikipedia.com. Just enter random strings and see what you hit.

Comment by Chris on July 25, 2013 at 12:05pm

Hello every one! talking about Oracle, I sniffed a cool video http://youtu.be/1GU4Imbo6R8 talking about Hadoop and No SQL.

Another cool site on how to build a team sql critical learning success factors: (1) set audacious goals, (2) learn bit by bit, and (3) know your limits. Title is what driving dogs tell us about learning

You have to scroll down the page to see it http://robertlambert.net/ 

Comment by Chris on July 24, 2013 at 12:47pm

Sql is quite complex and detail demanding, my empathy to you Dr. Vincent! One year analyst, another year number crunching led me embrace SQL even more. The nice thing is that SQL render our mind as simple as the computer can be. The ugly thing is simplicity is extremely difficult and can cause database relationships unstable. I learn a lot from you, did not know that there exists NoSQL! Keep going!

Comment by jaap Karman on July 12, 2013 at 4:08am

@Vincent,
I hope the explanation why using a hash method (using internal memory) can be that faster (50) than IO dependant  (DSBMS SQL) based tools was clear enough.

There is a real difference between a good analyst (you) trying to do some technical computer work or someone on the other edge understanding the technical computer part extending it to analysts (me).
I always have enjoyed the cooperation with analysts on this rare edge within the IT. The analysts are always been a curse within standard IT as they are that different to the classic operations (OLTP/Batch).

The bad performance (very slow response) and no reactions or capabilities to improve or help is very recognizable. That situation is something that can be changed although it will be though sometimes.  
 

Your laptop will have some 8Gb 4 cores running at a speed of 2,4Ghz and some 500Gb Dasd possible a basic GPU. These are common high performance laptop specifications. (Cost about $1K)
Compared to the Hollerith age this is tremendous powerful.

Compared to really big-data datacenters it is really a peanut tiny boy-toy. What must be behind the Prism project collection is factors bigger: big-brother,  big-data, big budgets,  big datacenters, big ..  .
Compared  to what becoming common  at commercials is also very tiny. Recently joined a discussion: just a server for analytics specified at: 32 core (hyperthreading not mentioned) 1Tb memory ,I/O storage and DBMS not mentioned expectation is many Tb’s. (cost estimate several $100k).

 

Some limits:

- Processor speed is not increasing anymore( for more than 5 years now). Limit heat/light speed. Instead multicore and parallel processing (eg GPU limited floatings) is getting focus.

- Running out of the resource internal memory can cause Thrashing http://en.wikipedia.org/wiki/Thrashing_(computer_science) the root cause can be hashing.   

This is why real time monitoring and data analyses with forecasting can be that important. Well operated computersystems are collecting a lot of data (big data) that can be analyzed.

A dog tail: big data analyses is used to get the operations in place for the working data analysts/scientists.         

 

It is the same with all data-acess logging and audit trails are mostly very big data. With security access fraude preventation etc. this could be an emerging area.  Analyzing this kind of data is requiring almost the same size of computer resources as the ones generating them.

Comment by James F on July 11, 2013 at 9:55pm

@Vincent: but to paraphrase an old perl saying: some people, when confronted with a problem, think "I know, I'll use distributed processing."  Now they have two problems.  

If a database server slows down because it's been overloaded with client requests, something has gone wrong: either the hardware wasn't sized correctly, or the storage is underspecced, or the networking, or the database schemas, or the queries that are being fired aren't properly optimised. If you're getting better results from a Perl script on your laptop than the same join on the server, then kudos to you, but that points to a weakness in the database set up, rather than a fundamental problem with the technology.  If you see this happening time and again, that points to a weakness in the DBA or infrastructure teams that you're working with.

The cost of using your laptop may seem infinitesimal, but then you have to contend with source control on every one of those laptops, data desyncronisation between different versions, and a lot of other problems that were solved a long time ago by keeping the data and analysis in the RDBMS. 

Of course, you could always run the perl scripts on the server...

Comment by Vincent Granville on July 11, 2013 at 3:05pm

@Jaap: You wrote Nothing is faster  than keeping it as close to its origin as possible. This is true, but when servers are overloaded by client requests, they become incredibly slow. In my experience, I've found that extracting / importing two sets of 50 MM rows from two tables onto my local machine, then running a Perl script to "do a join" (using hash tables), then exporting the result as a new table on the DB server, can be 50 times faster than getting this done within the remote DB server environment with a single basic SQL query (unless maybe if you are a true DB guru familiar with all sorts of tricks and platform-dependent query optimization - very few business analysts or analytic scientists are). I've seen this slowness happening time and again. I don't see why the action that I described could not be implemented transparently to the user by the SQL engine. Getting more new servers in the cloud, close to the source, would be more efficient, but expensive (cost of using my laptop is $0). Using the cloud solution (rather than my laptop) works well unless you have many heavy data consumers like myself. If "data consumption" does not grow faster than cloud capacity (given the fixed budget that enterprises can spend on these resources), yes the cloud is better than my laptop for most users. But not for me, and since I am an outlier (one of the biggest data user) I can just keep doing my transfers and local processing - I pretty much automated this process for myself now (or for whoever interested in my solution).

Comment by jaap Karman on July 11, 2013 at 4:11am

Vincent,

Getting the designs principles of an OLTP and then SQL - DBMS is It think the best approach why the working Analytic persons / data scientist.  ACID (http://en.wikipedia.org/wiki/ACID ) and DDL DML MDL are ways of isolation to hide the DBA work from the analists.  


A nice document: http://www.bileader.com/A_Practical_Guide_to_Analytics_E-book.pdf (may be you've missed)

 

An OLTP was Original designed (1980-s) for the online operational process, little updates retrieves limited hardware, failover recovery very important. The SQL came in (IBM F.Codd) to overcome vendor lock-in by the many other DBMS-types (non relational). The additional cost/overhead was accepted, as weighted against needed dedicated knowlegdge (Hr-capital).

Optimizing meant randomizing as much as possible locations of records. The goal to achieve is minimize the chance different users are trying to update the same data locations. Locking (deadlock prevention),enqueing journalling and logging all added to it. The result is one simpel record request is causing a lot of IO.

 

As analyst you are wanting not in little ammounts but in big ammounts.

As analyst you would have many values pre calculated, this is were cubes (teradata Netezza) are coming in.

As analyst you would like to have columns accessed fast not the rows. That is where eg Vertica is coming in.

https://my.vertica.com/docs/5.1.6/PDF/Concepts%20Guide.pdf

All is driven by optimizing IO constraints, the hardware is changing perhaps the classic IO can be eliminated.
Precalcultating in cubes (Kimball/Immon) defining en optimizing for read-only high speed. Grid processing (servers), In memory, clustering  (the hadoop hype) is what is having the focus now.
Going back to Desktop-pc to do processing is not the future (anymore) as it was many years ago.

Something new is evolving: code can be processed at the DBMS side. (See Richard's note). Sometimes indicated as In-database processing. Nothing is faster than keeping it as close to its origin as possible.

The idea: data not moving out of an DBMS but the code (processing the data) moving into the DBMS itself.

With PMML/mining the scoring node can be deployed to most modern databases. The scoring model is mostly a rather simple to deploy (java) code.        
     

Comment by James F on July 10, 2013 at 1:28am

I'm not sure what's making your CSV import so cumbersome. Certainly SQL Server has made tradeoffs in functionality and ease-of-use that can make that difficult, but with my grumpy DBA hat on I'd make some handwaving remark about how the data needs sanitising before it goes into the database.  In postgres, it's three lines of code to stuff csv into a database (although you can get caught out with things like encoding - the devil is in the details on that one.

It's intriguing to think about what a distributed database would look like, but I can't help feeling the I/O and other inefficiencies will punish you. Expensive work for a database is often in sorts and aggregations, which aren't always things you can distribute in the way that you could, say with a project like SETI.

Thinking of MapReduce et al, if you consider a Hadoop cluster, I don't think you tend to have the individual nodes far apart on the network.  Modern databases already have pretty efficient techniques to join large tables, but they rely on big powerful processors and lots of fast ram, which most laptops don't have. I'd argue that if you have to force your business analysts to worry about DB architecture, then your DB admins and architects aren't doing their jobs properly.  SQL's not a difficult thing to learn, it's the way people build database schemas that make it hard for it to be used.

Comment by Vincent Granville on July 9, 2013 at 6:54pm

Here's my answer to some comments about using cloud technology deployed on employee laptops, as a potential solution to in-memory, very intense computations:

There are also big security issues with secure, enterprise-grade cloud, be it internal or external (Amazon) or hybrid. The security issues by using employee laptops are different, in some way the risk might be higher, but it is possibly better distributed, maybe creating a stronger shield against network attacks if these computers are decentralized.

What I had is mind is optimizing a problem such as a user joining a 150 million rows table with another 30 million rows table. My idea was to have the data (the two tables) moved locally to his laptop, the join (of course not an outer join) performed locally and transparently on his laptop using efficient techniques (hash tables) without the user being aware of it, and the result, after 2 or 3 hours of processing, stored on a local file or sent back as a newly created table on the DB server, depending on where the SQL query wants the output result to be located. If the join process is distributed over 20 laptops of fellow employees (Map Reduce), the task could be completed in 20 minutes.

This would be very helpful to business analysts or purely analytical people (lacking deep DB architecture understanding) to complete queries that would typically be denied on most systems, or would take ages to complete.

Comment by Vincent Granville on July 9, 2013 at 2:28pm

I am familiar with creating indexes, but not so much with a number of other aspects such as caching or data compression, As an high level end user, my perspective (you will surely find it naive or simplistic) was that with so many unused RAM on thousands of employee desktops and laptops (in any big company), and database servers that are so overused and thus slow, there ought to be a solution where computer power can  be moved from DB servers to your local machine, or better, to all local machines of all employees as in some sort of "employee cloud" distributed architecture - a bit like Map Reduce, but where the "external" cloud is replaced by an "internal" open Botnet made of employee computers. Of course when an employee switches off his machine this might cause problems, but I've designed systems (not DBMS systems) that can be restarted without data loss and with minimum efficiency loss right after Internet failures, even when they occur every 5 minutes.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service