Java Forum

Ask Question   UnAnswered
Home » Forum » Java       RSS Feeds

RDB design, Avoiding Duplicate Records

  Asked By: Arland    Date: Mar 14    Category: Java    Views: 1413

As you may know one of the best practices in relational database design is the use of numeric primary keys which causes to boost the performance in data manipulation and retrieval.

Unfortunately there are few ways to achieve this goal and of course one of them would be use of the embedded database “AUTO NUMBER” feature which is basically very profitable for smaller applications, but it’s absolutely not reliable because of allowance of duplicate records. For instance, inserting the same person twice into the DB will cause having 2 different records.

In this situation, what comes to mind at first is to make up a PK according to the every record specifications. For example, to insert a university student data into a table, the numeric PK would consist of year of admission, score, … which makes the student No.
But, unfortunately you can not always make up such a combination for every record.

So, the other option, as I think, is the use of HASH ALGORITHMS which produce unique numeric outputs based on an input String. For instance you give the algorithm a combination of some specific characteristics as a String input and then you receive a unique number as output.

Now, what I wanted to know is that can anybody propose a better solution?
Does anyone have such a Hash Algorithm?



7 Answers Found

Answer #1    Answered By: Ora Hanson     Answered On: Mar 14

First, my source of confusion is why would you insert duplicate  data
into your database. You may need to check your object in the UI, if
applicable, and then insert  it into database.

Anyway, I wouldn't go for Hashing. The point is that you will have to
check your database  quite a few times to find the right id. The world
would be a better place if we could define a hash function that maps
our objects to unique  ids.

You could alternatively rely on your database auto numbering, but you
will need to put it in a stored procedure to check whether not the
record is already in database. Your database is indexed and a String
search is fast anyway. On top of that, putting this logic in the
database will give  you scalability at a minor cost. You always have
the option  to add another machine to the cluster if the performance

Answer #2    Answered By: Angel Watkins     Answered On: Mar 14

Many Thanks for your reply, but there is no source of confusion.
when you are using auto number, insertion of duplicate  record happens alot.
besides, please consider that such preventions should not occur in UI at all, because of the severe performance  degradation in searching DB for every inserting  record. also, as benchmarks say, PK-String-search in a db will never be as fast as numeric  search.
of course as i said, in smaller  application you can choose this strategy and even using Hash Algorithms with String output  (such as SHA-1and MD5) would be good. but in larger ones, only Numeric PK is recommended.

Answer #3    Answered By: Burkett Bernard     Answered On: Mar 14

tell me about your application. It helps a lot if I know how you have duplicate  records.

Answer #4    Answered By: Perdita Lopez     Answered On: Mar 14

let me give  you a complete example:
suppose a table  of personnel information including the following fields.
id, firstname, lastname, father_name and age.
when you insert  data:
Jack, Smith, adam, 27
it will insert with an integer PK (like 987)
now if you accidentally insert those data  again, it will insert another record  with another PK (like 988) which is not desireable
now what I want is that I want to control generating primary  keys based  on the given data. I need a hash algorithm  which gives numeric  outputs as follows.
JackSmithadam27----->1025458218------->PK exist

Answer #5    Answered By: Faith Hughes     Answered On: Mar 14

this is where you define rule engines to avoid the load to be pushed on the db server or unnecessary code on UI.

your solution (hashing) might not be a better alternative to the first one (according to your email) and I don't think  if you end up with duplicate  PKs. but you might having different records  with different PKs but representing the same object. this is where you can use sequence in oracle or identity() in sql server to identify the object

most of the time you can let DB server to do the work by defining PK on columns. i.e. in sql server when you define a pk on multiple columns therefore the server will enforce uniqueness by generating unique  index for the pk columns. creation of the index involves some kind of hashing and it is a reliable and efficient technique. this is irrespective of the types of the pks (string or integer) but if columns contain null the ok generation function can ignore the content and still construct a unique key

still even for small scale applications, you might consider some light-weigth rule engine approach,

Answer #6    Answered By: Dinh Tran     Answered On: Mar 14

this defining PK on different columns seems quite an interesting way, but here 2 questions arise:
1- are these generated PKs integer?
2- can I apply this feature  in other databases? cause I use MySQL and I haven't yet encountered such a feature in it.

Answer #7    Answered By: Ann Evans     Answered On: Mar 14

this defining PK on different columns seems quite an interesting way, but here 2 questions arise:
1- are these generated PKs integer?

Reply : This does not matter, but the only restriction is that the column that you are defining a PK for should not be NULL.

2- can I apply this feature  in other databases? cause I use MySQL and I haven't yet encountered such a feature in it.
Reply : As far as I know this is definitely supported by Oracle and SQL Server, I'm not too sure about MySQL.

Didn't find what you were looking for? Find more on RDB design, Avoiding Duplicate Records Or get search suggestion and latest updates.