Saturday, July 16, 2011

Big Data storage etc

Peace S/W utilities - 100 row inserts per power bill in CO

RIAC
Cassandra
PostgreSQL9 has more unstructured data features
NoSQL
MongoDB
up-front schema design important
Robert O'Brien
ACID, BASE -- eventual consistency -- concensus problem
can't do shared nothing with bank balance e.g.
oops - Guy left!
hybrid solutions - relational plus noSQL -- by hand
Hadoop
Guy's back! :)
FOO - NYC? show us your stack
gone off MongoDB
(why?)
if mySQL solves problem, use it
PostgreSQL h-store engine? new
high-write vs high-read/query reqs
operationl considerations
#machines
how to develop
deployment, evolution
SQL not relational (?) - R O'B
Glen Barnes
flume central server
OLTP + analytical - conflicting needs
solution: replicate (denormalize)?
robot filter
geocode
...
other anomalies
Guy : what about huge records (but not many of them necessarily?
MongoDB grid file system
e.g. Radio Telescope etc.
CFD struct mech
scientific
hard to nail down rigid schema for metadata
Petabyte databases
conference MySQL
1P records
(link from Mark Derricutt to Guy)
TB per record
medical scans
x-ray @ cellular level 30GB per record / blob
Provenance - algorithm version etc. - where data came from and how processed
Guy involved with EU grid provenance project til 2006
use cases organ transplant privacy
Aircraft design
German Aerospace
flight simulation
airflow / CFD/ CAD meshes - pressure on surface - FEM mech deformation
100 CPU hours per run - don't want to rerun when forgot little thing like algo version
Rob O'B no longer complete self-contained record in one place - use queues more
Gear-man - Tim ?? session
graph database - transactions, actor state
neo4j ?
REDIS
CouchDB
memcachedb
greenplum?
STB amazon
XML db's Termino? (e)Xist? Berkeley (db) ?
VoltDB
Stonebraker - destroying his reputation / selling soul?
all redundancy in H/W - everything in RAM
high-freq trading
Mark D Spring data - hibernate
Tokyo? Kyoto? Sky were using Tokyo?
Hadoop / RIACK / Cassandra
Mongo most popular by far ?
promoted self by speed claims (but will lose data sometimes)
Guy teaching in Honors paper
PIG query based langs
map-reduce
couch targeting mobile
non-robust links
big data can be quite small
Data Science - O'Reilly - Data Size?
u4j New4j? scientific data (who?) neo4j
not transactional use case
=> graphs analysis
millions of nodes - look for patterns
perturbations - see whether nwk evolves
graphviz
neo4j GPL cut down - AfferoGPL license for full ACID-compliant version
designed to work around 'the cloud hole'
AGPL - must make code available even if not distributing modified project
infinitegraph
torquebase? (Tim ?)
gremlin? query language on top of Neo4J
HF trading = FPGA GPU etc coming in - big $ for sub-ms ping times in same datacentre as market servers
need to be first
Bitcoin
Oracle license $1M US - Peace
2days on paper - call centre

No comments: