Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
Each session lasts 2 hours
Day-1: Session-1: Business Overview of Why Big Data Business Intelligence in Government
- Case Studies from NIH, DoE
- Big Data adoption rates in Government Agencies & how they are aligning future operations around Big Data Predictive Analytics
- Broad Scale Application Areas in DoD, NSA, IRS, USDA, etc.
- Integrating Big Data with Legacy data
- Basic understanding of enabling technologies in predictive analytics
- Data Integration & Dashboard visualization
- Fraud management
- Business Rule/Fraud detection generation
- Threat detection and profiling
- Cost-benefit analysis for Big Data implementation
Day-1: Session-2: Introduction to Big Data-1
- Key characteristics of Big Data: volume, variety, velocity, and veracity. MPP architecture for volume.
- Data Warehouses – static schema, slowly evolving datasets
- MPP Databases like Greenplum, Exadata, Teradata, Netezza, Vertica, etc.
- Hadoop-Based Solutions – no constraints on dataset structure
- Typical pattern: HDFS, MapReduce (crunch), retrieve from HDFS
- Batch processing – suited for analytical/non-interactive tasks
- Volume: CEP streaming data
- Typical choices – CEP products (e.g., Infostreams, Apama, MarkLogic, etc.)
- Less production-ready – Storm/S4
- NoSQL Databases – (columnar and key-value): Best suited as an analytical adjunct to data warehouses/databases
Day-1: Session-3: Introduction to Big Data-2
NoSQL Solutions
- KV Store - Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
- KV Store - Dynamo, Voldemort, Dynomite, SubRecord, MongoDB, DovetailDB
- KV Store (Hierarchical) - GT.m, Cache
- KV Store (Ordered) - TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
- KV Cache - Memcached, Repcached, Coherence, Infinispan, EXtremeScale, JBossCache, Velocity, Terracotta
- Tuple Store - Gigaspaces, Coord, Apache River
- Object Database - ZopeDB, DB40, Shoal
- Document Store - CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Prsevere, Riak-Basho, Scalaris
- Wide Columnar Store - BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI
Varieties of Data: Introduction to Data Cleaning Issues in Big Data
- RDBMS – static structure/schema, does not promote an agile, exploratory environment.
- NoSQL – semi-structured; sufficient structure to store data without an exact schema beforehand.
- Data cleaning issues
Day-1: Session-4: Big Data Introduction-3: Hadoop
- When to select Hadoop?
- STRUCTURED - Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not ideal for active exploration)
- SEMI-STRUCTURED data – challenging to handle with traditional solutions (DW/DB)
- Warehousing data = HUGE effort and remains static even after implementation
- For variety & volume of data, crunched on commodity hardware – HADOOP
- Commodity H/W needed to create a Hadoop Cluster
Introduction to MapReduce / HDFS
- MapReduce – distribute computing over multiple servers
- HDFS – make data available locally for the computing process (with redundancy)
- Data – can be unstructured/schema-less (unlike RDBMS)
- Developer responsibility to make sense of data
- Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS
Day-2: Session-1: Big Data Ecosystem-Building Big Data ETL: Universe of Big Data Tools - Which one to use and when?
- Hadoop vs. Other NoSQL solutions
- For interactive, random access to data
- HBase (column-oriented database) on top of Hadoop
- Random access to data but with restrictions (max 1 PB)
- Not ideal for ad-hoc analytics; good for logging, counting, time-series
- Sqoop - Import from databases to Hive or HDFS (JDBC/ODBC access)
- Flume – Stream data (e.g., log data) into HDFS
Day-2: Session-2: Big Data Management System
- Moving parts, compute nodes start/fail: ZooKeeper - For configuration/coordination/naming services
- Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain
- Deploy, configure, cluster management, upgrade, etc. (sys admin): Ambari
- In Cloud: Whirr
Day-2: Session-3: Predictive Analytics in Business Intelligence -1: Fundamental Techniques & Machine Learning-based BI:
- Introduction to Machine Learning
- Learning classification techniques
- Bayesian Prediction - preparing training file
- Support Vector Machine
- KNN p-Tree Algebra & vertical mining
- Neural Network
- Big Data large variable problem - Random Forest (RF)
- Big Data Automation problem – Multi-model ensemble RF
- Automation through Soft10-M
- Text analytic tool - Treeminer
- Agile learning
- Agent-based learning
- Distributed learning
- Introduction to open-source tools for predictive analytics: R, RapidMiner, Mahout
Day-2: Session-4: Predictive Analytics Ecosystem-2: Common Predictive Analytic Problems in Government
- Insight analytics
- Visualization analytics
- Structured predictive analytics
- Unstructured predictive analytics
- Threat/fraudster/vendor profiling
- Recommendation Engine
- Pattern detection
- Rule/Scenario discovery – failure, fraud, optimization
- Root cause discovery
- Sentiment analysis
- CRM analytics
- Network analytics
- Text Analytics
- Technology-assisted review
- Fraud analytics
- Real-Time Analytics
Day-3: Session-1: Real-Time and Scalable Analytics Over Hadoop
- Why common analytic algorithms fail in Hadoop/HDFS
- Apache Hama - for Bulk Synchronous distributed computing
- Apache SPARK - for cluster computing for real-time analytics
- CMU Graphics Lab2 - Graph-based asynchronous approach to distributed computing
- KNN p-Algebra based approach from Treeminer for reduced hardware cost of operation
Day-3: Session-2: Tools for eDiscovery and Forensics
- eDiscovery over Big Data vs. Legacy data – a comparison of cost and performance
- Predictive coding and technology-assisted review (TAR)
- Live demo of a TAR product (vMiner) to understand how TAR works for faster discovery
- Faster indexing through HDFS – velocity of data
- NLP or Natural Language processing – various techniques and open-source products
- eDiscovery in foreign languages - technology for foreign language processing
Day-3: Session-3: Big Data BI for Cyber Security – Understanding the Whole 360-Degree View from Speedy Data Collection to Threat Identification
- Understanding basics of security analytics - attack surface, security misconfiguration, host defenses
- Network infrastructure / Large data pipeline / Response ETL for real-time analytics
- Prescriptive vs. predictive – Fixed rule-based vs. auto-discovery of threat rules from Meta data
Day-3: Session-4: Big Data in USDA: Application in Agriculture
- Introduction to IoT (Internet of Things) for agriculture - sensor-based Big Data and control
- Introduction to Satellite imaging and its application in agriculture
- Integrating sensor and image data for soil fertility, cultivation recommendations, and forecasting
- Agriculture insurance and Big Data
- Crop Loss forecasting
Day-4: Session-1: Fraud Prevention BI from Big Data in Government - Fraud Analytics:
- Basic classification of Fraud analytics - rule-based vs. predictive analytics
- Supervised vs. unsupervised Machine Learning for Fraud pattern detection
- Vendor fraud / overcharging for projects
- Medicare and Medicaid fraud - fraud detection techniques for claim processing
- Travel reimbursement frauds
- IRS refund frauds
- Case studies and live demos will be provided wherever data is available.
Day-4: Session-2: Social Media Analytics - Intelligence Gathering and Analysis
- Big Data ETL API for extracting social media data
- Text, image, metadata, and video
- Sentiment analysis from social media feeds
- Contextual and non-contextual filtering of social media feeds
- Social Media Dashboard to integrate diverse social media
- Automated profiling of social media profiles
- Live demo of each analytic will be given through the Treeminer Tool.
Day-4: Session-3: Big Data Analytics in Image Processing and Video Feeds
- Image Storage techniques in Big Data - Storage solutions for data exceeding petabytes
- LTFS and LTO
- GPFS-LTFS (Layered storage solution for Big image data)
- Fundamentals of image analytics
- Object recognition
- Image segmentation
- Motion tracking
- 3-D image reconstruction
Day-4: Session-4: Big Data Applications in NIH:
- Emerging areas of Bioinformatics
- Meta-genomics and Big Data mining issues
- Big Data Predictive analytics for Pharmacogenomics, Metabolomics, and Proteomics
- Big Data in downstream Genomics processes
- Application of Big Data predictive analytics in Public Health
Big Data Dashboard for Quick Accessibility of Diverse Data and Display:
- Integration of existing application platforms with Big Data Dashboard
- Big Data management
- Case Study of Big Data Dashboard: Tableau and Pentaho
- Use Big Data apps to push location-based services in Government
- Tracking system and management
Day-5: Session-1: How to Justify Big Data BI Implementation Within an Organization:
- Defining ROI for Big Data implementation
- Case studies for saving Analyst Time for collection and preparation of Data – increase in productivity gain
- Case studies of revenue gain from saving licensed database costs
- Revenue gain from location-based services
- Savings from fraud prevention
- An integrated spreadsheet approach to calculate approximate expenses vs. Revenue gain/savings from Big Data implementation.
Day-5: Session-2: Step-by-Step Procedure to Replace Legacy Data Systems with Big Data Systems:
- Understanding practical Big Data Migration Roadmap
- What important information is needed before architecting a Big Data implementation
- What are the different ways of calculating volume, velocity, variety, and veracity of data
- How to estimate data growth
- Case studies
Day-5: Session-4: Review of Big Data Vendors and Review of Their Products. Q&A Session:
- Accenture
- APTEAN (Formerly CDC Software)
- Cisco Systems
- Cloudera
- Dell
- EMC
- GoodData Corporation
- Guavus
- Hitachi Data Systems
- Hortonworks
- HP
- IBM
- Informatica
- Intel
- Jaspersoft
- Microsoft
- MongoDB (Formerly 10Gen)
- MU Sigma
- Netapp
- Opera Solutions
- Oracle
- Pentaho
- Platfora
- Qliktech
- Quantum
- Rackspace
- Revolution Analytics
- Salesforce
- SAP
- SAS Institute
- Sisense
- Software AG/Terracotta
- Soft10 Automation
- Splunk
- Sqrrl
- Supermicro
- Tableau Software
- Teradata
- Think Big Analytics
- Tidemark Systems
- Treeminer
- VMware (Part of EMC)
Requirements
- Basic knowledge of business operations and data systems within their respective government domains
- Basic understanding of SQL/Oracle or relational databases
- Basic understanding of Statistics (at the spreadsheet level)
35 Hours
Testimonials (1)
The ability of the trainer to align the course with the requirements of the organization other than just providing the course for the sake of delivering it.