Question
What is a "data swamp"?
Answer
A data swamp is best defined as a severely degraded data lake. The term data swamp connotes poor governance and negligent management that caused a data lake to gradually lose its value. A data swamp is data lake that was once useful but through negligent utilization can no longer be used by even highly talented analytics professionals. Data swamps can be improved and restored to regular, functional data lakes. There is a gray area between a poor data lake and data swamp that is degraded with a low degree of severity.
IBM suggests that there is a continuum from data swamp (least valuable) to data lake to data reservoir (most valuable). Page 122 of this IBM document uses all three terms in a way that alludes to a continuous series of these three data concepts. IBM's website refers to a "well-managed and governed data lake" being the same as a data reservoir. Gartner makes a distinction between data lake and data reservoir too (infocus.emc.com). Other companies besides IBM and Gartner refer to "data lake" and "data reservoir" interchangeably.
According to an article on TeraData's website that was published in late 2016, the biggest mistake that people are making with data lakes is "poor governance." Expect to hear or read the phrase "data swamp" as you continue to work with big data. For further information about data swamps, data lakes, and data reservoirs, see the links below: